3 Performance and Cost Optimization Tips for Using Amazon Athena at Scale

Learn how to optimize your Amazon Athena queries with three key techniques: data partitioning, data compression, and file size optimization. These strategies will help you reduce costs and speed up your query performance, making your data analysis more efficient.

August 27, 2024
Share on Linkedin
Learn how to optimize your Amazon Athena queries with three key techniques: data partitioning, data compression, and file size optimization. These strategies will help you reduce costs and speed up your query performance, making your data analysis more efficient.

Key Takeaways:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Amazon Athena is a wonderful tool for analyzing and querying the Data that you store on S3. What makes it powerful is the fact that you don’t need to learn a new syntax as it uses pure SQL query techniques for querying data. As powerful as it is, it can get expensive and slow if you run a query against the data set that isn’t optimized for Athena.


Tl;dr; 3 techniques that can help you reduce Amazon Athena query costs and improve performance:

  • Data Partitioning: Partitioning your Athena tables groups related data, reducing the amount of data scanned. This results in quicker queries and lower costs. For example, partitioning a user table by ID can significantly enhance query efficiency.
  • Data Compression: Compressing data reduces the size of files, leading to faster queries and lower costs. Athena supports multiple compression formats like gzip, Snappy, and zstd. Choose the right compression type based on your needs, balancing between compression efficiency and processing speed.
  • Optimal File Size: Large, parallel-read files improve query speed. Avoid too many small files, which can slow down queries and even cause errors. Aim for file splits of around 128 MB for the best performance, using tools like S3DistCP on Amazon EMR to manage file sizes effectively.

In this article we focus on these data storage techniques that enable Athena queries run more efficiently thus making the query run faster at a lower price. 

1. Data Partitioning

You can read it in detail at AWS official page about Partitioning. In short, it divides your Athena table into parts and keeps the related data together thus making your query to scan a much smaller data set thus leading to faster query result at lower cost. You can define table partition during table creation as shown below.


 CREATE EXTERNAL TABLE users (
 first string,
 last string,
 username string
 )
 PARTITIONED BY (id string)
 STORED AS parquet
 LOCATION 's3://amzn-s3-demo-bucket'

2. Use Compression

Compressing your data can make your queries run much faster.Smaller files mean less data to load from Amazon S3, which cuts down on costs. It also reduces network traffic between S3 and Athena.Athena supports a bunch of compression formats, including popular ones like gzip, Snappy, and zstd. Check out Athena’s compression support for the full list.

When querying compressed text data (like JSON or CSV), there’s something to keep in mind. Athena divides files into smaller chunks called splits to process them in parallel. But most common compression formats aren’t splitable, meaning they have to be read from the beginning. So, if you have a single large compressed CSV file, only one node can work on it at a time. When creating datasets with compressed text files, try to find a good balance between the number of files and their size. We’ll talk more about optimizing file sizes in the next section. Parquet and ORC files are always splitable. These formats compress parts of the file separately and have metadata that tells Athena where to find each part. This makes them great for parallel processing.

gzip offers great compression and works with a lot of other tools and services. zstd is newer and strikes a good balance between speed and how much it shrinks your data. bzip2 and LZO can be split into smaller parts, but they’re not the best choices if you want speed and compatibility.

3. Selecting right file size

Queries run faster when data can be read in parallel and in large chunks. Reading files has some overhead (like getting metadata, making requests to S3, and setting up compression). This isn’t usually a big deal, but with lots of files, it can add up. To get the best performance, try to have a good balance between the number of files and their size.

Aim for splits of around 128 MB. A split is a part of a file, like a section of an uncompressed text file or a page in a Parquet file. As we talked about with compression, most compressed text files can’t be split, so they’re processed as a single chunk. Analytics-optimized formats like Parquet and ORC can always be split.

Having too many small files can slow down your queries. This can happen if you’ve over-partitioned your data. If the planning phase of your query takes a long time, that’s a sign you might have too many small files. In the worst case, your queries might fail with an error from S3 saying “Please reduce your request rate.” This happens when there are so many files that Athena goes over S3’s limits. To fix this, you can use the S3DistCP utility on Amazon EMR. It can combine smaller files into larger ones. You can also use it to move large amounts of data between HDFS and S3. Another option is to reprocess your data using Athena Spark. Having fewer, larger files has some benefits: faster listing, fewer S3 requests, and less metadata to manage.

Below is a table comparing a query that reads 50,000 files to one that reads the same data as a single file. Both sets of files have the same rows, stored as uncompressed text files. The total data is 63 GB.

Figure 1: Number of files vs Total Runtime

File size, the number of files, and whether they’re compressed can all affect query performance.  Uncompressed data can be processed in parallel in optimal-sized chunks, making a single large uncompressed text file faster to process than 100,000 small ones. With compressed data, the number and size of files are even more important. You need enough files for Athena to process the data in parallel.

Stay tuned as we post other techniques related to this topic.

For more information

We’re ready when you are. Get in touch and a friendly, Costwise knowledgeable cloud expert is prepared to discuss your cost optimization needs, ask a few questions to get the full picture, and make a plan to follow up.

THE COSTWISE.CLOUD BLOG

Stay in the know

Our experts continuously share their insights to provide meaningful perspective on the best practices in the industry -- breaking down what you need to know, opportunities, and how you should approach your cloud infrastructure settings.

Redefining the way you look at AWS Costs

CostWise is dedicated to helping businesses reduce their cloud spend without sacrificing performance. By leveraging advanced analytics, tailored recommendations, and continuous monitoring, we empower you to achieve sustainable cost efficiency.