Amazon Athena is a wonderful tool for analyzing and querying the Data that you store on S3. What makes it powerful is the fact that you don’t need to learn a new syntax as it uses pure SQL query techniques for querying data. As powerful as it is, it can get expensive and slow if you run a query against the data set that isn’t optimized for Athena.
Tl;dr; 3 techniques that can help you reduce Amazon Athena query costs and improve performance:
- Data Partitioning: Partitioning your Athena tables groups related data, reducing the amount of data scanned. This results in quicker queries and lower costs. For example, partitioning a user table by ID can significantly enhance query efficiency.
- Data Compression: Compressing data reduces the size of files, leading to faster queries and lower costs. Athena supports multiple compression formats like gzip, Snappy, and zstd. Choose the right compression type based on your needs, balancing between compression efficiency and processing speed.
- Optimal File Size: Large, parallel-read files improve query speed. Avoid too many small files, which can slow down queries and even cause errors. Aim for file splits of around 128 MB for the best performance, using tools like S3DistCP on Amazon EMR to manage file sizes effectively.
In this article we focus on these data storage techniques that enable Athena queries run more efficiently thus making the query run faster at a lower price.
1. Data Partitioning
You can read it in detail at AWS official page about Partitioning. In short, it divides your Athena table into parts and keeps the related data together thus making your query to scan a much smaller data set thus leading to faster query result at lower cost. You can define table partition during table creation as shown below.
CREATE EXTERNAL TABLE users (
first string,
last string,
username string
)
PARTITIONED BY (id string)
STORED AS parquet
LOCATION 's3://amzn-s3-demo-bucket'
2. Use Compression
Compressing your data can make your queries run much faster.Smaller files mean less data to load from Amazon S3, which cuts down on costs. It also reduces network traffic between S3 and Athena.Athena supports a bunch of compression formats, including popular ones like gzip, Snappy, and zstd. Check out Athena’s compression support for the full list.
When querying compressed text data (like JSON or CSV), there’s something to keep in mind. Athena divides files into smaller chunks called splits to process them in parallel. But most common compression formats aren’t splitable, meaning they have to be read from the beginning. So, if you have a single large compressed CSV file, only one node can work on it at a time. When creating datasets with compressed text files, try to find a good balance between the number of files and their size. We’ll talk more about optimizing file sizes in the next section. Parquet and ORC files are always splitable. These formats compress parts of the file separately and have metadata that tells Athena where to find each part. This makes them great for parallel processing.
gzip offers great compression and works with a lot of other tools and services. zstd is newer and strikes a good balance between speed and how much it shrinks your data. bzip2 and LZO can be split into smaller parts, but they’re not the best choices if you want speed and compatibility.
3. Selecting right file size
Queries run faster when data can be read in parallel and in large chunks. Reading files has some overhead (like getting metadata, making requests to S3, and setting up compression). This isn’t usually a big deal, but with lots of files, it can add up. To get the best performance, try to have a good balance between the number of files and their size.
Aim for splits of around 128 MB. A split is a part of a file, like a section of an uncompressed text file or a page in a Parquet file. As we talked about with compression, most compressed text files can’t be split, so they’re processed as a single chunk. Analytics-optimized formats like Parquet and ORC can always be split.
Having too many small files can slow down your queries. This can happen if you’ve over-partitioned your data. If the planning phase of your query takes a long time, that’s a sign you might have too many small files. In the worst case, your queries might fail with an error from S3 saying “Please reduce your request rate.” This happens when there are so many files that Athena goes over S3’s limits. To fix this, you can use the S3DistCP utility on Amazon EMR. It can combine smaller files into larger ones. You can also use it to move large amounts of data between HDFS and S3. Another option is to reprocess your data using Athena Spark. Having fewer, larger files has some benefits: faster listing, fewer S3 requests, and less metadata to manage.
Below is a table comparing a query that reads 50,000 files to one that reads the same data as a single file. Both sets of files have the same rows, stored as uncompressed text files. The total data is 63 GB.
Query | Number of files | Total Runtime |
---|---|---|
SELECT COUNT(*) FROM item | 50,000 | 6,8 seconds |
SELECT COUNT(*) FROM item | 1 | 3,2 seconds |
File size, the number of files, and whether they’re compressed can all affect query performance. Uncompressed data can be processed in parallel in optimal-sized chunks, making a single large uncompressed text file faster to process than 100,000 small ones. With compressed data, the number and size of files are even more important. You need enough files for Athena to process the data in parallel.
Stay tuned as we post other techniques related to this topic.