3 Performance And Cost Optimization Tips For Using Amazon Athena At Scale

AWS, AWS Cost Optimization

3 Performance and Cost Optimization Tips for Using Amazon Athena at Scale

Learn how to optimize your Amazon Athena queries with three key techniques: data partitioning, data compression, and file size optimization. These strategies will help you reduce costs and speed up your query performance, making your data analysis more efficient.

August 27, 2024

Key Takeaways:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Amazon Athena is a wonderful tool for analyzing and querying the Data that you store on S3. What makes it powerful is the fact that you don’t need to learn a new syntax as it uses pure SQL query techniques for querying data. As powerful as it is, it can get expensive and slow if you run a query against the data set that isn’t optimized for Athena.

Tl;dr; 3 techniques that can help you reduce Amazon Athena query costs and improve performance:

Data Partitioning: Partitioning your Athena tables groups related data, reducing the amount of data scanned. This results in quicker queries and lower costs. For example, partitioning a user table by ID can significantly enhance query efficiency.
Data Compression: Compressing data reduces the size of files, leading to faster queries and lower costs. Athena supports multiple compression formats like gzip, Snappy, and zstd. Choose the right compression type based on your needs, balancing between compression efficiency and processing speed.
Optimal File Size: Large, parallel-read files improve query speed. Avoid too many small files, which can slow down queries and even cause errors. Aim for file splits of around 128 MB for the best performance, using tools like S3DistCP on Amazon EMR to manage file sizes effectively.

In this article we focus on these data storage techniques that enable Athena queries run more efficiently thus making the query run faster at a lower price.

1. Data Partitioning

You can read it in detail at AWS official page about Partitioning. In short, it divides your Athena table into parts and keeps the related data together thus making your query to scan a much smaller data set thus leading to faster query result at lower cost. You can define table partition during table creation as shown below.


 CREATE EXTERNAL TABLE users (
 first string,
 last string,
 username string
 )
 PARTITIONED BY (id string)
 STORED AS parquet
 LOCATION 's3://amzn-s3-demo-bucket'

2. Use Compression

Compressing your data can make your queries run much faster.Smaller files mean less data to load from Amazon S3, which cuts down on costs. It also reduces network traffic between S3 and Athena.Athena supports a bunch of compression formats, including popular ones like gzip, Snappy, and zstd. Check out Athena’s compression support for the full list.

When querying compressed text data (like JSON or CSV), there’s something to keep in mind. Athena divides files into smaller chunks called splits to process them in parallel. But most common compression formats aren’t splitable, meaning they have to be read from the beginning. So, if you have a single large compressed CSV file, only one node can work on it at a time. When creating datasets with compressed text files, try to find a good balance between the number of files and their size. We’ll talk more about optimizing file sizes in the next section. Parquet and ORC files are always splitable. These formats compress parts of the file separately and have metadata that tells Athena where to find each part. This makes them great for parallel processing.

gzip offers great compression and works with a lot of other tools and services. zstd is newer and strikes a good balance between speed and how much it shrinks your data. bzip2 and LZO can be split into smaller parts, but they’re not the best choices if you want speed and compatibility.

3. Selecting right file size

Queries run faster when data can be read in parallel and in large chunks. Reading files has some overhead (like getting metadata, making requests to S3, and setting up compression). This isn’t usually a big deal, but with lots of files, it can add up. To get the best performance, try to have a good balance between the number of files and their size.

Aim for splits of around 128 MB. A split is a part of a file, like a section of an uncompressed text file or a page in a Parquet file. As we talked about with compression, most compressed text files can’t be split, so they’re processed as a single chunk. Analytics-optimized formats like Parquet and ORC can always be split.

Having too many small files can slow down your queries. This can happen if you’ve over-partitioned your data. If the planning phase of your query takes a long time, that’s a sign you might have too many small files. In the worst case, your queries might fail with an error from S3 saying “Please reduce your request rate.” This happens when there are so many files that Athena goes over S3’s limits. To fix this, you can use the S3DistCP utility on Amazon EMR. It can combine smaller files into larger ones. You can also use it to move large amounts of data between HDFS and S3. Another option is to reprocess your data using Athena Spark. Having fewer, larger files has some benefits: faster listing, fewer S3 requests, and less metadata to manage.

Below is a table comparing a query that reads 50,000 files to one that reads the same data as a single file. Both sets of files have the same rows, stored as uncompressed text files. The total data is 63 GB.

Query	Number of files	Total Runtime
SELECT COUNT(*) FROM item	50,000	6,8 seconds
SELECT COUNT(*) FROM item	1	3,2 seconds

Figure 1: Number of files vs Total Runtime

File size, the number of files, and whether they’re compressed can all affect query performance. Uncompressed data can be processed in parallel in optimal-sized chunks, making a single large uncompressed text file faster to process than 100,000 small ones. With compressed data, the number and size of files are even more important. You need enough files for Athena to process the data in parallel.

Stay tuned as we post other techniques related to this topic.

For more information

We’re ready when you are. Get in touch and a friendly, Costwise knowledgeable cloud expert is prepared to discuss your cost optimization needs, ask a few questions to get the full picture, and make a plan to follow up.

THE COSTWISE.CLOUD BLOG

Stay in the know

Our experts continuously share their insights to provide meaningful perspective on the best practices in the industry -- breaking down what you need to know, opportunities, and how you should approach your cloud infrastructure settings.

A guide on navigating Cross VPC and Account network connectivity in AWS

AWS offers an extensive array of networking services for private communication between VPCs and accounts, including VPC Peering, AWS Transit Gateway, VPC Lattice, and PrivateLink. This guide helps you navigate their features, use cases, and complexities.

Don’t Mix Up Lowering Cloud Cost With Cloud Cost Optimization

While cost-cutting targets immediate waste reduction, cloud optimization focuses on long-term strategies to maximize cloud efficiency. Learn how to balance both for sustainable cost management. Learn more.

CloudFront

3 Ideas to Slash Your CloudFront Cost

CloudFront is often overlooked when it comes to AWS cost optimization. In this blog, we explore three simple yet impactful tips to reduce CloudFront costs: optimizing origin file formatting, enabling compression on distributions, and fine-tuning cache settings. Learn how these strategies helped an entertainment company cut CDN costs significantly.

FinOps

FinOps — Our Thought On its State After Attending FinOps X Event

We recently attended the FinOps X event in Barcelona, where we explored the rapid evolution of FinOps, its expanded scope beyond cloud costs, and the rise of FOCUS as a standard for cost and usage data. From redefining FinOps to discovering innovative tools like Scale-ops, the event showcased the exciting future of cloud financial management.

DynamoDB

DynamoDB – Designing Table Schema for Performance And Cost Efficiency

Learn cost-efficient DynamoDB schema design strategies like flattening tables and vertical partitioning, and their impact on AWS costs and performance.

DynamoDB

3 Practical Tips for cost efficiency When Using DynamoDB at Scale

Optimize AWS DynamoDB costs with tips on storage, read consistency, and consumption modes. Lower your DynamoDB cost with these strategies, whether scaling or managing small tables.

Data Transfer

Unpacking Data Transfer Cost in AWS — Part 2

n this blog post, we explore how to minimize AWS data transfer costs when your VPC workload communicates with public AWS services like S3 and DynamoDB. Discover how using VPC Gateway endpoints can eliminate data transfer fees and potentially improve latency.

Data Transfer

Unpacking Data Transfer Cost in AWS — Part 1

Data Transfer Out (DTO) costs are often overlooked in AWS workloads, yet they can be a major contributor to your monthly invoice. Learn how AWS Data Transfer Out (DTO) costs can impact your monthly bill and discover simple networking tweaks to reduce costs without sacrificing performance.

Compression

3 Performance and Cost Optimization Tips for Using Amazon Athena at Scale

FinOps

How to Implement Financial Governance in FinOps?

This guide explores the critical aspects of financial governance in FinOps, from establishing clear policies and forming dedicated teams to leveraging technology for cost management. Learn how to navigate the complexities of cloud spending, promote financial accountability, and drive your organization’s cloud strategy toward efficiency and growth.

Redefining the way you look at AWS Costs

CostWise is dedicated to helping businesses reduce their cloud spend without sacrificing performance. By leveraging advanced analytics, tailored recommendations, and continuous monitoring, we empower you to achieve sustainable cost efficiency.

Cloud Cost Optimization

AWS Cost Optimization

FinOps

FinOps Services

For Teams

What's New?

About Us

Get fresh perspectives on the cloud

Blogs

Blogs

A guide on navigating Cross VPC and Account network connectivity in AWS

Contact Us

Schedule a call​

AWS, AWS Cost Optimization

3 Performance and Cost Optimization Tips for Using Amazon Athena at Scale

Key Takeaways:

1. Data Partitioning

2. Use Compression

3. Selecting right file size

For more information

Stay in the know

Our experts continuously share their insights to provide meaningful perspective on the best practices in the industry -- breaking down what you need to know, opportunities, and how you should approach your cloud infrastructure settings.

A guide on navigating Cross VPC and Account network connectivity in AWS

Don’t Mix Up Lowering Cloud Cost With Cloud Cost Optimization

3 Ideas to Slash Your CloudFront Cost

FinOps — Our Thought On its State After Attending FinOps X Event

DynamoDB – Designing Table Schema for Performance And Cost Efficiency

3 Practical Tips for cost efficiency When Using DynamoDB at Scale

Unpacking Data Transfer Cost in AWS — Part 2

Unpacking Data Transfer Cost in AWS — Part 1

3 Performance and Cost Optimization Tips for Using Amazon Athena at Scale

How to Implement Financial Governance in FinOps?

Redefining the way you look at AWS Costs

CostWise is dedicated to helping businesses reduce their cloud spend without sacrificing performance. By leveraging advanced analytics, tailored recommendations, and continuous monitoring, we empower you to achieve sustainable cost efficiency.

Services

Schedule a call