Amazon Athena is an interactive query service that allows you to analyze data stored in Amazon S3 using standard SQL. Athena is serverless, so you only pay for the queries that you run and there is no infrastructure to manage.
To optimize the query performance of Athena, one of the best practices is to convert the data into a columnar format, such as Apache Parquet or Apache ORC. Columnar formats store data by columns rather than by rows, which allows Athena to scan only the columns that are relevant to the query, reducing the amount of data read and improving the query speed. Columnar formats also support compression and encoding schemes that can reduce the storage space and the data scanned per query, further enhancing the performance and reducing the cost.
In contrast, plaintext CSV files store data by rows, which means that Athena has to scan the entire row even if only a few columns are needed for the query. This increases the amount of data read and the query latency. Moreover, plaintext CSV files do not support compression or encoding, which means that they take up more storage space and incur higher query costs.
Therefore, the Machine Learning Specialist should transform the dataset to Apache Parquet format to minimize query runtime.
Top 10 Performance Tuning Tips for Amazon Athena
Columnar Storage Formats
Using compressions will reduce the amount of data scanned by Amazon Athena, and also reduce your S3 bucket storage. It’s a Win-Win for your AWS bill. Supported formats: GZIP, LZO, SNAPPY (Parquet) and ZLIB.
[Reference: https://www.cloudforecast.io/blog/using-parquet-on-athena-to-save-money-on-aws/, , , , ]
Submit