The best answers are Parquet, JSON, CSV, and ORC in that order.
Parquet is the strongest choice for complex analytical queries over large structured datasets because it is a columnar format. Columnar storage allows query engines such as Amazon Athena, AWS Glue, and Spark to read only the columns required by the query instead of scanning full rows. AWS documentation states that Apache Parquet and ORC are columnar storage formats optimized for fast retrieval in analytical applications, and that column-level compression can reduce storage space and I/O during query processing. This directly matches the need to filter, aggregate, reduce query response time, and lower storage/query cost.
JSON is correct for semi-structured real-time logs because JSON supports flexible and nested data structures. AWS Glue documentation describes JSON as a format for data structures with consistent shape but flexible contents and notes that it is not row-based or column-based. That makes it appropriate for application logs, event records, and evolving schemas used later for analytics or ML ingestion.
CSV is correct for small spreadsheet exports and occasional human-readable analysis. AWS Glue describes CSV as a minimal, row-based data format. CSV is widely supported by spreadsheet tools and is easy for humans to inspect, but it is not ideal for large-scale analytical performance because it lacks efficient column pruning and rich schema support.
ORC is correct for the Apache Hive read-heavy big data pipeline. ORC is a performance-oriented, column-based format, and it is strongly associated with Hive-based analytics workloads. It provides high compression and efficient reads, making it well suited for structured data in read-heavy big data pipelines.
Submit