Query Parquet Files in S3

13 Dec 2020

Got some parquet files sitting in S3 as the result of an export of an RDS snapshot? Here's how to query them using the aws CLI command.

First, list your parquet files from the S3 bucket where they live:

$ aws --output text s3api list-objects-v2 \
    --bucket my-db-export-bucket \
    --prefix my-db-export-for-2020-12-11/mydb/ \
    --query 'Contents[?ends_with(Key, `parquet`)]|[*].[Key]'
my-db-export-for-2020-12-11/mydb/mydb.accounts/part-00000-3c4f9c6c-e173-4410-821a-a416de9d4503-c000.gz.parquet
my-db-export-for-2020-12-11/mydb/mydb.users/part-00000-ef5c6dee-380f-473b-a19b-bb7e75bdabab-c000.gz.parquet
my-db-export-for-2020-12-11/mydb/mydb.widgets/part-00000-9f9081a7-c16c-4fa5-92b9-285b2de21f60-c000.gz.parquet
...

Next, run a SQL query on a file. Use the table name "s3object" in your SQL query: it is the "table" that the parquet file represents.

$ aws s3api select-object-content \
    --bucket my-db-export-bucket \
    --key my-db-export-for-2020-12-11/mydb/mydb.accounts/part-00000-3c4f9c6c-e173-4410-821a-a416de9d4503-c000.gz.parquet \
    --expression 'select count(*) from s3object' \
    --expression-type SQL \
    --input-serialization '{"Parquet": {}}' \
    --output-serialization '{"CSV": {}}' \
    /dev/stdout
722712