https://duckdbsnippets.com/snippets/36/query-s3-access-logs
Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types.
Usage: you'll want to search for the strings and , and insert the S3 bucket where your access logs are being delivered. Use (or delete) to filter to a subset of your logs.
Also, these commented out configuration settings you can either run yourself in the REPL and source this file using .read parse_s3_access_logs.sql, or you can uncomment them and supply values for yourself.
-- install https;
-- load https;
-- SET s3_region='us-west-2';
-- SET s3_access_key_id='';
-- SET s3_secret_access_key='';
WITH parsed_logs AS (
SELECT
regexp_extract(col1, '^([0-9a-zA-Z]+)\s+([a-z0-9.\-]+)\s+\[([0-9/A-Za-z: +]+)\] ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ([^ ]+) ("[^"]*"|-) ([^ ]+) ([^ ]+) (\d+|-) (\d+|-) (\d+|-) (\d+|-) ("[^"]*"|-) ("[^"]*"|-) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)(.*)$',
['bucket_owner', 'bucket', 'timestamp', 'remote_ip', 'request', 'request_id', 'operation', 's3_key', 'request_uri', 'http_status', 's3_errorcode', 'bytes_sent','object_size', 'total_time', 'turn_around_time', 'referer', 'user_agent', 'version_id', 'host_id', 'sigver', 'cyphersuite', 'auth_type', 'host_header', 'tls_version', 'access_point_arn', 'acl_required', 'extra']
) AS log_struct
FROM
-- Trick the CSV reader into reading as a single column
read_csv(
's3://<bucket>/<prefix>/*',
columns={'col1': 'VARCHAR'},
-- Use a *hopefully* nonsensical deliminator, so no ',' chars screw us up
delim='\0'
)
)
SELECT
-- Grab everything from the struct that we want as strings, exclude stuff we'll coersce to diff types
log_struct.* exclude (timestamp, bytes_sent, object_size, total_time, turn_around_time),
strptime(log_struct.timestamp, '%d/%b/%Y:%H:%M:%S %z') AS timestamp,
CASE
WHEN log_struct.bytes_sent = '-' THEN NULL
ELSE CAST(log_struct.bytes_sent AS INTEGER)
END AS bytes_sent,
CASE
WHEN log_struct.object_size = '-' THEN NULL
ELSE CAST(log_struct.object_size AS INTEGER)
END AS object_size,
CASE
WHEN log_struct.total_time = '-' THEN NULL
ELSE CAST(log_struct.total_time AS INTEGER)
END AS total_time,
CASE
WHEN log_struct.turn_around_time = '-' THEN NULL
ELSE CAST(log_struct.turn_around_time AS INTEGER)
END AS turn_around_time
FROM parsed_logs;
https://duckdbsnippets.com/snippets/36/query-s3-access-logs
Background: If you have S3 Access Logging enabled on one of your S3 buckets, you'll have some useful information about requests to your bucket. Unfortunately, it's in a semistructured format that can be difficult to parse. This SQL query will can help in this manner, both pulling out individual fields and coersing them to native data types.
Usage: you'll want to search for the strings and , and insert the S3 bucket where your access logs are being delivered. Use (or delete) to filter to a subset of your logs.
Also, these commented out configuration settings you can either run yourself in the REPL and source this file using
.read parse_s3_access_logs.sql, or you can uncomment them and supply values for yourself.