Description
Runtime:
Python
Is your feature request related to a problem? Please describe
IHAC who was looking into the benefits of the logger formatting but was having a hard time getting Athena to ingest the logs from Cloudwatch and create a schema out of them and was wondering if there was any consideration on the logger format to be more friendly to data ingestion by other services outside of Cloudwatch. For example as it is now this is how its outputting to cloudwatch with the powertools logger on:
2021-05-10T15:26:57.772Z {"level":"INFO","location":"checkOrderInput:50","message":"Checking required fields ['contact', 'address', 'id', 'expedite', 'installDate', 'addressValidationOverride', 'product', 'asset']","timestamp":"2021-05-10 15:26:57,771+0000","service":"new-order","xray_trace_id":"1-609950be-18655ee7e321f53ab8b4f629"}
with this format you could extract two columns
- timestamp, 2. the entire message as a json struc column
or use Grok serDer in athena with regex to try to grab a pattern, in either case you would still need to run this through some type of data processing no different than if you used the built in python logger with custom configurations but slightly more challenging due to the JSON structure.
Describe the solution you'd like
Either for Powertools Logger to just output eachline as a JSON object (removing the timestamp at the beginning of the log so Athena/Glue can just use the built in JsonSerDer to parse it and create columns or some documentation or examples on how to leverage this logger formatting in queries and creating Metric Filters of them.
Describe alternatives you've considered
Using a messy DDL statement in Athena using Grok SerDer
CREATE EXTERNAL TABLE ugi
(
loglevel
string COMMENT 'from deserializer',
timestamp
string COMMENT 'from deserializer',
service
string COMMENT 'from deserializer',
traceid
string COMMENT 'from deserializer')
ROW FORMAT SERDE
'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.format'='(?"level":"(.{4,10})),([^.]+)(?"timestamp":"(.{10,28}))"([^.]+)(?"service":"(.{3,10}))",([^.]+)(?xray_trace_id":"(.{30,40}))"')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://ugi/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='powertoollogs',
'averageRecordSize'='191',
'classification'='powertoollogs',
'compressionType'='none',
'grokPattern'='(?"level":"(.{4,10})),([^.]+)(?"timestamp":"(.{10,28}))"([^.]+)(?"service":"(.{3,10}))",([^.]+)(?xray_trace_id":"(.{30,40}))"',
'objectCount'='1',
'recordCount'='1',
'sizeKey'='191',
'typeOfData'='file')
Just using the built-in python logger and use the traditional CW-Firehose-S3 architecture for streaming the logs into an S3 bucket in Athena to avoid parsing the JSON structure.
If you provide guidance, is this something you'd like to contribute?
I am not the best developer but sure! I could help
Additional context
Providing some examples of how others have leverage the powertools logger or any use cases where this logger has made operational tasks easier would be very valuable and easier to sell to customers. Right now outside of the nice structured uniform formatting it creates in Cloudwatch logs I do not see another benefit of then using this data efficiently