Skip to content

Leveraging Lambda Powertools Logger for querying in Athena or other AWS services #460

Closed
@rickyaws

Description

@rickyaws

Runtime:
Python

Is your feature request related to a problem? Please describe
IHAC who was looking into the benefits of the logger formatting but was having a hard time getting Athena to ingest the logs from Cloudwatch and create a schema out of them and was wondering if there was any consideration on the logger format to be more friendly to data ingestion by other services outside of Cloudwatch. For example as it is now this is how its outputting to cloudwatch with the powertools logger on:

2021-05-10T15:26:57.772Z {"level":"INFO","location":"checkOrderInput:50","message":"Checking required fields ['contact', 'address', 'id', 'expedite', 'installDate', 'addressValidationOverride', 'product', 'asset']","timestamp":"2021-05-10 15:26:57,771+0000","service":"new-order","xray_trace_id":"1-609950be-18655ee7e321f53ab8b4f629"}

with this format you could extract two columns

  1. timestamp, 2. the entire message as a json struc column

or use Grok serDer in athena with regex to try to grab a pattern, in either case you would still need to run this through some type of data processing no different than if you used the built in python logger with custom configurations but slightly more challenging due to the JSON structure.

Describe the solution you'd like

Either for Powertools Logger to just output eachline as a JSON object (removing the timestamp at the beginning of the log so Athena/Glue can just use the built in JsonSerDer to parse it and create columns or some documentation or examples on how to leverage this logger formatting in queries and creating Metric Filters of them.
Describe alternatives you've considered

Using a messy DDL statement in Athena using Grok SerDer
CREATE EXTERNAL TABLE ugi(
loglevel string COMMENT 'from deserializer',
timestamp string COMMENT 'from deserializer',
service string COMMENT 'from deserializer',
traceid string COMMENT 'from deserializer')
ROW FORMAT SERDE
'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.format'='(?"level":"(.{4,10})),([^.]+)(?"timestamp":"(.{10,28}))"([^.]+)(?"service":"(.{3,10}))",([^.]+)(?xray_trace_id":"(.{30,40}))"')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://ugi/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='powertoollogs',
'averageRecordSize'='191',
'classification'='powertoollogs',
'compressionType'='none',
'grokPattern'='(?"level":"(.{4,10})),([^.]+)(?"timestamp":"(.{10,28}))"([^.]+)(?"service":"(.{3,10}))",([^.]+)(?xray_trace_id":"(.{30,40}))"',
'objectCount'='1',
'recordCount'='1',
'sizeKey'='191',
'typeOfData'='file')

Just using the built-in python logger and use the traditional CW-Firehose-S3 architecture for streaming the logs into an S3 bucket in Athena to avoid parsing the JSON structure.
If you provide guidance, is this something you'd like to contribute?
I am not the best developer but sure! I could help

Additional context

Providing some examples of how others have leverage the powertools logger or any use cases where this logger has made operational tasks easier would be very valuable and easier to sell to customers. Right now outside of the nice structured uniform formatting it creates in Cloudwatch logs I do not see another benefit of then using this data efficiently

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions