Sample to show how process different types of avro subjects in a single topic #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

buddhike wants to merge 8 commits into aws-samples:main from buddhike:main

buddhike commented Apr 3, 2025

No description provided.


          Sample to show how process different types of avro subjects in a sing…

a050abc

…le topic

buddhike requested a review from nicusX

April 3, 2025 05:37

nicusX requested changes

View reviewed changes

Contributor

nicusX left a comment

Many thanks for the contribution.
I would suggest some changes to make it simpler to understand, and also not to hint any not-so-good practice.

java/AvroOneTopicManySubjects/.gitignore Outdated

Contributor

nicusX Apr 15, 2025

.gitignore in the subfolder is not required.
There is a gitignore at top level. If there is anything missing there please update that one

java/AvroOneTopicManySubjects/README.md Outdated

+              * Flink API: DataStream API
+              * Language: Java (11)
+              This example demonstrates how to serialize/deserialize Avro messages in Kafka when one topic stores multiple subject types.

Contributor

nicusX Apr 15, 2025

Explain this is specific to Confluent Schema Registry

Author

buddhike May 12, 2025

✅

java/AvroOneTopicManySubjects/README.md

+              * Flink version: 1.20
+              * Flink API: DataStream API
+              * Language: Java (11)

Contributor

nicusX Apr 15, 2025

We usually add to the list the connectors used in the example. In this case, it's also important to add that the example uses AVRO Confluent Schema Registry

Author

buddhike May 12, 2025

✅

java/AvroOneTopicManySubjects/README.md Outdated


		This example uses Avro-generated classes (more details [below](#using-avro-generated-classes)).

		A `KafkaSource` produces a stream of Avro data objects (`SpecificRecord`), fetching the writer's schema from AWS Glue Schema Registry. The Avro Kafka message value must have been serialized using AWS Glue Schema Registry.

Contributor

nicusX Apr 15, 2025

I think you mean Confluent. Schema Registry?

Author

buddhike May 12, 2025

✅

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java Outdated

+                          env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
+                          env.enableCheckpointing(60000);
+                      }
+                      env.setRuntimeMode(RuntimeExecutionMode.STREAMING);

Contributor

nicusX Apr 15, 2025

This is not really required

Author

buddhike May 12, 2025

✅

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java Outdated

+                      env.execute("avro-one-topic-many-subjects");
+                  }
+                  private static void setupAirQualityGenerator(String bootstrapServers, String sourceTopic, String schemaRegistryUrl, Map<String, Object> schemaRegistryConfig, StreamExecutionEnvironment env) {

Contributor

nicusX Apr 29, 2025

Even though having the data generator within the same Flink app works, we are deliberately avoiding doing it in any of the examples. The reason is that building jobs with multiple dataflows is strongly discouraged.
We are avoiding using any bad practice in examples, not to suggest it may be a good idea doing it.

I reckon it's more complicated, but you can add a separate module with a standalone Java application which generates data. Something similar to what we do in this example, even though in that case it's Kinesis

Author

buddhike May 12, 2025

✅

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java Outdated

+               * strategies and event time extraction. However, for those scenarios to work
+               * all subjects should have a standard set of fields.
+               */
+              class Option {

Contributor

nicusX Apr 29, 2025

Maybe you can use org.apache.flink.types.SerializableOptional<T> that comes with Flink

Author

buddhike May 8, 2025

Option type in this PR is a container type to hold any possible deserialized value. SerializableOptional<T> is for optional values. I guess it would not be the right choice here? Am I missing something? 👀 🙏🏾

BTW: I could have used Object instead of creating an Option with the Object type value field. However, having Option type helps if we want to generate watermarks via source operator using a common timestamp field.

Is there a better way to do this?

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java Outdated

+              }
+              // Custom deserialization schema for handling multiple generic Avro record types
+              class OptionDeserializationSchema implements KafkaRecordDeserializationSchema<Option> {

Contributor

nicusX Apr 29, 2025

Please, move to a top level class for readability

Author

buddhike May 12, 2025

✅

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java Outdated

+                  }
+              }
+              class RecordNameSerializer<T> implements KafkaRecordSerializationSchema<T>

Contributor

nicusX Apr 29, 2025

Move to top level class

Author

buddhike May 12, 2025

✅

java/AvroOneTopicManySubjects/src/main/java/com/amazonaws/services/msf/StreamingJob.java Outdated

+                      env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
+                      Properties applicationProperties = loadApplicationProperties(env).get(APPLICATION_CONFIG_GROUP);
+                      String bootstrapServers = Preconditions.checkNotNull(applicationProperties.getProperty("bootstrap.servers"), "bootstrap.servers not defined");

Contributor

nicusX Apr 29, 2025

The code building the dataflow is a bit hard to follow.
I would suggest to do what we tend to do in other examples

In runtime configuration, use a PropertyGroup for each source and sink, even if some configurations are repeated
Instantiate Source and Sink in a local method, the Properties which contains all configuration for that specific component. Extract specific properties, like topic name, within the method rather than in the main() directly
Build the dataflow just attaching the operators one after the others, using intermediate streams variables only when it helps readability
Avoid having methods that attach operators to the dataflow. Practically, any method which expects a DataStream or StreamingExecutionEnvironment as a parameter should be avoided.
If an operator implementation like a map for a filter is simple, try using a lambda and inlining it. If the operator implementation is complex externalize the implementation to a separate class

See examples here

We are not following these patterns in all examples, but we are trying to converge as possible

Author

buddhike May 12, 2025

✅

buddhike added 6 commits

April 30, 2025 13:57


          Remove redundant .gitignore file

f6469f6


          Highlight the relevance of Confluent Schema Registry

76f6600


          Highlight that sample uses Confluent Schema Registry

e7b40e8


          Runtime mode is STREAMING by default

c30178b


          Split producer and consumer into different applications

1defa0b


          Link to detailed instructions on how to run locally

d693cf3

Author

buddhike commented May 12, 2025

@nicusX Thanks for reviewing this PR. I've addressed your points. Could you please take another look? 🙏🏾


          Share run configuration for IntelliJ

fdcbdd8

buddhike commented

View reviewed changes

java/AvroOneTopicManySubjects/.run/all.run.xml

Author

buddhike May 12, 2025

@nicusX Do you think this is a good approach to share run configurations with IntelliJ users?

buddhike requested a review from nicusX

May 13, 2025 11:12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet