A data engineer is designing a pipeline in Databricks that processes records from a Kafka stream where late-arriving data is common.
Which approach should the data engineer use?
A.
Implement a custom solution using Databricks Jobs to periodically reprocess all historical data.
B.
Use batch processing and overwrite the entire output table each time to ensure late data is incorporated correctly.
C.
Use an Auto CDC pipeline with batch tables to simplify late data handling.
D.
Use a watermark to specify the allowed lateness to accommodate records that arrive after their expected window, ensuring correct aggregation and state management.
In Structured Streaming, event-time watermarks control how long the engine waits for late-arriving data before finalizing aggregations. By setting an appropriate watermark, Databricks can handle late data gracefully — incorporating records that arrive within the defined window while discarding excessively delayed events.
This approach ensures accurate aggregations, minimizes state size, and prevents memory leaks.
Manual reprocessing (A) or overwriting entire datasets (B) is inefficient and costly, while Auto CDC (C) is used for change tracking in Delta tables, not for streaming event lateness.
Thus, using watermarking is the recommended and official approach for managing late data in streaming pipelines.
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit