Databricks Databricks-Certified-Professional-Data-Engineer Exam Questions Free Practice Test

Viewing page 5 out of 6 pages

Viewing questions 41-50 out of questions

Questions # 41:

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

(spark.readStream

.format( " parquet " )

.load( " /mnt/raw_orders/ " )

.withWatermark( " time " , " 2 hours " )

.dropDuplicates([ " customer_id " , " order_id " ])

.writeStream

.trigger(once=True)

.table( " orders " )

)

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

Options:

The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.

The orders table will contain only the most recent 2 hours of records and no duplicates will be present.

All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.

Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.

Expert Solution

Questions # 42:

A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.

A senior data engineer updates the Delta Table ' s schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays:

Which limitation will the team face while diagnosing this problem?

Options:

New fields not be computed for historic records.

Updating the table schema will invalidate the Delta transaction log metadata.

Updating the table schema requires a default value provided for each file added.

Spark cannot capture the topic partition fields from the kafka source.

Expert Solution

Questions # 43:

A data engineer is using Lakeflow Declarative Pipelines Expectations feature to track the data quality of their incoming sensor data. Periodically, sensors send bad readings that are out of range, and they are currently flagging those rows with a warning and writing them to the silver table along with the good data. They’ve been given a new requirement – the bad rows need to be quarantined in a separate quarantine table and no longer included in the silver table.

This is the existing code for their silver table:

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():