Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: force70

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce

Viewing page 5 out of 6 pages
Viewing questions 41-50 out of questions
Questions # 41:

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

(spark.readStream

.format( " parquet " )

.load( " /mnt/raw_orders/ " )

.withWatermark( " time " , " 2 hours " )

.dropDuplicates([ " customer_id " , " order_id " ])

.writeStream

.trigger(once=True)

.table( " orders " )

)

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

Options:

A.

The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.


B.

The orders table will contain only the most recent 2 hours of records and no duplicates will be present.


C.

All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.


D.

Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.


Expert Solution
Questions # 42:

A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.

A senior data engineer updates the Delta Table ' s schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays:

Which limitation will the team face while diagnosing this problem?

Options:

A.

New fields not be computed for historic records.


B.

Updating the table schema will invalidate the Delta transaction log metadata.


C.

Updating the table schema requires a default value provided for each file added.


D.

Spark cannot capture the topic partition fields from the kafka source.


Expert Solution
Questions # 43:

A data engineer is using Lakeflow Declarative Pipelines Expectations feature to track the data quality of their incoming sensor data. Periodically, sensors send bad readings that are out of range, and they are currently flagging those rows with a warning and writing them to the silver table along with the good data. They’ve been given a new requirement – the bad rows need to be quarantined in a separate quarantine table and no longer included in the silver table.

This is the existing code for their silver table:

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

What code will satisfy the requirements?

Options:

A.

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )


B.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading < 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )


C.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect_or_drop( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )


D.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )


Expert Solution
Questions # 44:

An external object storage container has been mounted to the location /mnt/finance_eda_bucket .

The following logic was executed to create a database for the finance team:

Question # 44

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

Question # 44

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

Options:

A.

A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.


B.

An external table will be created in the storage container mounted to /mnt/finance eda bucket.


C.

A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.


D.

An managed table will be created in the storage container mounted to /mnt/finance eda bucket.


E.

A managed table will be created in the DBFS root storage container.


Expert Solution
Questions # 45:

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.

The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.

Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?

Options:

A.

Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.


B.

Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.


C.

Create a new table with the required schema and new fields and use Delta Lake ' s deep clone functionality to sync up changes committed to one table to the corresponding table.


D.

Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.


E.

Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.


Expert Solution
Questions # 46:

An organization processes customer data from web and mobile applications. Data includes names, emails, phone numbers, and location history. Data arrives both as batch files (from SFTP daily) and streaming JSON events (from Kafka in real-time).

To comply with data privacy policies, the following requirements must be met:

    Personally Identifiable Information (PII) such as email, phone number, and IP address must be masked or anonymized before storage.

    Both batch and streaming pipelines must apply consistent PII handling.

    Masking logic must be auditable and reproducible.

    The masked data must remain usable for downstream analytics.

How should the data engineer design a compliant data pipeline on Databricks that supports both batch and streaming modes, applies data masking to PII, and maintains traceability for audits?

Options:

A.

Allow PII to be stored unmasked in Bronze for lineage tracking, then apply masking logic in Gold tables used for reporting.


B.

Load batch data with notebooks and ingest streaming data with SQL Warehouses; use Unity Catalog column masks on Silver tables to redact fields after storage.


C.

Ingest both batch and streaming data using Lakeflow Declarative Pipelines, and apply masking via Unity Catalog column masks at read time to avoid modifying the data during ingestion.


D.

Use Lakeflow Declarative Pipelines for batch and streaming ingestion, define a PII masking function , and apply it during Bronze ingestion before writing to Delta Lake .


Expert Solution
Questions # 47:

An analytics team wants to run a short-term experiment in Databricks SQL on the customer transactions Delta table (about 20 billion records) created by the data engineering team. Which strategy should the data engineering team use to ensure minimal downtime and no impact on the ongoing ETL processes?

Options:

A.

Create a new table for the analytics team using a CTAS statement.


B.

Deep clone the table for the analytics team.


C.

Give the analytics team direct access to the production table.


D.

Shallow clone the table for the analytics team.


Expert Solution
Questions # 48:

A data engineer has created a transactions Delta table on Databricks that should be used by the analytics team. The analytics team wants to use the table with another tool that requires Apache Iceberg format.

What should the data engineer do?

Options:

A.

Require the analytics team to use a tool that supports Delta table.


B.

Enable uniform on the transactions table to ' iceberg ' so that the table can be read as an Iceberg table.


C.

Create an Iceberg copy of the transactions Delta table which can be used by the analytics team.


D.

Convert the transactions Delta table to Iceberg and enable uniform so that the table can be read as a Delta table.


Expert Solution
Questions # 49:

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.

If task A fails during a scheduled run, which statement describes the results of this run?

Options:

A.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.


B.

Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.


C.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.


D.

Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.


E.

Tasks B and C will be skipped; task A will not commit any changes because of stage failure.


Expert Solution
Questions # 50:

A data engineer needs to capture pipeline settings from an existing in the workspace, and use them to create and version a JSON file to create a new pipeline.

Which command should the data engineer enter in a web terminal configured with the Databricks CLI?

Options:

A.

Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and rename the pipeline; use this in a create command


B.

Stop the existing pipeline; use the returned settings in a reset command


C.

Use the alone command to create a copy of an existing pipeline; use the get JSON command to get the pipeline definition; save this to git


D.

Use list pipelines to get the specs for all pipelines; get the pipeline spec from the return results parse and use this to create a pipeline


Expert Solution
Viewing page 5 out of 6 pages
Viewing questions 41-50 out of questions