Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: force70

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce

Viewing page 4 out of 6 pages
Viewing questions 31-40 out of questions
Questions # 31:

A data engineer is designing a pipeline in Databricks that processes records from a Kafka stream where late-arriving data is common.

Which approach should the data engineer use?

Options:

A.

Implement a custom solution using Databricks Jobs to periodically reprocess all historical data.


B.

Use batch processing and overwrite the entire output table each time to ensure late data is incorporated correctly.


C.

Use an Auto CDC pipeline with batch tables to simplify late data handling.


D.

Use a watermark to specify the allowed lateness to accommodate records that arrive after their expected window, ensuring correct aggregation and state management.


Expert Solution
Questions # 32:

A data engineer has a Delta table orders with deletion vectors enabled. The engineer executes the following command:

DELETE FROM orders WHERE status = ' cancelled ' ;

What should be the behavior of deletion vectors when the command is executed?

Options:

A.

Rows are marked as deleted both in metadata and in files.


B.

Delta automatically removes all cancelled orders permanently.


C.

Files are physically rewritten without the deleted rows.


D.

Rows are marked as deleted in metadata, not in files.


Expert Solution
Questions # 33:

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.

Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales .

Question # 33

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?

Options:

A.

Both commands will succeed. Executing show tables will show that countries at and sales at have been registered as views.


B.

Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if this entity exists, Cmd 2 will succeed.


C.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark DataFrame.


D.

Both commands will fail. No new variables, tables, or views will be created.


E.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.


Expert Solution
Questions # 34:

Two data engineers are working on the same Databricks notebook in separate branches. Both have edited the same section of code. When one tries to merge the other’s branch into their own using the Databricks Git folders UI, a merge conflict occurs on that notebook file. The UI highlights the conflict and presents options for resolution.

How should the data engineers resolve this merge conflict using Databricks Git folders?

Options:

A.

Abort the merge, discard all local changes, and try the merge operation again without reviewing the conflicting code.


B.

Delete the conflicted notebook file via the Databricks workspace UI, commit the deletion, and recreate the notebook from scratch in a new commit to bypass the conflict entirely.


C.

Use the Git CLI in the cluster’s web terminal to force-push the conflicted merge (git push -force), overriding the remote branch with the local version and discarding changes.


D.

Use the Git folders UI to manually edit the notebook file, selecting the desired lines from both versions and removing the conflict markers, then mark the conflict as resolved.


Expert Solution
Questions # 35:

A data engineering team is setting up deployment automation. To deploy workspace assets remotely using the Databricks CLI command, they must configure it with proper authentication.

Which authentication approach will provide the highest level of security ?

Options:

A.

Use a service principal with OAuth token federation.


B.

Use a service principal ID and its OAuth client secret.


C.

Use a service principal and its Personal Access Token.


D.

Use a shared user account and its OAuth client secret.


Expert Solution
Questions # 36:

A data engineer wants to ingest a large collection of image files (JPEG and PNG) from cloud object storage into a Unity Catalog–managed table for analysis and visualization.

Which two configurations and practices are recommended to incrementally ingest these images into the table? (Choose 2 answers)

Options:

A.

Move files to a volume and read with SQL editor.


B.

Use Auto Loader and set cloudFiles.format to " BINARYFILE " .


C.

Use Auto Loader and set cloudFiles.format to " TEXT " .


D.

Use Auto Loader and set cloudFiles.format to " IMAGE " .


E.

Use the pathGlobFilter option to select only image files (e.g., " *.jpg,*.png " ).


Expert Solution
Questions # 37:

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

" device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT "

Code block:

Question # 37

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

A.

withWatermark( " event_time " , " 10 minutes " )


B.

awaitArrival( " event_time " , " 10 minutes " )


C.

await( " event_time + ‘10 minutes ' " )


D.

slidingWindow( " event_time " , " 10 minutes " )


E.

delayWrite( " event_time " , " 10 minutes " )


Expert Solution
Questions # 38:

A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.

The proposed directory structure is displayed below:

Question # 38

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

Options:

A.

No; Delta Lake manages streaming checkpoints in the transaction log.


B.

Yes; both of the streams can share a single checkpoint directory.


C.

No; only one stream can write to a Delta Lake table.


D.

Yes; Delta Lake supports infinite concurrent writers.


E.

No; each of the streams needs to have its own checkpoint directory.


Expert Solution
Questions # 39:

A Delta Lake table with Change Data Feed (CDF) enabled in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources. The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours. Which approach would simplify the identification of these changed records?

Options:

A.

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.


B.

Modify the overwrite logic to include a field populated by calling current_timestamp() as data are being written; use this field to identify records written on a particular date.


C.

Replace the current overwrite logic with a MERGE statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the Change Data Feed.


D.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.


Expert Solution
Questions # 40:

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM ' s resources?

Options:

A.

The five Minute Load Average remains consistent/flat


B.

Bytes Received never exceeds 80 million bytes per second


C.

Network I/O never spikes


D.

Total Disk Space remains constant


E.

CPU Utilization is around 75%


Expert Solution
Viewing page 4 out of 6 pages
Viewing questions 31-40 out of questions