Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: force70

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce

Viewing page 3 out of 6 pages
Viewing questions 21-30 out of questions
Questions # 21:

A data engineer is configuring Delta Sharing for a Databricks-to-Databricks scenario to optimize read performance. The recipient needs to perform time travel queries and streaming reads on shared sales data.

Which configuration will provide the optimal performance while enabling these capabilities?

Options:

A.

Share tables WITH HISTORY , ensure tables don’t have partitioning enabled, and enable CDF before sharing.


B.

Share tables WITHOUT HISTORY and enable partitioning for better query performance.


C.

Share the entire schema WITHOUT HISTORY and rely on recipient-side caching for performance.


D.

Use the open sharing protocol instead of Databricks-to-Databricks sharing for better performance.


Expert Solution
Questions # 22:

A data engineer is designing a Lakeflow Declarative Pipeline to process streaming order data. The pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id and amount are greater than zero. Invalid records should be dropped.

Which Lakeflow Declarative Pipelines configurations implement this requirement using Python?

Options:

A.

@dlt.table

def silver_orders():

return (

dlt.read_stream( " bronze_orders " )

.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

.expect_or_drop( " valid_amount " , " amount > 0 " )

)


B.

@dlt.table

@dlt.expect( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )


C.

@dlt.table

def silver_orders():

return (

dlt.read_stream( " bronze_orders " )

.expect( " valid_customer " , " customer_id IS NOT NULL " )

.expect( " valid_amount " , " amount > 0 " )

)


D.

@dlt.table

@dlt.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect_or_drop( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )


Expert Solution
Questions # 23:

The data governance team is reviewing user for deleting records for compliance with GDPR. The following logic has been implemented to propagate deleted requests from the user_lookup table to the user aggregate table.

Question # 23

Assuming that user_id is a unique identifying key and that all users have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?

Options:

A.

No: files containing deleted records may still be accessible with time travel until a BACUM command is used to remove invalidated data files.


B.

Yes: Delta Lake ACID guarantees provide assurance that the DELETE command successed fully and permanently purged these records.


C.

No: the change data feed only tracks inserts and updates not deleted records.


D.

No: the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command


Expert Solution
Questions # 24:

The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible.

A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have

Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.

Which statement captures best practices for this situation?

Options:

A.

Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.


B.

All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.


C.

In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.


D.

Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.


Expert Solution
Questions # 25:

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

Options:

A.

Task queueing resulting from improper thread pool assignment.


B.

Spill resulting from attached volume storage being too small.


C.

Network latency due to some cluster nodes being in different regions from the source data


D.

Skew caused by more data being assigned to a subset of spark-partitions.


E.

Credential validation errors while pulling data from an external system.


Expert Solution
Questions # 26:

A data engineering team uses Databricks Lakehouse Monitoring to track the percent_null metric for a critical column in their Delta table.

The profile metrics table (prod_catalog.prod_schema.customer_data_profile_metrics) stores hourly percent_null values.

The team wants to:

    Trigger an alert when the daily average of percent_null exceeds 5% for three consecutive days .

    Ensure that notifications are not spammed during sustained issues.

Options:

Options:

A.

SELECT percent_null

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 1 ' DAY

Alert Condition: percent_null > 5

Notification Frequency: At most every 24 hours


B.

WITH daily_avg AS (

SELECT DATE_TRUNC( ' DAY ' , window.end) AS day,

AVG(percent_null) AS avg_null

FROM prod_catalog.prod_schema.customer_data_profile_metrics

GROUP BY DATE_TRUNC( ' DAY ' , window.end)

)

SELECT day, avg_null

FROM daily_avg

ORDER BY day DESC

LIMIT 3

Alert Condition: ALL avg_null > 5 for the latest 3 rows

Notification Frequency: Just once


C.

SELECT AVG(percent_null) AS daily_avg

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 3 ' DAY

Alert Condition: daily_avg > 5

Notification Frequency: Each time alert is evaluated


D.

SELECT SUM(CASE WHEN percent_null > 5 THEN 1 ELSE 0 END) AS violation_days

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 3 ' DAY

Alert Condition: violation_days > = 3

Notification Frequency: Just once


Expert Solution
Questions # 27:

Given the following error traceback (from display(df.select(3* " heartrate " ))) which shows AnalysisException: cannot resolve ' heartrateheartrateheartrate ' , which statement describes the error being raised?

Options:

A.

There is a type error because a DataFrame object cannot be multiplied.


B.

There is a syntax error because the heartrate column is not correctly identified as a column.


C.

There is no column in the table named heartrateheartrateheartrate.


D.

There is a type error because a column object cannot be multiplied.


Expert Solution
Questions # 28:

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Question # 28

Which response correctly fills in the blank to meet the specified requirements?

Question # 28

Options:

A.

Option A


B.

Option B


C.

Option C


D.

Option D


E.

Option E


Expert Solution
Questions # 29:

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response indicating the job run request was submitted successfully includes a field run_id. Which statement describes what the number alongside this field represents?

Options:

A.

The job_id and number of times the job has been run are concatenated and returned.


B.

The globally unique ID of the newly triggered run.


C.

The job_id is returned in this field.


D.

The number of times the job definition has been run in this workspace.


Expert Solution
Questions # 30:

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema " customer_id LONG, predictions DOUBLE, date DATE " .

Question # 30

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.

Which code block accomplishes this task while minimizing potential compute costs?

Options:

A.

preds.write.mode( " append " ).saveAsTable( " churn_preds " )


B.

preds.write.format( " delta " ).save( " /preds/churn_preds " )

C)

30

D)

30

E)

30


C.

Option A


D.

Option B


E.

Option C


F.

Option D


G.

Option E


Expert Solution
Viewing page 3 out of 6 pages
Viewing questions 21-30 out of questions