Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: force70

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce

Viewing page 1 out of 6 pages
Viewing questions 1-10 out of questions
Questions # 1:

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.

Which statement describes the contents of the workspace audit logs concerning these events?

Options:

A.

Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identity these events.


B.

Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events.


C.

Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.


D.

Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.


E.

Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.


Expert Solution
Questions # 2:

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Options:

A.

Set the configuration delta.deduplicate = true.


B.

VACUUM the Delta table after each batch completes.


C.

Perform an insert-only merge with a matching condition on a unique key.


D.

Perform a full outer join on a unique key and overwrite existing data.


E.

Rely on Delta Lake schema enforcement to prevent duplicate records.


Expert Solution
Questions # 3:

What is true for Delta Lake?

Options:

A.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.


B.

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters.


C.

Z-ORDER can only be applied to numeric values stored in Delta Lake tables.


D.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.


Expert Solution
Questions # 4:

The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts, transforms, and loads the data for their pipeline runs in 10 minutes. Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

Options:

A.

Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster.


B.

Schedule a job to execute the pipeline once an hour on a new job cluster.


C.

Schedule a Structured Streaming job with a trigger interval of 60 minutes.


D.

Configure a job that executes every time new data lands in a given directory.


Expert Solution
Questions # 5:

A query is taking too long to run. After investigating the Spark UI, the data engineer discovered a significant amount of disk spill . The compute instance being used has a core-to-memory ratio of 1:2.

What are the two steps the data engineer should take to minimize spillage? (Choose 2 answers)

Options:

A.

Choose a compute instance with a higher core-to-memory ratio.


B.

Choose a compute instance with more disk space.


C.

Increase spark.sql.files.maxPartitionBytes.


D.

Reduce spark.sql.files.maxPartitionBytes.


E.

Choose a compute instance with more network bandwidth.


Expert Solution
Questions # 6:

A developer has successfully configured credential for Databricks Repos and cloned a remote Git repository. Hey don not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.

Use Response to pull changes from the remote Git repository commit and push changes to a branch that appeared as a changes were pulled.

Options:

A.

Use Repos to merge all differences and make a pull request back to the remote repository.


B.

Use repos to merge all difference and make a pull request back to the remote repository.


C.

Use Repos to create a new branch commit all changes and push changes to the remote Git repertory.


D.

Use repos to create a fork of the remote repository commit all changes and make a pull request on the source repository


Expert Solution
Questions # 7:

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

Options:

A.

Regex


B.

Julia


C.

pyspsark.ml.feature


D.

Scala Datasets


E.

C++


Expert Solution
Questions # 8:

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

Options:

A.

Stage’s detail screen and Executor’s files


B.

Stage’s detail screen and Query’s detail screen


C.

Driver’s and Executor’s log files


D.

Executor’s detail screen and Executor’s log files


Expert Solution
Questions # 9:

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

df has the following schema: device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT

Code block:

df.withWatermark( " event_time " , " 10 minutes " )

.groupBy(

________,

" device_id "

)

.agg(

avg( " temp " ).alias( " avg_temp " ),

avg( " humidity " ).alias( " avg_humidity " )

)

.writeStream

.format( " delta " )

.saveAsTable( " sensor_avg " )

Which line of code correctly fills in the blank within the code block to complete this task?

Options:

A.

window( " event_time " , " 5 minutes " ).alias( " time " )


B.

to_interval( " event_time " , " 5 minutes " ).alias( " time " )


C.

" event_time "


D.

lag( " event_time " , " 5 minutes " ).alias( " time " )


Expert Solution
Questions # 10:

A data engineer has created a new cluster using shared access mode with default configurations. The data engineer needs to allow the development team access to view the driver logs if needed.

What are the minimal cluster permissions that allow the development team to accomplish this?

Options:

A.

CAN ATTACH TO


B.

CAN MANAGE


C.

CAN VIEW


D.

CAN RESTART


Expert Solution
Viewing page 1 out of 6 pages
Viewing questions 1-10 out of questions