Databricks Databricks-Certified-Professional-Data-Engineer Exam Questions Free Practice Test

Viewing page 3 out of 4 pages

Viewing questions 21-30 out of questions

Questions # 21:

Which statement characterizes the general programming model used by Spark Structured Streaming?

Options:

Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.

Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.

Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.

Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.

Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.

Expert Solution

Questions # 22:

A junior data engineer on your team has implemented the following code block.

Question # 22

The viewnew_eventscontains a batch of records with the same schema as theeventsDelta table. Theevent_idfield serves as a unique key for this table.

When this query is executed, what will happen with new records that have the sameevent_idas an existing record?

Options:

They are merged.

They are ignored.

They are updated.

They are inserted.

They are deleted.

Expert Solution

Questions # 23:

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

When a database is being created, make sure that the LOCATION keyword is used.

When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.

When data is saved to a table, make sure that a full file path is specified alongside the Delta format.

When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Expert Solution

Questions # 24:

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.

If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

Options:

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

Expert Solution

Questions # 25:

A data engineer has created a new cluster using shared access mode with default configurations. The data engineer needs to allow the development team access to view the driver logs if needed.

What are the minimal cluster permissions that allow the development team to accomplish this?

Options:

CAN ATTACH TO

CAN MANAGE

CAN VIEW

CAN RESTART

Expert Solution

Questions # 26:

The data engineer is using Spark's MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the spark UI's Storage tab to signal that a cached table is not performing optimally?

Options:

Size on Disk is> 0

The number of Cached Partitions> the number of Spark Partitions

The RDD Block Name included the '' annotation signaling failure to cache

On Heap Memory Usage is within 75% of off Heap Memory usage

Expert Solution

Questions # 27:

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both deep and shallow clone, development tables are created using shallow clone.

A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that vacuum was run the day before.

Why are the cloned tables no longer working?

Options:

The data files compacted by vacuum are not tracked by the cloned metadata; running refresh on the cloned table will pull in recent changes.

Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.

The metadata created by the clone operation is referencing data files that were purged as invalid by the vacuum command

Running vacuum automatically invalidates any shallow clones of a table; deep clone should always be used when a cloned table will be repeatedly queried.

Expert Solution

Questions # 28:

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

Options:

Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.

The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed.

Other than the default "admins" group, only individual users can be granted privileges on jobs.

A user can only transfer job ownership to a group if they are also a member of that group.

Only workspace administrators can grant "Owner" privileges to a group.

Expert Solution

Questions # 29:

A table is registered with the following code:

Question # 29

Bothusersandordersare Delta Lake tables. Which statement describes the results of queryingrecent_orders?

Options:

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Expert Solution

Questions # 30:

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, usingdisplay()calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Options:

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Expert Solution

Answer

Viewing page 3 out of 4 pages

Viewing questions 21-30 out of questions

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce