Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce

Viewing page 2 out of 4 pages
Viewing questions 11-20 out of questions
Questions # 11:

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Options:

A.

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.


B.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.


C.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.


D.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.


E.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.


Expert Solution
Questions # 12:

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

Options:

A.

Task queueing resulting from improper thread pool assignment.


B.

Spill resulting from attached volume storage being too small.


C.

Network latency due to some cluster nodes being in different regions from the source data


D.

Skew caused by more data being assigned to a subset of spark-partitions.


E.

Credential validation errors while pulling data from an external system.


Expert Solution
Questions # 13:

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

Options:

A.

Can manage


B.

Can edit


C.

Can run


D.

Can Read


Expert Solution
Questions # 14:

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint2.0/jobs/create.

Question # 14

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

Options:

A.

Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.


B.

The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.


C.

Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.


D.

One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.


E.

The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.


Expert Solution
Questions # 15:

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

Options:

A.

• Total VMs; 1

• 400 GB per Executor

• 160 Cores / Executor


B.

• Total VMs: 8

• 50 GB per Executor

• 20 Cores / Executor


C.

• Total VMs: 4

• 100 GB per Executor

• 40 Cores/Executor


D.

• Total VMs:2

• 200 GB per Executor

• 80 Cores / Executor


Expert Solution
Questions # 16:

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream for highly selective joins on a number of fields, and will also be leveraged by the machine learning team to filter on a handful of relevant fields, in total, 15 fields have been identified that will often be used for filter and join logic.

The data engineer is trying to determine the best approach for dealing with these nested fields before declaring the table schema.

Which of the following accurately presents information about Delta Lake and Databricks that may Impact their decision-making process?

Options:

A.

Because Delta Lake uses Parquet for data storage, Dremel encoding information for nesting can be directly referenced by the Delta transaction log.


B.

Tungsten encoding used by Databricks is optimized for storing string data: newly-added native support for querying JSON strings means that string types are always most efficient.


C.

Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.


D.

By default Delta Lake collects statistics on the first 32 columns in a table; these statistics are leveraged for data skipping when executing selective queries.


Expert Solution
Questions # 17:

A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should havefive workers,one driverof type i3.xlarge, and should use the '14.3.x-scala2.12' runtime.

Which command should the data engineer use?

Options:

A.

databricks clusters create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name DataEngineer_cluster


B.

databricks clusters add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster


C.

databricks compute add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster


D.

databricks compute create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster


Expert Solution
Questions # 18:

A data engineer has created a transactions Delta table on Databricks that should be used by the analytics team. The analytics team wants to use the table with another tool that requires Apache Iceberg format.

What should the data engineer do?

Options:

A.

Require the analytics team to use a tool that supports Delta table.


B.

Enable uniform on the transactions table to 'iceberg' so that the table can be read as an Iceberg table.


C.

Create an Iceberg copy of the transactions Delta table which can be used by the analytics team.


D.

Convert the transactions Delta table to Iceberg and enable uniform so that the table can be read as a Delta table.


Expert Solution
Questions # 19:

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Options:

A.

Delta Lake statistics are not optimized for free text fields with high cardinality.


B.

Text data cannot be stored with Delta Lake.


C.

ZORDER ON review will need to be run to see performance gains.


D.

The Delta log creates a term matrix for free text fields to support selective filtering.


E.

Delta Lake statistics are only collected on the first 4 columns in a table.


Expert Solution
Questions # 20:

The business intelligence team has a dashboard configured to track various summary metrics for retail stories. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

Question # 20

For Demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table named products_per_order, includes the following fields:

Question # 20

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?

Options:

A.

Use the Delta Cache to persists the products_per_order table in memory to quickly the dashboard with each query.


B.

Populate the dashboard by configuring a nightly batch job to save the required to quickly update the dashboard with each query.


C.

Use Structure Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.


D.

Define a view against the products_per_order table and define the dashboard against this view.


Expert Solution
Viewing page 2 out of 4 pages
Viewing questions 11-20 out of questions