Databricks Databricks-Certified-Professional-Data-Engineer Exam Questions Free Practice Test

Viewing page 2 out of 4 pages

Viewing questions 11-20 out of questions

Questions # 11:

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Options:

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Expert Solution

Answer

Explanation

For this scenario where a one-TB JSON dataset needs to be converted into Parquet format without employing Delta Lake's auto-sizing features, the goal is to avoid unnecessary data shuffles and yet ensure optimal file sizes for the output Parquet files. Here’s a breakdown of why option A is most suitable:

Setting maxPartitionBytes:The spark.sql.files.maxPartitionBytes configuration controls the size of blocks that Spark reads from the data source (in this case, the JSON files) but also influences the output size of files when data is written without repartition or coalesce operations. Setting this parameter to 512 MB directly addresses the requirement to manage the output file size effectively.

Data Ingestion and Processing:

Ingesting Data:Load the JSON dataset into a DataFrame.

Applying Transformations:Perform any required narrow transformations that do not involve shuffling data (like filtering or adding new columns).

Writing to Parquet:Directly write the transformed DataFrame to Parquet files. The setting for maxPartitionBytes ensures that each part-file is approximately 512 MB, meeting the requirement for part-file size without additional steps to repartition or coalesce the data.

Performance Consideration:This approach is optimal because:

It avoids the overhead of shuffling data, which can be significant, especially with large datasets.

It directly ties the read/write operations to a configuration that matches the target output size, making it efficient in terms of both computation and I/O operations.

Alternative Options Analysis:

Option B and D:Involves repartitioning, which would trigger a shuffle of the data, contradicting the requirement to avoid shuffling for performance reasons.

Option C:Uses coalesce, which is less intensive than repartition but can still lead to uneven partition sizes and does not directly control the output file size as effectively as setting maxPartitionBytes.

Option E:Setting shuffle partitions to 512 doesn’t directly control the output file size for writing to Parquet and could lead to smaller files depending on the dataset's partitioning post-transformations.

References

Apache Spark Configuration

Writing to Parquet Files in Spark

Questions # 12:

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

Options:

Task queueing resulting from improper thread pool assignment.

Spill resulting from attached volume storage being too small.

Network latency due to some cluster nodes being in different regions from the source data

Skew caused by more data being assigned to a subset of spark-partitions.

Credential validation errors while pulling data from an external system.

Expert Solution

Questions # 13:

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

Options:

Can manage

Can edit

Can run

Can Read

Expert Solution

Questions # 14:

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint2.0/jobs/create.

Question # 14

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

Options:

Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.

The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.

Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.

One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.

The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.

Expert Solution

Questions # 15:

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

Options:

• Total VMs; 1

• 400 GB per Executor

• 160 Cores / Executor

• Total VMs: 8

• 50 GB per Executor

• 20 Cores / Executor

• Total VMs: 4

• 100 GB per Executor

• 40 Cores/Executor

• Total VMs:2

• 200 GB per Executor

• 80 Cores / Executor

Expert Solution

Questions # 16:

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream for highly selective joins on a number of fields, and will also be leveraged by the machine learning team to filter on a handful of relevant fields, in total, 15 fields have been identified that will often be used for filter and join logic.

The data engineer is trying to determine the best approach for dealing with these nested fields before declaring the table schema.

Which of the following accurately presents information about Delta Lake and Databricks that may Impact their decision-making process?

Options:

Because Delta Lake uses Parquet for data storage, Dremel encoding information for nesting can be directly referenced by the Delta transaction log.

Tungsten encoding used by Databricks is optimized for storing string data: newly-added native support for querying JSON strings means that string types are always most efficient.

Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

By default Delta Lake collects statistics on the first 32 columns in a table; these statistics are leveraged for data skipping when executing selective queries.

Expert Solution

Questions # 17:

A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should havefive workers,one driverof type i3.xlarge, and should use the '14.3.x-scala2.12' runtime.

Which command should the data engineer use?

Options:

databricks clusters create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name DataEngineer_cluster

databricks clusters add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

databricks compute add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

databricks compute create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

Expert Solution

Questions # 18:

A data engineer has created a transactions Delta table on Databricks that should be used by the analytics team. The analytics team wants to use the table with another tool that requires Apache Iceberg format.

What should the data engineer do?

Options:

Require the analytics team to use a tool that supports Delta table.

Enable uniform on the transactions table to 'iceberg' so that the table can be read as an Iceberg table.

Create an Iceberg copy of the transactions Delta table which can be used by the analytics team.

Convert the transactions Delta table to Iceberg and enable uniform so that the table can be read as a Delta table.

Expert Solution

Questions # 19:

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Options:

Delta Lake statistics are not optimized for free text fields with high cardinality.

Text data cannot be stored with Delta Lake.

ZORDER ON review will need to be run to see performance gains.

The Delta log creates a term matrix for free text fields to support selective filtering.

Delta Lake statistics are only collected on the first 4 columns in a table.

Expert Solution

Questions # 20:

The business intelligence team has a dashboard configured to track various summary metrics for retail stories. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

Question # 20

For Demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table named products_per_order, includes the following fields:

Question # 20

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?

Options:

Use the Delta Cache to persists the products_per_order table in memory to quickly the dashboard with each query.

Populate the dashboard by configuring a nightly batch job to save the required to quickly update the dashboard with each query.

Use Structure Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.

Define a view against the products_per_order table and define the dashboard against this view.

Expert Solution

Viewing page 2 out of 4 pages

Viewing questions 11-20 out of questions

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce