New Year Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: simple70

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce

Viewing page 3 out of 4 pages
Viewing questions 21-30 out of questions
Questions # 21:

A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks.

One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.

What approach would allow them to do this?

Options:

A.

Maintain data quality rules in a Delta table outside of this pipeline’s target schema, providing the schema name as a pipeline parameter.


B.

Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.


C.

Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.


D.

Maintain data quality rules in a separate Databricks notebook that each DLT notebook of file.


Expert Solution
Questions # 22:

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:

Question # 22

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

Question # 22

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

Options:

A.

A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.


B.

An external table will be created in the storage container mounted to /mnt/finance eda bucket.


C.

A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.


D.

An managed table will be created in the storage container mounted to /mnt/finance eda bucket.


E.

A managed table will be created in the DBFS root storage container.


Expert Solution
Questions # 23:

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specific fields have not been approval for the sales org.

Which of the following solutions addresses the situation while emphasizing simplicity?

Options:

A.

Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions.


B.

Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes.


C.

Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table.


D.

Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.


Expert Solution
Questions # 24:

A data architect has heard about lake's built-in versioning and time travel capabilities. For auditing purposes they have a requirement to maintain a full of all valid street addresses as they appear in the customers table.

The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.

Which piece of information is critical to this decision?

Options:

A.

Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.


B.

Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place.


C.

Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning.


D.

Data corruption can occur if a query fails in a partially completed state because Type 2 tables requires

Setting multiple fields in a single update.


Expert Solution
Questions # 25:

Given the following error traceback:

AnalysisException: cannot resolve 'heartrateheartrateheartrate' given input columns:

[spark_catalog.database.table.device_id, spark_catalog.database.table.heartrate,

spark_catalog.database.table.mrn, spark_catalog.database.table.time]

The code snippet was:

display(df.select(3*"heartrate"))

Which statement describes the error being raised?

Options:

A.

There is a type error because a DataFrame object cannot be multiplied.


B.

There is a syntax error because the heartrate column is not correctly identified as a column.


C.

There is no column in the table named heartrateheartrateheartrate.


D.

There is a type error because a column object cannot be multiplied.


Expert Solution
Questions # 26:

The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

Options:

A.

Schedule a jo to execute the pipeline once and hour on a dedicated interactive cluster.


B.

Schedule a Structured Streaming job with a trigger interval of 60 minutes.


C.

Schedule a job to execute the pipeline once hour on a new job cluster.


D.

Configure a job that executes every time new data lands in a given directory.


Expert Solution
Questions # 27:

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Options:

A.

Delta Lake statistics are not optimized for free text fields with high cardinality.


B.

Text data cannot be stored with Delta Lake.


C.

ZORDER ON review will need to be run to see performance gains.


D.

The Delta log creates a term matrix for free text fields to support selective filtering.


E.

Delta Lake statistics are only collected on the first 4 columns in a table.


Expert Solution
Questions # 28:

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

A.

When a database is being created, make sure that the LOCATION keyword is used.


B.

When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.


C.

When data is saved to a table, make sure that a full file path is specified alongside the Delta format.


D.

When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.


E.

When the workspace is being configured, make sure that external cloud object storage has been mounted.


Expert Solution
Questions # 29:

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

Options:

A.

Task queueing resulting from improper thread pool assignment.


B.

Spill resulting from attached volume storage being too small.


C.

Network latency due to some cluster nodes being in different regions from the source data


D.

Skew caused by more data being assigned to a subset of spark-partitions.


E.

Credential validation errors while pulling data from an external system.


Expert Solution
Questions # 30:

The data engineering team maintains the following code:

Question # 30

Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?

Options:

A.

The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.


B.

A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.


C.

The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.


D.

An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.


E.

An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.


Expert Solution
Viewing page 3 out of 4 pages
Viewing questions 21-30 out of questions