Databricks Databricks-Certified-Professional-Data-Engineer Exam Questions Free Practice Test

Viewing page 4 out of 4 pages

Viewing questions 31-40 out of questions

Questions # 31:

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is namedstore_saies_summaryand the schema is as follows:

Question # 31

The tabledaily_store_salescontains all the information needed to updatestore_sales_summary. The schema for this table is:

store_id INT, sales_date DATE, total_sales FLOAT

Ifdaily_store_salesis implemented as a Type 1 table and thetotal_salescolumn might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in thestore_sales_summarytable?

Options:

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Expert Solution

Answer

Explanation

The daily_store_sales table contains all the information needed to update store_sales_summary. The schema of the table is:

store_id INT, sales_date DATE, total_sales FLOAT

The daily_store_sales table is implemented as a Type 1 table, which means that old values are overwritten by new values and no history is maintained. The total_sales column might be adjusted after manual data auditing, which means that the data in the table may change over time.

The safest approach to generate accurate reports in the store_sales_summary table is to use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update. Structured Streaming is a scalable and fault-tolerant stream processing engine built on Spark SQL. Structured Streaming allows processing data streams as if they were tables or DataFrames, using familiar operations such as select, filter, groupBy, or join. Structured Streaming also supports output modes that specify how to write the results of a streaming query to a sink, such as append, update, or complete. Structured Streaming can handle both streaming and batch data sources in a unified manner.

The change data feed is a feature of Delta Lake that provides structured streaming sources that can subscribe to changes made to a Delta Lake table. The change data feed captures both data changes and schema changes as ordered events that can be processed by downstream applications or services. The change data feed can be configured with different options, such as starting from a specific version or timestamp, filtering by operation type or partition values, or excluding no-op changes.

By using Structured Streaming to subscribe to the change data feed for daily_store_sales, one can capture and process any changes made to the total_sales column due to manual data auditing. By applying these changes to the aggregates in the store_sales_summary table with each update, one can ensure that the reports are always consistent and accurate with the latest data. Verified References: [Databricks Certified Data Engineer Professional], under “Spark Core” section; Databricks Documentation, under “Structured Streaming” section; Databricks Documentation, under “Delta Change Data Feed” section.

Questions # 32:

A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Question # 32

Which step must also be completed to put the proposed query into production?

Options:

Increase the shuffle partitions to account for additional aggregates

Specify a new checkpointlocation

Run REFRESH TABLE delta, /item_agg'

Remove .option (mergeSchema', true') from the streaming write

Expert Solution

Questions # 33:

A table nameduser_ltvis being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

Theuser_ltvtable has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

Question # 33

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

Options:

Three columns will be returned, but one column will be named "redacted" and contain only null values.

Only the email and itv columns will be returned; the email column will contain all null values.

The email and ltv columns will be returned with the values in user itv.

The email, age. and ltv columns will be returned with the values in user ltv.

Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.

Expert Solution

Questions # 34:

A junior data engineer is working to implement logic for a Lakehouse table namedsilver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

Thesilver_device_recordingstable will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

Options:

The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.

Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.

Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.

Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.

Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

Expert Solution

Questions # 35:

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame namedpredswith the schema "customer_id LONG, predictions DOUBLE, date DATE".

Question # 35

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.

Which code block accomplishes this task while minimizing potential compute costs?

Options:

preds.write.mode("append").saveAsTable("churn_preds")

preds.write.format("delta").save("/preds/churn_preds")

Option A

Option B

Option C

Option D

Option E

Expert Solution

Questions # 36:

A table named user_ltv is being used to create a view that will be used by data analysis on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

Question # 36

An analyze who is not a member of the auditing group executing the following query:

Question # 36

Which result will be returned by this query?

Options:

All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.

All age values less than 18 will be returned as null values all other columns will be returned with the values in user_ltv.

All records from all columns will be displayed with the values in user_ltv.

Expert Solution

Questions # 37:

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:

df = spark.read.format("parquet").load(f"/mnt/source/(date)")

Which code block should be used to create the date Python variable used in the above code block?

Options:

date = spark.conf.get("date")

input_dict = input()

date= input_dict["date"]

import sys

date = sys.argv[1]

date = dbutils.notebooks.getParam("date")

dbutils.widgets.text("date", "null")

date = dbutils.widgets.get("date")

Expert Solution

Answer

Explanation

The code block that should be used to create the date Python variable used in the above code block is:

dbutils.widgets.text(“date”, “null”) date = dbutils.widgets.get(“date”)

This code block uses the dbutils.widgets API to create and get a text widget named “date” that can accept a string value as a parameter1. The default value of the widget is “null”, which means that if no parameter is passed, the date variable will be “null”. However, if a parameter is passed through the Databricks Jobs API, the date variable will be assigned the value of the parameter. For example, if the parameter is “2021-11-01”, the date variable will be “2021-11-01”. This way, the notebook can use the date variable to load data from the specified path.

The other options are not correct, because:

Option A is incorrect because spark.conf.get(“date”) is not a valid way to get a parameter passed through the Databricks Jobs API. The spark.conf API is used to get or set Spark configuration properties, not notebook parameters2.

Option B is incorrect because input() is not a valid way to get a parameter passed through the Databricks Jobs API. The input() function is used to get user input from the standard input stream, not from the API request3.

Option C is incorrect because sys.argv1 is not a valid way to get a parameter passed through the Databricks Jobs API. The sys.argv list is used to get the command-line arguments passed to a Python script, not to a notebook4.

Option D is incorrect because dbutils.notebooks.getParam(“date”) is not a valid way to get a parameter passed through the Databricks Jobs API. The dbutils.notebooks API is used to get or set notebook parameters when running a notebook as a job or as a subnotebook, not when passing parameters through the API5.

References: Widgets, Spark Configuration, input(), sys.argv, Notebooks

Questions # 38:

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Options:

Set the configuration delta.deduplicate = true.

VACUUM the Delta table after each batch completes.

Perform an insert-only merge with a matching condition on a unique key.

Perform a full outer join on a unique key and overwrite existing data.

Rely on Delta Lake schema enforcement to prevent duplicate records.

Expert Solution

Viewing page 4 out of 4 pages

Viewing questions 31-40 out of questions

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce