Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce

Viewing page 4 out of 4 pages
Viewing questions 31-40 out of questions
Questions # 31:

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is namedstore_saies_summaryand the schema is as follows:

Question # 31

The tabledaily_store_salescontains all the information needed to updatestore_sales_summary. The schema for this table is:

store_id INT, sales_date DATE, total_sales FLOAT

Ifdaily_store_salesis implemented as a Type 1 table and thetotal_salescolumn might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in thestore_sales_summarytable?

Options:

A.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.


B.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.


C.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.


D.

Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.


E.

Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.


Expert Solution
Questions # 32:

A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Question # 32

Which step must also be completed to put the proposed query into production?

Options:

A.

Increase the shuffle partitions to account for additional aggregates


B.

Specify a new checkpointlocation


C.

Run REFRESH TABLE delta, /item_agg'


D.

Remove .option (mergeSchema', true') from the streaming write


Expert Solution
Questions # 33:

A table nameduser_ltvis being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

Theuser_ltvtable has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

Question # 33

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

Options:

A.

Three columns will be returned, but one column will be named "redacted" and contain only null values.


B.

Only the email and itv columns will be returned; the email column will contain all null values.


C.

The email and ltv columns will be returned with the values in user itv.


D.

The email, age. and ltv columns will be returned with the values in user ltv.


E.

Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.


Expert Solution
Questions # 34:

A junior data engineer is working to implement logic for a Lakehouse table namedsilver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

Thesilver_device_recordingstable will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

Options:

A.

The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.


B.

Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.


C.

Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.


D.

Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.


E.

Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.


Expert Solution
Questions # 35:

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame namedpredswith the schema "customer_id LONG, predictions DOUBLE, date DATE".

Question # 35

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.

Which code block accomplishes this task while minimizing potential compute costs?

Options:

A.

preds.write.mode("append").saveAsTable("churn_preds")


B.

preds.write.format("delta").save("/preds/churn_preds")

C)

35

D)

35

E)

35


C.

Option A


D.

Option B


E.

Option C


F.

Option D


G.

Option E


Expert Solution
Questions # 36:

A table named user_ltv is being used to create a view that will be used by data analysis on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

Question # 36

An analyze who is not a member of the auditing group executing the following query:

Question # 36

Which result will be returned by this query?

Options:

A.

All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.


B.

All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.


C.

All age values less than 18 will be returned as null values all other columns will be returned with the values in user_ltv.


D.

All records from all columns will be displayed with the values in user_ltv.


Expert Solution
Questions # 37:

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:

df = spark.read.format("parquet").load(f"/mnt/source/(date)")

Which code block should be used to create the date Python variable used in the above code block?

Options:

A.

date = spark.conf.get("date")


B.

input_dict = input()

date= input_dict["date"]


C.

import sys

date = sys.argv[1]


D.

date = dbutils.notebooks.getParam("date")


E.

dbutils.widgets.text("date", "null")

date = dbutils.widgets.get("date")


Expert Solution
Questions # 38:

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Options:

A.

Set the configuration delta.deduplicate = true.


B.

VACUUM the Delta table after each batch completes.


C.

Perform an insert-only merge with a matching condition on a unique key.


D.

Perform a full outer join on a unique key and overwrite existing data.


E.

Rely on Delta Lake schema enforcement to prevent duplicate records.


Expert Solution
Viewing page 4 out of 4 pages
Viewing questions 31-40 out of questions