New Year Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: simple70

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce

Viewing page 4 out of 4 pages
Viewing questions 31-40 out of questions
Questions # 31:

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?

Options:

A.

In the Executor's log file, by gripping for "predicate push-down"


B.

In the Stage's Detail screen, in the Completed Stages table, by noting the size of data read from the Input column


C.

In the Storage Detail screen, by noting which RDDs are not stored on disk


D.

In the Delta Lake transaction log. by noting the column statistics


E.

In the Query Detail screen, by interpreting the Physical Plan


Expert Solution
Questions # 32:

A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should have five workers, one driver of type i3.xlarge, and should use the '14.3.x-scala2.12' runtime.

Which command should the data engineer use?

Options:

A.

databricks clusters create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name DataEngineer_cluster


B.

databricks clusters add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster


C.

databricks compute add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster


D.

databricks compute create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster


Expert Solution
Questions # 33:

A Delta table of weather records is partitioned by date and has the below schema:

date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a query with the below filter:

latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

Options:

A.

All records are cached to an operational database and then the filter is applied


B.

The Parquet file footers are scanned for min and max statistics for the latitude column


C.

All records are cached to attached storage and then the filter is applied


D.

The Delta log is scanned for min and max statistics for the latitude column


E.

The Hive metastore is scanned for min and max statistics for the latitude column


Expert Solution
Questions # 34:

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each micro-batch of data is processed in less than 3 seconds; at least 12 times per minute, a micro-batch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution. Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

Options:

A.

Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.


B.

Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.


C.

Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to the maximum allowable threshold should minimize this cost.


D.

Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.


Expert Solution
Questions # 35:

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both deep and shallow clone, development tables are created using shallow clone.

A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that vacuum was run the day before.

Why are the cloned tables no longer working?

Options:

A.

The data files compacted by vacuum are not tracked by the cloned metadata; running refresh on the cloned table will pull in recent changes.


B.

Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.


C.

The metadata created by the clone operation is referencing data files that were purged as invalid by the vacuum command


D.

Running vacuum automatically invalidates any shallow clones of a table; deep clone should always be used when a cloned table will be repeatedly queried.


Expert Solution
Questions # 36:

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

Question # 36

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

Options:

A.

Three columns will be returned, but one column will be named "redacted" and contain only null values.


B.

Only the email and itv columns will be returned; the email column will contain all null values.


C.

The email and ltv columns will be returned with the values in user itv.


D.

The email, age. and ltv columns will be returned with the values in user ltv.


E.

Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.


Expert Solution
Questions # 37:

What statement is true regarding the retention of job run history?

Options:

A.

It is retained until you export or delete job run logs


B.

It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3


C.

t is retained for 60 days, during which you can export notebook run results to HTML


D.

It is retained for 60 days, after which logs are archived


E.

It is retained for 90 days or until the run-id is re-used through custom run configuration


Expert Solution
Questions # 38:

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

MERGE INTO customers

USING (

SELECT updates.customer_id as merge_ey, updates .*

FROM updates

UNION ALL

SELECT NULL as merge_key, updates .*

FROM updates JOIN customers

ON updates.customer_id = customers.customer_id

WHERE customers.current = true AND updates.address <> customers.address

) staged_updates

ON customers.customer_id = mergekey

WHEN MATCHED AND customers. current = true AND customers.address <> staged_updates.address THEN

UPDATE SET current = false, end_date = staged_updates.effective_date

WHEN NOT MATCHED THEN

INSERT (customer_id, address, current, effective_date, end_date)

VALUES (staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null)

Which statement describes this implementation?

    The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Options:

A.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.


B.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.


C.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.


Expert Solution
Questions # 39:

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

Options:

A.

"Can Manage" privileges on the required cluster


B.

Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster


C.

Cluster creation allowed. "Can Attach To" privileges on the required cluster


D.

"Can Restart" privileges on the required cluster


E.

Cluster creation allowed. "Can Restart" privileges on the required cluster


Expert Solution
Questions # 40:

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

Options:

A.

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.


B.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.


C.

Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.


D.

Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.


E.

Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.


Expert Solution
Viewing page 4 out of 4 pages
Viewing questions 31-40 out of questions