Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: force70

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with CertsForce

Viewing page 6 out of 6 pages
Viewing questions 51-60 out of questions
Questions # 51:

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property " contains_pii " = true .

The following SQL DDL statement is executed to create a new table:

Question # 51

Which command allows manual confirmation that these three requirements have been met?

Options:

A.

DESCRIBE EXTENDED dev.pii test


B.

DESCRIBE DETAIL dev.pii test


C.

SHOW TBLPROPERTIES dev.pii test


D.

DESCRIBE HISTORY dev.pii test


E.

SHOW TABLES dev


Expert Solution
Questions # 52:

Which statement regarding spark configuration on the Databricks platform is true?

Options:

A.

Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.


B.

When the same spar configuration property is set for an interactive to the same interactive cluster.


C.

Spark configuration set within an notebook will affect all SparkSession attached to the same interactive cluster


D.

The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs.


Expert Solution
Questions # 53:

A company wants to implement Lakehouse Federation across multiple data sources but is concerned about data consistency and ensuring that all teams access the same authoritative version of their data.

Which statement is applicable for Lakehouse Federations to maintain data consistency?

Options:

A.

Federation provides read-only access that reflects the current state of source systems.


B.

Federation implements change data capture (CDC) from all sources.


C.

A separate data synchronization service must be deployed.


D.

Federation creates local copies that must be manually refreshed.


Expert Solution
Questions # 54:

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

This table is partitioned by the date column. A query is run with the following filter:

longitude < 20 and longitude > -20

Which statement describes how data will be filtered?

Options:

A.

Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.


B.

No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.


C.

The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.


D.

Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.


E.

The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.


Expert Solution
Questions # 55:

A data engineer created a daily batch ingestion pipeline using a cluster with the latest DBR version to store banking transaction data, and persisted it in a MANAGED DELTA table called prod.gold.all_banking_transactions_daily. The data engineer is constantly receiving complaints from business users who query this table ad hoc through a SQL Serverless Warehouse about poor query performance. Upon analysis, the data engineer identified that these users frequently use high-cardinality columns as filters. The engineer now seeks to implement a data layout optimization technique that is incremental, easy to maintain, and can evolve over time.

Which command should the data engineer implement?

Options:

A.

Alter the table to use Hive-Style Partitions + Z-ORDER and implement a periodic OPTIMIZE command.


B.

Alter the table to use Liquid Clustering and implement a periodic OPTIMIZE command.


C.

Alter the table to use Hive-Style Partitions and implement a periodic OPTIMIZE command.


D.

Alter the table to use Z-ORDER and implement a periodic OPTIMIZE command.


Expert Solution
Questions # 56:

A Delta Lake table representing metadata about content posts from users has the following schema:

    user_id LONG

    post_text STRING

    post_id STRING

    longitude FLOAT

    latitude FLOAT

    post_time TIMESTAMP

    date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

Options:

A.

date


B.

user_id


C.

post_id


D.

post_time


Expert Solution
Questions # 57:

What statement is true regarding the retention of job run history?

Options:

A.

It is retained until you export or delete job run logs


B.

It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3


C.

t is retained for 60 days, during which you can export notebook run results to HTML


D.

It is retained for 60 days, after which logs are archived


E.

It is retained for 90 days or until the run-id is re-used through custom run configuration


Expert Solution
Questions # 58:

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each micro-batch of data is processed in less than 3 seconds; at least 12 times per minute, a micro-batch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution. Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

Options:

A.

Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.


B.

Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.


C.

Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to the maximum allowable threshold should minimize this cost.


D.

Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.


Expert Solution
Questions # 59:

A data engineer, while designing a Pandas UDF to process financial time-series data with complex calculations that require maintaining state across rows within each stock symbol group, must ensure the function is efficient and scalable. Which approach will solve the problem with minimum overhead while preserving data integrity?

Options:

A.

Use a scalar_iter Pandas UDF with iterator-based processing, implementing state management through persistent storage (Delta tables) that gets updated after each batch to maintain continuity across iterator chunks.


B.

Use a scalar Pandas UDF that processes the entire dataset at once, implementing custom partitioning logic within the UDF to group by stock symbol and maintain state using global variables shared across all executor processes.


C.

Use applyInPandas on a Spark DataFrame so that each stock symbol group is received as a pandas DataFrame, allowing processing within each group while maintaining state variables local to each group’s processing function.


D.

Use a grouped-aggregate Pandas UDF that processes each stock symbol group independently, maintaining state through intermediate aggregation results that get passed between successive UDF calls via broadcast variables.


Expert Solution
Questions # 60:

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Options:

A.

Delta Lake statistics are not optimized for free text fields with high cardinality.


B.

Text data cannot be stored with Delta Lake.


C.

ZORDER ON review will need to be run to see performance gains.


D.

The Delta log creates a term matrix for free text fields to support selective filtering.


E.

Delta Lake statistics are only collected on the first 4 columns in a table.


Expert Solution
Viewing page 6 out of 6 pages
Viewing questions 51-60 out of questions