Databricks Databricks-Certified-Professional-Data-Engineer Exam Questions Free Practice Test

Viewing page 6 out of 6 pages

Viewing questions 51-60 out of questions

Questions # 51:

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property " contains_pii " = true .

The following SQL DDL statement is executed to create a new table:

Question # 51

Which command allows manual confirmation that these three requirements have been met?

Options:

DESCRIBE EXTENDED dev.pii test

DESCRIBE DETAIL dev.pii test

SHOW TBLPROPERTIES dev.pii test

DESCRIBE HISTORY dev.pii test

SHOW TABLES dev

Expert Solution

Questions # 52:

Which statement regarding spark configuration on the Databricks platform is true?

Options:

Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.

When the same spar configuration property is set for an interactive to the same interactive cluster.

Spark configuration set within an notebook will affect all SparkSession attached to the same interactive cluster

The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs.

Expert Solution

Questions # 53:

A company wants to implement Lakehouse Federation across multiple data sources but is concerned about data consistency and ensuring that all teams access the same authoritative version of their data.

Which statement is applicable for Lakehouse Federations to maintain data consistency?

Options:

Federation provides read-only access that reflects the current state of source systems.

Federation implements change data capture (CDC) from all sources.

A separate data synchronization service must be deployed.

Federation creates local copies that must be manually refreshed.

Expert Solution

Questions # 54:

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

This table is partitioned by the date column. A query is run with the following filter:

longitude < 20 and longitude > -20

Which statement describes how data will be filtered?

Options:

Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.

No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.

The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.

Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.

The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

Expert Solution

Questions # 55:

A data engineer created a daily batch ingestion pipeline using a cluster with the latest DBR version to store banking transaction data, and persisted it in a MANAGED DELTA table called prod.gold.all_banking_transactions_daily. The data engineer is constantly receiving complaints from business users who query this table ad hoc through a SQL Serverless Warehouse about poor query performance. Upon analysis, the data engineer identified that these users frequently use high-cardinality columns as filters. The engineer now seeks to implement a data layout optimization technique that is incremental, easy to maintain, and can evolve over time.

Which command should the data engineer implement?

Options:

Alter the table to use Hive-Style Partitions + Z-ORDER and implement a periodic OPTIMIZE command.

Alter the table to use Liquid Clustering and implement a periodic OPTIMIZE command.

Alter the table to use Hive-Style Partitions and implement a periodic OPTIMIZE command.

Alter the table to use Z-ORDER and implement a periodic OPTIMIZE command.

Expert Solution

Questions # 56:

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG

post_text STRING

post_id STRING

longitude FLOAT

latitude FLOAT

post_time TIMESTAMP

date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

Options:

date

user_id

post_id

post_time

Expert Solution

Answer

Explanation

Partitioning a Delta Lake table is a strategy used to improve query performance by dividing the table into distinct segments based on the values of a specific column. This approach allows queries to scan only the relevant partitions, thereby reducing the amount of data read and enhancing performance.

Considerations for Choosing a Partition Column:

Cardinality: Columns with high cardinality (i.e., a large number of unique values) are generally poor choices for partitioning. High cardinality can lead to a large number of small partitions, which can degrade performance.

Query Patterns: The partition column should align with common query filters. If queries frequently filter data based on a particular column, partitioning by that column can be beneficial.

Partition Size: Each partition should ideally contain at least 1 GB of data. This ensures that partitions are neither too small (leading to too many partitions) nor too large (negating the benefits of partitioning).

Evaluation of Columns:

date:

Cardinality: Typically low, especially if data spans over days, months, or years.

Query Patterns: Many analytical queries filter data based on date ranges.

Partition Size: Likely to meet the 1 GB threshold per partition, depending on data volume.

user_id:

Cardinality: High, as each user has a unique ID.

Query Patterns: While some queries might filter by user_id, the high cardinality makes it unsuitable for partitioning.

Partition Size: Partitions could be too small, leading to inefficiencies.

post_id:

Cardinality: Extremely high, with each post having a unique ID.

Query Patterns: Unlikely to be used for filtering large datasets.

Partition Size: Each partition would be very small, resulting in a large number of partitions.

post_time:

Cardinality: High, especially if it includes exact timestamps.

Query Patterns: Queries might filter by time, but the high cardinality poses challenges.

Partition Size: Similar to user_id, partitions could be too small.

Conclusion:

Given the considerations, the date column is the most suitable candidate for partitioning. It has low cardinality, aligns with common query patterns, and is likely to result in appropriately sized partitions.

[References:, Delta Lake Best Practices, Partitioning in Delta Lake, , ]

Questions # 57:

What statement is true regarding the retention of job run history?

Options:

It is retained until you export or delete job run logs

It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3

t is retained for 60 days, during which you can export notebook run results to HTML

It is retained for 60 days, after which logs are archived

It is retained for 90 days or until the run-id is re-used through custom run configuration

Expert Solution

Questions # 58:

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each micro-batch of data is processed in less than 3 seconds; at least 12 times per minute, a micro-batch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution. Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

Options:

Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.

Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.

Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to the maximum allowable threshold should minimize this cost.

Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.

Expert Solution

Questions # 59:

A data engineer, while designing a Pandas UDF to process financial time-series data with complex calculations that require maintaining state across rows within each stock symbol group, must ensure the function is efficient and scalable. Which approach will solve the problem with minimum overhead while preserving data integrity?

Options:

Use a scalar_iter Pandas UDF with iterator-based processing, implementing state management through persistent storage (Delta tables) that gets updated after each batch to maintain continuity across iterator chunks.

Use a scalar Pandas UDF that processes the entire dataset at once, implementing custom partitioning logic within the UDF to group by stock symbol and maintain state using global variables shared across all executor processes.

Use applyInPandas on a Spark DataFrame so that each stock symbol group is received as a pandas DataFrame, allowing processing within each group while maintaining state variables local to each group’s processing function.

Use a grouped-aggregate Pandas UDF that processes each stock symbol group independently, maintaining state through intermediate aggregation results that get passed between successive UDF calls via broadcast variables.

Expert Solution

Answer

Explanation

applyInPandas is the documented grouped Pandas API for processing each group as a pandas DataFrame. Spark passes all columns for each group together, which allows per-group state to be maintained naturally inside the function. By contrast, scalar Pandas UDFs are batch-oriented Series-to-Series operations, not group-state processing tools. ( Apache Spark )

This is why option C is the intended best answer among the listed choices. Option A adds unnecessary external persistence overhead, option B relies on unsupported global executor state, and option D misuses grouped aggregation semantics for row-by-row stateful logic. Spark also documents applyInPandas specifically as a grouped operation, while scalar Pandas UDFs process row batches and concatenate results rather than preserving grouped state semantics. ( Apache Spark )

======

QUESTION NO: 13

To identify the top users consuming compute resources, a data engineering team needs to monitor usage within their Databricks workspace for better resource utilization and cost control. The team decided to use Databricks system tables, available under the system catalog in Unity Catalog, to gain detailed visibility into workspace activity. Which SQL query should the team run from the system catalog to achieve this?

SELECT

sku_name,

identity_metadata.created_by AS user_email,

SUM(usage_quantity * usage_unit) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

SELECT

sku_name,

identity_metadata.created_by AS user_email,

COUNT(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

SELECT

identity_metadata.run_as AS user_email,

SUM(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email

ORDER BY total_dbus DESC

LIMIT 10

SELECT

sku_name,

usage_metadata.run_name AS user_email,

SUM(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

Answer: C

Databricks documents system.billing.usage as the correct system table for billable usage analysis, and it documents identity_metadata.run_as as the field that records who ran supported workloads such as jobs, notebooks, and Lakeflow Spark Declarative Pipelines. For “top users consuming compute resources,” summing usage_quantity by identity_metadata.run_as is the correct conceptual approach. ( Databricks Documentation )

The other options are not aligned with the documented schema or metric usage. identity_metadata.created_by is not the general compute-consumer identity field for jobs and notebook workloads; it applies to specific products such as Databricks Apps and certain agent workloads. usage_quantity should be summed, not counted, and usage_unit is not something you multiply into DBUs in the way shown. usage_metadata.run_name is not the documented user identity field for this purpose. As written, option C is the only option that matches the official identity model for user-attributed compute consumption. ( Databricks Documentation )

======

QUESTION NO: 15

Which approach demonstrates a modular and testable way to use DataFrame transform for ETL code in PySpark?

def transform_data(input_df):

# transformation logic here

return output_df

test_input = spark.createDataFrame([(1, " a " )], [ " id " , " value " ])

assertDataFrameEqual(transform_data(test_input), expected)

def upper_value(df):

return df.withColumn( " value_upper " , upper(col( " value " )))

def filter_positive(df):

return df.filter(df[ " id " ] > 0)

pipeline_df = df.transform(upper_value).transform(filter_positive)

class Pipeline:

def transform(self, df):

return df.withColumn( " value_upper " , upper(col( " value " )))

pipeline = Pipeline()

assertDataFrameEqual(pipeline.transform(test_input), expected)

def upper_transform(df):

return df.withColumn( " value_upper " , upper(col( " value " )))

actual = test_input.transform(upper_transform)

assertDataFrameEqual(actual, expected)

Answer: B

Apache Spark documents DataFrame.transform(func, *args, **kwargs) as concise syntax for chaining custom transformations, where the function takes a DataFrame and returns a DataFrame. The official example explicitly shows chained transforms, which makes option B the most modular and idiomatic ETL design among the choices. ( Apache Spark )

Option D shows a valid single transform and test, but it does not demonstrate the modular pipeline composition aspect as clearly as B. Option A does not actually use DataFrame.transform , and option C wraps logic in a class method but does not demonstrate the documented chaining pattern that transform is designed for. ( Apache Spark )

======

QUESTION NO: 19

A data engineer is configuring a Databricks Asset Bundle to deploy a job with granular permissions. The requirements are:

Grant the data-engineers group CAN_MANAGE access to the job.

Ensure the auditors group can view the job but not modify or run it.

Avoid granting unintended permissions to other users or groups.

How should the data engineer deploy the job while meeting the requirements?

resources:

jobs:

my-job:

tasks: (...)

job_clusters: (...)

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

resources:

jobs:

my-job:

tasks: (...)

job_clusters: (...)

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

resources:

jobs:

my-job:

tasks: [...]

job_clusters: [...]

permissions:

- group_name: data-engineers

level: CAN_MANAGE

permissions:

- group_name: auditors

level: CAN_VIEW

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

resources:

jobs:

my-job:

tasks: [...]

job_clusters: [...]

Answer: B

Databricks documents that resource-specific permissions for bundle resources can be defined under the resource itself, such as resources.jobs. < job > .permissions , using group_name and level . The documented syntax supports CAN_VIEW , CAN_MANAGE , and related permission levels, which matches option B. ( Databricks Documentation )

Option C is invalid because it repeats the permissions key incorrectly. Option D applies top-level permissions more broadly across bundle resources instead of scoping them specifically to the job, which does not best satisfy the “avoid unintended permissions” requirement. Option B is therefore the correct and properly scoped configuration. ( Databricks Documentation )

======

QUESTION NO: 21

A data engineering team uses Databricks Lakehouse Monitoring to track the percent_null metric for a critical column in their Delta table. The profile metrics table ( prod_catalog.prod_schema.customer_data_profile_metrics ) stores hourly percent_null values. The team wants to trigger an alert when the daily average of percent_null exceeds 5% for three consecutive days, while ensuring notifications are not spammed during sustained issues. Which SQL alert configuration achieves this goal while minimizing false positives and redundant notifications?

WITH daily_avg AS (

SELECT

DATE_TRUNC( ' DAY ' , window.end) AS day,

AVG(percent_null) AS avg_null

FROM prod_catalog.prod_schema.customer_data_profile_metrics

GROUP BY DATE_TRUNC( ' DAY ' , window.end)

)

SELECT day, avg_null

FROM daily_avg

ORDER BY day DESC

LIMIT 3

Alert Condition: ALL avg_null > 5 for the latest 3 rows

Notification Frequency: Just once

SELECT percent_null

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 1 ' DAY

Alert Condition: percent_null > 5

Notification Frequency: At most every 24 hours

SELECT AVG(percent_null) AS daily_avg

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 3 ' DAY

Alert Condition: daily_avg > 5

Notification Frequency: Each time alert is evaluated

SELECT SUM(CASE WHEN percent_null > 5 THEN 1 ELSE 0 END) AS violation_days

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 3 ' DAY

Alert Condition: violation_days > = 3

Notification Frequency: Just once

Answer: A

Databricks SQL alerts support alert conditions over aggregated query results, including AVG , and they support notification-frequency behavior that avoids repeated alerts during a sustained triggered state. Databricks documents that with Just Once , a notification is sent when the alert changes from OK to TRIGGERED , but not repeatedly while it remains triggered. ( Databricks Documentation )

Option A is the only choice that correctly computes a daily average, checks the latest three daily rows, and pairs that with the anti-spam Just Once notification behavior. Option B checks only raw hourly values over one day, option C averages across the entire three-day span rather than requiring three consecutive daily breaches, and option D counts hourly threshold violations instead of true daily-average violations. ( Databricks Documentation )

======

QUESTION NO: 24

A data engineer is using Lakeflow Spark Declarative Pipelines Expectations to track the data quality of incoming sensor data. Periodically, sensors send bad readings that are out of range, and the team is currently flagging those rows with a warning and writing them to the silver table along with the good data. They have been given a new requirement: the bad rows need to be quarantined in a separate quarantine table and no longer included in the silver table.

This is the existing code for the silver table:

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():