Databricks Certified Data Engineer Professional Exam Databricks-Certified-Professional-Data-Engineer Question # 59 Topic 6 Discussion

Databricks-Certified-Professional-Data-Engineer Exam Topic 6 Question 59 Discussion:

Question #: 59

Topic #: 6

A data engineer, while designing a Pandas UDF to process financial time-series data with complex calculations that require maintaining state across rows within each stock symbol group, must ensure the function is efficient and scalable. Which approach will solve the problem with minimum overhead while preserving data integrity?

Use a scalar_iter Pandas UDF with iterator-based processing, implementing state management through persistent storage (Delta tables) that gets updated after each batch to maintain continuity across iterator chunks.

Use a scalar Pandas UDF that processes the entire dataset at once, implementing custom partitioning logic within the UDF to group by stock symbol and maintain state using global variables shared across all executor processes.

Use applyInPandas on a Spark DataFrame so that each stock symbol group is received as a pandas DataFrame, allowing processing within each group while maintaining state variables local to each group’s processing function.

Use a grouped-aggregate Pandas UDF that processes each stock symbol group independently, maintaining state through intermediate aggregation results that get passed between successive UDF calls via broadcast variables.

Get Premium Databricks-Certified-Professional-Data-Engineer Questions

Explanation

applyInPandas is the documented grouped Pandas API for processing each group as a pandas DataFrame. Spark passes all columns for each group together, which allows per-group state to be maintained naturally inside the function. By contrast, scalar Pandas UDFs are batch-oriented Series-to-Series operations, not group-state processing tools. ( Apache Spark )

This is why option C is the intended best answer among the listed choices. Option A adds unnecessary external persistence overhead, option B relies on unsupported global executor state, and option D misuses grouped aggregation semantics for row-by-row stateful logic. Spark also documents applyInPandas specifically as a grouped operation, while scalar Pandas UDFs process row batches and concatenate results rather than preserving grouped state semantics. ( Apache Spark )

======

QUESTION NO: 13

To identify the top users consuming compute resources, a data engineering team needs to monitor usage within their Databricks workspace for better resource utilization and cost control. The team decided to use Databricks system tables, available under the system catalog in Unity Catalog, to gain detailed visibility into workspace activity. Which SQL query should the team run from the system catalog to achieve this?

SELECT

sku_name,

identity_metadata.created_by AS user_email,

SUM(usage_quantity * usage_unit) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

SELECT

sku_name,

identity_metadata.created_by AS user_email,

COUNT(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

SELECT

identity_metadata.run_as AS user_email,

SUM(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email

ORDER BY total_dbus DESC

LIMIT 10

SELECT

sku_name,

usage_metadata.run_name AS user_email,

SUM(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

Answer: C

Databricks documents system.billing.usage as the correct system table for billable usage analysis, and it documents identity_metadata.run_as as the field that records who ran supported workloads such as jobs, notebooks, and Lakeflow Spark Declarative Pipelines. For “top users consuming compute resources,” summing usage_quantity by identity_metadata.run_as is the correct conceptual approach. ( Databricks Documentation )

The other options are not aligned with the documented schema or metric usage. identity_metadata.created_by is not the general compute-consumer identity field for jobs and notebook workloads; it applies to specific products such as Databricks Apps and certain agent workloads. usage_quantity should be summed, not counted, and usage_unit is not something you multiply into DBUs in the way shown. usage_metadata.run_name is not the documented user identity field for this purpose. As written, option C is the only option that matches the official identity model for user-attributed compute consumption. ( Databricks Documentation )

======

QUESTION NO: 15

Which approach demonstrates a modular and testable way to use DataFrame transform for ETL code in PySpark?

def transform_data(input_df):

# transformation logic here

return output_df

test_input = spark.createDataFrame([(1, " a " )], [ " id " , " value " ])

assertDataFrameEqual(transform_data(test_input), expected)

def upper_value(df):

return df.withColumn( " value_upper " , upper(col( " value " )))

def filter_positive(df):

return df.filter(df[ " id " ] > 0)

pipeline_df = df.transform(upper_value).transform(filter_positive)

class Pipeline:

def transform(self, df):

return df.withColumn( " value_upper " , upper(col( " value " )))

pipeline = Pipeline()

assertDataFrameEqual(pipeline.transform(test_input), expected)

def upper_transform(df):

return df.withColumn( " value_upper " , upper(col( " value " )))

actual = test_input.transform(upper_transform)

assertDataFrameEqual(actual, expected)

Answer: B

Apache Spark documents DataFrame.transform(func, *args, **kwargs) as concise syntax for chaining custom transformations, where the function takes a DataFrame and returns a DataFrame. The official example explicitly shows chained transforms, which makes option B the most modular and idiomatic ETL design among the choices. ( Apache Spark )

Option D shows a valid single transform and test, but it does not demonstrate the modular pipeline composition aspect as clearly as B. Option A does not actually use DataFrame.transform , and option C wraps logic in a class method but does not demonstrate the documented chaining pattern that transform is designed for. ( Apache Spark )

======

QUESTION NO: 19

A data engineer is configuring a Databricks Asset Bundle to deploy a job with granular permissions. The requirements are:

Grant the data-engineers group CAN_MANAGE access to the job.

Ensure the auditors group can view the job but not modify or run it.

Avoid granting unintended permissions to other users or groups.

How should the data engineer deploy the job while meeting the requirements?

resources:

jobs:

my-job:

tasks: (...)

job_clusters: (...)

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

resources:

jobs:

my-job:

tasks: (...)

job_clusters: (...)

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

resources:

jobs:

my-job:

tasks: [...]

job_clusters: [...]

permissions:

- group_name: data-engineers

level: CAN_MANAGE

permissions:

- group_name: auditors

level: CAN_VIEW

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

resources:

jobs:

my-job:

tasks: [...]

job_clusters: [...]

Answer: B

Databricks documents that resource-specific permissions for bundle resources can be defined under the resource itself, such as resources.jobs. < job > .permissions , using group_name and level . The documented syntax supports CAN_VIEW , CAN_MANAGE , and related permission levels, which matches option B. ( Databricks Documentation )

Option C is invalid because it repeats the permissions key incorrectly. Option D applies top-level permissions more broadly across bundle resources instead of scoping them specifically to the job, which does not best satisfy the “avoid unintended permissions” requirement. Option B is therefore the correct and properly scoped configuration. ( Databricks Documentation )

======

QUESTION NO: 21

A data engineering team uses Databricks Lakehouse Monitoring to track the percent_null metric for a critical column in their Delta table. The profile metrics table ( prod_catalog.prod_schema.customer_data_profile_metrics ) stores hourly percent_null values. The team wants to trigger an alert when the daily average of percent_null exceeds 5% for three consecutive days, while ensuring notifications are not spammed during sustained issues. Which SQL alert configuration achieves this goal while minimizing false positives and redundant notifications?

WITH daily_avg AS (

SELECT

DATE_TRUNC( ' DAY ' , window.end) AS day,

AVG(percent_null) AS avg_null

FROM prod_catalog.prod_schema.customer_data_profile_metrics

GROUP BY DATE_TRUNC( ' DAY ' , window.end)

)

SELECT day, avg_null

FROM daily_avg

ORDER BY day DESC

LIMIT 3

Alert Condition: ALL avg_null > 5 for the latest 3 rows

Notification Frequency: Just once

SELECT percent_null

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 1 ' DAY

Alert Condition: percent_null > 5

Notification Frequency: At most every 24 hours

SELECT AVG(percent_null) AS daily_avg

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 3 ' DAY

Alert Condition: daily_avg > 5

Notification Frequency: Each time alert is evaluated

SELECT SUM(CASE WHEN percent_null > 5 THEN 1 ELSE 0 END) AS violation_days

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 3 ' DAY

Alert Condition: violation_days > = 3

Notification Frequency: Just once

Answer: A

Databricks SQL alerts support alert conditions over aggregated query results, including AVG , and they support notification-frequency behavior that avoids repeated alerts during a sustained triggered state. Databricks documents that with Just Once , a notification is sent when the alert changes from OK to TRIGGERED , but not repeatedly while it remains triggered. ( Databricks Documentation )

Option A is the only choice that correctly computes a daily average, checks the latest three daily rows, and pairs that with the anti-spam Just Once notification behavior. Option B checks only raw hourly values over one day, option C averages across the entire three-day span rather than requiring three consecutive daily breaches, and option D counts hourly threshold violations instead of true daily-average violations. ( Databricks Documentation )

======

QUESTION NO: 24

A data engineer is using Lakeflow Spark Declarative Pipelines Expectations to track the data quality of incoming sensor data. Periodically, sensors send bad readings that are out of range, and the team is currently flagging those rows with a warning and writing them to the silver table along with the good data. They have been given a new requirement: the bad rows need to be quarantined in a separate quarantine table and no longer included in the silver table.

This is the existing code for the silver table:

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():