Databricks Certified Associate Developer for Apache Spark 3.5 – Python Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question # 13 Topic 2 Discussion

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question # 13 Topic 2 Discussion

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam Topic 2 Question 13 Discussion:
Question #: 13
Topic #: 2

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?


A.

Replace .bucketBy() with .partitionBy("event_year", "event_month")


B.

Change the bucket count (42) to a lower number


C.

Add .sortBy() after .bucketBy()


D.

Replace .bucketBy() with .partitionBy("event_year") only


Get Premium Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions

Contribute your Thoughts:


Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.