Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam Questions Free Practice Test

Viewing page 2 out of 4 pages

Viewing questions 11-20 out of questions

Questions # 11:

11 of 55.

Which Spark configuration controls the number of tasks that can run in parallel on an executor?

Options:

spark.executor.cores

spark.task.maxFailures

spark.executor.memory

spark.sql.shuffle.partitions

Expert Solution

Questions # 12:

13 of 55.

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

region_id

region_name

North

East

West

The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values.

Which code fragment meets the requirements?

Options:

regions_dict = dict(regions.take(3))

regions_dict = regions.select("region_id", "region_name").take(3)

regions_dict = dict(regions.select("region_id", "region_name").rdd.collect())

regions_dict = dict(regions.orderBy("region_id").limit(3).rdd.map(lambda x: (x.region_id, x.region_name)).collect())

Expert Solution

Questions # 13:

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Options:

Replace .bucketBy() with .partitionBy("event_year", "event_month")

Change the bucket count (42) to a lower number

Add .sortBy() after .bucketBy()

Replace .bucketBy() with .partitionBy("event_year") only

Expert Solution

Questions # 14:

27 of 55.

A data engineer needs to add all the rows from one table to all the rows from another, but not all the columns in the first table exist in the second table.

The error message is:

AnalysisException: UNION can only be performed on tables with the same number of columns.

The existing code is:

au_df.union(nz_df)

The DataFrame au_df has one extra column that does not exist in the DataFrame nz_df, but otherwise both DataFrames have the same column names and data types.

What should the data engineer fix in the code to ensure the combined DataFrame can be produced as expected?

Options:

df = au_df.unionByName(nz_df, allowMissingColumns=True)

df = au_df.unionAll(nz_df)

df = au_df.unionByName(nz_df, allowMissingColumns=False)

df = au_df.union(nz_df, allowMissingColumns=True)

Expert Solution

Questions # 15:

What is the difference between df.cache() and df.persist() in Spark DataFrame?

Options:

Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)

Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.

persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.

cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist() - Can be used to set different storage levels to persist the contents of the DataFrame

Expert Solution

Questions # 16:

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

Options:

Same number as the cluster executors

Expert Solution

Questions # 17:

A data engineer wants to write a Spark job that creates a new managed table. If the table already exists, the job should fail and not modify anything.

Which save mode and method should be used?

Options:

saveAsTable with mode ErrorIfExists

saveAsTable with mode Overwrite

save with mode Ignore

save with mode ErrorIfExists

Expert Solution

Questions # 18:

Given the schema:

Question # 18

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

dropDuplicates on all columns (wrong criteria)

dropDuplicates with no arguments (removes based on all columns)

groupBy without aggregation (invalid use)

dropDuplicates on the exact matching fields

Expert Solution

Questions # 19:

A developer wants to refactor some older Spark code to leverage built-in functions introduced in Spark 3.5.0. The existing code performs array manipulations manually. Which of the following code snippets utilizes new built-in functions in Spark 3.5.0 for array operations?

Question # 19

Options:

result_df = prices_df \

.withColumn("valid_price", F.when(F.col("spot_price") > F.lit(min_price), 1).otherwise(0))

result_df = prices_df \

.agg(F.count_if(F.col("spot_price") >= F.lit(min_price)))

result_df = prices_df \

.agg(F.min("spot_price"), F.max("spot_price"))

result_df = prices_df \

.agg(F.count("spot_price").alias("spot_price")) \

.filter(F.col("spot_price") > F.lit("min_price"))

Expert Solution

Questions # 20:

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

Options:

Use an RDD action like reduce() to compute the maximum time

Use an accumulator to record the maximum time on the driver

Broadcast a variable to share the maximum time among workers

Configure the Spark UI to automatically collect maximum times

Expert Solution

Viewing page 2 out of 4 pages

Viewing questions 11-20 out of questions

Big Halloween Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: simple70

Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with CertsForce