Big Halloween Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: simple70

Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with CertsForce

Viewing page 2 out of 4 pages
Viewing questions 11-20 out of questions
Questions # 11:

11 of 55.

Which Spark configuration controls the number of tasks that can run in parallel on an executor?

Options:

A.

spark.executor.cores


B.

spark.task.maxFailures


C.

spark.executor.memory


D.

spark.sql.shuffle.partitions


Expert Solution
Questions # 12:

13 of 55.

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

region_id

region_name

10

North

12

East

14

West

The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values.

Which code fragment meets the requirements?

Options:

A.

regions_dict = dict(regions.take(3))


B.

regions_dict = regions.select("region_id", "region_name").take(3)


C.

regions_dict = dict(regions.select("region_id", "region_name").rdd.collect())


D.

regions_dict = dict(regions.orderBy("region_id").limit(3).rdd.map(lambda x: (x.region_id, x.region_name)).collect())


Expert Solution
Questions # 13:

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Options:

A.

Replace .bucketBy() with .partitionBy("event_year", "event_month")


B.

Change the bucket count (42) to a lower number


C.

Add .sortBy() after .bucketBy()


D.

Replace .bucketBy() with .partitionBy("event_year") only


Expert Solution
Questions # 14:

27 of 55.

A data engineer needs to add all the rows from one table to all the rows from another, but not all the columns in the first table exist in the second table.

The error message is:

AnalysisException: UNION can only be performed on tables with the same number of columns.

The existing code is:

au_df.union(nz_df)

The DataFrame au_df has one extra column that does not exist in the DataFrame nz_df, but otherwise both DataFrames have the same column names and data types.

What should the data engineer fix in the code to ensure the combined DataFrame can be produced as expected?

Options:

A.

df = au_df.unionByName(nz_df, allowMissingColumns=True)


B.

df = au_df.unionAll(nz_df)


C.

df = au_df.unionByName(nz_df, allowMissingColumns=False)


D.

df = au_df.union(nz_df, allowMissingColumns=True)


Expert Solution
Questions # 15:

What is the difference between df.cache() and df.persist() in Spark DataFrame?

Options:

A.

Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)


B.

Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.


C.

persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.


D.

cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist() - Can be used to set different storage levels to persist the contents of the DataFrame


Expert Solution
Questions # 16:

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

Options:

A.

10


B.

Same number as the cluster executors


C.

1


D.

20


Expert Solution
Questions # 17:

A data engineer wants to write a Spark job that creates a new managed table. If the table already exists, the job should fail and not modify anything.

Which save mode and method should be used?

Options:

A.

saveAsTable with mode ErrorIfExists


B.

saveAsTable with mode Overwrite


C.

save with mode Ignore


D.

save with mode ErrorIfExists


Expert Solution
Questions # 18:

Given the schema:

Question # 18

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

Options:

A.

dropDuplicates on all columns (wrong criteria)


B.

dropDuplicates with no arguments (removes based on all columns)


C.

groupBy without aggregation (invalid use)


D.

dropDuplicates on the exact matching fields


Expert Solution
Questions # 19:

A developer wants to refactor some older Spark code to leverage built-in functions introduced in Spark 3.5.0. The existing code performs array manipulations manually. Which of the following code snippets utilizes new built-in functions in Spark 3.5.0 for array operations?

Question # 19

A)

Question # 19

B)

Question # 19

C)

Question # 19

D)

Question # 19

Options:

A.

result_df = prices_df \

.withColumn("valid_price", F.when(F.col("spot_price") > F.lit(min_price), 1).otherwise(0))


B.

result_df = prices_df \

.agg(F.count_if(F.col("spot_price") >= F.lit(min_price)))


C.

result_df = prices_df \

.agg(F.min("spot_price"), F.max("spot_price"))


D.

result_df = prices_df \

.agg(F.count("spot_price").alias("spot_price")) \

.filter(F.col("spot_price") > F.lit("min_price"))


Expert Solution
Questions # 20:

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

Options:

A.

Use an RDD action like reduce() to compute the maximum time


B.

Use an accumulator to record the maximum time on the driver


C.

Broadcast a variable to share the maximum time among workers


D.

Configure the Spark UI to automatically collect maximum times


Expert Solution
Viewing page 2 out of 4 pages
Viewing questions 11-20 out of questions