Big Halloween Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: simple70

Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with CertsForce

Viewing page 1 out of 4 pages
Viewing questions 1-10 out of questions
Questions # 1:

49 of 55.

In the code block below, aggDF contains aggregations on a streaming DataFrame:

aggDF.writeStream \

.format("console") \

.outputMode("???") \

.start()

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Options:

A.

AGGREGATE


B.

COMPLETE


C.

REPLACE


D.

APPEND


Expert Solution
Questions # 2:

What is a feature of Spark Connect?

Options:

A.

It supports DataStreamReader, DataStreamWriter, StreamingQuery, and Streaming APIs


B.

Supports DataFrame, Functions, Column, SparkContext PySpark APIs


C.

It supports only PySpark applications


D.

It has built-in authentication


Expert Solution
Questions # 3:

43 of 55.

An organization has been running a Spark application in production and is considering disabling the Spark History Server to reduce resource usage.

What will be the impact of disabling the Spark History Server in production?

Options:

A.

Prevention of driver log accumulation during long-running jobs


B.

Improved job execution speed due to reduced logging overhead


C.

Loss of access to past job logs and reduced debugging capability for completed jobs


D.

Enhanced executor performance due to reduced log size


Expert Solution
Questions # 4:

How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?

Options:

Options:

A.

Configure the application to run in cluster mode instead of local mode.


B.

Increase the number of local threads based on the number of CPU cores.


C.

Use the spark.dynamicAllocation.enabled property to scale resources dynamically.


D.

Set the spark.executor.memory property to a large value.


Expert Solution
Questions # 5:

39 of 55.

A Spark developer is developing a Spark application to monitor task performance across a cluster.

One requirement is to track the maximum processing time for tasks on each worker node and consolidate this information on the driver for further analysis.

Which technique should the developer use?

Options:

A.

Broadcast a variable to share the maximum time among workers.


B.

Configure the Spark UI to automatically collect maximum times.


C.

Use an RDD action like reduce() to compute the maximum time.


D.

Use an accumulator to record the maximum time on the driver.


Expert Solution
Questions # 6:

34 of 55.

A data engineer is investigating a Spark cluster that is experiencing underutilization during scheduled batch jobs.

After checking the Spark logs, they noticed that tasks are often getting killed due to timeout errors, and there are several warnings about insufficient resources in the logs.

Which action should the engineer take to resolve the underutilization issue?

Options:

A.

Set the spark.network.timeout property to allow tasks more time to complete without being killed.


B.

Increase the executor memory allocation in the Spark configuration.


C.

Reduce the size of the data partitions to improve task scheduling.


D.

Increase the number of executor instances to handle more concurrent tasks.


Expert Solution
Questions # 7:

A data scientist is analyzing a large dataset and has written a PySpark script that includes several transformations and actions on a DataFrame. The script ends with a collect() action to retrieve the results.

How does Apache Spark™'s execution hierarchy process the operations when the data scientist runs this script?

Options:

A.

The script is first divided into multiple applications, then each application is split into jobs, stages, and finally tasks.


B.

The entire script is treated as a single job, which is then divided into multiple stages, and each stage is further divided into tasks based on data partitions.


C.

The collect() action triggers a job, which is divided into stages at shuffle boundaries, and each stage is split into tasks that operate on individual data partitions.


D.

Spark creates a single task for each transformation and action in the script, and these tasks are grouped into stages and jobs based on their dependencies.


Expert Solution
Questions # 8:

10 of 55.

What is the benefit of using Pandas API on Spark for data transformations?

Options:

A.

It executes queries faster using all the available cores in the cluster as well as provides Pandas's rich set of features.


B.

It is available only with Python, thereby reducing the learning curve.


C.

It runs on a single node only, utilizing memory efficiently.


D.

It computes results immediately using eager execution.


Expert Solution
Questions # 9:

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

Question # 9

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Options:

A.

Convert the Pandas UDF to a PySpark UDF


B.

Convert the Pandas UDF from a Series → Series UDF to a Series → Scalar UDF


C.

Run the in_spanish_inner() function in a mapInPandas() function call


D.

Convert the Pandas UDF from a Series → Series UDF to an Iterator[Series] → Iterator[Series] UDF


Expert Solution
Questions # 10:

What is the benefit of Adaptive Query Execution (AQE)?

Options:

A.

It allows Spark to optimize the query plan before execution but does not adapt during runtime.


B.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.


C.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.


D.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.


Expert Solution
Viewing page 1 out of 4 pages
Viewing questions 1-10 out of questions