Big Halloween Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: simple70

Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with CertsForce

Viewing page 3 out of 4 pages
Viewing questions 21-30 out of questions
Questions # 21:

A Spark application suffers from too many small tasks due to excessive partitioning. How can this be fixed without a full shuffle?

Options:

Options:

A.

Use the distinct() transformation to combine similar partitions


B.

Use the coalesce() transformation with a lower number of partitions


C.

Use the sortBy() transformation to reorganize the data


D.

Use the repartition() transformation with a lower number of partitions


Expert Solution
Questions # 22:

A data engineer observes that an upstream streaming source sends duplicate records, where duplicates share the same key and have at most a 30-minute difference in event_timestamp. The engineer adds:

dropDuplicatesWithinWatermark("event_timestamp", "30 minutes")

What is the result?

Options:

A.

It is not able to handle deduplication in this scenario


B.

It removes duplicates that arrive within the 30-minute window specified by the watermark


C.

It removes all duplicates regardless of when they arrive


D.

It accepts watermarks in seconds and the code results in an error


Expert Solution
Questions # 23:

What is the benefit of using Pandas on Spark for data transformations?

Options:

Options:

A.

It is available only with Python, thereby reducing the learning curve.


B.

It computes results immediately using eager execution, making it simple to use.


C.

It runs on a single node only, utilizing the memory with memory-bound DataFrames and hence cost-efficient.


D.

It executes queries faster using all the available cores in the cluster as well as provides Pandas’s rich set of features.


Expert Solution
Questions # 24:

A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

Options:

A.

groupBy


B.

filter


C.

select


D.

coalesce


Expert Solution
Questions # 25:

A data engineer wants to create an external table from a JSON file located at /data/input.json with the following requirements:

Create an external table named users

Automatically infer schema

Merge records with differing schemas

Which code snippet should the engineer use?

Options:

Options:

A.

CREATE TABLE users USING json OPTIONS (path '/data/input.json')


B.

CREATE EXTERNAL TABLE users USING json OPTIONS (path '/data/input.json')


C.

CREATE EXTERNAL TABLE users USING json OPTIONS (path '/data/input.json', mergeSchema 'true')


D.

CREATE EXTERNAL TABLE users USING json OPTIONS (path '/data/input.json', schemaMerge 'true')


Expert Solution
Questions # 26:

18 of 55.

An engineer has two DataFrames — df1 (small) and df2 (large). To optimize the join, the engineer uses a broadcast join:

from pyspark.sql.functions import broadcast

df_result = df2.join(broadcast(df1), on="id", how="inner")

What is the purpose of using broadcast() in this scenario?

Options:

A.

It increases the partition size for df1 and df2.


B.

It ensures that the join happens only when the id values are identical.


C.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.


D.

It filters the id values before performing the join.


Expert Solution
Questions # 27:

A Spark engineer must select an appropriate deployment mode for the Spark jobs.

What is the benefit of using cluster mode in Apache Spark™?

Options:

A.

In cluster mode, resources are allocated from a resource manager on the cluster, enabling better performance and scalability for large jobs


B.

In cluster mode, the driver is responsible for executing all tasks locally without distributing them across the worker nodes.


C.

In cluster mode, the driver runs on the client machine, which can limit the application's ability to handle large datasets efficiently.


D.

In cluster mode, the driver program runs on one of the worker nodes, allowing the application to fully utilize the distributed resources of the cluster.


Expert Solution
Questions # 28:

17 of 55.

A data engineer has noticed that upgrading the Spark version in their applications from Spark 3.0 to Spark 3.5 has improved the runtime of some scheduled Spark applications.

Looking further, the data engineer realizes that Adaptive Query Execution (AQE) is now enabled.

Which operation should AQE be implementing to automatically improve the Spark application performance?

Options:

A.

Dynamically switching join strategies


B.

Collecting persistent table statistics and storing them in the metastore for future use


C.

Improving the performance of single-stage Spark jobs


D.

Optimizing the layout of Delta files on disk


Expert Solution
Questions # 29:

1 of 55. A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from.

The first attempt does read the text files, but each record contains a single line. This code is shown below:

txt_path = "/datasets/raw_txt/*"

df = spark.read.text(txt_path) # one row per line by default

df = df.withColumn("file_path", input_file_name()) # add full path

Which code change can be implemented in a DataFrame that meets the data scientist's requirements?

Options:

A.

Add the option wholetext to the text() function.


B.

Add the option lineSep to the text() function.


C.

Add the option wholetext=False to the text() function.


D.

Add the option lineSep=", " to the text() function.


Expert Solution
Questions # 30:

A data engineer is working on a real-time analytics pipeline using Apache Spark Structured Streaming. The engineer wants to process incoming data and ensure that triggers control when the query is executed. The system needs to process data in micro-batches with a fixed interval of 5 seconds.

Which code snippet the data engineer could use to fulfil this requirement?

A)

Question # 30

B)

Question # 30

C)

Question # 30

D)

Question # 30

Options:

Options:

A.

Uses trigger(continuous='5 seconds') – continuous processing mode.


B.

Uses trigger() – default micro-batch trigger without interval.


C.

Uses trigger(processingTime='5 seconds') – correct micro-batch trigger with interval.


D.

Uses trigger(processingTime=5000) – invalid, as processingTime expects a string.


Expert Solution
Viewing page 3 out of 4 pages
Viewing questions 21-30 out of questions