Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with CertsForce

Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions
Questions # 1:

An engineer has a large ORC file located at/file/test_data.orcand wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e.,col1,col2, during the reading process?

Options:

A.

spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")


B.

spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")


C.

spark.read.orc("/file/test_data.orc").selected("col1", "col2")


D.

spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")


Expert Solution
Questions # 2:

Given a DataFramedfthat has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

Options:

A.

10


B.

Same number as the cluster executors


C.

1


D.

20


Expert Solution
Questions # 3:

A data engineer is working with a large JSON dataset containing order information. The dataset is stored in a distributed file system and needs to be loaded into a Spark DataFrame for analysis. The data engineer wants to ensure that the schema is correctly defined and that the data is read efficiently.

Which approach should the data scientist use to efficiently load the JSON data into a Spark DataFrame with a predefined schema?

Options:

A.

Use spark.read.json() to load the data, then use DataFrame.printSchema() to view the inferred schema, and finally use DataFrame.cast() to modify column types.


B.

Use spark.read.json() with the inferSchema option set to true


C.

Use spark.read.format("json").load() and then use DataFrame.withColumn() to cast each column to the desired data type.


D.

Define a StructType schema and use spark.read.schema(predefinedSchema).json() to load the data.


Expert Solution
Questions # 4:

A developer initializes a SparkSession:

Question # 4

spark = SparkSession.builder \

.appName("Analytics Application") \

.getOrCreate()

Which statement describes thesparkSparkSession?

Options:

A.

ThegetOrCreate()method explicitly destroys any existing SparkSession and creates a new one.


B.

A SparkSession is unique for eachappName, and callinggetOrCreate()with the same name will return an existing SparkSession once it has been created.


C.

If a SparkSession already exists, this code will return the existing session instead of creating a new one.


D.

A new SparkSession is created every time thegetOrCreate()method is invoked.


Expert Solution
Questions # 5:

The following code fragment results in an error:

Question # 5

Which code fragment should be used instead?

A)

Question # 5

B)

Question # 5

C)

Question # 5

D)

Question # 5


Expert Solution
Questions # 6:

A Spark engineer must select an appropriate deployment mode for the Spark jobs.

What is the benefit of using cluster mode in Apache Spark™?

Options:

A.

In cluster mode, resources are allocated from a resource manager on the cluster, enabling better performance and scalability for large jobs


B.

In cluster mode, the driver is responsible for executing all tasks locally without distributing them across the worker nodes.


C.

In cluster mode, the driver runs on the client machine, which can limit the application's ability to handle large datasets efficiently.


D.

In cluster mode, the driver program runs on one of the worker nodes, allowing the application to fully utilize the distributed resources of the cluster.


Expert Solution
Questions # 7:

What is the benefit of using Pandas on Spark for data transformations?

Options:

Options:

A.

It is available only with Python, thereby reducing the learning curve.


B.

It computes results immediately using eager execution, making it simple to use.


C.

It runs on a single node only, utilizing the memory with memory-bound DataFrames and hence cost-efficient.


D.

It executes queries faster using all the available cores in the cluster as well as provides Pandas’s rich set of features.


Expert Solution
Questions # 8:

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

Options:

A.

The Spark engine requires manual intervention to start executing transformations.


B.

Only actions trigger the execution of the transformation pipeline.


C.

Transformations are executed immediately to build the lineage graph.


D.

The Spark engine optimizes the execution plan during the transformations, causing delays.


E.

Transformations are evaluated lazily.


Expert Solution
Questions # 9:

A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.

The schema of the user profile table looks like this:

Question # 9

Which block of Spark code can be used to achieve this requirement?

Options:

Options:

A.

filtered_df = users_raw_df.na.drop(thresh=0)


B.

filtered_df = users_raw_df.na.drop(how='all')


C.

filtered_df = users_raw_df.na.drop(how='any')


D.

filtered_df = users_raw_df.na.drop(how='all', thresh=None)


Expert Solution
Questions # 10:

A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:

Low number of Active Tasks

Many tasks complete in milliseconds

Fewer tasks than available CPUs

Which approach should be used to adjust the partitioning for optimal resource allocation?

Options:

A.

Set the number of partitions equal to the total number of CPUs in the cluster


B.

Set the number of partitions to a fixed value, such as 200


C.

Set the number of partitions equal to the number of nodes in the cluster


D.

Set the number of partitions by dividing the dataset size (1 TB) by a reasonable partition size, such as 128 MB


Expert Solution
Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions