Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions and answers with CertsForce

Viewing page 1 out of 6 pages
Viewing questions 1-10 out of questions
Questions # 1:

The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to

numeric and nullable column predError in DataFrame transactionsDf. Find the error.

Code block:

1.def add_2_if_geq_3(x):

2. if x is None:

3. return x

4. elif x >= 3:

5. return x+2

6. return x

7.

8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)

9.

10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

Options:

A.

The operator used to adding the column does not add column predErrorAdded to the DataFrame.


B.

Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.


C.

The udf() method does not declare a return type.


D.

UDFs are only available through the SQL API, but not in the Python API as shown in the code block.


E.

The Python function is unable to handle null values, resulting in the code block crashing on execution.


Expert Solution
Questions # 2:

The code block displayed below contains an error. The code block should return all rows of DataFrame transactionsDf, but including only columns storeId and predError. Find the error.

Code block:

spark.collect(transactionsDf.select("storeId", "predError"))

Options:

A.

Instead of select, DataFrame transactionsDf needs to be filtered using the filter operator.


B.

Columns storeId and predError need to be represented as a Python list, so they need to be wrapped in brackets ([]).


C.

The take method should be used instead of the collect method.


D.

Instead of collect, collectAsRows needs to be called.


E.

The collect method is not a method of the SparkSession object.


Expert Solution
Questions # 3:

Which of the following describes Spark actions?

Options:

A.

Writing data to disk is the primary purpose of actions.


B.

Actions are Spark's way of exchanging data between executors.


C.

The driver receives data upon request by actions.


D.

Stage boundaries are commonly established by actions.


E.

Actions are Spark's way of modifying RDDs.


Expert Solution
Questions # 4:

Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

Options:

A.

Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.


B.

Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.


C.

Use a narrow transformation to reduce the number of partitions.


D.

Use a wide transformation to reduce the number of partitions.

Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.


Expert Solution
Questions # 5:

Which of the following code blocks returns a one-column DataFrame for which every row contains an array of all integer numbers from 0 up to and including the number given in column predError of

DataFrame transactionsDf, and null if predError is null?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

Options:

A.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = [range(target)]

6. return result

7.

8.count_to_target_udf = udf(count_to_target, ArrayType[IntegerType])

9.

10.transactionsDf.select(count_to_target_udf(col('predError')))


B.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = list(range(target))

6. return result

7.

8.transactionsDf.select(count_to_target(col('predError')))


C.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = list(range(target))

6. return result

7.

8.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))

9.

10.transactionsDf.select(count_to_target_udf('predError'))

(Correct)


D.

1.def count_to_target(target):

2. result = list(range(target))

3. return result

4.

5.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))

6.

7.df = transactionsDf.select(count_to_target_udf('predError'))


E.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = list(range(target))

6. return result

7.

8.count_to_target_udf = udf(count_to_target)

9.

10.transactionsDf.select(count_to_target_udf('predError'))


Expert Solution
Questions # 6:

The code block displayed below contains multiple errors. The code block should return a DataFrame that contains only columns transactionId, predError, value and storeId of DataFrame

transactionsDf. Find the errors.

Code block:

transactionsDf.select([col(productId), col(f)])

Sample of transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

Options:

A.

The column names should be listed directly as arguments to the operator and not as a list.


B.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed

as strings without being wrapped in a col() operator.


C.

The select operator should be replaced by a drop operator.


D.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and

f should be replaced by transactionId, predError, value and storeId.


E.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.


Expert Solution
Questions # 7:

Which of the following describes a shuffle?

Options:

A.

A shuffle is a process that is executed during a broadcast hash join.


B.

A shuffle is a process that compares data across executors.


C.

A shuffle is a process that compares data across partitions.


D.

A shuffle is a Spark operation that results from DataFrame.coalesce().


E.

A shuffle is a process that allocates partitions to executors.


Expert Solution
Questions # 8:

Which of the following code blocks returns a single-column DataFrame showing the number of words in column supplier of DataFrame itemsDf?

Sample of DataFrame itemsDf:

1.+------+-----------------------------+-------------------+

2.|itemId|attributes |supplier |

3.+------+-----------------------------+-------------------+

4.|1 |[blue, winter, cozy] |Sports Company Inc.|

5.|2 |[red, summer, fresh, cooling]|YetiX |

6.|3 |[green, summer, travel] |Sports Company Inc.|

7.+------+-----------------------------+-------------------+

Options:

A.

itemsDf.split("supplier", " ").count()


B.

itemsDf.split("supplier", " ").size()


C.

itemsDf.select(word_count("supplier"))


D.

spark.select(size(split(col(supplier), " ")))


E.

itemsDf.select(size(split("supplier", " ")))


Expert Solution
Questions # 9:

Which of the following statements about RDDs is incorrect?

Options:

A.

An RDD consists of a single partition.


B.

The high-level DataFrame API is built on top of the low-level RDD API.


C.

RDDs are immutable.


D.

RDD stands for Resilient Distributed Dataset.


E.

RDDs are great for precisely instructing Spark on how to do a query.


Expert Solution
Questions # 10:

Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type

double?

Options:

A.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})


B.

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])


C.

1. from pyspark.sql import types as T

2. spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))


D.

spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])


E.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})


Expert Solution
Viewing page 1 out of 6 pages
Viewing questions 1-10 out of questions