Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Questions Free Practice Test

Viewing page 5 out of 6 pages

Viewing questions 41-50 out of questions

Questions # 41:

Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

Options:

spark.read.json(filePath)

spark.read.path(filePath, source="json")

spark.read().path(filePath)

spark.read().json(filePath)

spark.read.path(filePath)

Expert Solution

Questions # 42:

The code block displayed below contains at least one error. The code block should return a DataFrame with only one column, result. That column should include all values in column value from

DataFrame transactionsDf raised to the power of 5, and a null value for rows in which there is no value in column value. Find the error(s).

Code block:

1.from pyspark.sql.functions import udf

2.from pyspark.sql import types as T

4.transactionsDf.createOrReplaceTempView('transactions')

6.def pow_5(x):

7. return x**5

9.spark.udf.register(pow_5, 'power_5_udf', T.LongType())

10.spark.sql('SELECT power_5_udf(value) FROM transactions')

Options:

The pow_5 method is unable to handle empty values in column value and the name of the column in the returned DataFrame is not result.

The returned DataFrame includes multiple columns instead of just one column.

The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and the SparkSession cannot access the transactionsDf

DataFrame.

The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and Spark driver does not call the UDF function

appropriately.

The pow_5 method is unable to handle empty values in column value, the UDF function is not registered properly with the Spark driver, and the name of the column in the returned DataFrame is

not result.

Expert Solution

Questions # 43:

Which of the following is not a feature of Adaptive Query Execution?

Options:

Replace a sort merge join with a broadcast join, where appropriate.

Coalesce partitions to accelerate data processing.

Split skewed partitions into smaller partitions to avoid differences in partition processing time.

Reroute a query in case of an executor failure.

Collect runtime statistics during query execution.

Expert Solution

Questions # 44:

The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching

column names and inserting null values where column names do not appear in both DataFrames. Find the error.

Sample of DataFrame transactionsDfMonday:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 5| null| null| null| 2|null|

5.| 6| 3| 2| 25| 2|null|

6.+-------------+---------+-----+-------+---------+----+

Sample of DataFrame transactionsDfTuesday:

1.+-------+-------------+---------+-----+

3.+-------+-------------+---------+-----+

4.| 25| 1| 1| 4|

5.| 2| 2| 2| 7|

6.| 3| 4| 2| null|

7.| null| 5| 2| null|

8.+-------+-------------+---------+-----+

Code block:

sc.union([transactionsDfMonday, transactionsDfTuesday])

Options:

The DataFrames' RDDs need to be passed into the sc.union method instead of the DataFrame variable names.

Instead of union, the concat method should be used, making sure to not use its default arguments.

Instead of the Spark context, transactionDfMonday should be called with the join method instead of the union method, making sure to use its default arguments.

Instead of the Spark context, transactionDfMonday should be called with the union method.

Instead of the Spark context, transactionDfMonday should be called with the unionByName method instead of the union method, making sure to not use its default arguments.

Expert Solution

Answer

Explanation

Correct code block:

transactionsDfMonday.unionByName(transactionsDfTuesday, True)

Output of correct code block:

+-------------+---------+-----+-------+---------+----+

+-------------+---------+-----+-------+---------+----+

| 6| 3| 2| 25| 2|null|

| 1| null| 4| 25| 1|null|

| 2| null| 7| 2| 2|null|

| 4| null| null| 3| 2|null|

+-------------+---------+-----+-------+---------+----+

For solving this question, you should be aware of the difference between the DataFrame.union() and DataFrame.unionByName() methods. The first one matches columns independent of their

names, just by their order. The second one matches columns by their name (which is asked for in the question). It also has a useful optional argument, allowMissingColumns. This allows you to

merge DataFrames that have different columns - just like in this example.

sc stands for SparkContext and is automatically provided when executing code on Databricks. While sc.union() allows you to join RDDs, it is not the right choice for joining DataFrames. A hint away

from sc.union() is given where the QUESTION NO: talks about joining "into a new DataFrame".

concat is a method in pyspark.sql.functions. It is great for consolidating values from different columns, but has no place when trying to join rows of multiple DataFrames.

Finally, the join method is a contender here. However, the default join defined for that method is an inner join which does not get us closer to the goal to match the two DataFrames as instructed,

especially given that with the default arguments we cannot define a join condition.

More info:

- pyspark.sql.DataFrame.unionByName — PySpark 3.1.2 documentation

- pyspark.SparkContext.union — PySpark 3.1.2 documentation

- pyspark.sql.functions.concat — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 45 (Databricks import instructions)

Questions # 45:

Which of the following is the idea behind dynamic partition pruning in Spark?

Options:

Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.

Dynamic partition pruning concatenates columns of similar data types to optimize join performance.

Dynamic partition pruning performs wide transformations on disk instead of in memory.

Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.

Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.

Expert Solution

Questions # 46:

Which of the following describes how Spark achieves fault tolerance?

Options:

Spark helps fast recovery of data in case of a worker fault by providing the MEMORY_AND_DISK storage level option.

If an executor on a worker node fails while calculating an RDD, that RDD can be recomputed by another executor using the lineage.

Spark builds a fault-tolerant layer on top of the legacy RDD data system, which by itself is not fault tolerant.

Due to the mutability of DataFrames after transformations, Spark reproduces them using observed lineage in case of worker node failure.

Spark is only fault-tolerant if this feature is specifically enabled via the spark.fault_recovery.enabled property.

Expert Solution

Questions # 47:

Which of the following code blocks displays various aggregated statistics of all columns in DataFrame transactionsDf, including the standard deviation and minimum of values in each column?

Options:

transactionsDf.summary()

transactionsDf.agg("count", "mean", "stddev", "25%", "50%", "75%", "min")

transactionsDf.summary("count", "mean", "stddev", "25%", "50%", "75%", "max").show()

transactionsDf.agg("count", "mean", "stddev", "25%", "50%", "75%", "min").show()

transactionsDf.summary().show()

Expert Solution

Questions # 48:

Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?

Options:

transactionsDf.withColumn("predErrorSqrt", sqrt(predError))

transactionsDf.select(sqrt(predError))

transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())

transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))

transactionsDf.select(sqrt("predError"))

Expert Solution

Answer

Explanation

transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))

Correct. The DataFrame.withColumn() operator is used to add a new column to a DataFrame. It takes two arguments: The name of the new column (here: predErrorSqrt) and a Column expression

as the new column. In PySpark, a Column expression means referring to a column using the col("predError") command or by other means, for example by transactionsDf.predError, or even just

using the column name as a string, "predError".

The QUESTION NO: asks for the square root. sqrt() is a function in pyspark.sql.functions and calculates the square root. It takes a value or a Column as an input. Here it is the predError column of

DataFrame transactionsDf expressed through col("predError").

transactionsDf.withColumn("predErrorSqrt", sqrt(predError))

Incorrect. In this expression, sqrt(predError) is incorrect syntax. You cannot refer to predError in this way – to Spark it looks as if you are trying to refer to the non-existent Python variable predError.

You could pass transactionsDf.predError, col("predError") (as in the correct solution), or even just "predError" instead.

transactionsDf.select(sqrt(predError))

Wrong. Here, the explanation just above this one about how to refer to predError applies.

transactionsDf.select(sqrt("predError"))

No. While this is correct syntax, it will return a single-column DataFrame only containing a column showing the square root of column predError. However, the QUESTION NO: asks for a column to

be added to the original DataFrame transactionsDf.

transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())

No. The issue with this statement is that column col("predError") has no sqrt() method. sqrt() is a member of pyspark.sql.functions, but not of pyspark.sql.Column.

More info: pyspark.sql.DataFrame.withColumn — PySpark 3.1.2 documentation and pyspark.sql.functions.sqrt — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, QUESTION NO: 31 (Databricks import instructions)

Questions # 49:

Which of the following code blocks applies the boolean-returning Python function evaluateTestSuccess to column storeId of DataFrame transactionsDf as a user-defined function?

Options:

1.from pyspark.sql import types as T

2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.BooleanType())

3.transactionsDf.withColumn("result", evaluateTestSuccessUDF(col("storeId")))

1.evaluateTestSuccessUDF = udf(evaluateTestSuccess)

2.transactionsDf.withColumn("result", evaluateTestSuccessUDF(storeId))

1.from pyspark.sql import types as T

2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.IntegerType())

3.transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))

1.evaluateTestSuccessUDF = udf(evaluateTestSuccess)

2.transactionsDf.withColumn("result", evaluateTestSuccessUDF(col("storeId")))

1.from pyspark.sql import types as T

2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.BooleanType())

3.transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))

Expert Solution

Questions # 50:

Which of the following statements about executors is correct?

Options:

Executors are launched by the driver.

Executors stop upon application completion by default.

Each node hosts a single executor.

Executors store data in memory only.

An executor can serve multiple applications.

Expert Solution

Viewing page 5 out of 6 pages

Viewing questions 41-50 out of questions

Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions and answers with CertsForce