Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Questions Free Practice Test

Viewing page 2 out of 6 pages

Viewing questions 11-20 out of questions

Questions # 11:

Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?

Options:

itemsDf.cache().count()

itemsDf.cache(eager=True)

cache(itemsDf)

itemsDf.cache().filter()

itemsDf.rdd.storeCopy()

Expert Solution

Questions # 12:

The code block shown below should return the number of columns in the CSV file stored at location filePath. From the CSV file, only lines should be read that do not start with a # character. Choose

the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

__1__(__2__.__3__.csv(filePath, __4__).__5__)

Options:

1. size

2. spark

3. read()

4. escape='#'

5. columns

1. DataFrame

2. spark

3. read()

4. escape='#'

5. shape[0]

1. len

2. pyspark

3. DataFrameReader

4. comment='#'

5. columns

1. size

2. pyspark

3. DataFrameReader

4. comment='#'

5. columns

1. len

2. spark

3. read

4. comment='#'

5. columns

Expert Solution

Answer

Explanation

Correct code block:

len(spark.read.csv(filePath, comment='#').columns)

This is a challenging QUESTION NO: with difficulties in an unusual context: The boundary between DataFrame and the DataFrameReader. It is unlikely that a QUESTION NO: of this difficulty level

appears in the

exam. However, solving it helps you get more comfortable with the DataFrameReader, a subject you will likely have to deal with in the exam.

Before dealing with the inner parentheses, it is easier to figure out the outer parentheses, gaps 1 and 5. Given the code block, the object in gap 5 would have to be evaluated by the object in gap 1,

returning the number of columns in the read-in CSV. One answer option includes DataFrame in gap 1 and shape[0] in gap 2. Since DataFrame cannot be used to evaluate shape[0], we can discard

this answer option.

Other answer options include size in gap 1. size() is not a built-in Python command, so if we use it, it would have to come from somewhere else. pyspark.sql.functions includes a size() method, but

this method only returns the length of an array or map stored within a column (documentation linked below). So, using a size() method is not an option here. This leaves us with two potentially valid

answers.

We have to pick between gaps 2 and 3 being spark.read or pyspark.DataFrameReader. Looking at the documentation (linked below), the DataFrameReader is actually a child class of pyspark.sql,

which means that we cannot import it using pyspark.DataFrameReader. Moreover, spark.read makes sense because on Databricks, spark references current Spark session

(pyspark.sql.SparkSession) and spark.read therefore returns a DataFrameReader (also see documentation below). Finally, there is only one correct answer option remaining.

More info:

- pyspark.sql.functions.size — PySpark 3.1.2 documentation

- pyspark.sql.DataFrameReader.csv — PySpark 3.1.2 documentation

- pyspark.sql.SparkSession.read — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 50 (Databricks import instructions)

Questions # 13:

Which of the following code blocks returns a DataFrame where columns predError and productId are removed from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.|1 |3 |4 |25 |1 |null|

5.|2 |6 |7 |2 |2 |null|

6.|3 |3 |null |25 |3 |null|

7.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.withColumnRemoved("predError", "productId")

transactionsDf.drop(["predError", "productId", "associateId"])

transactionsDf.drop("predError", "productId", "associateId")

transactionsDf.dropColumns("predError", "productId", "associateId")

transactionsDf.drop(col("predError", "productId"))

Expert Solution

Questions # 14:

The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

transactionsDf.__1__(__2__.__3__(__4__))

Options:

1. select

2. col("storeId")

3. cast

4. StringType

1. select

2. col("storeId")

3. as

4. StringType

1. cast

2. "storeId"

3. as

4. StringType()

1. select

2. col("storeId")

3. cast

4. StringType()

1. select

2. storeId

3. cast

4. StringType()

Expert Solution

Questions # 15:

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

Options:

itemsDf.persist(StorageLevel.MEMORY_ONLY)

itemsDf.cache(StorageLevel.MEMORY_AND_DISK)

itemsDf.store()

itemsDf.cache()

itemsDf.write.option('destination', 'memory').save()

Expert Solution

Questions # 16:

The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by

column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items

in column value. Find the error.

Code block:

transactionsDf.orderBy('value', asc_nulls_first(col('predError')))

Options:

Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.

Column value should be wrapped by the col() operator.

Column predError should be sorted in a descending way, putting nulls last.

Column predError should be sorted by desc_nulls_first() instead.

Instead of orderBy, sort should be used.

Expert Solution

Questions # 17:

Which of the following code blocks creates a new one-column, two-row DataFrame dfDates with column date of type timestamp?

Options:

1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])

2.dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date"))

1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

2.dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-dd HH:mm:ss"))

1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

2.dfDates = dfDates.withColumn("date", to_timestamp("date", "dd/MM/yyyy HH:mm:ss"))

1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])

2.dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss"))

1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

Expert Solution

Answer

Explanation

This QUESTION NO: is tricky. Two things are important to know here:

First, the syntax for createDataFrame: Here you need a list of tuples, like so: [(1,), (2,)]. To define a tuple in Python, if you just have a single item in it, it is important to put a comma after the item so

that Python interprets it as a tuple and not just a normal parenthesis.

Second, you should understand the to_timestamp syntax. You can find out more about it in the documentation linked below.

For good measure, let's examine in detail why the incorrect options are wrong:

dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

This code snippet does everything the QUESTION NO: asks for – except that the data type of the date column is a string and not a timestamp. When no schema is specified, Spark sets the string

data type as default.

dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])

dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date"))

In the first row of this command, Spark throws the following error: TypeError: Can not infer schema for type: . This is because Spark expects to find row information, but instead finds

strings. This is why you need to specify the data as tuples. Fortunately, the Spark documentation (linked below) shows a number of examples for creating DataFrames that should help you get on

the right track here.

dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-dd HH:mm:ss"))

The issue with this answer is that the operator withColumnRenamed is used. This operator simply renames a column, but it has no power to modify its actual content. This is why withColumn should

be used instead. In addition, the date format yyyy-MM-dd HH:mm:ss does not reflect the format of the actual timestamp: "23/01/2022 11:28:12".

dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])

dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss"))

Here, withColumnRenamed is used instead of withColumn (see above). In addition, the rows are not expressed correctly – they should be written as tuples, using parentheses. Finally, even the date

format is off here (see above).

More info: pyspark.sql.functions.to_timestamp — PySpark 3.1.2 documentation and pyspark.sql.SparkSession.createDataFrame — PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 2, QUESTION NO: 38 (Databricks import instructions)

Questions # 18:

Which of the following describes characteristics of the Dataset API?

Options:

The Dataset API does not support unstructured data.

In Python, the Dataset API mainly resembles Pandas' DataFrame API.

In Python, the Dataset API's schema is constructed via type hints.

The Dataset API is available in Scala, but it is not available in Python.

The Dataset API does not provide compile-time type safety.

Expert Solution

Questions # 19:

In which order should the code blocks shown below be run in order to create a DataFrame that shows the mean of column predError of DataFrame transactionsDf per column storeId and productId,

where productId should be either 2 or 3 and the returned DataFrame should be sorted in ascending order by column storeId, leaving out any nulls in that column?

DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

1. .mean("predError")

2. .groupBy("storeId")

3. .orderBy("storeId")

4. transactionsDf.filter(transactionsDf.storeId.isNotNull())

5. .pivot("productId", [2, 3])

Options:

4, 5, 2, 3, 1

4, 2, 1

4, 1, 5, 2, 3

4, 2, 5, 1, 3

4, 3, 2, 5, 1

Expert Solution

Questions # 20:

Which of the following describes the conversion of a computational query into an execution plan in Spark?

Options:

Spark uses the catalog to resolve the optimized logical plan.

The catalog assigns specific resources to the optimized memory plan.

The executed physical plan depends on a cost optimization from a previous stage.

Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

The catalog assigns specific resources to the physical plan.

Expert Solution

Viewing page 2 out of 6 pages

Viewing questions 11-20 out of questions

Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions and answers with CertsForce