Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Questions Free Practice Test

Viewing page 3 out of 6 pages

Viewing questions 21-30 out of questions

Questions # 21:

The code block shown below should return an exact copy of DataFrame transactionsDf that does not include rows in which values in column storeId have the value 25. Choose the answer that

correctly fills the blanks in the code block to accomplish this.

Options:

transactionsDf.remove(transactionsDf.storeId==25)

transactionsDf.where(transactionsDf.storeId!=25)

transactionsDf.filter(transactionsDf.storeId==25)

transactionsDf.drop(transactionsDf.storeId==25)

transactionsDf.select(transactionsDf.storeId!=25)

Expert Solution

Questions # 22:

The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and

transactionDate (in this order). Find the error.

Code block:

transactionsDf.coalesce(14, ("storeId", "transactionDate"))

Options:

The parentheses around the column names need to be removed and .select() needs to be appended to the code block.

Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.

(Correct)

Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.

Operator coalesce needs to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.

Operator coalesce needs to be replaced by repartition.

Expert Solution

Questions # 23:

The code block shown below should return a two-column DataFrame with columns transactionId and supplier, with combined information from DataFrames itemsDf and transactionsDf. The code

block should merge rows in which column productId of DataFrame transactionsDf matches the value of column itemId in DataFrame itemsDf, but only where column storeId of DataFrame

transactionsDf does not match column itemId of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

transactionsDf.__1__(itemsDf, __2__).__3__(__4__)

Options:

1. join

2. transactionsDf.productId==itemsDf.itemId, how="inner"

3. select

4. "transactionId", "supplier"

1. select

2. "transactionId", "supplier"

3. join

4. [transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId]

1. join

2. [transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId]

3. select

4. "transactionId", "supplier"

1. filter

2. "transactionId", "supplier"

3. join

4. "transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId"

1. join

2. transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId

3. filter

4. "transactionId", "supplier"

Expert Solution

Answer

Explanation

This QUESTION NO: is pretty complex and, in its complexity, is probably above what you would encounter in the exam. However, reading the QUESTION NO: carefully, you can use your logic skills

to weed out the

wrong answers here.

First, you should examine the join statement which is common to all answers. The first argument of the join() operator (documentation linked below) is the DataFrame to be joined with. Where join is

in gap 3, the first argument of gap 4 should therefore be another DataFrame. For none of the questions where join is in the third gap, this is the case. So you can immediately discard two answers.

For all other answers, join is in gap 1, followed by .(itemsDf, according to the code block. Given how the join() operator is called, there are now three remaining candidates.

Looking further at the join() statement, the second argument (on=) expects "a string for the join column name, a list of column names, a join expression (Column), or a list of Columns", according to

the documentation. As one answer option includes a list of join expressions (transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId) which is unsupported according to the

documentation, we can discard that answer, leaving us with two remaining candidates.

Both candidates have valid syntax, but only one of them fulfills the condition in the QUESTION NO: "only where column storeId of DataFrame transactionsDf does not match column itemId of

DataFrame

itemsDf". So, this one remaining answer option has to be the correct one!

As you can see, although sometimes overwhelming at first, even more complex questions can be figured out by rigorously applying the knowledge you can gain from the documentation during the

exam.

More info: pyspark.sql.DataFrame.join — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 47 (Databricks import instructions)

Questions # 24:

Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf, using a valid schema for the sample of itemsDf shown below?

Sample of itemsDf:

1.+------+-----------------------------+-------------------+

2.|itemId|attributes |supplier |

3.+------+-----------------------------+-------------------+

4.|1 |[blue, winter, cozy] |Sports Company Inc.|

5.|2 |[red, summer, fresh, cooling]|YetiX |

6.|3 |[green, summer, travel] |Sports Company Inc.|

7.+------+-----------------------------+-------------------+

Options:

1.itemsDfSchema = StructType([

2. StructField("itemId", IntegerType()),

3. StructField("attributes", StringType()),

4. StructField("supplier", StringType())])

6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)

1.itemsDfSchema = StructType([

2. StructField("itemId", IntegerType),

3. StructField("attributes", ArrayType(StringType)),

4. StructField("supplier", StringType)])

6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)

1.itemsDf = spark.read.schema('itemId integer, attributes , supplier string').parquet(filePath)

1.itemsDfSchema = StructType([

2. StructField("itemId", IntegerType()),

3. StructField("attributes", ArrayType(StringType())),

4. StructField("supplier", StringType())])

6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)

1.itemsDfSchema = StructType([

2. StructField("itemId", IntegerType()),

3. StructField("attributes", ArrayType([StringType()])),

4. StructField("supplier", StringType())])

6.itemsDf = spark.read(schema=itemsDfSchema).parquet(filePath)

Expert Solution

Questions # 25:

The code block shown below should read all files with the file ending .png in directory path into Spark. Choose the answer that correctly fills the blanks in the code block to accomplish this.

spark.__1__.__2__(__3__).option(__4__, "*.png").__5__(path)

Options:

1. read()

2. format

3. "binaryFile"

4. "recursiveFileLookup"

5. load

1. read

2. format

3. "binaryFile"

4. "pathGlobFilter"

5. load

1. read

2. format

3. binaryFile

4. pathGlobFilter

5. load

1. open

2. format

3. "image"

4. "fileType"

5. open

1. open

2. as

3. "binaryFile"

4. "pathGlobFilter"

5. load

Expert Solution

Questions # 26:

Which of the following describes Spark's way of managing memory?

Options:

Spark uses a subset of the reserved system memory.

Storage memory is used for caching partitions derived from DataFrames.

As a general rule for garbage collection, Spark performs better on many small objects than few big objects.

Disabling serialization potentially greatly reduces the memory footprint of a Spark application.

Spark's memory usage can be divided into three categories: Execution, transaction, and storage.

Expert Solution

Questions # 27:

Which of the following describes a narrow transformation?

Options:

narrow transformation is an operation in which data is exchanged across partitions.

A narrow transformation is a process in which data from multiple RDDs is used.

A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.

A narrow transformation is an operation in which data is exchanged across the cluster.

A narrow transformation is an operation in which no data is exchanged across the cluster.

Expert Solution

Answer

Questions # 28:

The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string

type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))

Options:

1. withColumn

2. "transactionDateForm"

3. "MMM d (EEEE)"

4. "transactionDate"

1. select

2. "transactionDate"

3. "transactionDateForm"

4. "MMM d (EEEE)"

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MMM d (EEEE)"

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MM d (EEE)"

1. withColumnRenamed

2. "transactionDate"

3. "transactionDateForm"

4. "MM d (EEE)"

Expert Solution

Questions # 29:

Which of the following code blocks uses a schema fileSchema to read a parquet file at location filePath into a DataFrame?

Options:

spark.read.schema(fileSchema).format("parquet").load(filePath)

spark.read.schema("fileSchema").format("parquet").load(filePath)

spark.read().schema(fileSchema).parquet(filePath)

spark.read().schema(fileSchema).format(parquet).load(filePath)

spark.read.schema(fileSchema).open(filePath)

Expert Solution

Questions # 30:

Which of the following code blocks returns a DataFrame that is an inner join of DataFrame itemsDf and DataFrame transactionsDf, on columns itemId and productId, respectively and in which every

itemId just appears once?

Options:

itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId").distinct("itemId")

itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates(["itemId"])

itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates("itemId")

itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId, how="inner").distinct(["itemId"])

itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId", how="inner").dropDuplicates(["itemId"])

Expert Solution

Viewing page 3 out of 6 pages

Viewing questions 21-30 out of questions

Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions and answers with CertsForce