Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Questions Free Practice Test

Viewing page 4 out of 6 pages

Viewing questions 31-40 out of questions

Questions # 31:

Which of the following describes slots?

Options:

Slots are dynamically created and destroyed in accordance with an executor's workload.

To optimize I/O performance, Spark stores data on disk in multiple slots.

A Java Virtual Machine (JVM) working as an executor can be considered as a pool of slots for task execution.

A slot is always limited to a single core.

Slots are the communication interface for executors and are used for receiving commands and sending results to the driver.

Expert Solution

Questions # 32:

Which of the following statements about reducing out-of-memory errors is incorrect?

Options:

Concatenating multiple string columns into a single column may guard against out-of-memory errors.

Reducing partition size can help against out-of-memory errors.

Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors.

Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors.

Decreasing the number of cores available to each executor can help against out-of-memory errors.

Expert Solution

Answer

Explanation

Concatenating multiple string columns into a single column may guard against out-of-memory errors.

Exactly, this is an incorrect answer! Concatenating any string columns does not reduce the size of the data, it just structures it a different way. This does little to how Spark processes the data and

definitely does not reduce out-of-memory errors.

Reducing partition size can help against out-of-memory errors.

No, this is not incorrect. Reducing partition size is a viable way to aid against out-of-memory errors, since executors need to load partitions into memory before processing them. If the executor does

not have enough memory available to do that, it will throw an out-of-memory error. Decreasing partition size can therefore be very helpful for preventing that.

Decreasing the number of cores available to each executor can help against out-of-memory errors.

No, this is not incorrect. To process a partition, this partition needs to be loaded into the memory of an executor. If you imagine that every core in every executor processes a partition, potentially in

parallel with other executors, you can imagine that memory on the machine hosting the executors fills up quite quickly. So, memory usage of executors is a concern, especially when multiple

partitions are processed at the same time. To strike a balance between performance and memory usage, decreasing the number of cores may help against out-of-memory errors.

Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors.

No, this is not incorrect. When using commands like collect() that trigger the transmission of potentially large amounts of data from the cluster to the driver, the driver may experience out-of-memory

errors. One strategy to avoid this is to be careful about using commands like collect() that send back large amounts of data to the driver. Another strategy is setting the parameter

spark.driver.maxResultSize. If data to be transmitted to the driver exceeds the threshold specified by the parameter, Spark will abort the job and therefore prevent an out-of-memory error.

Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors.

Wrong, this is not incorrect. As part of Spark's internal optimization, Spark may choose to speed up operations by broadcasting (usually relatively small) tables to executors. This broadcast is

happening from the driver, so all the broadcast tables are loaded into the driver first. If these tables are relatively big, or multiple mid-size tables are being broadcast, this may lead to an out-of-

memory error. The maximum table size for which Spark will consider broadcasting is set by the spark.sql.autoBroadcastJoinThreshold parameter.

More info: Configuration - Spark 3.1.2 Documentation and Spark OOM Error — Closeup. Does the following look familiar when… | by Amit Singh Rathore | The Startup | Medium

Questions # 33:

The code block displayed below contains an error. The code block should return a copy of DataFrame transactionsDf where the name of column transactionId has been changed to

transactionNumber. Find the error.

Code block:

transactionsDf.withColumn("transactionNumber", "transactionId")

Options:

The arguments to the withColumn method need to be reordered.

The arguments to the withColumn method need to be reordered and the copy() operator should be appended to the code block to ensure a copy is returned.

The copy() operator should be appended to the code block to ensure a copy is returned.

Each column name needs to be wrapped in the col() method and method withColumn should be replaced by method withColumnRenamed.

The method withColumn should be replaced by method withColumnRenamed and the arguments to the method need to be reordered.

Expert Solution

Questions # 34:

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

Options:

transactionsDf.repartition(transactionsDf.getNumPartitions()+2)

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

transactionsDf.coalesce(10)

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

transactionsDf.repartition(transactionsDf._partitions+2)

Expert Solution

Questions # 35:

Which of the following describes the difference between client and cluster execution modes?

Options:

In cluster mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine.

In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.

In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.

In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.

In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.

Expert Solution

Questions # 36:

Which of the following statements about the differences between actions and transformations is correct?

Options:

Actions are evaluated lazily, while transformations are not evaluated lazily.

Actions generate RDDs, while transformations do not.

Actions do not send results to the driver, while transformations do.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

Actions can trigger Adaptive Query Execution, while transformation cannot.

Expert Solution

Answer

Questions # 37:

Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i?

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

3.+------+----------------------------------+-----------------------------+-------------------+

7.+------+----------------------------------+-----------------------------+-------------------+

Options:

itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(attributes_exploded.contains("i"))

itemsDf.explode(attributes).alias("attributes_exploded").filter(col("attributes_exploded").contains("i"))

itemsDf.select(explode("attributes")).filter("attributes_exploded".contains("i"))

itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(col("attributes_exploded").contains("i"))

itemsDf.select(col("attributes").explode().alias("attributes_exploded")).filter(col("attributes_exploded").contains("i"))

Expert Solution

Questions # 38:

The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.

Code block:

transactionsDf.agg("storeId").avg("value")

Options:

Instead of avg("value"), avg(col("value")) should be used.

The avg("value") should be specified as a second argument to agg() instead of being appended to it.

All column names should be wrapped in col() operators.

agg should be replaced by groupBy.

"storeId" and "value" should be swapped.

Expert Solution

Questions # 39:

Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate

format for this kind of data?

Options:

1.spark.read.schema(

2. StructType(

3. StructField("transactionId", IntegerType(), True),

4. StructField("predError", IntegerType(), True)

5. )).load(filePath)

1.spark.read.schema([

2. StructField("transactionId", NumberType(), True),

3. StructField("predError", IntegerType(), True)

4. ]).load(filePath)

1.spark.read.schema(

2. StructType([

3. StructField("transactionId", StringType(), True),

4. StructField("predError", IntegerType(), True)]

5. )).parquet(filePath)

1.spark.read.schema(

2. StructType([

3. StructField("transactionId", IntegerType(), True),

4. StructField("predError", IntegerType(), True)]

5. )).format("parquet").load(filePath)

1.spark.read.schema([

2. StructField("transactionId", IntegerType(), True),

3. StructField("predError", IntegerType(), True)

4. ]).load(filePath, format="parquet")

Expert Solution

Questions # 40:

Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?

Options:

from pyspark import StorageLevel

transactionsDf.cache(StorageLevel.MEMORY_ONLY)

transactionsDf.cache()

transactionsDf.storage_level('MEMORY_ONLY')

transactionsDf.persist()

transactionsDf.clear_persist()

from pyspark import StorageLevel

transactionsDf.persist(StorageLevel.MEMORY_ONLY)

Expert Solution

Viewing page 4 out of 6 pages

Viewing questions 31-40 out of questions

Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions and answers with CertsForce