Pass the Databricks Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with CertsForce

Viewing page 2 out of 3 pages
Viewing questions 11-20 out of questions
Questions # 11:

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFramedfwith columnsuser_id,product_id, andpurchase_amountand needs to perform some operations on this data efficiently.

Which sequence of operations results in transformations that require a shuffle followed by transformations that do not?

Options:

A.

df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")


B.

df.withColumn("discount", df.purchase_amount * 0.1).select("discount")


C.

df.withColumn("purchase_date", current_date()).where("total_purchase > 50")


D.

df.groupBy("user_id").agg(sum("purchase_amount").alias("total_purchase")).repartition(10)


Expert Solution
Questions # 12:

What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?

Options:

A.

The conversion will automatically distribute the data across worker nodes


B.

The operation will fail if the Pandas DataFrame exceeds 1000 rows


C.

Data will be lost during conversion


D.

The operation will load all data into the driver's memory, potentially causing memory overflow


Expert Solution
Questions # 13:

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing agroupByoperation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

Question # 13

C)

Question # 13

D)

Question # 13

Options:

A.

Use theapplyInPandasAPI:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()


B.

Use themapInPandasAPI:

df.mapInPandas(mean_func, schema="user_id long, value double").show()


C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()


D.

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()


Expert Solution
Questions # 14:

A Data Analyst needs to retrieve employees with 5 or more years of tenure.

Which code snippet filters and shows the list?

Options:

A.

employees_df.filter(employees_df.tenure >= 5).show()


B.

employees_df.where(employees_df.tenure >= 5)


C.

filter(employees_df.tenure >= 5)


D.

employees_df.filter(employees_df.tenure >= 5).collect()


Expert Solution
Questions # 15:

What is the behavior for functiondate_sub(start, days)if a negative value is passed into thedaysparameter?

Options:

A.

The same start date will be returned


B.

An error message of an invalid parameter will be returned


C.

The number of days specified will be added to the start date


D.

The number of days specified will be removed from the start date


Expert Solution
Questions # 16:

Given this code:

Question # 16

.withWatermark("event_time","10 minutes")

.groupBy(window("event_time","15 minutes"))

.count()

What happens to data that arrives after the watermark threshold?

Options:

Options:

A.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.


B.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.


C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.


D.

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.


Expert Solution
Questions # 17:

An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.

How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?

Options:

A.

Locate the executor logs on the Spark master node, typically under the/tmpdirectory.


B.

Use the commandspark-submitwith the—verboseflag to print the logs to the console.


C.

Use the Spark UI to select the stage and view the executor logs directly from the stages tab.


D.

Fetch the logs by running a Spark job with thespark-sqlCLI tool.


Expert Solution
Questions # 18:

A data engineer uses a broadcast variable to share a DataFrame containing millions of rows across executors for lookup purposes. What will be the outcome?

Options:

A.

The job may fail if the memory on each executor is not large enough to accommodate the DataFrame being broadcasted


B.

The job may fail if the executors do not have enough CPU cores to process the broadcasted dataset


C.

The job will hang indefinitely as Spark will struggle to distribute and serialize such a large broadcast variable to all executors


D.

The job may fail because the driver does not have enough CPU cores to serialize the large DataFrame


Expert Solution
Questions # 19:

A developer is trying to join two tables,sales.purchases_fctandsales.customer_dim, using the following code:

Question # 19

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'))

The developer has discovered that customers in thepurchases_fcttable that do not exist in thecustomer_dimtable are being dropped from the joined table.

Which change should be made to the code to stop these customer records from being dropped?

Options:

A.

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'left')


B.

fact_df = cust_df.join(purch_df, F.col('customer_id') == F.col('custid'))


C.

fact_df = purch_df.join(cust_df, F.col('cust_id') == F.col('customer_id'))


D.

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'right_outer')


Expert Solution
Questions # 20:

A data engineer is working on a real-time analytics pipeline using Apache Spark Structured Streaming. The engineer wants to process incoming data and ensure that triggers control when the query is executed. The system needs to process data in micro-batches with a fixed interval of 5 seconds.

Which code snippet the data engineer could use to fulfil this requirement?

A)

Question # 20

B)

Question # 20

C)

Question # 20

D)

Question # 20

Options:

Options:

A.

Uses trigger(continuous='5 seconds') – continuous processing mode.


B.

Uses trigger() – default micro-batch trigger without interval.


C.

Uses trigger(processingTime='5 seconds') – correct micro-batch trigger with interval.


D.

Uses trigger(processingTime=5000) – invalid, as processingTime expects a string.


Expert Solution
Viewing page 2 out of 3 pages
Viewing questions 11-20 out of questions