Databricks Certified Associate Developer for Apache Spark 3.5-Python Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question # 13 Topic 2 Discussion

Databricks Certified Associate Developer for Apache Spark 3.5-Python Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question # 13 Topic 2 Discussion

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam Topic 2 Question 13 Discussion:
Question #: 13
Topic #: 2

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing agroupByoperation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 13

C)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 13

D)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 13


A.

Use theapplyInPandasAPI:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()


B.

Use themapInPandasAPI:

df.mapInPandas(mean_func, schema="user_id long, value double").show()


C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()


D.

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()


Get Premium Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions

Contribute your Thoughts:


Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.