A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.
Which code snippet meets the requirement of the developer?
Which Spark configuration controls the number of tasks that can run in parallel on the executor?
Options:
A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.
Which combination of Apache Spark modules should the data scientist use in this scenario?
Options:
24 of 55.
Which code should be used to display the schema of the Parquet file stored in the location events.parquet?
25 of 55.
A Data Analyst is working on employees_df and needs to add a new column where a 10% tax is calculated on the salary.
Additionally, the DataFrame contains the column age, which is not needed.
Which code fragment adds the tax column and removes the age column?
45 of 55.
Which feature of Spark Connect should be considered when designing an application that plans to enable remote interaction with a Spark cluster?
12 of 55.
A data scientist has been investigating user profile data to build features for their model. After some exploratory data analysis, the data scientist identified that some records in the user profiles contain NULL values in too many fields to be useful.
The schema of the user profile table looks like this:
user_id STRING,
username STRING,
date_of_birth DATE,
country STRING,
created_at TIMESTAMP
The data scientist decided that if any record contains a NULL value in any field, they want to remove that record from the output before further processing.
Which block of Spark code can be used to achieve these requirements?
41 of 55.
A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.
The DataFrame has columns:
id | Name | count | timestamp
---------------------------------
1 | USA | 10
2 | India | 20
3 | England | 50
4 | India | 50
5 | France | 20
6 | India | 10
7 | USA | 30
8 | USA | 40
Which code fragment should the engineer use to sort the data in the Name and count columns?
Which command overwrites an existing JSON file when writing a DataFrame?
40 of 55.
A developer wants to refactor older Spark code to take advantage of built-in functions introduced in Spark 3.5.
The original code:
from pyspark.sql import functions as F
min_price = 110.50
result_df = prices_df.filter(F.col("price") > min_price).agg(F.count("*"))
Which code block should the developer use to refactor the code?