Explanation
Use a narrow transformation to reduce the number of partitions.
Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the
DataFrame.
Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" – this is something the coalesce operation will not do. As a
general rule, it will reduce the number of partitions with the very least movement of data possible. More info: distributed computing - Spark - repartition() vs coalesce() - Stack Overflow
Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.
Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info: pyspark.sql.DataFrame.coalesce —
PySpark 3.1.2 documentation
Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient way of reducing the number of partitions of all listed options.
Use a wide transformation to reduce the number of partitions.
No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.
Submit