The groupBy() operation causes a shuffle because it requires all values for a specific key to be brought together, which may involve moving data across partitions.
In contrast:
filter() and select() are narrow transformations and do not cause shuffles.
coalesce() tries to reduce the number of partitions and avoids shuffling by moving data to fewer partitions without a full shuffle (unlike repartition()).
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit