What is the difference between df.cache() and df.persist() in Spark DataFrame?
A.
Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)
B.
Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.
C.
persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.
D.
cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist() - Can be used to set different storage levels to persist the contents of the DataFrame
df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK)
df.persist() allows specifying any storage level such as MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK_SER, etc.
By default, persist() uses MEMORY_AND_DISK, unless specified otherwise.
[Reference: Spark Programming Guide - Caching and Persistence, ]
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit