Explanation
Due to the mutability of DataFrames after transformations, Spark reproduces them using observed lineage in case of worker node failure.
Wrong – Between transformations, DataFrames are immutable. Given that Spark also records the lineage, Spark can reproduce any DataFrame in case of failure. These two aspects are the key to
understanding fault tolerance in Spark.
Spark builds a fault-tolerant layer on top of the legacy RDD data system, which by itself is not fault tolerant.
Wrong. RDD stands for Resilient Distributed Dataset and it is at the core of Spark and not a "legacy system". It is fault-tolerant by design.
Spark helps fast recovery of data in case of a worker fault by providing the MEMORY_AND_DISK storage level option.
This is not true. For supporting recovery in case of worker failures, Spark provides "_2", "_3", and so on, storage level options, for example MEMORY_AND_DISK_2. These storage levels are
specifically designed to keep duplicates of the data on multiple nodes. This saves time in case of a worker fault, since a copy of the data can be used immediately, vs. having to recompute it first.
Spark is only fault-tolerant if this feature is specifically enabled via the spark.fault_recovery.enabled property.
No, Spark is fault-tolerant by design.
Submit