By default, the spark.read.text() method reads a text file one line per record. This means that each line in a text file becomes one row in the resulting DataFrame.
To read each file as a single record, Apache Spark provides the option wholetext, which, when set to True, causes Spark to treat the entire file contents as one single string per row.
Correct usage:
df = spark.read.option("wholetext", True).text(txt_path)
This way, each record in the DataFrame will contain the full content of one file instead of one line per record.
To also include the file path, the function input_file_name() can be used to create an additional column that stores the complete path of the file being read:
from pyspark.sql.functions import input_file_name
df = spark.read.option("wholetext", True).text(txt_path) \
withColumn("file_path", input_file_name())
This approach satisfies both requirements from the question:
Each record holds the entire contents of a file.
Each record also contains the file path from which the text was read.
Why the other options are incorrect:
B or D (lineSep) – The lineSep option only defines the delimiter between lines. It does not combine the entire file content into a single record.
C (wholetext=False) – This is the default behavior, which still reads one record per line rather than per file.
References (Databricks Apache Spark 3.5 – Python / Study Guide):
PySpark API Reference: DataFrameReader.text — describes the wholetext option.
PySpark Functions: input_file_name() — adds a column with the source file path.
Databricks Certified Associate Developer for Apache Spark Exam Guide (June 2025): Section “Using Spark DataFrame APIs” — covers reading files and handling DataFrames.
Submit