Explanation
This QUESTION NO: is tricky. Two things are important to know here:
First, the syntax for createDataFrame: Here you need a list of tuples, like so: [(1,), (2,)]. To define a tuple in Python, if you just have a single item in it, it is important to put a comma after the item so
that Python interprets it as a tuple and not just a normal parenthesis.
Second, you should understand the to_timestamp syntax. You can find out more about it in the documentation linked below.
For good measure, let's examine in detail why the incorrect options are wrong:
dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
This code snippet does everything the QUESTION NO: asks for – except that the data type of the date column is a string and not a timestamp. When no schema is specified, Spark sets the string
data type as default.
dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])
dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date"))
In the first row of this command, Spark throws the following error: TypeError: Can not infer schema for type: . This is because Spark expects to find row information, but instead finds
strings. This is why you need to specify the data as tuples. Fortunately, the Spark documentation (linked below) shows a number of examples for creating DataFrames that should help you get on
the right track here.
dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-dd HH:mm:ss"))
The issue with this answer is that the operator withColumnRenamed is used. This operator simply renames a column, but it has no power to modify its actual content. This is why withColumn should
be used instead. In addition, the date format yyyy-MM-dd HH:mm:ss does not reflect the format of the actual timestamp: "23/01/2022 11:28:12".
dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])
dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss"))
Here, withColumnRenamed is used instead of withColumn (see above). In addition, the rows are not expressed correctly – they should be written as tuples, using parentheses. Finally, even the date
format is off here (see above).
More info: pyspark.sql.functions.to_timestamp — PySpark 3.1.2 documentation and pyspark.sql.SparkSession.createDataFrame — PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 2, QUESTION NO: 38 (Databricks import instructions)
Submit