Explanation
Correct code block:
transactionsDfMonday.unionByName(transactionsDfTuesday, True)
Output of correct code block:
+-------------+---------+-----+-------+---------+----+
|transactionId|predError|value|storeId|productId| f|
+-------------+---------+-----+-------+---------+----+
| 5| null| null| null| 2|null|
| 6| 3| 2| 25| 2|null|
| 1| null| 4| 25| 1|null|
| 2| null| 7| 2| 2|null|
| 4| null| null| 3| 2|null|
| 5| null| null| null| 2|null|
+-------------+---------+-----+-------+---------+----+
For solving this question, you should be aware of the difference between the DataFrame.union() and DataFrame.unionByName() methods. The first one matches columns independent of their
names, just by their order. The second one matches columns by their name (which is asked for in the question). It also has a useful optional argument, allowMissingColumns. This allows you to
merge DataFrames that have different columns - just like in this example.
sc stands for SparkContext and is automatically provided when executing code on Databricks. While sc.union() allows you to join RDDs, it is not the right choice for joining DataFrames. A hint away
from sc.union() is given where the QUESTION NO: talks about joining "into a new DataFrame".
concat is a method in pyspark.sql.functions. It is great for consolidating values from different columns, but has no place when trying to join rows of multiple DataFrames.
Finally, the join method is a contender here. However, the default join defined for that method is an inner join which does not get us closer to the goal to match the two DataFrames as instructed,
especially given that with the default arguments we cannot define a join condition.
More info:
- pyspark.sql.DataFrame.unionByName — PySpark 3.1.2 documentation
- pyspark.SparkContext.union — PySpark 3.1.2 documentation
- pyspark.sql.functions.concat — PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, QUESTION NO: 45 (Databricks import instructions)
Submit