Explanation
Correct code block:
transactionsDf.orderBy('value', desc_nulls_last('predError'))
Column predError should be sorted in a descending way, putting nulls last.
Correct! By default, Spark sorts ascending, putting nulls first. So, the inverse sort of the default sort is indeed desc_nulls_last.
Instead of orderBy, sort should be used.
No. DataFrame.sort() orders data per partition, it does not guarantee a global order. This is why orderBy is the more appropriate operator here.
Column value should be wrapped by the col() operator.
Incorrect. DataFrame.sort() accepts both string and Column objects.
Column predError should be sorted by desc_nulls_first() instead.
Wrong. Since Spark's default sort order matches asc_nulls_first(), nulls would have to come last when inverted.
Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.
No, this would just sort the DataFrame by the very last column, but would not take information from both columns into account, as noted in the question.
More info: pyspark.sql.DataFrame.orderBy — PySpark 3.1.2 documentation, pyspark.sql.functions.desc_nulls_last — PySpark 3.1.2 documentation, sort() vs orderBy() in Spark | Towards Data
Science
Static notebook | Dynamic notebook: See test 3, QUESTION NO: 27 (Databricks import instructions)
Submit