May 13, 2023

Pyspark sort(), orderBy()

 

Sort() or OrderBy()  #Both are synonyms


  • sort() - ascending order

    • df.sort(df.age.asc()).collect()

    • df.sort(df['age']).collect()

    • df.sort(col('age').asc()).collect()

    • df.sort('age', ascending=True).collect()   # ascending order

    • For asc, If any null values, row come at the top

  • sort() - descending order

    • df.sort('age', ascending=False).collect()  # descending order

    • df.sort(col('age').desc()).collect()  # descending order

    • For desc, If any null values, row come at the bottom


  • using desc/asc functions from pyspark.sql

    • from pyspark.sql import asc, desc

    • df.sort(asc('age'), desc('name')).collect()


  • orderBy()

    • df.orderBy(df.age.asc()).collect()

    • df.orderBy(df['age']).collect()

    • df.orderBy(col('age').asc()).collect()

  • Dealing with nulls in sorting

    • Default, For asc, If any null values, row come at the top

    •       asc_nulls_first()

    • Default, For desc, If any null values, row come at the bottom

      • desc_nulls_last()

      • transactionsDf.sort("predError", ascending=False)

        • Correct! When using DataFrame.sort() and setting ascending=False, the DataFrame will be sorted by the specified column in descending order, putting all missing values last. 

      • Alternatively, 

        • transactionsDf.sort("predError", ascending=False)

        • (or)

        • transactionsDf.sort(desc_nulls_last("predError"))


    • asc_nulls_last()

    • desc_nulls_first()

  • Composite sorting (multiple column sorting)

    • df.sort(['age', 'name']).collect()  # pass as a list

    • df.sort(['age', 'name'], ascending=[0, 1]).collect()  # pass as a list 

    • df.sort(col('age').desc(), col('name').asc()).collect() # individual

  • Prioritized sorting

    • Suppose you want to order by levels or some logic - say Beginner, Intermediate, Advanced

      • Beginner = 0

      • Intermediate = 1

      • Advanced = 2

    • You have to sort by integer values 0/1/2


df1 = spark.createDataFrame([['False', 'test1', 'Beginner'], ['True', 'test2', 'Intermediate'], ['False', 'test3', 'Advanced']], ['is_student', 'name', 'type'])


df1.withColumn('new', when(col('type') == 'Beginner', 1)

.when(col('type') == 'Intermediate', 2)

.otherwise(3))\

.show()




df1.withColumn('new', expr("""

CASE 

WHEN type == 'Beginner'

THEN 1

WHEN type == 'Intermediate'

THEN 2

ELSE 3

END

"""

) ).show()



No comments:

Post a Comment