Prabhath Kota: Pyspark sort(), orderBy()

May 13, 2023

Pyspark sort(), orderBy()

Sort() or OrderBy() #Both are synonyms

sort() - ascending order

df.sort(df.age.asc()).collect()
df.sort(df['age']).collect()
df.sort(col('age').asc()).collect()
df.sort('age', ascending=True).collect() # ascending order
For asc, If any null values, row come at the top

sort() - descending order

df.sort('age', ascending=False).collect() # descending order
df.sort(col('age').desc()).collect() # descending order
For desc, If any null values, row come at the bottom

using desc/asc functions from pyspark.sql

from pyspark.sql import asc, desc
df.sort(asc('age'), desc('name')).collect()

orderBy()

df.orderBy(df.age.asc()).collect()
df.orderBy(df['age']).collect()
df.orderBy(col('age').asc()).collect()

Dealing with nulls in sorting

Default, For asc, If any null values, row come at the top
asc_nulls_first()
Default, For desc, If any null values, row come at the bottom

desc_nulls_last()
transactionsDf.sort("predError", ascending=False)

Correct! When using DataFrame.sort() and setting ascending=False, the DataFrame will be sorted by the specified column in descending order, putting all missing values last.

Alternatively,

transactionsDf.sort("predError", ascending=False)
(or)
transactionsDf.sort(desc_nulls_last("predError"))

asc_nulls_last()
desc_nulls_first()

Composite sorting (multiple column sorting)

df.sort(['age', 'name']).collect() # pass as a list
df.sort(['age', 'name'], ascending=[0, 1]).collect() # pass as a list
df.sort(col('age').desc(), col('name').asc()).collect() # individual

Prioritized sorting

Suppose you want to order by levels or some logic - say Beginner, Intermediate, Advanced

Beginner = 0
Intermediate = 1
Advanced = 2

You have to sort by integer values 0/1/2

df1 = spark.createDataFrame([['False', 'test1', 'Beginner'], ['True', 'test2', 'Intermediate'], ['False', 'test3', 'Advanced']], ['is_student', 'name', 'type'])

df1.withColumn('new', when(col('type') == 'Beginner', 1)

.when(col('type') == 'Intermediate', 2)

.otherwise(3))\

.show()

df1.withColumn('new', expr("""

CASE

WHEN type == 'Beginner'

THEN 1

WHEN type == 'Intermediate'

THEN 2

ELSE 3

END

"""

) ).show()

Prabhath Kota

May 13, 2023

Pyspark sort(), orderBy()

Sort() or OrderBy() #Both are synonyms

sort() - ascending order

sort() - descending order

using desc/asc functions from pyspark.sql

orderBy()

Dealing with nulls in sorting

Composite sorting (multiple column sorting)

Prioritized sorting

No comments:

Post a Comment