Prabhath Kota: Pyspark drop()

May 13, 2023

Pyspark drop()

drop()

Drop Single column

df.drop(col('age)).show() #using col()

df.drop('age').show() #correct
df.drop(df['age']).show() #correct
If you drop a column which does not exist, it will simply ignore (it won't throw error)

Drop multiple column(s) - Don't use col(), use only strings

df.drop('first_name', 'last_name' ).show()

df.drop(col('first_name'), col('last_name') ).show() #Error

#In case of multiple, pass as strings only, not col()

Each col in the list should be a string
If you drop a column which does not exist, it will simply ignore (it won't throw error)

Drop multiple columns as a list

Use * before passing the list of columns to drop

Drop duplicate records

3 ways

distinct() #Drop exact duplicates

df.distinct().show()

drop_duplicates() (alias for dropDuplicates)
dropDuplicates()

df.dropDuplicates().show()

This is same as df.distinct().show()

df.dropDuplicates(['department', 'salary']).show()

passing subset

df.dropDuplicates(['id']).show()

df.dropDuplicates('id').show() #Error, pass list only
In Spite it is single/multiple columns pass as list only

Drop Null records

When all columns are null
When any one column is null
When multiple columns are null
df.dropna() #df.drop.na()

df.na.drop(how='any').show()
how -> any / all

thresh (See below)

df.fillna()
df.replacena()

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)