drop()
Drop Single column
df.drop(col('age)).show() #using col()
df.drop('age').show() #correct
df.drop(df['age']).show() #correct
If you drop a column which does not exist, it will simply ignore (it won't throw error)
Drop multiple column(s) - Don't use col(), use only strings
df.drop('first_name', 'last_name' ).show()
df.drop(col('first_name'), col('last_name') ).show() #Error
#In case of multiple, pass as strings only, not col()
Each col in the list should be a string
If you drop a column which does not exist, it will simply ignore (it won't throw error)
Drop multiple columns as a list
Use * before passing the list of columns to drop
Drop duplicate records
3 ways
distinct() #Drop exact duplicates
df.distinct().show()
drop_duplicates() (alias for dropDuplicates)
dropDuplicates()
df.dropDuplicates().show()
This is same as df.distinct().show()
df.dropDuplicates(['department', 'salary']).show()
passing subset
df.dropDuplicates(['id']).show()
df.dropDuplicates('id').show() #Error, pass list only
In Spite it is single/multiple columns pass as list only
Drop Null records
When all columns are null
When any one column is null
When multiple columns are null
df.dropna() #df.drop.na()
df.na.drop(how='any').show()
how -> any / all
thresh (See below)
df.fillna()
df.replacena()
No comments:
Post a Comment