May 13, 2023

Pyspark drop()

 

drop()

  • Drop Single column

    • df.drop(col('age)).show()   #using col()

      • df.drop('age').show()   #correct

      • df.drop(df['age']).show()   #correct

      • If you drop a column which does not exist, it will simply ignore (it won't throw error)

  • Drop multiple column(s) - Don't use col(), use only strings

    • df.drop('first_name', 'last_name' ).show()   

      • df.drop(col('first_name'), col('last_name') ).show()  #Error

        • #In case of multiple, pass as strings only, not col()

      • Each col in the list should be a string

      • If you drop a column which does not exist, it will simply ignore (it won't throw error)

  • Drop multiple columns as a list

    • Use * before passing the list of columns to drop

  • Drop duplicate records

    • 3 ways

      • distinct()  #Drop exact duplicates
        • df.distinct().show()

      • drop_duplicates()  (alias for dropDuplicates)
      • dropDuplicates()
        • df.dropDuplicates().show()

          • This is same as df.distinct().show()

        • df.dropDuplicates(['department', 'salary']).show()

          • passing subset

        • df.dropDuplicates(['id']).show()

          • df.dropDuplicates('id').show()   #Error, pass list only

          • In Spite it is single/multiple columns pass as list only

  • Drop Null records

    • When all columns are null
    • When any one column is null
    • When multiple columns are null
    • df.dropna()  #df.drop.na()

      • df.na.drop(how='any').show()

      • how -> any / all

    • thresh (See below)

  • df.fillna()

  • df.replacena()

No comments:

Post a Comment