May 18, 2020

Spark Intro

  • Spark
    • Apache is unified analytics engine and large-scale data processing
    • Latest version 2.4.5 Feb 2020
    • Speed
      • Apache Spark achieves high performance for both batch and streaming using state of the art DAG scheduler, query optimizer and physical execution engine
      • Runs 100X times faster than Hadoop
    • Ease of use
      • Write applications quickly in Java, Scala, Python, R and SQL
    • Generality
      • Spark SQL
      • Spark Streaming
      • MLib
      • GraphX
    • Runs every where
      • Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.
      • You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. 
      • Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

No comments:

Post a Comment