- Spark
- Apache is unified analytics engine and large-scale data processing
- Latest version 2.4.5 Feb 2020
- Speed
- Apache Spark achieves high performance for both batch and streaming using state of the art DAG scheduler, query optimizer and physical execution engine
- Runs 100X times faster than Hadoop
- Ease of use
- Write applications quickly in Java, Scala, Python, R and SQL
- Generality
- Spark SQL
- Spark Streaming
- MLib
- GraphX
- Runs every where
- Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.
- You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes.
- Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.
PySpark, BigData, SQL, Hive, AWS, Python, Unix/Linux, Shortcuts, Examples, Scripts, Perl
May 18, 2020
Spark Intro
Labels:
Cassandra,
Hadoop,
HBase,
HDFS,
Kubernetes,
pyspark,
pyspark_streaming,
python_advanced,
R,
Scala,
spark,
Yarn
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment