📜  Kafka Streams与Spark Streaming

📅  最后修改于: 2021-01-05 03:04:43             🧑  作者: Mango

Kafka Streams与火花流

Apache Spark

Apache Spark是一个分布式的通用处理系统,可以一次处理PB级的数据。它主要用于流传输和处理数据。它分布在数千个虚拟服务器之间。大型组织使用Spark处理大量数据集。 Apache Spark允许使用大约80个高级运算符更快地构建应用程序。通过查询优化器,物理执行引擎和DAG调度程序,它可以获得流和批处理数据的高性能。因此,它的速度快了一百倍。

火花流

Apache spark通过Spark Streaming启用大型数据集的流传输。 Spark Streaming是核心Spark API的一部分,可让用户处理实时数据流。它从不同的数据源获取数据,并使用复杂的算法对其进行处理。最后,将处理后的数据推送到实时仪表板,数据库和文件系统中。

kafka 流

一个客户端库,用于处理和分析存储在Kafka中的数据。 Kafka流使用户能够构建应用程序和微服务。此外,将输出存储在Kafka集群中。除了Kafka之外,它对系统没有任何外部依赖性。它一次只处理一条记录。

Kafka Streams与火花流

”Kafka

Parameters Apache Kafka Apache Spark
Developers Originally developed by LinkedIn. Later, donated to Apache Software Foundation. Originally developed at the University of California. Later, it was donated to Apache Software Foundation.
Infrastructure It is a Java client library. Thus, it can execute wherever Java is supported. It executes on the top of the Spark stack. It can be either Spark standalone, YARN, or container-based.
Data Sources It processes data from Kafka itself via topics and streams. Spark ingest data from various files, Kafka, Socket source, etc.
Processing Model It processes the events as it arrives. Thus, it uses Event-at-a-time (continuous) processing model. It has a micro-batch processing model. It splits the incoming streams into small batches for further processing.
Latency It has low latency than Apache Spark It has a higher latency.
ETL Transformation It is not supported in Apache Kafka. This transformation is supported in Spark.
Fault-tolerance Fault-tolerance is complex in Kafka. Fault-tolerance is easy in Spark.
Language Support It supports Java mainly. It supports multiple languages such as Java, Scala, R, Python.
Use Cases The New York Times, Zalando, Trivago, etc. use Kafka Streams to store and distribute data. Booking.com, Yelp (ad platform) uses Spark streams for handling millions of ad requests per day.