Benchmarks for data processing systems: Pathway, Spark, Flink, Kafka Streams
https://github.com/pathwaycom/pathway-benchmarks.git
Pathway is a reactive data processing framework designed for high-throughput and low-latency realtime data processing. Pathway's unified Rust engine processes code seamlessly in both batch and streaming mode using the same Python API syntax.
This repository contains benchmarks to compare the performance of Pathway against state-of-the-art technologies designed for streaming and batch data processing tasks, including Flink, Spark and Kafka Streaming. For a complete write-up of the benchmarks, read our corresponding benchmarking article.
The benchmarks are reproducible using the code in this repository. Find the instructions below under "Reproducing the benchmarks".
The repository contains two types of benchmarks:
For a full discussion of the results obtained, read our benchmarking article.
Below we present the results of the benchmarks. For these results, all benchmarks were run on dedicated machines with: 12-core AMD Ryzen 9 5900X Processor, 128GB of RAM and SSD drives. For all multithreaded benchmarks we explicitly allocate cores to ensure that threads maximally share L3 cache. This is important, as internally the CPU is assembled from two 6-core halves, and thread communication between halves is impacted. For this reason we report results on up to 6 cores for all frameworks.
All experiments are run using Docker, enforcing limits on used CPU cores and RAM consumption.
This section presents the results of the benchmarks. The results show that:
Pathway clearly outperforms the default Flink setup in terms of sustained throughput, and dominates the Flink minibatching setup in terms of latency for all of the throughput spectrum we could measure. For most throughputs, Pathway also achieves lower latency than the better of the two Flink setups.
The fastest performance is achieved by the Spark GraphX implementation and the more aggressively-optimized Pathway build. The formulation (and syntax) of the GraphX algorithm is different from the others. Performing an apples-to-apples comparison of performance of equivalent logic in Table APIs, Pathway is the fastest, followed by Flink and Spark.
We evaluate only two systems on the streaming PageRank task: Pathway and Flink. We donโt test Kafka Streams because it was suboptimal on the streaming wordcount task. Moreover, no Spark variant supports such a complicated streaming computation: GraphX doesnโt support streaming, Spark Structured Streaming doesnโt allow chaining multiple groupbyโs and reductions, and Spark Continuous Streaming is too limited to support even simple streaming benchmarks.
We see that while both systems are able to run the streaming benchmark, Pathway maintains a large advantage over Flink. It is hard to say whether this advantage is โconstantโ (with a factor of about 50x) or increases โasymptoticallyโ with dataset size. Indeed, extending the benchmarks to tests on larger datasets than those reported in Table 2 is problematic as Flinkโs performance is degraded by memory issues.
Pathway again offers superior performance, completing the first of the datasets considered approximately 20x faster than Flink. The first large batch is processed by Pathway in times comparable to the pure batch scenario.
For backfilling on the complete LiveJournal dataset, Flink either ran out of memory or failed to complete the task on 6 cores within 2 hours, depending on the setup.
For a full discussion of the results obtained, read our benchmarking article.
The following sections contain information necessary for reproducing the benchmarks.
The repository provides a single Python script to run each benchmark:
run_wordcount.py script in the wordcount-online-streaming directory to launch the WordCount benchmark on all tested solutions.run_pagerank.py script in the pagerank-iterative-graph-processing directory to launch the PageRank benchmark on all tested solutions. You will first need to download and preprocess the datasets (see below).All benchmarks are run using Dockerized containers. Before launching the Python scripts, make sure you have the latest version of Docker installed and your Docker daemon is running. You may have to increase your allocated memory per container (at least 4GB per container) and allocate the necessary number of CPU cores to your Docker containers.
The WordCount benchmarks are run on a dataset of 76 million words taken uniformly at random from a dictionary of 5000 random 7-lowercase letter words. We split the dataset into two parts: we use 16 million words as a burn-in period (to disregard high-latency at engine start-up) and we include only the latencies of the remaining 60 million words in the final results. The dataset is generated automatically when you run the run_wordcount.py script.
Note that if your username has characters such as a dot or similar, you should add the USER= variable before launching, otherwise you may run into an error message because the docker-compose project name is built based on the username.
Results are stored in the results subdirectory.
The PageRank benchmarks are run on various subsets of the Stanford LiveJournal dataset. You can download and preprocess the datasets by running the get_datasets.sh script in the pagerank-iterative-graph-processing/datasets directory.
Results are stored in the results subdirectory.
author = Jure Leskovec and Andrej Krevl, title = SNAP Datasets: Stanford Large Network Dataset Collection, url = http://snap.stanford.edu/data, month = jun, year = 2014
The repository is structured as follows:
wordcount-online-streaming, contains the scripts and files necessary to run the WordCount benchmark. This is where you will find the run_wordcount.py script to reproduce the benchmark yourself;pagerank-iterative-graph-processing, contains the scripts and files necessary to run the PageRank benchmark. This is where you will find the run_pagerank.py script to reproduce the benchmark yourself;