Tuesday, April 18, 2017

Stream Processing Systems

Distributed stream processing is a form of Big Data Analytics where a large volume of heterogeneous data is processed in a continuous fashion (Pospiech & Felden, 2013).  The typical stream processing solution is presented in Figure 1.  Data is produced by distributed devices and inserted into a message queue, via a gateway, for processing by a collection of streaming applications (Kamburugamuve & Fox, 2016).  The message queue is utilized as a means of buffering the input, enabling distributed processing, allowing fault tolerance, and increasing scalability. The gateway’s purpose is to provide a thin layer of abstraction, accepting messages in various protocols and inserting their payload into a queue for processing.  They are utilized in a way that is similar to the way that a load-balancer in a traditional web application is utilized.
The processing logic of a stream application can be represented as a graph composed of nodes, which represent streaming operations, and edges representing data flow (Kamburugamuve, Ekanayake, Pathirage, & Fox, 2016).  A logical user defined graph is created explicitly using an application program interface (API) or implicitly using a programming construct, such as Scala, Java, or a variant of SQL.  A component in a stream processing system converts the logical graph into an execution graph which can be scheduled for the execution of the application. The execution graph is then distributed to a set of worker nodes tasked with its execution.  The specifics of each of these steps varies by streaming system, but most share this approach.  Figure 2 shows a graphical representation of the logical graph and its conversion to an execution graph.
Two popular open source stream processing systems are Apache Storm (Apache Software Foundation, n.d.-a) and Apache Spark Streaming (Apache Software Foundation, n.d.-b).  A commercial, closed source, solution is Azure Steam Analytics (Stokes, 2016b). 
Apache Storm is one of the older distributed stream processing systems (Wang, 2016). It was created and released as open source by Twitter.  Storm is a continuously running streaming solution.  It processes data contained in tuples as they arrive.  Developers create logical graphs explicitly via the Storm API (Kamburugamuve & Fox, 2016).  In hosted environments, the cost of Apache Storm execution is based on the time cluster is running, as such, it does not support consumption based pricing (Stokes, 2016a). Deployment and operational monitoring are also at a cluster level.
Apache Spark Streaming takes a different approach to stream processing.  Rather than executing long-running jobs it utilizes a small batch approach (Zaharia, 2013).  Since Spark Streaming utilizes Spark’s Resilient Distributed Datasets (RDDs), it can provide high availability without replication.  This reduces the number of nodes within a cluster by approximately half, greatly reducing cost.  Spark is a near real-time processing system (Morshed, Rana, & Milrad, 2016). Since it uses a micro-batch approach, the developer can trade increased latency for better performance (Chintapalli et al., 2016) by decreasing the frequency of micro-batch creation.  A streaming application’s logical graph is created an API using a language such as Scala.  Deployment and monitoring for Spark Streaming are essentially the same as for Storm. 
Azure Stream Analytics (ASA) is a real-time stream processing system (Stokes, 2016b).  Logical graphs are created implicitly using an SQL-like language.  ASA natively supports temporal processing, such as time-based joins.  This distinction is the result of ASA being a complex event processing implementation (Chen, 2016). Temporal processing is possible in Spark and Storm, but the developer must implement the desired functionality.  ASA is a consumption priced managed service (Stokes, 2016a).  This means that the cost of a hosted ASA application is based on the amount of data processed, not the amount of time it is executing. Since ASA is delivered as a managed service, deployment and monitoring are at an ASA application level.  ASA is based on Trill, a Microsoft Research Project (Chandramouli et al., 2014).  Trill was benchmarked higher than Storm and Spark.  This implies that Stream Analytics likely delivers similar performance.
ASA and Storm are real-time processing systems, while Spark Streaming is near real-time.  This should be taken into consideration for solutions requiring low latency.  Development for Spark and Storm is done at a lower-level than for ASA.  ASA relies on an SQL-like language which is familiar and accessible to many.  Storm and Spark are open source while ASA is a proprietary system.  The choice between the streaming solution is not a simple one.  The choice will likely be made based on other factors, such as a pre-existing cluster or the skillset of the developers tasked with the solution creation. The most likely driver for the decision will be the ecosphere in which the solution is to execute.



References
Apache Software Foundation. (n.d.-a). Apache Storm.   Retrieved from http://storm.apache.org/
Apache Software Foundation. (n.d.-b). Spark Streaming: Apache Spark.   Retrieved from http://spark.apache.org/streaming/
Chandramouli, B., Goldstein, J., Barnett, M., DeLine, R., Fisher, D., Platt, J. C., . . . Wernsing, J. (2014). Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the VLDB Endowment, 8(4), 401-412.
Chen, Z. (2016). Spark Streaming and Azure Stream Analytics.  Retrieved from https://blogs.msdn.microsoft.com/streamanalytics/2016/06/16/spark-streaming-and-azure-stream-analytics/
Chintapalli, S., Dagit, D., Evans, B., Farivar, R., Graves, T., Holderbaugh, M., . . . Poulosky, P. (2016, 23-27 May 2016). Benchmarking streaming computation engines: Storm, Flink and Spark Streaming. Paper presented at the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
Kamburugamuve, S., Ekanayake, S., Pathirage, M., & Fox, G. (2016). Towards High Performance Processing of Streaming Data in Large Data Centers. Paper presented at the HPBDC 2016 IEEE International Workshop on High-Performance Big Data Computing in conjunction with The 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2016), Chicago, Illinois USA, Friday.
Kamburugamuve, S., & Fox, G. (2016). Survey of distributed stream processing. Retrieved from
Morshed, S. J., Rana, J., & Milrad, M. (2016, 23-27 May 2016). Open source initiatives and frameworks addressing distributed real-time data analytics.Paper presented at the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
Pospiech, M., & Felden, C. (2013, 3-5 Dec. 2013). A descriptive Big Data model using Grounded Theory. Paper presented at the Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on.
Stokes, J. (2016a). Help choosing a streaming analytics platform: Apache Storm comparison to Azure Stream Analytics.   Retrieved from https://azure.microsoft.com/en-us/documentation/articles/stream-analytics-comparison-storm/
Stokes, J. (2016b). What is Stream Analytics?   Retrieved from https://azure.microsoft.com/en-us/documentation/articles/stream-analytics-introduction/
Wang, Y. (2016). Stream Processing Systems Benchmark: StreamBench.
Zaharia, M. A. (2013). An architecture for fast and general data processing on large clusters. (3616767 Ph.D.), University of California, Berkeley, Ann Arbor. Retrieved from http://search.proquest.com.proxy.cecybrary.com/docview/1527163897?accountid=26967 ProQuest Dissertations & Theses Global database.

No comments: