Alan Dennis' Blog: Overview of Apache Spark

Big Data processing is often performed using Google’s MapReduce framework (Liu, Iftikhar, & Xie, 2014). In the MapReduce pattern, the only way to share data between operations is to write it to a distributed filesystem (M. Zaharia et al., 2012). This introduces significant overhead and can account for more than 90% of the execution time when performing machine learning algorithms in Hadoop, an open source implementation of MapReduce. To facilitate efficient multi-pass and interactive query execution, Zaharia et al. created Spark.

Spark utilizes the concept of discretized streams (D-Streams) to facilitate the sharing of information between iterations and operations (M. Zaharia et al., 2013). D-Stream essentially enables micro-batch processing of data. These micro-batches are created every few seconds. More precisely, the order of magnitude of the duration of the time window used to create the batch is typically measured in seconds. They are deterministic in nature and enable parallel recovery when a failure occurs. They also address the challenge of slow nodes, termed stragglers, which are often treated as a failure condition. D-Streams are manifested through the use of Resilient Distributed Datasets (RDDs). An RDD contains data, and a lineage graph showing how that data was constructed. These RDDs are stored to disk via an occasional check-pointing system but primarily reside in memory. This approach enables parallel recovery when a node is lost or is slow delivering results.

Because of Spark’s in-memory computation, compared to disk-based for Hadoop, it is between 40 and 80 times faster than Hadoop (M. A. Zaharia, 2013). One logistic regression algorithm was 100 times faster in Spark than Hadoop MapReduce (Díaz, Martín, & Rubio, 2016). It is important to note that while RDDs are typically stored in memory if the data is too large Spark will persist them to disk. When a spillover event occurs, the performance degrades to something comparable that delivered by Hadoop MapReduce.

Spark is complementary to Hadoop. It can read and write data to any storage system that Hadoop supports (M. Zaharia et al., 2012). Spark’s processing paradigm is similar to Hadoop (Casado & Younas, 2015). It contains transformations such as map, filter, and group by (M. Zaharia et al., 2012). The output of a transformation can be returned as a result, or used to construct an RDD. It also serves as a foundation for additional functionality, such as Spark Streaming (Liu et al., 2014), the Apache Spark Machine Learning Library (MLlib) (Meng et al., 2016) , SQL queries (Armbrust et al., 2015), and GraphX (Shoro & Soomro, 2015). Stream processing is one area that can elucidate the characteristics of Spark.

Spark supports a unified processing architecture that is well suited to both batch and stream processing (M. A. Zaharia, 2013). Spark Streaming creates a series of RDDs from a stream over a specific window of time (Díaz et al., 2016). The collection of RDDs (a D-Stream) is processed using the same constructs in stream and batch processing. Because of this uniformity, a solution that requires both a streaming and batch processing model is well suited to Spark.

Traditional stream processing systems utilize a continuous operator model (M. Zaharia et al., 2013). This model relies on stateful operators receiving records, updating their state, and sending new records. For this model to provide high-availability, it must rely on replication or upstream backup. Replication doubles the cost of hardware while the upstream backup approach requires all other nodes to wait while a recovering node processes a replay of all messages before failure. Distributed processing systems also encounter situations where one node is functioning, but processing slower than other nodes. This straggler use case is treated as a failure in traditional stream processing system.

In contrast, Spark Streaming utilizes a D-Streams approach, relying on RDDs for recovery. When a failure occurs, a node’s data can be reconstructed using disk-based snapshots and the data lineage information contained within the RDDs. This processing can occur in parallel, across the entire cluster of nodes (M. Zaharia et al., 2013). This parallelism allows for fast recovery during failure and partial computation of an RDDs data when a node is slow completing processing.

Spark and Spark Streaming are good solutions for situations where latency is measured in seconds, not milliseconds (M. Zaharia et al., 2013). Other streaming technology should be used if sub-second response is required. Spark and Spark Streaming APIs are available for Java, Scala, and Python (Anonymous, 2016).

Spark offers a high-performance alternative to Hadoop MapReduce. It leverages the existing Hadoop stack to provide an alternative for jobs that require multiple passes or interactive queries. Spark is a relatively new arrival to the Big Data space but is already gaining popularity (Xin, Armbrust, & Zaharia, 2016). The latest revision includes performance improvements that improve its benchmarks by 10%. It is an ideal solution where both stream and batch processing is required, or multi-pass MapReduce has previously been employed.

References

Anonymous. (2016). Apache Spark FAQ. Retrieved from http://spark.apache.org/faq.html

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., . . . Zaharia, M. (2015). Spark SQL: Relational Data Processing in Spark. Paper presented at the Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia.

Casado, R., & Younas, M. (2015). Emerging trends and technologies in big data processing. Concurrency and Computation: Practice and Experience, 27(8), 2078-2091.

Díaz, M., Martín, C., & Rubio, B. (2016). State-of-the-art, challenges, and open issues in the integration of Internet of Things and cloud computing. Journal of Network and Computer Applications. doi:http://dx.doi.org/10.1016/j.jnca.2016.01.010

Liu, X., Iftikhar, N., & Xie, X. (2014). Survey of real-time processing systems for Big Data. Paper presented at the Proceedings of the 18th International Database Engineering & Applications Symposium.

Meng, X., Bradley, J., Yuvaz, B., Sparks, E., Venkataraman, S., Liu, D., . . . Owen, S. (2016). Mllib: Machine learning in apache spark. JMLR, 17(34), 1-7.

Shoro, A. G., & Soomro, T. R. (2015). Big data analysis: Apache spark perspective. Global Journal of Computer Science and Technology, 15(1).

Xin, R., Armbrust, M., & Zaharia, M. (2016). Introducing Apache Spark 2.0. Retrieved from https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M., . . . Stoica, I. (2012). Fast and interactive analytics over Hadoop data with Spark. USENIX Login, 37(4), 45-51.

Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., & Stoica, I. (2013). Discretized streams: fault-tolerant streaming computation at scale. Paper presented at the Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, Farminton, Pennsylvania.

Zaharia, M. A. (2013). An architecture for fast and general data processing on large clusters. (3616767 Ph.D.), University of California, Berkeley, Ann Arbor. Retrieved from http://search.proquest.com.proxy.cecybrary.com/docview/1527163897?accountid=26967 ProQuest Dissertations & Theses Global database.

Alan Dennis' Blog

Tuesday, April 18, 2017

Overview of Apache Spark

No comments:

Blog Archive