Saturday, May 06, 2017

Database or Big Data

Traditional database management systems (DMS) have many similarities and difference with distributed Big Data Analytics (BDA) platforms, such as Hadoop and Azure Data Lake.  The similarities will be discussed, followed by the differences.  Lastly, a hybrid approach will be discussed, showing how DMS and BDA can and should coexist.
Both DMS and BDA are concerned with the capture and storage of data (Abbasi, Sarker, & Chiang, 2016). They are also both concerned with enabling an organization to utilize that data in a way to derive value from it. Traditional DMS utilize a relational data store to organize their data.  Data is typically moved from one place to another utilizing extract/transform/load (ETL).  Transactional data is typically stored in a Relational Database Management System, while reporting is often performed against a data warehouse.  Data warehouses often support report generation, business intelligence tools, and possibly predictive tools and models.  The idea is to capture data, convert it to information, glean knowledge from the information, make decisions based on the knowledge, and execute the decisions through some action.
BDA is concerned with the same information value chain.  There are however several key differences.   In BDA, data is often not relational in nature.  This is a product of several driving forces, such as the inclusion of unstructured data and merging data from disparate systems.  In a traditional DMS environment, data is typically cleansed during the ETL process (often before ingestion into a data warehouse).  Because of the volume of data involved in BDA, that sort of movement is typically not feasible.  Instead, the data is distributed across a low-cost storage mechanism, such as Hadoop File System (Taylor, 2010).  The operations on that data occur near the data, with the intent of minimizing bandwidth issues.  A consequence of this architecture is that the data may be of low veracity.  The motivation behind this architecture is to utilize many low-cost commodity computers rather than a few high-cost specialized clusters.  Within this environment, failure is expected and planned for, rather than unforeseen and debilitating.
BDA systems should be combined with traditional DMS elements (Miloslavskaya & Tolstoy, 2016).  For example, a data warehouse is a proven way to generate reports and facilitate traditional data exploration.  Data warehouses can send data to a data lake, or similar architecture, and should also be populated with the results of computations in that environment.  While there are systems such as Hive that enable data warehouse functionality on top of Hadoop (Thusoo et al., 2010), there is still a need for traditional data warehouse technologies.
As with many things, the answer to the question of DMS or BDA is both.  As traditional relational systems take on features and capabilities inspired by BDA, we also see BDA being shaped by traditional DMS needs.  What is important is to have a common data platform within an organization facilitating the usage of the best tool for a particular problem.
  
References
Abbasi, A., Sarker, S., & Chiang, R. H. (2016). Big data research in information systems: Toward an inclusive research agenda. Journal of the Association for Information Systems, 17(2), 3.

Miloslavskaya, N., & Tolstoy, A. (2016, 22-24 Aug. 2016). Application of Big Data, Fast Data, and Data Lake Concepts to Information Security Issues. Paper presented at the 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW).

Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics, 11 Suppl 12, S1-S1. doi:10.1186/1471-2105-11-S12-S1

Thusoo, A., Sarma, J. S., Jain, N., Zheng, S., Chakka, P., Ning, Z., . . . Murthy, R. (2010, 1-6 March 2010). Hive - a petabyte scale data warehouse using Hadoop. Paper presented at the Data Engineering (ICDE), 2010 IEEE 26th International Conference on.



No comments: