Saturday, May 06, 2017

R is a statistically oriented programming language.  It provides a mixture of traditional programming language support and statistical operations.  The applicability of various programming elements to processing large data sets is briefly discussed.  The R language’s suitability for Big Data Analytics is considered, followed by a concussion.
The base package within R delivers statistical capabilities.  For example, it provides support for range, median, mean, standard deviation, various regression models, and probability distributions. (Anonymous, 2017a).  Additional statistical features can be accessed by loading additionals packages.   The R language supports many programming features (Kabacoff, 2017).

Language Features

Category
Application to Large Datasets
Input and Output
The ability to load and store data is essential to perform analytics on preexisting datasets.
Variables and Type System
Associating a data type to a variable, along with the ability to change it as needed, decreases errors associated with the improper processing of errant records.
Operators and Built-in functions
Event advanced analytics often require primitive operations, such as addition and multiplication.  For example, if the supplied data contains units sold and price per unit, multiplication is required to determine a total price of items sold.
Control Structures
Controls structures, such as if/else, are useful when categorizing data.  For example, it is often useful to convert from a raw age of 49 to an age group of 45-49.  Control structures facilitate those types of operations.
User-defined functions
Extensibility is an important aspect of the R language.   A user can create a custom function which can be used in place of duplicated code.  When datasets require non-trivial computation, a function can provide a way to encapsulate the required logic.
Data Management
Data management routines, such as sorting, are relevant to processing large data sets with R.  Based on professional experience, considerable effort is required to manipulate data to the desired state to perform the desired operation.  For example, data must often have duplicate data removed before performing regression analysis.

R’s Suitability to Big Data Analytics

Traditionally, R has several limitations related to its ability to process large datasets.  R is typically single threaded (Microsoft Corporation, 2017).  When a program is single threaded, it means that in multiple-core machines, only a single core is utilized for computationally intensive tasks (Dennis, 2002).  R is traditionally limited by the resources of a single computer.  Big Data is by definition beyond the abilities of a single computer (Bihl, Young II, & Weckman, 2016).  To address these limitations, R distributions have been developed to leverage organization’s investments in Hadoop and other Big Data platforms (Anonymous, 2017b).  Microsoft’s R Server, for example, distributes R code to data distributed across a cluster to increase performance and enable processing of datasets beyond traditional capabilities.
Once the traditional limitations of R have been addressed, it provides a rich environment for addressing Big Data Analytics problems.  For example, association analysis (often referred to as Market Basket Analysis) is a common pattern when processing Big Data.  An example of affinity analysis is when Amazon recommends that one should purchase an item based on adding an item to the shopping cart.   It is sometimes applied outside of retail.  For example, a customer project that is currently in the design phase is exploring the use of association analysis to determine what employees are similar to other employees based on expertise, interest, and experience.

Conclusion

R is a statistically oriented programming language.  It contains many powerful statistical operations accessible via relatively simple language constructs.  It can be applied to Big Data problems, once scalability issues have been addressed.

Reference

Anonymous. (2017a). R Mean, Median and Mode.   Retrieved from https://www.tutorialspoint.com/r/r_mean_median_mode.htm

Anonymous. (2017b). R Server Overview—R Data Analysis | Microsoft.   Retrieved from https://www.microsoft.com/en-us/cloud-platform/r-server

Bihl, T. J., Young II, W. A., & Weckman, G. R. (2016). Defining, understanding, and addressing Big Data. International Journal of Business Analytics (IJBAN), 3(2), 1-32.

Dennis, A. L. (2002). . Net Multithreading: Manning Publications Co.

Kabacoff, R. (2017). Quick-R: Built-in Functions.   Retrieved from http://www.statmethods.net/management/functions.html

Microsoft Corporation. (2017). The Benefits of Multithreaded Performance with Microsoft R Open · MRAN.   Retrieved from https://mran.microsoft.com/documents/rro/multithread/


No comments: