Saturday, May 06, 2017

Machine Learning

Machine learning is a multidisciplinary field focused on algorithms that learn using concepts from statistics, artificial intelligence, cognitive science, and many other disciplines (Qiu, Wu, Ding, Xu, & Feng, 2016).   Supervised learning utilizes a training set containing inputs and desired outputs.   Supervised learning is typically concerned with classification, regression, and estimation.
Supervised learning can be used to address large data sets in which all elements are statistically significant because it is not concerned with the relationships, but rather fitting a curve (in the case of regression) in a way that minimizes error.  There are many supervised machine learning algorithms (Jordan & Mitchell, 2015).  From professional experience, it is often a trial and error exercise to find the algorithm that works best for a given dataset.
Machine learning, like most approaches based on the scientific method, requires a clear hypothesis, or null hypothesis.  From professional experience, many organizations think they can learn from their data without having a clear idea of what they hope to learn outlined.  Our team hosts workshops with the explicit goal of determining what the answers are that require an answer.  For example, does increasing advertising spend correlate with an increase in sales?
Creating a good sample is important in both statistics and machine learning.  The approach I have seen most often is to create two sets from a labeled data set.  The first set is used to train while the second is used to test the accuracy of the trained model.   I have not seen a hard and fast recommendation on how big each should be, but often we I have seen 80% designated for training and 20% for testing.   The motivation behind this is to avoid overtraining the model.   The basic idea is that an algorithm might be very good at classifying the data it has seen before, but very poor when it sees new data.  This would result in very poor performance.  The selection of the items for each group is typically random.
Machine learning is a powerful way of interacting with data.  Because it leverages many approaches, including statistics, it facilitates experimentation to reach the best results.  While machine learning is a powerful tool, for a solution leveraging it to be a success requires clear goals and purpose.

References


Jordan, M., & Mitchell, T. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255-260.

Qiu, J., Wu, Q., Ding, G., Xu, Y., & Feng, S. (2016). A survey of machine learning for big data processing. EURASIP Journal on Advances in Signal Processing, 2016(1), 67. doi:10.1186/s13634-016-0355-x


No comments: