Saturday, May 06, 2017

Parsimony

Parsimony, also known as the economy of parameters, is desirable when creating models for several reasons  (Ledolter & Abraham, 1981).  Models with fewer parameters are easier to explain and understand.  Each parameter included in a model introduces a degree of estimation variation into the model.  In Big Data, the inclusion or elimination of a parameter may also introduce significant computational requirements. 
Overfitting refers to a model that does not generalize well to previously unseen data (Jackson, 2002).  The model performs very well with the training data, but when it encounters new data, the results are poor.  Models built with large numbers of variables that have been tuned to the training data often perform poorly in new situations because of unneeded prediction variables (Ledolter, 2013).
General Least Square Model (GLM) approach would be susceptible to overtraining as it attempts to minimize the error between predicted value and trained value.  If many variables are supplied, and a high degree of accuracy is required, the algorithm will essentially memorize the values.  When a novel value is encountered, the algorithm will essentially guess.
Overfitting is a pervasive problem.  Overfitting can occur in deep belief networks (Hinton, Osindero, & Teh, 2006), which are multilayered neural networks.  Some deep learning algorithms use unsupervised pre-training as a way to reduce overfitting (Dahl, 2015; LeCun, Bengio, & Hinton, 2015).  When Convolutional Neural Networks (CNN) were applied to a corpus of video clips containing various sports, the data was preprocessed to minimize overtraining (Karpathy et al., 2014).  Specifically, images from a video stream were resized, randomly sampled and flipped in an attempt to reduce overfitting.
When Google was applying Deep Neural Network algorithms to YouTube video recommendations they encountered overfitting (Covington, Adams, & Sargin, 2016).  One specific problem related to recommendations based upon a surrogate.  Deciding which variables, and signals, to exclude was essential to good performance.  If all information was utilized during training, the resulting model would not generalize well.



References
Covington, P., Adams, J., & Sargin, E. (2016). Deep Neural Networks for YouTube recommendations. Paper presented at the Proceedings of the 10th ACM Conference on Recommender Systems, Boston, Massachusetts, USA. http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf

Dahl, G. E. (2015). Deep learning approaches to problems in speech recognition, computational chemistry, and natural language text processing. (3715720 Ph.D.), University of Toronto (Canada), Ann Arbor. Retrieved from http://search.proquest.com.proxy.cecybrary.com/docview/1708929080?accountid=26967 ProQuest Dissertations & Theses Global database.

Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural computation, 18(7), 1527-1554. doi:10.1162/neco.2006.18.7.1527

Jackson, J. (2002). Data Mining; A Conceptual Overview. Communications of the Association for Information Systems, 8(1), 19.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. Paper presented at the Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Ledolter, J., & Abraham, B. (1981). Parsimony and its importance in time series forecasting. Technometrics, 23(4), 411-414.



No comments: