Friday, May 19, 2017

Unlimited Time and Money Wish-List

What would you do with unlimited time and money? This post answers that question classifying the items into five categories: education, job/research, philosophical/religion, travel, and home.

Education

·         Complete the Doctorate in Computer Science
·         Learn to play the piano
·         Learn to compose music
·         Have the time to read all assigned readings in my coursework
·         Teach
·         Make an impact on the young, so they can see the world is full of choices and diverse destinations
·         Take more math classes
·         Take more statistics classes
·         Take history classes, and pay attention this time
·         Learn Spanish

Job/Research

·         Research a life changing technology, such as a cure for a terminal illness.
·         Leave a mark on the world; be it an interesting paper, idea, or book.
·         Write another non-fiction book
·         Write a fiction book
·         Master a topic
·         Work where research is the most important element
·         Publish articles
·         Go to conference, present, and be included in the proceedings
·         Share my thoughts
·         Make the future happen, rather than just think about what might be

Philosophical/Religion

·         Become the sort of person who wakes early, drinks coffee on the porch, and thinks a lot
·         Visit old churches, to take in the atmosphere
·         Visit a cathedral, and contemplate the lives spent building it.
·         Learn to paint, and express faith in art
·         Re-read the Bible
·         Read the major religions works
·         Take another philosophy class, and pay attention this time
·         Learn more about Hinduism
·         Learn more about Buddhism
·         Take time to pause, enjoy the moments, and reflect on the past

Travel

·         Run a half-marathon in all fifty states.
·         Spend a long vacation in Ireland.
·         Explorer Alaska, slowly, so that the true measure of the place can be felt.
·         Go to Maine
·         Climb Machu Picchu
·         Go to Yellowstone National Park
·         Go to Iceland
·         Spend time in Australia
·         London
·         Paris

Home

·         Build a log cabin (already a work in progress)
·         Have a study with bookshelves, leather chairs, and a good reading light.
·         Build an outdoor kitchen
·         Grow grapes and make wine
·         Grow berries, such as blueberries.
·         Have a pond/lake with fish and ducks
·         Have a dock on a pond
·         Build a treehouse, for young and old alike.
·         Have a deck with lots of comfortable chairs
·         Have a garden, and actually eat the stuff I grow

Conclusion


Most of the items in these lists do not require unlimited money or time.  As with many things, the issue is about focus and priority.  

Saturday, May 06, 2017

Quantitative and Qualitative Literature Reviews

Quantitative and qualitative literature reviews serve different purposes.   This post discusses the content and structure of each.  A quantitative study is not complete until the report is written (McGraw-Hill Companies Inc, 2006a).  It is generally intended for scholars and students.  Two important elements of a quantitative study are the introduction and the literature review.

Quantitate Study

The introduction is generally one or two paragraphs in length.  Its purpose is to frame the study and state the intention. The introduction should engage the reader, encouraging them to continue reading. 
After the introduction, the report contains the literature review.  The purpose of a literature review for quantitative research project is to further frame the research area.  The goal is to put the study into perspective.  It may include the history of the variables being studied.  The literature review should beyond a description of literature to include analysis, synthesis, and possibly a critique.
The problem statement should be included near the beginning of the literature review.  It servers to clarify the nature of the problem, why it is important to study it, why the researchers conducted the study, and why the reader should be interested in the results.  The literature review should include empirical research results, and possibly publications that evaluate or propose a theory.  The number of articles to be included in a literature review vary based upon the topic and nature of research. 
The literature review can be organized in various ways.  One way is to start with the seminal work and move forward in a chronological order.  An alternative is to start from the general concepts related to the study and move to specifics.  The first paragraph of the literature review should outline the upcoming topics.  The literature review should be organized, utilizing headings and other helpers to guide the reader.  A literature review is typically written in third person and should follow proper style, such as APA.
The research questions and hypotheses should be placed at the end of the literature review.  They should emerge from the literature that was reviewed.  They should be stated as a simple question or sentence.  They may be referred to using a shorthand notation, such as H1 and H2 or RQ1 and RQ2. 

Qualitative Study

A qualitative study is similar to a quantitative study in that it is not complete until the report has been written (McGraw-Hill Companies Inc, 2006b).  Unlike a quantitative study, a qualitative study is written in a somewhat reflective way.       While the quantitative study relies on instruments to gather data, a qualitative study relies on the author to interpret and to some extent capture information.  The literature review in a qualitative study is a summary.





References

McGraw-Hill Companies Inc. (2006a). Chapter 17: Reading and Writing the Quantitative Research Report.   Retrieved from http://highered.mheducation.com/sites/dl/free/0073049506/240132/Chapter_17.ppt

McGraw-Hill Companies Inc. (2006b). Chapter 18: Reading and Writing the Qualitative Research Report.   Retrieved from http://highered.mheducation.com/sites/dl/free/0073049506/295001/Chapter_18.ppt


Not Just Significance but Size of Effect

 Blindly trusting the output of a computer program, such as SPSS, without understanding the data my lead to misleading results.  It is possible for results to appear statistically significant when the data as a whole do not support the results.   There are also significant limitations to the chi-squared test that should be taken into consideration before embracing the results.
It is a common practice to assume that results of a Pearson Chi-squared are significant if the p value is 0.05 (chance) or less (Penn State Eberly College of Science, n.d.). A contingency table is used to enumerate the combination of categorical variables and values (Field, 2013).  The table shows the counts of each combination.  
Without knowing more about the specifics of the management style survey that is referenced in this assignment it is difficult to speak with exactness about the cause of the overall results being significant while being of low value, however, it is likely that that several combinations are highly correlated while the majority are not.  To determine the specific situation a crosstabulation would be constructed and the standardized residuals compared to 1.96.  If the absolute value for each combination is less than 1.96 that combination indicates the relationship is not significant.
Sample size and other factors can impact tests for significance (Runkel, 2012).  It is important that when results of a Chi-squared test indicate significance that an appropriate test of strength (McHugh, 2013).   Calculation of a value, such as Lambda, as a means of measuring the degree of association between conditions is an appropriate means of determining the strength of the relationship (AcaStat Software, 2015).
When doing analysis, it is important to consider the overall picture the data is showing.  Rather than assuming the values produced by a statistical package are all-knowing, the researcher must dig deeper.  When a result is shown to be statistically significant, it is an indication that additional analysis, such as effect size, is required.




References


AcaStat Software. (2015). Chi Square Measures of Association.   Retrieved from http://www.acastat.com/statbook/chisqassoc.htm

Field, A. (2013). Discovering statistics using IBM SPSS statistics: Sage.

McHugh, M. L. (2013). The Chi-square test of independence. Biochemia Medica, 23(2), 143-149. doi:10.11613/BM.2013.018

Penn State Eberly College of Science. (n.d.). 11.2 - Chi-Square Test of Independence.   Retrieved from https://onlinecourses.science.psu.edu/stat200/node/73

Runkel, P. (2012). Large Samples: Too Much of a Good Thing?  Retrieved from http://blog.minitab.com/blog/statistics-and-quality-data-analysis/large-samples-too-much-of-a-good-thing


Parametric and Non-Parametric Tests

Selecting between parametric and non-parametric tests can be a confusing task.  There are many tests which are applicable to certain types of data and situations.  This post discusses parametric and non-parametric analysis and offers advice in the selection of the appropriate test.

Parametric Analysis

Parametric tests assume a normal distribution of data (Field, 2013).  The data must also be equally dispersed, also termed homogeneity of variance.  The data must be interval or ratio data. Lastly, the observations must be independent.  If data satisfies the requirements for parametric analysis, then a parametric test should be utilized.

Non-parametric Analysis

Non-parametric tests make fewer assumptions about data than do parametric tests (Field, 2013).  For example, non-parametric analysis is well suited to ordinal data which does not satisfy the requirements of parametric tests.  Another reason to utilize a non-parametric test is if the median better represents the central tendency of the data than does the mean (Frost, 2015).  This is an indication that the data is skewed and may contain outliers.  Non-parametric tests are also applicable when the sample size is smaller than that necessary for parametric approaches.

Comparison of Tests

Table 1 contains the various types of analysis and the corresponding parametric and non-parametric tests.  It was constructed by combining materials from Dr. Miller’s lecture with other sources (Frost, 2015; Hoskin, 2014; Miller, 2017).  The table breaks down the types of test that are appropriate for a kind of analysis.  If the data is parametric in nature, then a test in the corresponding parametric column should be utilized.
Table 1: Parametric and Non-Parametric Tests
Type of Analysis
Parametric
Non-Parametric
Is a sample similar to a known population
T-Test or Z-Test
Sign Test
Comparing means of independent groups
Two-sample T-Test
Mann-Whitney U Test
Comparing two quantitative measurements from the same individual
Paired T-Test
Wilcoxon Signed-Rank
Comparing means between three or more independent groups
1-way Analysis of Variance (ANOVA)
Kruskal-Wallis
Multiple comparisons of means
2-way Analysis of variance (ANOVA)
Friedman
Estimating the degree of association between two quantitative variables
Pearson Correlation Coefficient
Spearman’s Rank Correlation Coefficient and Kendall’ Tau Coefficient

 

Conclusion

The selection of a test is guided by understanding the type of analysis to be performed and the nature of the associated data.  Enumerating and discussing each scenario and test is beyond the scope of this assignment.  This, and Dr. Miller’s, table does provide a starting point for selecting the appropriate test.



References
Field, A. (2013). Discovering statistics using IBM SPSS statistics: Sage.

Frost, J. (2015). Choosing Between a Nonparametric Test and a Parametric Test.  Retrieved from http://blog.minitab.com/blog/adventures-in-statistics-2/choosing-between-a-nonparametric-test-and-a-parametric-test

Hoskin, T. (2014). Parametric and nonparametric: Demystifying the terms. Mayo Clinic CTSA BERD Resource Retrieved from http://www mayo edu/mayo-edudocs/center-fortranslational-science-activities-documents/berd-5-6.pdf.

Miller, R. (Producer). (2017, 2/11/2017). Re-record of chat on non-parametrics. Retrieved from http://ctuadobeconnect.careeredonline.com/p29vwep0e20/?launcher=false&fcsContent=true&pbMode=normal


Maintaining the Context of Variables

Just as it is necessary for a quantitative report to clearly communicate the measures being analyzed (Huck, 2012) it is equally important that the researcher be cognizant of the connection between the numbers and what they represent. Fixation on the numbers rather than their meaning can result in inaccurate quantitative studies (Nielsen, 2004).
There are several strategies from personal and professional experience that can help keep focus on the meaning of the numbers.  It is always a good idea to perform a sanity check.  For example, if the data being studied is teacher salaries it is unlikely that a value of 10000000.00 is a reasonable value.   This applies to operations upon those values.  For example, if the mean salary was 5000, it is an indication that the numbers might not be representing what is expected.
Another personal strategy is to consider the units of measure.  Units of measure are relevant when operations such as multiplication are performed, along with more complex statistical operations.  During work with an Internet of Things sensor, the units of measure were important when dealing with light, pressure, and temperature.  Units of measure give guidance in how that measure was initially created.   For example, barometric pressure can be measured in pounds per square inch (PSI).  Converting from PSI to some other measure should remove the pounds and the inches from the units.
Lastly, look at all numbers with a degree of skepticism.  There are many instances where numbers are simply incorrect.  We should always question the correctness, accuracy, and validity of the data.  Sensors report incorrect values, people hit the wrong key, disks become corrupted, data files have errors, data transfers can contain noise, and people sometimes make mistakes.  Professional experience has shown that it is always a good idea to question the validity of the values presented.  Because someone ran a statistical operation on a set of data does not mean that data was without flaw.  We should always question the accuracy of what we analyze.




References
Huck, S. W. (2012). Reading Statistics and Research (6th ed.): Pearson.

Nielsen, J. (2004). Risks of Quantitative Studies.   Retrieved from https://www.nngroup.com/articles/risks-of-quantitative-studies/



Dependent and Independent Variables

This post discusses dependent and independent variables.  It provides an example of a single independent and dependent variable and a case with two independent and one dependent variable.  It discusses group comparison, and that the mean of a group along is not a sufficient measure to determine the most common values.

Dependent and Independent Variables

A scenario containing an independent and dependent variable is a dataset containing speed and stopping distance (Anonomous, 2017).  The cars dataset contains fifty observations containing speed (measured in miles per hour) and distance (measured in feet).  In this dataset speed is the independent variable while distance is dependent.  The ratio between the variables is non-linear.  As speed increases distance increases at a greater rate.
A variation of the cars dataset that would include two independent variables is if the type of brake was included.  The values for brake would be drum, disk, and antilock braking system.  Distance would continue to be a dependent variable.

Intra and Inter Group Variance

When looking at the performance of two groups on a given task, one speaks of two kinds of variance (between-groups and within-groups). What does each represent? Be sure to discuss this answer in terms of variance and not research designs.
Variance measured within a group is a means of determining the level of dispersion of values within a sample.  Variance between groups is related to analysis of variance (ANOVA) (Babbie, Halley, & Zaino, 2007).  The basic idea is that if two groups are randomly drawn from the same sample their variance should be similar and should be similar to the variance of the population in general.

Manufacturing Process Comparison

When evaluating two processes, it is important to understand the distribution when attempting to determine performance.  Given the example of process A having a mean of 82.5 and process B having a mean of 78.5 one cannot infer that process A is generally better than A.  If process A has a high degree of variance (especially with high valued outliers) it could produce consistently lower values than process B.  The occasional high value would offset the normal low valued output.



References

Anonomous. (2017). R: Speed and Stopping Distances of Cars.   Retrieved from https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/cars.html

Babbie, E. R., Halley, F., & Zaino, J. (2007). Adventures in social research: data analysis using SPSS 14.0 and 15.0 for Windows: Pine Forge Press.


Ensemble machine learning methods

Ensemble machine learning methods are created by combining a set of models and then using a weighted vote to classify new data points (Dietterich, 2000).  In order for an ensemble of classifiers to be more effective than their individual elements, they must include diverse and accurate models.
The basic idea is that by leveraging the results of multiple classifiers the deficiencies of a given classifier can be overcame.  For example, neural networks using a gradient decent approach are subject to local optima. They can essentially “get stuck” on what they think is the best answer, when a better value exists. By combining a neural network with a different approach, such as a k-nearest neighbor or Bayesian based approach, that situation can be avoided.  The essential characteristic of an ensemble approach is that the results of the combination of the approaches is better than any given approach in isolation.
Ensemble methods often include bagging, boosting, and random forests.  Bagging (also known as bootstrap aggregation) is used to reduce variance by splitting training data randomly (James, Witten, & Hastie, 2014).  Boosting is similar to bagging, with the inclusion of the assignment of a weight (Ledolter, 2013).
 A random forest is created by creating multiple random samples from a given dataset and creating a decision tree from each (Ahlemeyer-Stubbe & Coleman, 2014).  This combination of trees is the forest (in graph terminology).  The motivation behind the creation of a random forest is that while a decision tree will always produce a result, that result might be weak. 





References
Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Dietterich, T. G. (2000). Ensemble methods in machine learning. Paper presented at the International workshop on multiple classifier systems.

James, G., Witten, D., & Hastie, T. (2014). An introduction to statistical learning: With applications in R.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.


Decision trees

Decision trees provide a way of assigning a probability to a choice (Ledolter, 2013).  They are a visual representation of a decision-making process. They are useful as a tool to aid in decision making when there is a cost or reward associated with a decision with estimated probability of success. Figure 1 shows a contrived example of deciding to take an umbrella. This example would be more interesting if it cost something to take the umbrella, and the umbrella saved money. For example, if you had to rent an umbrella but having it saved you a dry-cleaning expense if it rained, then the example gets much more interesting.
The expected value related to a decision is calculated by applying the probability of a decision to the value associated with a related outcome (Kirkwood, 2002). If a decision is deconstructed to subsequent decisions, the expected values of those decisions are summed to the higher level.
Decision trees require that the results of a decision point be mutually exclusive (Ahlemeyer-Stubbe & Coleman, 2014). Using the tree analogy, once a limb branches, the resulting limbs must not be connected or rejoined. The goal is to partition the data at each decision.

Comparison of Approaches

The following is a breakdown of the strengths and weaknesses of linear and logistic regression, decision trees, and neural networks (Ahlemeyer-Stubbe & Coleman, 2014, pp. 141-142). I include neural networks as it is both a research interest and a common data mining technique.

Linear and logistic regression

Advantages

·        Possible to do parameter estimation and hypothesis testing
·        Model is directly interpretable
·        Fast execution
·        Good for screening purposes
·        Standard software 

Disadvantages

·        Large amount of manual
·        Assumptions must be met about distribution and linearity
·         Sensitive to outliers
·        Mediocre performance

Decision Trees

Advantages

·        Simple to interpret and implement results
·        No assumptions about distribution and linearity
·        Not sensitive to outliers
·        Fast execution
·        Good for screening purposes
·        Good performance

Disadvantages

·        Often too few (6–8) final nodes
·        Results of limited value, because not enough groups
·        No hypothesis testing and parameter estimation
·        Needs specialized software

Neural Networks

Advantages

·        All types of data can be analyzed
·        No assumptions about distribution and linearity
·        Good performance
·        Generally applicable predictive equations derived
·        Little manual work

Disadvantages

·        Difficult to interpret
·        Sensitive to outliers in continuous data
·        No hypothesis testing and parameter estimation
·        Slower execution
·        Needs specialized software


Figure 1

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Kirkwood, C. W. (2002). Decision tree primer. available on-line at http://www. public. asu. edu/~ kirkwood/DAStuff/decisiontrees/index. html.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.