Is the concept of the confidence interval flawed

The Usual Suspects - 8 Common Data Analysis Mistakes

One thing in advance: a list of the errors most often committed in data analysis will always remain a subjective assessment of the expert in question and will differ depending on the industry, focus of analysis and professional experience of the analyst. Nevertheless, some misunderstandings can be found over and over again across many areas of data analysis. The following list summarizes the eight most common mistakes in applied data analysis that I claim are universal.

  1. Statistical significance versus relevance

The idea of ​​statistical significance is often misunderstood and therefore wrongly equated with statistically proven relevance. However, both measure very different things. Statistical Significance is a measure of the certainty that the Randomness taken into account by variation. “Statistically significant” therefore means that it is unlikely that a certain phenomenon occurs only by chance. “Statistically not significant” means that apart from the random variation, no systematic could be proven. Important: this does not mean that there are no effects, but that these could not be proven. Statistical significance can, however, be proven with a sufficient number of observations even for very small differences. In general, the larger the sample, the smaller the differences that are tested as being statistically significant. Therefore the statistical relevance differs from the statistical significance.

Statistical relevance however, measures the effect size of a difference. The size of a difference is related to the spread of the data and is therefore independent of the sample size. The greater the variance of the random variables, the smaller the effect size.

  1. Correlation versus causality

If a high correlation is found between two quantities, it is often concluded that one of the two quantities determines the other. In truth, even complex statistical and econometric models cannot prove causality. This is true even if the modeling follows a theoretical basis, because that too can be wrong. Researchers and analysts regularly lean out the window by claiming effects that cannot withstand close scrutiny. Standard questions that should be followed as an automatism of any analysis that claims Effects What can be found are: What role do unobserved heterogeneities, reverse causality and measurement errors in the variables play for the estimation result? Only when these three sources of endogeneity are controlled and it can also be assumed that the sample represents the population can a causal relationship be assumed and quantified.

  1. Unobserved influencing factors

Influences that are not measurable and therefore not recorded distort the estimated parameters of the controllable factors, provided that the latter are related to the unobserved. In other words: the estimated effect is wrongly ascribed to the observed variable if a third, unobserved variable actually determines the target variable and at the same time correlates with the observed variable. The teaching example
The wage equation is used for distortions caused by unobserved quantities - an equation that has been intensively researched for 60 years. The difficulty in quantifying the effect of training lies in the fact that the remuneration varies not only with age, work experience, training and the other control variables, but also with the varying degrees of interest in a lucrative gain and the individual's ability to achieve it. The challenge: there is no statistical test that indicates an incorrect specification due to unobserved quantities. It is therefore essential to have a thorough understanding of the analysis problem. This enables the analyst to formulate hypotheses which unobserved variables are up to mischief via a correlation with the tested regressor in the error term. In order to create evidence for the hypotheses, smart estimation designs or sufficiently good instruments must be identified.

  1. Selection bias

A selection bias exists when observations are not available for every individual or are excluded from the analysis. The basic requirement for any statistical hypothesis test is the assumption of a random sample so that the target population is represented in a representative manner. In practice, however, situations often arise in which certain characteristics can only be observed for one group, but not for a second. For example, the effect of a health-promoting measure in a large company for the entire workforce cannot be measured by the voluntary participation of some employees. It must be explicitly checked which differences exist between employees who make use of the offer voluntarily compared to those who do not accept it. There is generally always a risk of overestimating or underestimating the effects if the nature of the sample in comparison to the population is not considered. On the basis of a non-representative sample, generalizations are then incorrectly formulated, which can lead to incorrect recommendations for action.

  1. Overfitting and high estimator variance

Overfitting happens when the analyst wants "too much" of the data. If the model is overused, the control variables not only explain the target variable but also the white noise, i.e. the random errors. The number of regressors in relation to the number of observations is exaggerated in such a specification. The problem: too few degrees of freedom and the increased occurrence of multicollinearity lead to a high variance in the distribution of the estimators. An estimation result of a specification with a high estimation variance can therefore produce estimation results which are further removed from the true value than a distorted estimator. In fact, a “wrong” is mostly an indication of multicollinearity.

It often makes sense to adapt the specification by comparing the correlated regressors. In practice, it's always about finding a compromise between distortion and variance. The criterion for this is the minimization of the mean square error. In order to check whether the analyst has overshot the target, there are also various validation methods which, depending on the method, “waste” a certain amount or even no data in order to check the model.

  1. Missing data points

In practice, observations with missing data points are excluded from the analysis in most cases, simply because it is the fastest. Before doing this, however, you should always ask why these data points are missing. If they are absent by chance, the exclusion of the observations does not lead to different results. However, if they are systematically lacking, for example if people with certain characteristics prefer to withhold specific data, challenges arise. It should then be a matter of determining this entire distribution. If it is unclear whether the data is randomly or systematically missing, the analyst should address this question in case of doubt. Information must then be identified which will help to impute the missing data.

  1. Runaway

In many applications, outliers are identified using standardized procedures and removed from the data set. In many cases it is worth taking the data seriously. The prerequisite for this: the data points must be legitimate. Data points that were generated by input errors and deliberate false reports can easily be excluded. Legitimate data points, on the other hand, are “real” values. The inclusion of outliers can sometimes make a substantive contribution to the analysis, since they are also part of the population as a whole. Retaining outliers becomes problematic if they make it possible to identify relationships that do not apply to the rest of the population. Possible methods that reconcile outliers with the rest of the observations are transformations of the data or the use of robust estimation methods. Both approaches play with a stronger weighting of the mean distribution. In addition, regressions can be used, for example, to check to what extent a non-linear fit better includes the outliers in the estimate.

  1. Specification versus modeling

Too often, complicated statistical models are built before checking what a simple model can do. However, before complex models are knitted, one should first work on the specification of the model. Small adjustments such as the inclusion of improved variables, the consideration of interactions and non-linear effects bring us closer to the truth in some cases than a complex model and should in any case be exhausted before a more complex model is chosen. The simpler the model, the easier it is to keep control over it. In any case, the selected specifications should always be supported by sensitivity analyzes. Differences in variable definition and data selection should both be tested and reported. The analyst has good reason to change the model if it becomes apparent that the assumptions of the simple model are violated and that the model therefore does not produce valid results.

Nannette Swed

Dr. Nannette Swed is the founder and chief data analyst of Her main focus is on statistical programming and the integration of open source tools into operational data analysis and reporting. Before that, she taught data analysis with various focuses at the Humboldt University of Berlin and Universidad de La Habana.

Tags:Data science, data scientist, data analysis, data error, error, relevance, significance, statistics, statistical error