How do you calculate outliers

How to find outliers in your data series

Imagine you've created a series of data and created a scatter plot from it. In your opinion, one or more values ​​do not really fit into the data cloud, because they are a little outside of this structure.

You wonder whether these values ​​are so-called outliers, because they could falsify an arithmetic mean.

So you want to analyze the data for outliers. In addition to very complicated methods, there are also relatively simple ones that I would like to show you in this post.

What is an outlier?

I learned this method methodically from http://de.wikihow.com [1]. For the article, however, I will use an example of my own.

First of all, it says on the site:

“An outlier is a measuring point that differs significantly numerically from the other measured values ​​in the series of measurements. The term is mainly used in statistics and can indicate deviations in an observed data series or reveal errors in the measurements. "

This gives me a good basis to examine my example.

 

1. The readings

At a water supplier, the so-called network charge is measured and documented on a monthly basis, and on a daily basis for large municipal works. Network output is the amount of water that is pumped into the water supply network after treatment and minus a certain self-consumption (e.g. for flushing) and is thus available to the end user.

The example shows the monthly measured values ​​of a fictitious waterworks.

 

2. How to spot a possible outlier

To do this, I look at the individual values ​​in the data series. I notice that the reading for August differs from the other values. That could be an outlier and suggest that a major burst pipe was an issue somewhere in the coverage area in August.

3. Order the data points from lowest to highest

I did this. The lowest value is on the left, the highest on the right.

4. Calculate the median of the data set

The median is the data point that is exactly in the middle of the sorted data. In the case of an even number of measured values, the mean value of the two middle values ​​must be formed, in this case the mean value of 4,550 and 4,651. That's 4,601.

I work with Excel and thus use the statistical function MEDIAN.

As you would expect, the result is also 4,601.

5. Calculation of the lower quartile

The lower quartile is called Q1 and shows the data point in the series of measurements below which 25% of the measured values ​​are located.

Excel also has a function for this. The only arguments to be entered are the matrix and a 1, which denotes the first quartile.

6. Calculation of the upper quartile

The upper quartile Q3 is the data point above which 25% of the measured values ​​are still located.

I use the same function again, but insert a 3 as the quartile, which denotes the third quartile.

7. Calculation of the "antennas" of the data set

First, the difference between Q1 and Q3, the interquartile range, is calculated.

So I calculate 4,726.5 - 4,475.5 and get the result = 251.0.

The interquartile range is now multiplied by 1.5.

I calculate 251.0 * 1.5 and get = 376.5.

This number is added to Q3 and subtracted from Q1.

Q3: 4,726.5 + 376.5 = 5,103.0

Q1: 4,475.5-376.5 = 4,099.0

Then I determined the “antennas” of the data set.

All values ​​that are outside of these limit values ​​would be so-called "mild" outliers, i.e. all values

<4,099.0 and

5.103,0.

In order to identify these values ​​in the data series, I use the conditional formatting. The values ​​<4.099 should be colored yellow, the values> 5.103 red.

Values ​​<4,099 are not included, but a value> 5,103, namely 5,670, the value I already suspected.

8. Calculation of the "extreme" outliers

To do this, proceed in the same way as in Sect. 6, except that the interquartile range is not multiplied by 1.5, but by 3.0.

I calculate 251.0 * 3 and get = 753.0.

Then I calculate:

Q3: 4,726.5 + 753.0 = 5,479.5

Q1: 4,475.5-753.0 = 3,722.5.

Extreme outliers are again all values ​​that lie outside these limits.

The conditional formatting shows it:

Values ​​<3,722.5 are not included, but the value 5,670 is> 5,479.5 and thus an extreme outlier.

The reasons for this enormous deviation from all other values ​​and from the median should definitely be investigated

[1] http://de.wikihow.com/Ausrei%C3%9Fer-berechnen

Author Gerhard PundtPosted on Categories functionsTags Excel, functions, statistics