How do you calculate outliers
How to find outliers in your data series
Imagine you've created a series of data and created a scatter plot from it. In your opinion, one or more values do not really fit into the data cloud, because they are a little outside of this structure.
You wonder whether these values are so-called outliers, because they could falsify an arithmetic mean.
So you want to analyze the data for outliers. In addition to very complicated methods, there are also relatively simple ones that I would like to show you in this post.
What is an outlier?
I learned this method methodically from http://de.wikihow.com . For the article, however, I will use an example of my own.
First of all, it says on the site:
“An outlier is a measuring point that differs significantly numerically from the other measured values in the series of measurements. The term is mainly used in statistics and can indicate deviations in an observed data series or reveal errors in the measurements. "
This gives me a good basis to examine my example.
1. The readings
At a water supplier, the so-called network charge is measured and documented on a monthly basis, and on a daily basis for large municipal works. Network output is the amount of water that is pumped into the water supply network after treatment and minus a certain self-consumption (e.g. for flushing) and is thus available to the end user.
The example shows the monthly measured values of a fictitious waterworks.
2. How to spot a possible outlier
To do this, I look at the individual values in the data series. I notice that the reading for August differs from the other values. That could be an outlier and suggest that a major burst pipe was an issue somewhere in the coverage area in August.
3. Order the data points from lowest to highest
I did this. The lowest value is on the left, the highest on the right.
4. Calculate the median of the data set
The median is the data point that is exactly in the middle of the sorted data. In the case of an even number of measured values, the mean value of the two middle values must be formed, in this case the mean value of 4,550 and 4,651. That's 4,601.
I work with Excel and thus use the statistical function MEDIAN.
As you would expect, the result is also 4,601.
5. Calculation of the lower quartile
The lower quartile is called Q1 and shows the data point in the series of measurements below which 25% of the measured values are located.
Excel also has a function for this. The only arguments to be entered are the matrix and a 1, which denotes the first quartile.
6. Calculation of the upper quartile
The upper quartile Q3 is the data point above which 25% of the measured values are still located.
I use the same function again, but insert a 3 as the quartile, which denotes the third quartile.
7. Calculation of the "antennas" of the data set
First, the difference between Q1 and Q3, the interquartile range, is calculated.
So I calculate 4,726.5 - 4,475.5 and get the result = 251.0.
The interquartile range is now multiplied by 1.5.
I calculate 251.0 * 1.5 and get = 376.5.
This number is added to Q3 and subtracted from Q1.
Q3: 4,726.5 + 376.5 = 5,103.0
Q1: 4,475.5-376.5 = 4,099.0
Then I determined the “antennas” of the data set.
All values that are outside of these limit values would be so-called "mild" outliers, i.e. all values
In order to identify these values in the data series, I use the conditional formatting. The values <4.099 should be colored yellow, the values> 5.103 red.
Values <4,099 are not included, but a value> 5,103, namely 5,670, the value I already suspected.
8. Calculation of the "extreme" outliers
To do this, proceed in the same way as in Sect. 6, except that the interquartile range is not multiplied by 1.5, but by 3.0.
I calculate 251.0 * 3 and get = 753.0.
Then I calculate:
Q3: 4,726.5 + 753.0 = 5,479.5
Q1: 4,475.5-753.0 = 3,722.5.
Extreme outliers are again all values that lie outside these limits.
The conditional formatting shows it:
Values <3,722.5 are not included, but the value 5,670 is> 5,479.5 and thus an extreme outlier.
The reasons for this enormous deviation from all other values and from the median should definitely be investigated
- Is an IELTS exam required for immigration to Singapore
- What is Asymptotic Flatness in Differential Geometry
- How do women manipulate men
- Wisdom Knowledge is more important than experience
- Which is the WordPress alternative in Python
- Why don't schools teach nutrition and diet
- How does a crab patty taste
- Who teaches the best affiliate marketing
- How reliable is WWICS
- What is the USP of SAP
- Should Quora allow corporate accounts
- Diabetes can be transmitted through saliva
- When is chubby unhealthy?
- What does virtue mean
- Are cholesterol and heart disease real
- What is the best charity donation site
- Is staged by Bigg Boss TV series
- What is the currency for Cambodia
- Who is the highest paid television actor
- What is the healthiest gluten-free breakfast
- Which is the hardest math question
- Can I soak chia seeds overnight
- What was the biggest bet you lost?
- Anyone blogging without their own website
- Why is there so much censorship these days
- What is floor tile
- ACCA is difficult to pass
- How political are Singaporean churches
- Is it possible to fish under an area?
- How is the job in corporate finance
- Indian Hyderabad Muslims are Qadiani Ahmadi
- What is the StuCred app
- What are the key qualifications in netball
- Do daughters enjoy fantasizing about their fathers?