How does principal component analysis work
Principal Component Analysis | Explanation with Python example
No idea about Principal Component Analysis?
Then you need this tutorial!
Here's the visual explanation of how PCAs work!
What is the PCA?
The PCA, Principal Component Analysis, simplifies, structures and visualizes statistical data sets. The PCA is one of the multivariate statistics that analyze several statistical variables at the same time.
The PCA uses the linear combination (e.g. 18 = 6 * 2 + 3 * 2), which combines several statistical features into the main components.
Example of application of PCA
Multivariate statistics analyze statistical models with many different variables (features). I can best explain how the PCA works using the example of “motor vehicle” (everything that rolls on the road).
Various continuous characteristics describe a motor vehicle, which the PCA groups together:
- Seats (only in whole numbers)
- Horsepower (measured in kilowatt hours)
- Load weight
The PCA analyzes the static characteristics and looks for similar variables that are (directly) related. The PCA summarizes z. B. Length, width, height to "size" and horsepower, cubic capacity and load weight to "power" together.
On German roads z. B. these vehicles:
Instead of looking at all the variables, you can compare the main group values (the main components) with one another.
Application in reality - what do I need the PCA for?
Who Needs the PCAs? What do the main components bring? These professions love PCAs ...
- Bankers use the PCAs for risk management of their interest rate derivatives. A derivative hedges a risk and runs on an underlying asset (in this case the interest rate).
- Neurologists use PCAs to examine brain neurons for the ability of intelligent cells to react.
- Economists examine an economy with a PCA because economists often have to work with a large number of data sets and characteristics.
- Chemists use it to analyze samples of substances.
How does a principal component analysis work?
Aim of the Principal Component Analysis
Before you dive into mathematics, you should know the purpose and reason for the calculations. You want to work with the PCA reduce the variables in a data set.
You form generalized groups of characteristics and separate the data into the main components. With the main components you want to be able to better explain and predict the variation.
Prerequisite for Principal Component Analysis
The PCAs need interval scaled scales (no ordinal or nominal). examples for this are
The data for a successful PCA should not show any high incorrect variances. You need the most exact measurements possible without errors.
Graphic solution as Principal Component Analysis
The statistics program gives you a vector (the main component) along the longest diameter of the point cloud. The second main component is orthogonal to the first axis. The point cloud then looks like a squashed kidney bean from J
5 steps for the Principal Component Analysis
First you get an overview with a scatter diagram (2D and 3D). In data science / statistics, you should visualize what your data set looks like.
You can more easily spot mistakes and patterns and check if your results make sense!
We need these steps to go from 40 features to a few main components.
- Standardization of the scales
- Calculation of the covariance matrix
- Conversion into the eigenvalue vectors
- Determine features vector
- Link vector to data set
# 1 standardize scales
The example analyzes national nations - this results in the following challenge:
The standardization tries to map the control to a value range of -1 and 1 in order to be able to compare them better. You can standardize the values with the following formula:
If you want to implement the formula programmatically, I recommend you to use the StandardScaler from scikit learn. The Python library offers a variety of functions that you can use in your statistics / artificial intelligence projects.
# 2 Compute the covariance matrix
In the next step, the computer calculates the covariances of all possibilities (x, y and z in 2 pairs).
The covariance determines the correlation between two features. The result is either ...
In Python you can use this function, which is based on the numpy package from Python. The Numpy package helps you to intelligently manage and calculate large tables and arrays.
# 3 calculate eigenvectors
When you have calculated the covariance matrix, you calculate the eigenvalue. To do this, subtract the identity matrix (variable y) from the covariance matrix and cross-calculate the eigenvalues using the addition method.
A capital letter in a formula means that it is a matrix.
An identity matrix is an n x n matrix that is padded with zeros and only contains 1 as values on the diagonal.
You can determine the variable with a little algebra ...
You can also use the numpy package to calculate the eigenvalue on the computer. Take a look at the declaration by the University of Münster:
import numpy.linalg as linalg
A = numpy.random.rand (100,100)
ew, ev = linalg.eig (A)
# 4 Find features vector
The features vector is the vector that consists of the eigenvectors. You write the eigenvalues in a vector! Write the eigenvectors below each other and you have the finished features vector.
# 5 Link vector to data set
In the last step you perform a matrix multiplication of the feature vector and the standardized data set that is transposed (flipped diagonally). With this trick you adapt the data set to the vectors.
With a matrix multiplication you multiply each value in a row by the 1st value in your vector ... and you work your way through the matrix row by row and vector position by vector position.
When a matrix is transposed, you swap the columns with the rows. Instead of writing out the values from left to right, you fill the matrix from top left to bottom left. Often times, mathematicians use a superscript T to indicate transposition.
Result: You can use the main component for the simplified analysis of your data sets.
PCA application in data science
The PCA use the Data Scientist for data visualization. If you have 6 characteristics, you have to create 15 diagrams (binomial coefficients of 6 over 2).
As a data scientist, you lose track of things. You only want 1 2D or 3D diagram, which shows clear separations!
Be careful: The PCA is of no help to you with the classification, because the mathematical method reduces the complexity of a model. The PCA helps keep track of things
Tip: Do you already know the Support Vector Machine? Oh, a blatant math sh *** that is ultra useful in data science!
Source reference images: Icons and SVG graphics in the cover image of Microsoft PowerPoint 2019, freely available according to EULA
- What is a camping trip
- Which technology is in high demand
- Gravity attracts photon particles
- Why is gravity stronger in lunar craters?
- Runs low testosterone levels
- What have been some of your wildest experiences
- How does someone write a poem
- Do B2B companies need an app
- How important are your stabilizer muscles
- What is analytical chromatography
- What defines a person's personality
- When is treason a patriotic act?
- How do electoral votes work
- What is the decarburization of internal combustion engine components
- How can we identify a profitable niche
- How many liters are 2 5 liters
- How many cookies should I eat
- How can I sell on myntra chothes
- Are country people also farmers?
- People secretly meet in Saudi Arabia
- Why don't people like widows
- Is China bigger than the USA?
- What can attract people to network marketing?
- Rank Google personal CMS blogs