How does principal component analysis work

Principal Component Analysis | Explanation with Python example

No idea about Principal Component Analysis?

Then you need this tutorial!

Here's the visual explanation of how PCAs work!

content

What is the PCA?

The PCA, Principal Component Analysis, simplifies, structures and visualizes statistical data sets. The PCA is one of the multivariate statistics that analyze several statistical variables at the same time.

The PCA uses the linear combination (e.g. 18 = 6 * 2 + 3 * 2), which combines several statistical features into the main components.

Example of application of PCA

Multivariate statistics analyze statistical models with many different variables (features). I can best explain how the PCA works using the example of “motor vehicle” (everything that rolls on the road).

Various continuous characteristics describe a motor vehicle, which the PCA groups together:

  • length
  • width
  • height
  • Seats (only in whole numbers)
  • Weight
  • Horsepower (measured in kilowatt hours)
  • Displacement
  • Load weight

The PCA analyzes the static characteristics and looks for similar variables that are (directly) related. The PCA summarizes z. B. Length, width, height to "size" and horsepower, cubic capacity and load weight to "power" together.

On German roads z. B. these vehicles:

Instead of looking at all the variables, you can compare the main group values ​​(the main components) with one another.

Application in reality - what do I need the PCA for?

Who Needs the PCAs? What do the main components bring? These professions love PCAs ...

  • Bankers use the PCAs for risk management of their interest rate derivatives. A derivative hedges a risk and runs on an underlying asset (in this case the interest rate).
  • Neurologists use PCAs to examine brain neurons for the ability of intelligent cells to react.
  • Economists examine an economy with a PCA because economists often have to work with a large number of data sets and characteristics.
  • Chemists use it to analyze samples of substances.

How does a principal component analysis work?

Aim of the Principal Component Analysis

Before you dive into mathematics, you should know the purpose and reason for the calculations. You want to work with the PCA reduce the variables in a data set.

You form generalized groups of characteristics and separate the data into the main components. With the main components you want to be able to better explain and predict the variation.

Prerequisite for Principal Component Analysis

The PCAs need interval scaled scales (no ordinal or nominal). examples for this are

  • time
  • Weight
  • speed
  • height

The data for a successful PCA should not show any high incorrect variances. You need the most exact measurements possible without errors.

Graphic solution as Principal Component Analysis

The statistics program gives you a vector (the main component) along the longest diameter of the point cloud. The second main component is orthogonal to the first axis. The point cloud then looks like a squashed kidney bean from J

5 steps for the Principal Component Analysis

First you get an overview with a scatter diagram (2D and 3D). In data science / statistics, you should visualize what your data set looks like.

You can more easily spot mistakes and patterns and check if your results make sense!

We need these steps to go from 40 features to a few main components.

  1. Standardization of the scales
  2. Calculation of the covariance matrix
  3. Conversion into the eigenvalue vectors
  4. Determine features vector
  5. Link vector to data set

# 1 standardize scales

The example analyzes national nations - this results in the following challenge:

The standardization tries to map the control to a value range of -1 and 1 in order to be able to compare them better. You can standardize the values ​​with the following formula:

If you want to implement the formula programmatically, I recommend you to use the StandardScaler from scikit learn. The Python library offers a variety of functions that you can use in your statistics / artificial intelligence projects.

# 2 Compute the covariance matrix

In the next step, the computer calculates the covariances of all possibilities (x, y and z in 2 pairs).

The covariance determines the correlation between two features. The result is either ...

In Python you can use this function, which is based on the numpy package from Python. The Numpy package helps you to intelligently manage and calculate large tables and arrays.

# 3 calculate eigenvectors

When you have calculated the covariance matrix, you calculate the eigenvalue. To do this, subtract the identity matrix (variable y) from the covariance matrix and cross-calculate the eigenvalues ​​using the addition method.

A capital letter in a formula means that it is a matrix.

An identity matrix is ​​an n x n matrix that is padded with zeros and only contains 1 as values ​​on the diagonal.

You can determine the variable with a little algebra ...

You can also use the numpy package to calculate the eigenvalue on the computer. Take a look at the declaration by the University of Münster:

import numpy

import numpy.linalg as linalg

A = numpy.random.rand (100,100)

ew, ev = linalg.eig (A)

# 4 Find features vector

The features vector is the vector that consists of the eigenvectors. You write the eigenvalues ​​in a vector! Write the eigenvectors below each other and you have the finished features vector.

# 5 Link vector to data set

In the last step you perform a matrix multiplication of the feature vector and the standardized data set that is transposed (flipped diagonally). With this trick you adapt the data set to the vectors.

With a matrix multiplication you multiply each value in a row by the 1st value in your vector ... and you work your way through the matrix row by row and vector position by vector position.

When a matrix is ​​transposed, you swap the columns with the rows. Instead of writing out the values ​​from left to right, you fill the matrix from top left to bottom left. Often times, mathematicians use a superscript T to indicate transposition.

Result: You can use the main component for the simplified analysis of your data sets.

PCA application in data science

The PCA use the Data Scientist for data visualization. If you have 6 characteristics, you have to create 15 diagrams (binomial coefficients of 6 over 2).

As a data scientist, you lose track of things. You only want 1 2D or 3D diagram, which shows clear separations!

Be careful: The PCA is of no help to you with the classification, because the mathematical method reduces the complexity of a model. The PCA helps keep track of things

Tip: Do you already know the Support Vector Machine? Oh, a blatant math sh *** that is ultra useful in data science!



Source reference images: Icons and SVG graphics in the cover image of Microsoft PowerPoint 2019, freely available according to EULA
*) By subscribing to the newsletter, you agree to the analysis of the newsletter through individual measurement, storage and analysis of opening rates and click rates in profiles for the purpose of designing better newsletters in the future. You can revoke your consent to receive the newsletter and the measurement with effect for the future. The dispatch takes place with MailChimp. More in the privacy policy.