Skip to content
Pablo Rodriguez

Reducing Number Features

Reducing the Number of Features (Optional)

Section titled “Reducing the Number of Features (Optional)”

Introduction to Principal Component Analysis

Section titled “Introduction to Principal Component Analysis”

Principal Component Analysis (PCA) is an unsupervised learning algorithm commonly used for visualization. If you have a dataset with a lot of features - say 10 features or 50 features or even thousands of features - you can’t plot 1,000 dimensional data.

PCA is an algorithm that lets you take data with a lot of features (50, 1,000, even more) and reduce the number of features to two features, maybe three features, so that you can plot it and visualize it.

Primary Use Case

PCA is commonly used by data scientists to visualize the data, to figure out what might be going on in their datasets.

To illustrate PCA, consider a dataset from a collection of passenger cars with many features:

  • Length of the car
  • Width of the car
  • Diameter of the wheel
  • Height of the car
  • Many other car features

The question is: how can you use PCA to reduce the number of features for visualization?

  • x₁: Length of the car (varies quite a bit)
  • x₂: Width of the car (varies relatively little)

In most countries, because of road width constraints, car width tends not to vary much. Most cars are about 1.8 meters wide (just under six feet). If you plot length vs width, x₁ varies quite a bit while x₂ varies relatively little.

For feature reduction, you could simply take x₁ since x₂ varies little from car to car. PCA will more or less automatically decide to just take x₁.

  • x₁: Length of the car (varies quite a bit)
  • x₂: Diameter of the wheel (varies a little bit)

Again, PCA would essentially choose the feature x₁ when applied to this dataset.

  • x₁: Length of the car (varies quite a bit)
  • x₂: Height of the car (also varies quite a bit)

This presents a more interesting case where some cars are bigger (longer and taller) and some cars are smaller (not as long and not as tall).

For feature reduction, you don’t want to pick just x₁ and ignore x₂, nor pick just x₂ and ignore x₁, since both have useful information.

Instead of being limited to the x₁ axis or x₂ axis, PCA introduces a new axis called the z-axis. This z-axis:

  • Is not a third dimension sticking out of the diagram
  • Is a combination of x₁ and x₂ lying flat within the plot
  • Corresponds to something about the size of the car

The idea of PCA is to find one or more new axes (such as z) so that when you measure your data’s coordinates on the new axis, you end up with very useful information about the items (cars in this example).

Instead of needing two numbers (coordinates on x₁ and x₂ axes for length and height), you now need fewer numbers - in this case, only one number instead of two - to capture roughly the size of the car.

In practice, PCA is usually used to reduce a very large number of features:

  • Input: 10, 20, 50, even thousands of features
  • Output: Maybe two or three features
  • Purpose: Visualize data in 2D or 3D plots

Consider data about different countries with 50 features:

  • GDP (x₁)
  • Per capita GDP (x₂)
  • Human Development Index (x₃)
  • Life expectancy (x₄)
  • And 46 more features…

PCA can compress these 50 features down to two features (z₁ and z₂) for visualization:

Possible interpretations:

  • z₁: How big is the country and what is its total GDP (larger countries tend to have higher GDP)
  • z₂: Per person GDP or economic activity per person

Country examples:

  • United States: Large country with high per-person activity → up and right on the plot
  • Singapore: Smaller country with high per-person activity → lower z₁, high z₂
  • Other patterns: Large countries with lower per-person activity, smaller countries with lower per-person activity

This approach lets you take 50-dimensional data (50 features) and reduce it to 2-dimensional data, enabling you to:

  • Plot it on a 2D visualization
  • Understand patterns in your data
  • Discover unexpected relationships
  • Identify outliers or anomalies

Data Exploration Value

Whenever working with a new dataset, visualizing the data helps understand what the data looks like and can reveal if something unexpected is happening in the dataset.

PCA provides a powerful way to take high-dimensional data and make it understandable through visualization, helping data scientists gain insights that would be impossible to see in the original high-dimensional space.