Reducing Number Features

Reducing the Number of Features (Optional)

Introduction to Principal Component Analysis

Principal Component Analysis (PCA) is an unsupervised learning algorithm commonly used for visualization. If you have a dataset with a lot of features - say 10 features or 50 features or even thousands of features - you can’t plot 1,000 dimensional data.

PCA is an algorithm that lets you take data with a lot of features (50, 1,000, even more) and reduce the number of features to two features, maybe three features, so that you can plot it and visualize it.

Primary Use Case

PCA is commonly used by data scientists to visualize the data, to figure out what might be going on in their datasets.

Car Dataset Example

To illustrate PCA, consider a dataset from a collection of passenger cars with many features:

Length of the car
Width of the car
Diameter of the wheel
Height of the car
Many other car features

The question is: how can you use PCA to reduce the number of features for visualization?

Simple Feature Reduction Cases

Example 1: Length vs Width

x₁: Length of the car (varies quite a bit)
x₂: Width of the car (varies relatively little)

In most countries, because of road width constraints, car width tends not to vary much. Most cars are about 1.8 meters wide (just under six feet). If you plot length vs width, x₁ varies quite a bit while x₂ varies relatively little.

For feature reduction, you could simply take x₁ since x₂ varies little from car to car. PCA will more or less automatically decide to just take x₁.

Example 2: Length vs Wheel Diameter

x₁: Length of the car (varies quite a bit)
x₂: Diameter of the wheel (varies a little bit)

Again, PCA would essentially choose the feature x₁ when applied to this dataset.

Complex Feature Reduction

Example 3: Length vs Height

x₁: Length of the car (varies quite a bit)
x₂: Height of the car (also varies quite a bit)

This presents a more interesting case where some cars are bigger (longer and taller) and some cars are smaller (not as long and not as tall).

For feature reduction, you don’t want to pick just x₁ and ignore x₂, nor pick just x₂ and ignore x₁, since both have useful information.

The Z-Axis Solution

Instead of being limited to the x₁ axis or x₂ axis, PCA introduces a new axis called the z-axis. This z-axis:

Is not a third dimension sticking out of the diagram
Is a combination of x₁ and x₂ lying flat within the plot
Corresponds to something about the size of the car

PCA’s Core Idea

The idea of PCA is to find one or more new axes (such as z) so that when you measure your data’s coordinates on the new axis, you end up with very useful information about the items (cars in this example).

Instead of needing two numbers (coordinates on x₁ and x₂ axes for length and height), you now need fewer numbers - in this case, only one number instead of two - to capture roughly the size of the car.

Practical Applications

Typical Usage Pattern

In practice, PCA is usually used to reduce a very large number of features:

Input: 10, 20, 50, even thousands of features
Output: Maybe two or three features
Purpose: Visualize data in 2D or 3D plots

Multi-Country Dataset Example

Consider data about different countries with 50 features:

GDP (x₁)
Per capita GDP (x₂)
Human Development Index (x₃)
Life expectancy (x₄)
And 46 more features…

PCA can compress these 50 features down to two features (z₁ and z₂) for visualization:

Possible interpretations:

z₁: How big is the country and what is its total GDP (larger countries tend to have higher GDP)
z₂: Per person GDP or economic activity per person

Country examples:

United States: Large country with high per-person activity → up and right on the plot
Singapore: Smaller country with high per-person activity → lower z₁, high z₂
Other patterns: Large countries with lower per-person activity, smaller countries with lower per-person activity

Visualization Benefits

This approach lets you take 50-dimensional data (50 features) and reduce it to 2-dimensional data, enabling you to:

Plot it on a 2D visualization
Understand patterns in your data
Discover unexpected relationships
Identify outliers or anomalies

Data Exploration Value

Whenever working with a new dataset, visualizing the data helps understand what the data looks like and can reveal if something unexpected is happening in the dataset.

PCA provides a powerful way to take high-dimensional data and make it understandable through visualization, helping data scientists gain insights that would be impossible to see in the original high-dimensional space.