Pca Code

PCA in Code

Implementation Steps with Scikit-Learn

Step 1: Feature Scaling (Preprocessing)

If features take on very different ranges of values, perform preprocessing to scale features to comparable ranges.

Example scenarios requiring scaling:

GDP in trillions of dollars vs. other features less than 100
House size in thousands vs. number of bedrooms (1-5)

Feature scaling helps PCA find good axes choices.

Step 2: Fit PCA Algorithm

Run PCA to “fit” the data and obtain new axes (Z₁, Z₂, Z₃, etc.). Use scikit-learn’s fit function.

Automatic Mean Normalization

The fit function in PCA automatically performs mean normalization (subtracts the mean of each feature), so you don’t need to do this separately.

Results: New axes called principal components:

Z₁ = first principal component
Z₂ = second principal component
Z₃ = third principal component

Step 3: Analyze Explained Variance

Examine how much each principal component explains the variance in your data using explained_variance_ratio_.

This helps determine whether projecting data onto these axes retains most of the variability/information from the original dataset.

Step 4: Transform Data

Project data onto new axes using the transform method. Each training example becomes 2-3 numbers that can be plotted for visualization.

Code Example: Single Principal Component

Dataset and Setup

# Dataset with 6 examples
X = np.array([[example1], [example2], [example3],
            [example4], [example5], [example6]])

# Reduce from 2 features (x₁, x₂) to 1 feature (z)
pca_1 = PCA(n_components=1)
pca_1.fit(X)

Variance Analysis

# Check explained variance ratio
print(pca_1.explained_variance_ratio_)

Interpretation: Single axis captures 99.2% of variability/information in original dataset.

Data Transformation

# Project data onto z-axis
X_transformed = pca_1.transform(X)
# Output: array with 6 numbers corresponding to 6 examples

Example: First training example [1,1] projected to z-axis gives 1.383.

Code Example: Two Principal Components

Extended Implementation

# Same data, but keep 2 dimensions
pca_2 = PCA(n_components=2)
pca_2.fit(X)

# Check explained variance for both components
print(pca_2.explained_variance_ratio_)
# Output: [0.992, 0.008]

Variance Interpretation

Z₁ (first component): Explains 99.2% of variance
Z₂ (second component): Explains 0.8% of variance
Total: 99.2% + 0.8% = 100% (complete information retained)

Transformation Results

# Transform to new coordinate system
X_transformed_2 = pca_2.transform(X)
# Each example now has coordinates on z₁ and z₂ axes

Perfect Reconstruction

Since we kept same number of dimensions (2D→2D):

No information loss occurs
Reconstruction gives exactly the original data
First example [1,1] maps to specific coordinates on z₁ and z₂ axes

Complete Implementation Template

from sklearn.decomposition import PCA
import numpy as np

# Step 1: Prepare data (scaling if needed)
# X = your_data_array

# Step 2: Create and fit PCA
pca = PCA(n_components=2)  # or 3 for 3D visualization
pca.fit(X)

# Step 3: Analyze explained variance
variance_ratios = pca.explained_variance_ratio_
print(f"Variance explained: {variance_ratios}")
print(f"Total variance retained: {sum(variance_ratios):.3f}")

# Step 4: Transform data
X_transformed = pca.transform(X)

# Step 5: Visualize (for 2D)
import matplotlib.pyplot as plt
plt.scatter(X_transformed[:, 0], X_transformed[:, 1])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()

Practical Applications and Advice

Primary Use: Visualization

PCA is frequently used for visualization - reducing data to 2-3 numbers for plotting, enabling visualization of high-dimensional datasets like country development data.

Historical Applications (Less Common Now)

Data Compression

Past use: Reduce 50 features to 10 for storage/transmission savings
Current status: Less common due to modern storage capacity and network speeds

Supervised Learning Speedup

Past use: Reduce 1,000 features to 100 to speed up algorithms like SVMs
Current status: Modern algorithms (especially deep learning) often work better with original high-dimensional data
Recommendation: Feed high-dimensional data directly into neural networks rather than using PCA preprocessing

Modern PCA Usage

Today, the most common and valuable use of PCA is visualization - reducing high-dimensional data to 2-3 dimensions so you can plot it, understand patterns, and gain insights into your datasets.

Key Implementation Points

Scikit-learn handles mean normalization automatically
Feature scaling may be needed for different value ranges
Explained variance ratio helps assess information retention
Most practical value is in visualization applications
Optional labs provide hands-on experience with parameter variations

PCA provides an accessible way to understand complex, high-dimensional datasets through dimensionality reduction and visualization, making it a valuable tool for exploratory data analysis.