Skip to content
Pablo Rodriguez

Pca Code

If features take on very different ranges of values, perform preprocessing to scale features to comparable ranges.

Example scenarios requiring scaling:

  • GDP in trillions of dollars vs. other features less than 100
  • House size in thousands vs. number of bedrooms (1-5)

Feature scaling helps PCA find good axes choices.

Run PCA to “fit” the data and obtain new axes (Z₁, Z₂, Z₃, etc.). Use scikit-learn’s fit function.

Automatic Mean Normalization

The fit function in PCA automatically performs mean normalization (subtracts the mean of each feature), so you don’t need to do this separately.

Results: New axes called principal components:

  • Z₁ = first principal component
  • Z₂ = second principal component
  • Z₃ = third principal component

Examine how much each principal component explains the variance in your data using explained_variance_ratio_.

This helps determine whether projecting data onto these axes retains most of the variability/information from the original dataset.

Project data onto new axes using the transform method. Each training example becomes 2-3 numbers that can be plotted for visualization.

# Dataset with 6 examples
X = np.array([[example1], [example2], [example3],
[example4], [example5], [example6]])
# Reduce from 2 features (x₁, x₂) to 1 feature (z)
pca_1 = PCA(n_components=1)
pca_1.fit(X)
0.992
# Check explained variance ratio
print(pca_1.explained_variance_ratio_)

Interpretation: Single axis captures 99.2% of variability/information in original dataset.

# Project data onto z-axis
X_transformed = pca_1.transform(X)
# Output: array with 6 numbers corresponding to 6 examples

Example: First training example [1,1] projected to z-axis gives 1.383.

# Same data, but keep 2 dimensions
pca_2 = PCA(n_components=2)
pca_2.fit(X)
# Check explained variance for both components
print(pca_2.explained_variance_ratio_)
# Output: [0.992, 0.008]
  • Z₁ (first component): Explains 99.2% of variance
  • Z₂ (second component): Explains 0.8% of variance
  • Total: 99.2% + 0.8% = 100% (complete information retained)
# Transform to new coordinate system
X_transformed_2 = pca_2.transform(X)
# Each example now has coordinates on z₁ and z₂ axes

Since we kept same number of dimensions (2D→2D):

  • No information loss occurs
  • Reconstruction gives exactly the original data
  • First example [1,1] maps to specific coordinates on z₁ and z₂ axes
from sklearn.decomposition import PCA
import numpy as np
# Step 1: Prepare data (scaling if needed)
# X = your_data_array
# Step 2: Create and fit PCA
pca = PCA(n_components=2) # or 3 for 3D visualization
pca.fit(X)
# Step 3: Analyze explained variance
variance_ratios = pca.explained_variance_ratio_
print(f"Variance explained: {variance_ratios}")
print(f"Total variance retained: {sum(variance_ratios):.3f}")
# Step 4: Transform data
X_transformed = pca.transform(X)
# Step 5: Visualize (for 2D)
import matplotlib.pyplot as plt
plt.scatter(X_transformed[:, 0], X_transformed[:, 1])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()

PCA is frequently used for visualization - reducing data to 2-3 numbers for plotting, enabling visualization of high-dimensional datasets like country development data.

  • Past use: Reduce 50 features to 10 for storage/transmission savings
  • Current status: Less common due to modern storage capacity and network speeds
  • Past use: Reduce 1,000 features to 100 to speed up algorithms like SVMs
  • Current status: Modern algorithms (especially deep learning) often work better with original high-dimensional data
  • Recommendation: Feed high-dimensional data directly into neural networks rather than using PCA preprocessing

Modern PCA Usage

Today, the most common and valuable use of PCA is visualization - reducing high-dimensional data to 2-3 dimensions so you can plot it, understand patterns, and gain insights into your datasets.

  1. Scikit-learn handles mean normalization automatically
  2. Feature scaling may be needed for different value ranges
  3. Explained variance ratio helps assess information retention
  4. Most practical value is in visualization applications
  5. Optional labs provide hands-on experience with parameter variations

PCA provides an accessible way to understand complex, high-dimensional datasets through dimensionality reduction and visualization, making it a valuable tool for exploratory data analysis.