Automatic Mean Normalization
The fit function in PCA automatically performs mean normalization (subtracts the mean of each feature), so you don’t need to do this separately.
If features take on very different ranges of values, perform preprocessing to scale features to comparable ranges.
Example scenarios requiring scaling:
Feature scaling helps PCA find good axes choices.
Run PCA to “fit” the data and obtain new axes (Z₁, Z₂, Z₃, etc.). Use scikit-learn’s fit
function.
Automatic Mean Normalization
The fit function in PCA automatically performs mean normalization (subtracts the mean of each feature), so you don’t need to do this separately.
Results: New axes called principal components:
Examine how much each principal component explains the variance in your data using explained_variance_ratio_
.
This helps determine whether projecting data onto these axes retains most of the variability/information from the original dataset.
Project data onto new axes using the transform
method. Each training example becomes 2-3 numbers that can be plotted for visualization.
# Dataset with 6 examplesX = np.array([[example1], [example2], [example3], [example4], [example5], [example6]])
# Reduce from 2 features (x₁, x₂) to 1 feature (z)pca_1 = PCA(n_components=1)pca_1.fit(X)
# Check explained variance ratioprint(pca_1.explained_variance_ratio_)
Interpretation: Single axis captures 99.2% of variability/information in original dataset.
# Project data onto z-axisX_transformed = pca_1.transform(X)# Output: array with 6 numbers corresponding to 6 examples
Example: First training example [1,1] projected to z-axis gives 1.383.
# Same data, but keep 2 dimensionspca_2 = PCA(n_components=2)pca_2.fit(X)
# Check explained variance for both componentsprint(pca_2.explained_variance_ratio_)# Output: [0.992, 0.008]
# Transform to new coordinate systemX_transformed_2 = pca_2.transform(X)# Each example now has coordinates on z₁ and z₂ axes
Since we kept same number of dimensions (2D→2D):
from sklearn.decomposition import PCAimport numpy as np
# Step 1: Prepare data (scaling if needed)# X = your_data_array
# Step 2: Create and fit PCApca = PCA(n_components=2) # or 3 for 3D visualizationpca.fit(X)
# Step 3: Analyze explained variancevariance_ratios = pca.explained_variance_ratio_print(f"Variance explained: {variance_ratios}")print(f"Total variance retained: {sum(variance_ratios):.3f}")
# Step 4: Transform dataX_transformed = pca.transform(X)
# Step 5: Visualize (for 2D)import matplotlib.pyplot as pltplt.scatter(X_transformed[:, 0], X_transformed[:, 1])plt.xlabel('First Principal Component')plt.ylabel('Second Principal Component')plt.show()
PCA is frequently used for visualization - reducing data to 2-3 numbers for plotting, enabling visualization of high-dimensional datasets like country development data.
Modern PCA Usage
Today, the most common and valuable use of PCA is visualization - reducing high-dimensional data to 2-3 dimensions so you can plot it, understand patterns, and gain insights into your datasets.
PCA provides an accessible way to understand complex, high-dimensional datasets through dimensionality reduction and visualization, making it a valuable tool for exploratory data analysis.