Skip to content
Pablo Rodriguez

Pca Programming

This programming exercise demonstrates PCA applications through multiple examples, from simple 2D cases to complex high-dimensional datasets. The lab shows how PCA reveals hidden patterns in data.

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from pca_utils import plot_widget
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
import matplotlib.pyplot as plt
import plotly.offline as py
# Andrew's lecture example
X = np.array([[ 99, -1],
[ 98, -1],
[ 97, -2],
[101, 1],
[102, 1],
[103, 2]])
# Visualize original data
plt.plot(X[:,0], X[:,1], 'ro')
# Fit PCA with 2 components
pca_2 = PCA(n_components=2)
pca_2.fit(X)
# Check explained variance
print(pca_2.explained_variance_ratio_)
# Output: [0.9924, 0.0076]

Interpretation: First principal component retains 99.24% of information, second component adds 0.76%.

# Transform to principal component coordinates
X_trans_2 = pca_2.transform(X)
print(X_trans_2)
# Column 1: coordinates along first principal component
# Column 2: coordinates along second principal component
# Reduce to single dimension
pca_1 = PCA(n_components=1)
pca_1.fit(X)
print(pca_1.explained_variance_ratio_)
# Output: [0.9924]
X_trans_1 = pca_1.transform(X)
print(X_trans_1)
# Single column - coordinates along first principal component only
# Reconstruct from 2 components (no information loss)
X_reduced_2 = pca_2.inverse_transform(X_trans_2)
plt.plot(X_reduced_2[:,0], X_reduced_2[:,1], 'ro')
# Result: Identical to original data
# Reconstruct from 1 component (with approximation)
X_reduced_1 = pca_1.inverse_transform(X_trans_1)
plt.plot(X_reduced_1[:,0], X_reduced_1[:,1], 'ro')
# Result: Points lie on a single line (the principal component)

Reconstruction Insight

With 1 component, data points are now confined to a single line. Each point’s position is determined by its single coordinate along the principal component axis.

# Create 10-point dataset for visualization
X = np.array([[-0.83934975, -0.21160323],
[ 0.67508491, 0.25113527],
[-0.05495253, 0.36339613],
# ... additional points
[ 0.02775847, -0.77709049]])
# Create interactive plot
p = figure(title='10-point scatterplot',
x_axis_label='x-axis',
y_axis_label='y-axis')
p.scatter(X[:,0], X[:,1], marker='o', color='#C00000', size=5)
show(p)
# Launch interactive PCA visualization
plot_widget()

Widget Functionality:

  • Slider rotates projection line through center
  • Shows how different projections affect point separation
  • Demonstrates that PCA line maximizes point spread
  • Some projections squeeze points together, others keep them separated
from pca_utils import random_point_circle, plot_3d_2d_graphs
# Generate 3D data lying on 2D surface
X = random_point_circle(n=150)
# Create 3D and 2D comparison plots
deb = plot_3d_2d_graphs(X)

Concept Demonstration: Shows how 3D data that actually lies on a 2D surface can be effectively represented in 2D using PCA.

Exercise 5: High-Dimensional Pattern Discovery

Section titled “Exercise 5: High-Dimensional Pattern Discovery”
# Load dataset: 500 samples, 1000 features
df = pd.read_csv("toy_dataset.csv")
print(df.head())
print(f"Dataset shape: {df.shape}") # (500, 1000)
def get_pairs(n=100):
from random import randint
tuples = []
i = 0
while i < 100:
x = df.columns[randint(0,999)]
y = df.columns[randint(0,999)]
while x == y or (x,y) in tuples or (y,x) in tuples:
y = df.columns[randint(0,999)]
tuples.append((x,y))
i += 1
return tuples
# Create 100 random pairwise scatter plots
pairs = get_pairs()
fig, axs = plt.subplots(10, 10, figsize=(35,35))
# ... plotting code

Result: No clear patterns visible in pairwise feature combinations.

# Compute correlation matrix
corr = df.corr()
# Find high correlations (>0.5, excluding self-correlation)
mask = (abs(corr) > 0.5) & (abs(corr) != 1)
high_corr = corr.where(mask).stack().sort_values()
print(f"Max correlation: {high_corr.max():.3f}")
print(f"Min correlation: {high_corr.min():.3f}")

Result: Maximum correlation around 0.631-0.632, indicating weak linear relationships.

# Apply PCA to reduce 1000D to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(df)
df_pca = pd.DataFrame(X_pca, columns=['principal_component_1',
'principal_component_2'])
# Visualize results
plt.scatter(df_pca['principal_component_1'],
df_pca['principal_component_2'],
color="#C00000")
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Decomposition')
# Check variance preservation
variance_explained = sum(pca.explained_variance_ratio_)
print(f"Variance preserved: {variance_explained:.3f}") # ~0.146 (14.6%)

Remarkable Discovery

Despite preserving only 14.6% of variance, PCA reveals well-defined clusters that were completely invisible in the original high-dimensional space!

# Extend to 3D for more information
pca_3 = PCA(n_components=3).fit(df)
X_t = pca_3.transform(df)
df_pca_3 = pd.DataFrame(X_t, columns=['principal_component_1',
'principal_component_2',
'principal_component_3'])
# 3D visualization
import plotly.express as px
fig = px.scatter_3d(df_pca_3,
x='principal_component_1',
y='principal_component_2',
z='principal_component_3')
fig.show()
# Check improved variance preservation
variance_3d = sum(pca_3.explained_variance_ratio_)
print(f"3D variance preserved: {variance_3d:.3f}") # ~0.19 (19%)

Enhanced Results:

  • 19% variance preservation
  • Clearly visible 10 clusters
  • Much better separation and structure understanding
  1. Pairwise analysis: Failed to reveal patterns
  2. Correlation analysis: Showed only weak relationships
  3. PCA analysis: Revealed clear cluster structure with minimal variance
  • High-dimensional data often contains hidden low-dimensional structure
  • PCA can reveal patterns invisible to other analysis methods
  • Small amounts of preserved variance can contain critical information
  • Visualization is crucial for understanding complex datasets
  • Always try multiple dimensionalities (2D and 3D)
  • Check explained variance ratios to understand information retention
  • Use interactive visualizations when possible
  • Compare PCA results with other exploratory techniques

The lab showcases PCA’s remarkable ability to uncover hidden patterns in high-dimensional data, making it an essential tool for data scientists working with complex datasets.