Reconstruction Insight
With 1 component, data points are now confined to a single line. Each point’s position is determined by its single coordinate along the principal component axis.
This programming exercise demonstrates PCA applications through multiple examples, from simple 2D cases to complex high-dimensional datasets. The lab shows how PCA reveals hidden patterns in data.
import pandas as pdimport numpy as npfrom sklearn.decomposition import PCAfrom pca_utils import plot_widgetfrom bokeh.io import show, output_notebookfrom bokeh.plotting import figureimport matplotlib.pyplot as pltimport plotly.offline as py
# Andrew's lecture exampleX = np.array([[ 99, -1], [ 98, -1], [ 97, -2], [101, 1], [102, 1], [103, 2]])
# Visualize original dataplt.plot(X[:,0], X[:,1], 'ro')
# Fit PCA with 2 componentspca_2 = PCA(n_components=2)pca_2.fit(X)
# Check explained varianceprint(pca_2.explained_variance_ratio_)# Output: [0.9924, 0.0076]
Interpretation: First principal component retains 99.24% of information, second component adds 0.76%.
# Transform to principal component coordinatesX_trans_2 = pca_2.transform(X)print(X_trans_2)# Column 1: coordinates along first principal component# Column 2: coordinates along second principal component
# Reduce to single dimensionpca_1 = PCA(n_components=1)pca_1.fit(X)
print(pca_1.explained_variance_ratio_)# Output: [0.9924]
X_trans_1 = pca_1.transform(X)print(X_trans_1)# Single column - coordinates along first principal component only
# Reconstruct from 2 components (no information loss)X_reduced_2 = pca_2.inverse_transform(X_trans_2)plt.plot(X_reduced_2[:,0], X_reduced_2[:,1], 'ro')# Result: Identical to original data
# Reconstruct from 1 component (with approximation)X_reduced_1 = pca_1.inverse_transform(X_trans_1)plt.plot(X_reduced_1[:,0], X_reduced_1[:,1], 'ro')# Result: Points lie on a single line (the principal component)
Reconstruction Insight
With 1 component, data points are now confined to a single line. Each point’s position is determined by its single coordinate along the principal component axis.
# Create 10-point dataset for visualizationX = np.array([[-0.83934975, -0.21160323], [ 0.67508491, 0.25113527], [-0.05495253, 0.36339613], # ... additional points [ 0.02775847, -0.77709049]])
# Create interactive plotp = figure(title='10-point scatterplot', x_axis_label='x-axis', y_axis_label='y-axis')p.scatter(X[:,0], X[:,1], marker='o', color='#C00000', size=5)show(p)
# Launch interactive PCA visualizationplot_widget()
Widget Functionality:
from pca_utils import random_point_circle, plot_3d_2d_graphs
# Generate 3D data lying on 2D surfaceX = random_point_circle(n=150)
# Create 3D and 2D comparison plotsdeb = plot_3d_2d_graphs(X)
Concept Demonstration: Shows how 3D data that actually lies on a 2D surface can be effectively represented in 2D using PCA.
# Load dataset: 500 samples, 1000 featuresdf = pd.read_csv("toy_dataset.csv")print(df.head())print(f"Dataset shape: {df.shape}") # (500, 1000)
def get_pairs(n=100): from random import randint tuples = [] i = 0 while i < 100: x = df.columns[randint(0,999)] y = df.columns[randint(0,999)] while x == y or (x,y) in tuples or (y,x) in tuples: y = df.columns[randint(0,999)] tuples.append((x,y)) i += 1 return tuples
# Create 100 random pairwise scatter plotspairs = get_pairs()fig, axs = plt.subplots(10, 10, figsize=(35,35))# ... plotting code
Result: No clear patterns visible in pairwise feature combinations.
# Compute correlation matrixcorr = df.corr()
# Find high correlations (>0.5, excluding self-correlation)mask = (abs(corr) > 0.5) & (abs(corr) != 1)high_corr = corr.where(mask).stack().sort_values()print(f"Max correlation: {high_corr.max():.3f}")print(f"Min correlation: {high_corr.min():.3f}")
Result: Maximum correlation around 0.631-0.632, indicating weak linear relationships.
# Apply PCA to reduce 1000D to 2Dpca = PCA(n_components=2)X_pca = pca.fit_transform(df)df_pca = pd.DataFrame(X_pca, columns=['principal_component_1', 'principal_component_2'])
# Visualize resultsplt.scatter(df_pca['principal_component_1'], df_pca['principal_component_2'], color="#C00000")plt.xlabel('Principal Component 1')plt.ylabel('Principal Component 2')plt.title('PCA Decomposition')
# Check variance preservationvariance_explained = sum(pca.explained_variance_ratio_)print(f"Variance preserved: {variance_explained:.3f}") # ~0.146 (14.6%)
Remarkable Discovery
Despite preserving only 14.6% of variance, PCA reveals well-defined clusters that were completely invisible in the original high-dimensional space!
# Extend to 3D for more informationpca_3 = PCA(n_components=3).fit(df)X_t = pca_3.transform(df)df_pca_3 = pd.DataFrame(X_t, columns=['principal_component_1', 'principal_component_2', 'principal_component_3'])
# 3D visualizationimport plotly.express as pxfig = px.scatter_3d(df_pca_3, x='principal_component_1', y='principal_component_2', z='principal_component_3')fig.show()
# Check improved variance preservationvariance_3d = sum(pca_3.explained_variance_ratio_)print(f"3D variance preserved: {variance_3d:.3f}") # ~0.19 (19%)
Enhanced Results:
The lab showcases PCA’s remarkable ability to uncover hidden patterns in high-dimensional data, making it an essential tool for data scientists working with complex datasets.