Skip to content
Pablo Rodriguez

Pca Algorithm

PCA works by finding new axes to represent your data. If you have a dataset with two features x₁ and x₂, initially your data is plotted using these axes. But to reduce features, you need to choose a new axis (z-axis) that effectively captures the data with fewer dimensions.

Before applying PCA, features should be normalized to have zero mean (subtract the mean from each feature).

If features take on very different scales, perform feature scaling before PCA. For example:

  • x₁: House size in square feet (1,000-3,000)
  • x₂: Number of bedrooms (1-5)

Without scaling, the large difference in ranges could affect PCA’s ability to find good axes.

Given five training examples, PCA must choose one axis instead of the original two to capture what’s important about the data.

Projection process:

  • Take each example and project it onto the chosen axis
  • Use line segments at 90-degree angles to the axis
  • The projection gives each example a single coordinate on the new axis

If you choose an axis where projections result in points that are squished together:

  • The projected points have little variance
  • You capture much less information from the original dataset
  • The choice fails to preserve the data’s spread

If you choose an axis where projections result in points that are spread apart:

  • The projected points have large variance
  • You capture a lot of the variation and information in the original dataset
  • This preserves the essential characteristics of the data

Principal Component Definition

In the PCA algorithm, the optimal axis is called the principal component - the axis that when you project data onto it, you end up with the largest possible amount of variance.

For a training example with coordinates (2, 3) and a principal component axis defined by vector [0.71, 0.71]:

Projection formula:

projection = dot_product([2, 3], [0.71, 0.71])
= 2 × 0.71 + 3 × 0.71
= 3.55

This means the distance from the origin to the projected point is 3.55, giving us one number to represent this example instead of two.

The principal component direction is represented as a length-1 vector pointing in the direction of the z-axis. In this example: [0.71, 0.71] (which is actually [0.707, 0.707] with more precision).

  • Second axis: Always at 90 degrees to the first axis
  • Third axis: At 90 degrees to both first and second axes
  • Additional axes: Each subsequent axis is perpendicular to all previous axes

If you had 50 features and wanted three principal components:

  • First axis: Chosen to maximize variance
  • Second axis: Perpendicular to first, maximizes remaining variance
  • Third axis: Perpendicular to first two, maximizes remaining variance
  • Data: Features x and labels y
  • Goal: Fit line so predicted value is close to ground truth label y
  • Optimization: Minimize vertical distances (aligned with y-axis)
  • Special treatment: y gets special treatment as the target variable
  • Data: Only features x₁, x₂, etc. (no labels y)
  • Goal: Find axis z that preserves data variance when projected
  • Optimization: Minimize projection distances (perpendicular to axis)
  • Equal treatment: All features (x₁, x₂, …, x₅₀) treated equally

Key Distinction

Linear regression predicts a target output y, while PCA reduces the number of axes needed to represent data well by treating all features equally.

  • Linear regression: Can only fit lines in one orientation (predicting y from x)
  • PCA: Can choose any orientation for the principal component based on data structure

Given a projected value (z = 3.55), you can approximate the original coordinates:

reconstruction = z × unit_vector
= 3.55 × [0.71, 0.71]
= [2.52, 2.52]
  • Original point: (2, 3)
  • Reconstructed point: (2.52, 2.52)
  • Approximation error: Small line segment between original and reconstructed points

With just one number, you can get a reasonable approximation of the original two-dimensional coordinates.

PCA looks at original data and:

  1. Chooses new axes (z or z₁, z₂, etc.) to represent data
  2. Projects original data onto these new axes
  3. Provides smaller set of numbers for plotting and visualization
  4. Maximizes information retention by preserving variance

The result enables visualization and analysis of high-dimensional data in lower-dimensional spaces while maintaining the most important characteristics of the original dataset.