I recently encountered a problem at work that required me to understand Principal Component Analysis (PCA). This required a lot of research on Linear Algebra & Machine Learning. When I finally made sense out of the whole thing, I realized it’s such a beautiful mathematical problem and felt like writing about it.
I decided to break it up into 2-3 posts so that it’s easier on the eyes and the mind. In Part I, I’ll just introduce the problem…
Let’s use the example of buying a house. You need to gather a lot of data when making this decision – price, square footage, age, number of rooms, neighborhood crime score, walk score, school rating, whether it has a backyard, HOA … and the list goes on.
Imagine potting 50 such values on a graph. If you were trying to analyze data for a 100 houses. Then you’re plotting 50-dimensional data. Good luck trying to make any sense out of that!
Enter PCA. PCA allows you to reduce the dimensions of the data, also referred to as Dimensionality Reduction. What it essentially does is it reduces the number of attributes from 50 to a smaller number(whatever number you want it to be) so that you can plot it and make sense of it. One may ask why can’t I just consider a few attributes of the house – for e.g. just the price and square footage. Well you could, but that would not give you a true estimate of the value of the house. So chopping off attributes is not a good idea. Instead you want to use information from all 50 attributes and come up with a whole new set of attributes – something that captures as much variability in the data as possible. Just so that one attribute of the house does not skew the estimated value of it. These new attributes are what are referred to as Principal Components.
The input to your PCA algorithm is a matrix of values. Each row represents one unique vector, in our example it would be one unique house, and each column in that row represents a specific attribute of the house. The number of columns is the number of attributes. The number of rows is usually much larger than the number of attributes. Which makes sense right? You would typically survey several hundred houses (rows) but probably consider a max of 50 attributes (columns). In most real world data sets, its more like several hundred thousand rows and < 100 columns. So the matrix looks ‘tall and skinny’.
The task at hand is to take these 50 attributes for each of the houses and finally reduce it to a much smaller number of dimensions – it could be 1 or 3 or 20 – depending on what you require. So your output would be a matrix that is as tall as the input but more skinny – cos we just reduced the number of attributes.
In the next post I’ll introduce the basic mathematical concepts one needs to understand in order to appreciate the solution to this problem – namely Eigen Vectors & Eigen Values.