This is an attempt to pen down my understanding of Alexnet. If you’re reading this, the hope is that you already know a bit about Convolutional Neural Networks(CNN). Alexnet is a Deep Convolutional Neural Network for image classification that won the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. Pretty impressive huh?
Though there are many more network topologies that have emerged since with lot more layers, Alexnet in my opinion was the first to make a breakthrough.
The paper is well written and surprisingly not too hard to understand. I just tried to break it down per my understanding and write it down. Because like all things, given a few months I tend to forget everything 🙂
Alexnet has 8 layers. The first 5 are convolutional and the last 3 are fully connected layers. In between we also have some ‘layers’ called pooling and activation. I’m not sure why they are not listed as a separate layer. What I want to do while diving into the details of Alexnet is to also provide the intuition behind what these layers mean. I maybe grossly wrong, so follow at your own risk 🙂
The network diagram is taken from the original paper.
The above diagram is the sequence of layers in Alexnet. You can see that the problem set is divided into 2 parts, half executing on GPU 1 & another half on GPU 2. The communication overhead is kept low and this helps to achieve good performance overall.
- Layer 1 is a Convolution Layer,
- Input Image size is – 224 x 224 x 3
- Number of filters – 96
- Filter size – 11 x 11 x 3
- Stride – 4
- Layer 1 Output
- 224/4 x 224/4 x 96 = 55 x 55 x 96 (because of stride 4)
- Split across 2 GPUs – So 55 x 55 x 48 for each GPU
You might have heard that there are multiple ways to perform a convolution – it could be a direct convolution – on similar lines to what we’ve known in the image processing world, a convolution that uses GEMM(General Matrix Multiply) or FFT(Fast Fourier Transform), and other fancy algorithms like Winograd etc.
I will only elaborate a bit about the GEMM based one, because that’s the one I have heard about a lot. And since GEMM has been, and continues to be, beaten to death for the last cycle of performance, one should definitely try to reap it’s benefits.
- Local regions in the input image are stretched out into columns
- Operation is commonly called im2col.
- E.g., if the input is [227x227x3] and it is to be convolved with 11x11x3 filters with stride 4
- Take [11x11x3] blocks of pixels in the input
- Stretch each block into a column vector of size 11*11*3 = 363.
- Result Matrix M = [363 x 3025] (55*55=3025), 55 comes from 227/4
- Weight Matrix W = [96 x 363]
- Perform Matrix multiply: W x M
- Disadvantages of this approach:
- Needs reshaping after GEMM
- Duplication of data – due to overlapping blocks of pixels, lot more memory required
Rectified Linear Unit activation is the activation function used in Alexnet. As fancy as the name sounds, the function is nothing more than max(x, 0). But using simple terms is against the principle of researchers and we lesser mortals have to live with that. The intuition behind an activation is similar to that of the neurons in the brain. What triggers a neuron is when the signal is above a certain threshold. In this case we are setting that threshold to 0. Any negative values do not propagate through the network. Other activation functions include the sigmoid and tanh functions.
- Layer 2 is a Max Pooling Followed by Convolution
- Input – 55 x 55 x 96
- Max pooling – 55/2 x 55/2 x 96 = 27 x 27 x 96
- Number of filters – 256
- Filter size – 5 x 5 x 48
- Layer 2 Output
- 27 x 27 x 256
- Split across 2 GPUs – So 27 x 27 x 128 for each GPU
Pooling is a sub-sampling in a 2×2 window(usually). Max pooling is max of the 4 values in 2×2 window. The intuition behind pooling is that it reduces computation & controls overfitting.
Layers 3, 4 & 5 follow on similar lines.
- Layer 6 is fully connected
- Input – 13 x 13 x 128 – > is transformed into a vector
- And multiplied with a matrix of the following dim – (13 x 13 x 128) x 2048
- GEMV(General Matrix Vector Multiply) is used here:
Vector X = 1 x (13x13x128)
Matrix A = (13x13x128) x 2048 – This is an external input to the network
Output is – 1 x 2048
Layers 7 & 8 follow on similar lines.
I’m working on similar posts for other topologies. Hopefully will get to them before they become obsolete!