Ok folks, I’m back again after a long hiatus. Just when I thought I got the hang of Alexnet & Inception, working with good old 32-bit floating point numbers, the DNN world (of which we all are a part of if we like it or not) decided that 16-bits or even 8-bits were more than sufficient for use in DNNs. While I was still reading about that, someone had the grand idea that 1 bit was more than sufficient. Obviously without a care that I hadn’t yet caught up with 16- and 8-bits yet. So off I went trying to figure out what the fuss was about, and turns out its pretty cool after all.
The blog is mostly based on the following paper http://minjekim.com/papers/icml2015_mkim.pdf combined with findings from some other sources and of course my own opinions.
Traditionally, DNNs used floating point values – mostly 32-bit. It was thought to be needed for accuracy and since CPUs and well mostly GPUs were well equipped to handle floating point operations efficiently, all was well. Which meant that DNNs could run well on systems which were equipped with GPUs which are powerful enough.
Convolution can be represented as a matrix multiplication operation. Which is essentially computing the dot product of each row of matrix A with each column of matrix B. Computing the dot product translates to nothing but Multiply-Accumulate(MAC) operations. These can be quite expensive to implement and require many logic gates. This translates to requiring greater die area and also more power consumption.
So what happens in today’s day and age, where we have so many edge devices – ranging from cellphones and watches to drones and glasses? These devices are much more limited in terms of size, power budget & memory capacity. Some of them cannot implement floating point operations efficiently.
In order to mitigate this problem, researchers in the Industry and Academia started exploring DNNs which used lower precision activations and weights. The primary concern was however, the effect of lower precision on the accuracy of the training or inference result. But turns out it can be done – with minimal accuracy loss!
Quantization is the process of compressing the floating point input/output values in a DNN to a fixed point representation. Typically these are 8-bit or 16-bit integers. For e.g. if we had floating point values ranging from -5.0 to +5.0, 0 would map to -5.0, 255 would map to +5.0 and any values in between would be scaled appropriately.
The basic idea is that DNNs are inherently known to cope well with noise. So reducing the precision can be considered as just adding another source of noise.
It becomes obvious that using 8-bit values reduces our memory storage and computation costs. It is also fairly easy to convert the many models already trained in floating point to 8-bit via quantization. It is however argued that while 8-bit works great on inference, the training process still needs to be done in 32-bit floats to maintain accuracy of the result. This seems to be changing though and there are papers showing that lower accuracy can be used for training as well.
But the question is, can we do better? Enter Binary Neural Networks.
In Binary Neural Networks, the inputs, outputs and weights are all binary values. By binary here, we mean Bipolar Binary, i.e. +1 & -1 values. In this case, -1 is actually represented as a 0. It becomes instantly obvious that replacing a 32-bit real number with a single bit would come with big savings – cheaper to store, cheaper to compute and consumes less power!
Cheaper to store is kinda obvious – it is cheaper to store 1 bit per value instead of 32 bits. But how did it get so much cheaper to compute. In order to understand that, lets take a step back and see how convolutions are computed today. In most cases, convolutions are written out as matrix multiplication problems. And most GPUs can perform this operation very efficiently today on floating point numbers. Essentially the matrix multiply translates to a dot product operation – multiply followed by accumulate of every row in matrix A, with every column in Matrix B.
for each i in width: C += A[row][i] * B[i][col]
The above equation could also apply to binary values, but surely we can do better with bits. Turns out we can substitute the fairly expensive multiply-accumulate (MAC) logic with a simple XNOR operation followed by pop-count to get the same results! How cool is that.
for each i in width: C += popcount(XNOR(A[row][i], B[i][col]))
XNOR and Popcount can be implement with much fewer gates than MAC. And whats cooler, it does not require any hardware changes unlike 8- and 16-bit integers since we are only considered with the binary representation. Most languages have built-in functions for popcount. XNOR can be implemented with == operator.
To understand the math better, I tried to write out an example:
A = 10010 B = 01111 0 is really -1 here Dot product: A * B = (1 * -1) + (-1 * 1) + (-1 * 1) + (1 * 1) + (-1 * 1) Result from traditional method = (-1) + (-1) + 1 + (-1) = -3 XNOR(A, B) = 00010 Popcount P of 00010 = 1
Wait! What just happened, -3 is clearly != 1.
Well after a LOT of hunting around, I found some sources which say that the actual result is not just XNOR followed by popcount. There is one additional step.
XNOR(A, B) = 00010 Popcount P of 00010 = 1 Result = 2*P - N, where N is the total number of bits In this case 2*1 - 5 = -3
And the world makes sense again!
It is still unclear as to why this is not called out explicitly. Some other papers also refer to an XOR instead of XNOR. But you get the basic idea!
Look here to see some good diagrams explaining the complexity of floating point networks vs BNNs.
The process of converting from a real value to binary is termed Binarization (d-uh). One of the simplest functions used for this is the Sign function.
Sign(x) = +1, if x ≥ 0 -1, else
During forward propagation, the weights and inputs are binarized at each layer. The gradients that go through back propagation are NOT binary though, they are real values. This is because the updates to the weights are very small and need higher precision to be represented.
Some of the inaccuracies from BNN can also be considered equal to dropoff. At a high level dropoff is nothing but zeroing out random activations in order to prevent over fitting.
The author also goes on to say that sometimes, BNNs need more hyperplanes (translates to more hidden units) to achieve similar accuracy as traditional DNNs. This however doesn’t increase the cost a lot because each BNN node is much cheaper to compute.
So does all of this work? Apparently yes. There are some experiments in the paper which show that MNIST worked quite well. Please do refer to that. Many of the DL Frameworks have implementations for BNN so it’s going to be fun experimenting with bits soon!