CapsNet
This is an implementation of CapsNet for mnist based on the Dynamic Routing Between Capsules paper by Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. I used tensorflow and mnist data downloaded from Yann LeCun website.
The code is developed Jupyter notebook. I added naive unit tests to make sure the output of each layer is according to specs. I tried to use same name convention for most cases for clarity.
The following is the main architecture of CapsNet:
Convolution layer
It consists of a convolution layer with 256 filters all with 9x9 kernels with stride of 1 with ReLU activation. This results in output 20x20x256.
Primary capsule layer
It consists of 32 6x6x8 capsules that translates to 32 convolution layers each have 8 filters and 9x9 kernels, stride of 2 and linear activation. The output (u) is 32x6x6x8 and is reshaped to 1152x8. 1152 is total capsules outputs.
Transformation
The output of the primary capsule layer is multiplied by Weight matrix to create u_hat. Considering the DigitCaps layer has 10-16D vectors the u_hat will have the shape of 1152x10x16.
Routing
At this stage, the logits (bij) will be learned through routing algorithms. Logits will be translated to coupling coefficients (cij) using softmax function (calculated over DigitCaps). This defines the parent (Digit Caps 1 to 10) that is chosen by each capsule output. So the output is called (s). (s) then goes through squash function to create (v) with unit norm. Routing algorithm should be executed for each sample in the batch, so I used tf.while_loop.
Number of iteration is set to two. The following shows routing algorithm which executed consecutively for every sample in the batch.
Fully connected (decoder) layer
This layer reconstructs the input from the DigiCaps layer outputs. This forces the 16D vectors in DigitCaps layer represent the actual digits.
Loss
Loss is combination of margin loss present, margin loss non-present and reconstruction loss with different scalers.
Optimization
Adam optimizer is used with default parameters.
Training
Training the model with batch size 100 and 5 epochs gave me 0.98% accuracy on test data.
Interesting characteristics of coupling coefficients
After training, I build a graph to visualize how coupling coefficients (cij) choose their parent capsules in DigiCaps layer. I used the values for coupling coefficient vectors to visualize the relationship between primary capsules and DigiCaps on a graph. As we can see, primary capsules from different layers are gathered around the DigiCaps.