Those who are working on data parallel distributed training for neural networks, you might have come across this paper in your life. https://arxiv.org/abs/1712.01887 on deep gradient compression (DGC). This work solves the communication bottleneck in data parallel DNN training by reducing the data transmission size by sending only gradient values that crosses a threshold. I was fascinated by the ideas in the paper and wanted to quickly try it out. So I made a version of DGC on MXNet.
Here is the code: https://github.com/anandj91/anand-mxnet (Branch: dgc)
I tried training ResNet-110 on CIFAR-10 with DGC. I don’t know what I’m doing wrong, but I’m not able to reproduce the results mentioned in the paper. The validation accuracy is around 92.8% as opposed to baseline (sending full gradients) accuracy of 93.66%. Mostly I’m not using the right hyper parameters.
If anybody has tried this before or is working on this area and interested to take a stab at it, feel free to contact me. I would be grateful if you can help me with the hyperparameter tuning and make my DGC implementation converge to the baseline accuracy.