I have tried to re-implement the architecture described in the paper exactly, just in TensorFlow. But don't get the correct number of trainable parameters. I can't find where this is calculated, so I was hoping someone could help me out.
Paper:
56 layer: 1.5 mil.
103 layer: 9.4 mil
My implementation:
56 layer: 1.4 mil
103 layer: 9.2 mil
The discrepancy is small, so normally I wouldn't care, but I can't quite get the same performance results as in the paper, so perhaps this could help reveal any bugs in my code.