Skip to content

[BUG] Why training hand writing digits produce incorrect results #1557

@shunmian

Description

@shunmian

Describe the bug

I have been trying to train hand writing digits with efficientnet_b2. However, the training is not running correctly.

To Reproduce

  • Step 1: Train
./distributed_train.sh 1 ./HWD/ --model efficientnet_b2 -b 128 --sched step --epochs 100 --decay-epochs 2.4 --decay-rate .97 --opt rmsproptf --opt-eps .001 -j 8 --warmup-lr 1e-6 --weight-decay 1e-5 --drop 0.3 --drop-path 0.2 --model-ema --model-ema-decay 0.9999 --aa rand-m9-mstd0.5 --remode pixel --reprob 0.2 --amp --lr .016 --num-classes 10

The ./HWD/ folder has following structure:

- train
   - 0
     - 1.jpg
     - 2.jpg
     - ...
     - 350.jpg
    - 1
     - 1.jpg
     - 2.jpg
     - ...
     - 350.jpg
    ...
    - 9
     - 1.jpg
     - 2.jpg
     - ...
     - 350.jpg
     
- validation
   - 0
     - 1.jpg
     - 2.jpg
     - ...
     - 150.jpg
    - 1
     - 1.jpg
     - 2.jpg
     - ...
     - 150.jpg
    ...
    - 9
     - 1.jpg
     - 2.jpg
     - ...
     - 150.jpg

When trainning finished, the log says "Best metric: 10.0 (epoch 0)"
The full training log is as following:

epoch,train_loss,eval_loss,eval_top1,eval_top5,lr
0,3.4499157269795737,2.3024010416666667,10.0,50.0,1e-06
1,3.624408103801586,2.3024010416666667,8.266666666666667,50.0,0.0032008
2,3.4603706112614385,2.3025208333333333,7.866666666666666,50.0,0.0064006
3,2.9570982544510453,2.302640625,10.0,50.0,0.0096004
4,2.7964119204768427,2.3029270833333335,10.0,50.0,0.0128002
5,2.799766266787494,2.302546875,10.0,49.46666666666667,0.015054399999999999
6,2.5497683684031167,2.3022604166666665,10.0,50.0,0.015054399999999999
7,2.4894364321673357,2.3029270833333335,10.0,49.8,0.015054399999999999
8,2.4517396291097007,2.302640625,10.0,50.0,0.014602768
9,2.4140525658925376,2.3028541666666666,10.0,50.0,0.014602768
....
90,0.9781680151268288,2.3033385416666667,10.0,50.0,0.00518410931230147
91,0.9978100569159897,2.3030052083333334,10.0,50.0,0.00518410931230147
92,0.9757056457025034,2.303671875,10.0,50.0,0.005028586032932427
93,0.9612530094605906,2.3033854166666665,10.0,50.0,0.005028586032932427
94,0.9917539711351748,2.3035520833333334,10.0,50.0,0.004877728451944453
95,0.9673527876536051,2.303765625,10.0,50.0,0.004877728451944453
96,0.957898685225734,2.302765625,10.0,50.0,0.00473139659838612
97,0.9409491817156473,2.303598958333333,10.0,50.0,0.00473139659838612
98,0.9428280658192105,2.302979166666667,10.0,50.0,0.00473139659838612
99,0.9692743840040984,2.3031458333333332,10.0,50.0,0.0045894547004345365
  • Step 2: Inference

When I do inference with

python inference.py ./output/inference/hwd/input --model efficientnet_b2 --checkpoint ./output/train/20221121-221541-efficientnet_b2-256/last.pth.tar --output_dir ./output/inference/hwd/output --num-classes 10

It produce unexpected result:

0-352.jpg,9,3,5,2,8
0-353.jpg,9,3,5,2,8
1-351.jpg,9,3,5,2,8
1-352.jpg,9,3,5,2,8
2-351.jpg,9,3,5,2,8
2-352.jpg,9,3,5,2,8
3-351.jpg,9,3,5,2,8
3-352.jpg,9,3,5,2,8
4-351.jpg,9,3,5,2,8
4-352.jpg,9,3,5,2,8
5-351.jpg,9,3,5,2,8
5-352.jpg,9,3,5,2,8
6-351.jpg,9,3,5,2,8
6-352.jpg,9,3,5,2,8
7-351.jpg,9,3,5,2,8
7-352.jpg,9,3,5,2,8
8-351.jpg,9,3,5,2,8
8-352.jpg,9,3,5,2,8
9-351.jpg,9,3,5,2,8
9-352.jpg,9,3,5,2,8

What would be the possible cause of that?

The training data is here.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions