-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I have been trying to train hand writing digits with efficientnet_b2. However, the training is not running correctly.
To Reproduce
- Step 1: Train
./distributed_train.sh 1 ./HWD/ --model efficientnet_b2 -b 128 --sched step --epochs 100 --decay-epochs 2.4 --decay-rate .97 --opt rmsproptf --opt-eps .001 -j 8 --warmup-lr 1e-6 --weight-decay 1e-5 --drop 0.3 --drop-path 0.2 --model-ema --model-ema-decay 0.9999 --aa rand-m9-mstd0.5 --remode pixel --reprob 0.2 --amp --lr .016 --num-classes 10
The ./HWD/ folder has following structure:
- train
- 0
- 1.jpg
- 2.jpg
- ...
- 350.jpg
- 1
- 1.jpg
- 2.jpg
- ...
- 350.jpg
...
- 9
- 1.jpg
- 2.jpg
- ...
- 350.jpg
- validation
- 0
- 1.jpg
- 2.jpg
- ...
- 150.jpg
- 1
- 1.jpg
- 2.jpg
- ...
- 150.jpg
...
- 9
- 1.jpg
- 2.jpg
- ...
- 150.jpg
When trainning finished, the log says "Best metric: 10.0 (epoch 0)"
The full training log is as following:
epoch,train_loss,eval_loss,eval_top1,eval_top5,lr
0,3.4499157269795737,2.3024010416666667,10.0,50.0,1e-06
1,3.624408103801586,2.3024010416666667,8.266666666666667,50.0,0.0032008
2,3.4603706112614385,2.3025208333333333,7.866666666666666,50.0,0.0064006
3,2.9570982544510453,2.302640625,10.0,50.0,0.0096004
4,2.7964119204768427,2.3029270833333335,10.0,50.0,0.0128002
5,2.799766266787494,2.302546875,10.0,49.46666666666667,0.015054399999999999
6,2.5497683684031167,2.3022604166666665,10.0,50.0,0.015054399999999999
7,2.4894364321673357,2.3029270833333335,10.0,49.8,0.015054399999999999
8,2.4517396291097007,2.302640625,10.0,50.0,0.014602768
9,2.4140525658925376,2.3028541666666666,10.0,50.0,0.014602768
....
90,0.9781680151268288,2.3033385416666667,10.0,50.0,0.00518410931230147
91,0.9978100569159897,2.3030052083333334,10.0,50.0,0.00518410931230147
92,0.9757056457025034,2.303671875,10.0,50.0,0.005028586032932427
93,0.9612530094605906,2.3033854166666665,10.0,50.0,0.005028586032932427
94,0.9917539711351748,2.3035520833333334,10.0,50.0,0.004877728451944453
95,0.9673527876536051,2.303765625,10.0,50.0,0.004877728451944453
96,0.957898685225734,2.302765625,10.0,50.0,0.00473139659838612
97,0.9409491817156473,2.303598958333333,10.0,50.0,0.00473139659838612
98,0.9428280658192105,2.302979166666667,10.0,50.0,0.00473139659838612
99,0.9692743840040984,2.3031458333333332,10.0,50.0,0.0045894547004345365
- Step 2: Inference
When I do inference with
python inference.py ./output/inference/hwd/input --model efficientnet_b2 --checkpoint ./output/train/20221121-221541-efficientnet_b2-256/last.pth.tar --output_dir ./output/inference/hwd/output --num-classes 10
It produce unexpected result:
0-352.jpg,9,3,5,2,8
0-353.jpg,9,3,5,2,8
1-351.jpg,9,3,5,2,8
1-352.jpg,9,3,5,2,8
2-351.jpg,9,3,5,2,8
2-352.jpg,9,3,5,2,8
3-351.jpg,9,3,5,2,8
3-352.jpg,9,3,5,2,8
4-351.jpg,9,3,5,2,8
4-352.jpg,9,3,5,2,8
5-351.jpg,9,3,5,2,8
5-352.jpg,9,3,5,2,8
6-351.jpg,9,3,5,2,8
6-352.jpg,9,3,5,2,8
7-351.jpg,9,3,5,2,8
7-352.jpg,9,3,5,2,8
8-351.jpg,9,3,5,2,8
8-352.jpg,9,3,5,2,8
9-351.jpg,9,3,5,2,8
9-352.jpg,9,3,5,2,8
What would be the possible cause of that?
The training data is here.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working