[BUG] Why training hand writing digits produce incorrect results

**Describe the bug**

I have been trying to train hand writing digits with `efficientnet_b2`. However, the training is not running correctly. 

**To Reproduce**

- Step 1: Train
```
./distributed_train.sh 1 ./HWD/ --model efficientnet_b2 -b 128 --sched step --epochs 100 --decay-epochs 2.4 --decay-rate .97 --opt rmsproptf --opt-eps .001 -j 8 --warmup-lr 1e-6 --weight-decay 1e-5 --drop 0.3 --drop-path 0.2 --model-ema --model-ema-decay 0.9999 --aa rand-m9-mstd0.5 --remode pixel --reprob 0.2 --amp --lr .016 --num-classes 10

```

The ./HWD/ folder has following structure:

```
- train
   - 0
     - 1.jpg
     - 2.jpg
     - ...
     - 350.jpg
    - 1
     - 1.jpg
     - 2.jpg
     - ...
     - 350.jpg
    ...
    - 9
     - 1.jpg
     - 2.jpg
     - ...
     - 350.jpg
     
- validation
   - 0
     - 1.jpg
     - 2.jpg
     - ...
     - 150.jpg
    - 1
     - 1.jpg
     - 2.jpg
     - ...
     - 150.jpg
    ...
    - 9
     - 1.jpg
     - 2.jpg
     - ...
     - 150.jpg
```

When trainning finished, the log says **"Best metric: 10.0 (epoch 0)"**
The full training log is as following:

```
epoch,train_loss,eval_loss,eval_top1,eval_top5,lr
0,3.4499157269795737,2.3024010416666667,10.0,50.0,1e-06
1,3.624408103801586,2.3024010416666667,8.266666666666667,50.0,0.0032008
2,3.4603706112614385,2.3025208333333333,7.866666666666666,50.0,0.0064006
3,2.9570982544510453,2.302640625,10.0,50.0,0.0096004
4,2.7964119204768427,2.3029270833333335,10.0,50.0,0.0128002
5,2.799766266787494,2.302546875,10.0,49.46666666666667,0.015054399999999999
6,2.5497683684031167,2.3022604166666665,10.0,50.0,0.015054399999999999
7,2.4894364321673357,2.3029270833333335,10.0,49.8,0.015054399999999999
8,2.4517396291097007,2.302640625,10.0,50.0,0.014602768
9,2.4140525658925376,2.3028541666666666,10.0,50.0,0.014602768
....
90,0.9781680151268288,2.3033385416666667,10.0,50.0,0.00518410931230147
91,0.9978100569159897,2.3030052083333334,10.0,50.0,0.00518410931230147
92,0.9757056457025034,2.303671875,10.0,50.0,0.005028586032932427
93,0.9612530094605906,2.3033854166666665,10.0,50.0,0.005028586032932427
94,0.9917539711351748,2.3035520833333334,10.0,50.0,0.004877728451944453
95,0.9673527876536051,2.303765625,10.0,50.0,0.004877728451944453
96,0.957898685225734,2.302765625,10.0,50.0,0.00473139659838612
97,0.9409491817156473,2.303598958333333,10.0,50.0,0.00473139659838612
98,0.9428280658192105,2.302979166666667,10.0,50.0,0.00473139659838612
99,0.9692743840040984,2.3031458333333332,10.0,50.0,0.0045894547004345365
```

- Step 2: Inference

When I do inference with 
```
python inference.py ./output/inference/hwd/input --model efficientnet_b2 --checkpoint ./output/train/20221121-221541-efficientnet_b2-256/last.pth.tar --output_dir ./output/inference/hwd/output --num-classes 10
```
It produce unexpected result:

```
0-352.jpg,9,3,5,2,8
0-353.jpg,9,3,5,2,8
1-351.jpg,9,3,5,2,8
1-352.jpg,9,3,5,2,8
2-351.jpg,9,3,5,2,8
2-352.jpg,9,3,5,2,8
3-351.jpg,9,3,5,2,8
3-352.jpg,9,3,5,2,8
4-351.jpg,9,3,5,2,8
4-352.jpg,9,3,5,2,8
5-351.jpg,9,3,5,2,8
5-352.jpg,9,3,5,2,8
6-351.jpg,9,3,5,2,8
6-352.jpg,9,3,5,2,8
7-351.jpg,9,3,5,2,8
7-352.jpg,9,3,5,2,8
8-351.jpg,9,3,5,2,8
8-352.jpg,9,3,5,2,8
9-351.jpg,9,3,5,2,8
9-352.jpg,9,3,5,2,8

```

What would be the possible cause of that? 

The training data is [here](https://drive.google.com/file/d/1tUu6z04-4cc_5kF_AS8xL3_tI6BwzB2q/view?usp=share_link).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] Why training hand writing digits produce incorrect results #1557

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Why training hand writing digits produce incorrect results #1557

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions