Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
6334d1f
adding code to make ddp work
ksreenivasan Apr 26, 2022
9e1f0de
trying out how args work in ddp
ksreenivasan Apr 26, 2022
f616ab6
trying out stuff with parser_args in ddp setting
ksreenivasan Apr 26, 2022
e6b93a6
adding ddp_utils
ksreenivasan Apr 26, 2022
f202a42
adding ddp_utils
ksreenivasan Apr 26, 2022
513b8a6
looks like reimporting doesn't hurt the global values
ksreenivasan Apr 26, 2022
02a6627
adding more code to implement ddp
ksreenivasan Apr 26, 2022
7f1e0f6
testing ddp with deepcopy
ksreenivasan Apr 26, 2022
6cc557b
looks like copy.deepcopy causes sync issues
ksreenivasan Apr 26, 2022
dd0ba73
fixing typo in ddp deepcopy
ksreenivasan Apr 26, 2022
1d21c5a
finishing up ddp for sanity checks and finetune
ksreenivasan Apr 26, 2022
691cffa
minor bugfixes for ddp
ksreenivasan Apr 26, 2022
ffca648
removing .cuda() from the codebase
ksreenivasan Apr 26, 2022
e1c9810
trying to catch the rank=0 leak
ksreenivasan Apr 26, 2022
64cbf57
I think ddp works!
ksreenivasan Apr 26, 2022
eb12a2d
skipping checkpoint, trying to get rid of tqdm
ksreenivasan Apr 28, 2022
f4fbff0
trying to eliminate tqdm and progress meters
ksreenivasan Apr 28, 2022
69048e0
trainer doesn't see parser_args
ksreenivasan Apr 28, 2022
3834b3d
still removing leftover calls to tqdm and progresmeters
ksreenivasan Apr 28, 2022
3ba1f93
removing more instances of progressmeter and replacing them with vani…
ksreenivasan Apr 28, 2022
77fbe48
fixing dir creation bug
ksreenivasan Apr 28, 2022
1ffc000
adding config for serial debug
ksreenivasan Apr 28, 2022
53b90c0
stuff works, removing an unnecessary barrier
ksreenivasan Apr 29, 2022
10b1ee4
adding imagenet configs and exec script
ksreenivasan Apr 29, 2022
38550af
setting find_unused_parameters=False and adding config for sp10
ksreenivasan May 3, 2022
50805e3
pushing config for sp15 imagenet
May 4, 2022
fb1eab0
minor tweaks to configs for imagenet
May 4, 2022
5ccc7a3
adding port as param for multiple runs
ksreenivasan May 4, 2022
9052870
updating port to be a str not int
ksreenivasan May 6, 2022
2e17a43
pushing psutil memory logs
ksreenivasan May 7, 2022
bb604ab
adding more memory logs
ksreenivasan May 7, 2022
7b21e75
adding more psutil logs
ksreenivasan May 7, 2022
7ef0dfe
updating memory logs only for master
May 7, 2022
2e199d4
updating some net_utils functions which are recreating ddp objects
ksreenivasan May 8, 2022
e3a4b8c
updating validate. I think that's where the leak is
ksreenivasan May 8, 2022
da11ccb
adding more fixes to hopefully handle the memory leak
ksreenivasan May 8, 2022
c9b64cc
updating val logs with gpu
ksreenivasan May 8, 2022
f112628
adding debug loop
ksreenivasan May 8, 2022
a0603b3
adding memory log to help debug
ksreenivasan May 8, 2022
90a827b
will reset this commit. but keeping it here for debug
ksreenivasan May 9, 2022
dfdd3d8
pushing ddp debug logt
ksreenivasan May 9, 2022
f817792
modifying data loader to look more like pytorch example
ksreenivasan May 9, 2022
9c5dad2
trying more things with dataloader
ksreenivasan May 9, 2022
0d8fe44
trying to avoid using data loader object
ksreenivasan May 9, 2022
7f995a7
Merge branch 'ddp' of github.com:ksreenivasan/pruning_is_enough into ddp
ksreenivasan May 9, 2022
8e3503b
removing trainer
ksreenivasan May 9, 2022
9853555
trying to see if imports are the issue
ksreenivasan May 9, 2022
a767fb1
removing more imports
ksreenivasan May 9, 2022
df656bc
adding set_seed
ksreenivasan May 9, 2022
1eeab6c
removing parser_args
ksreenivasan May 9, 2022
af1a52a
some more minor attempts at fixes
ksreenivasan May 9, 2022
33580a2
checking vanilla data loader memory
ksreenivasan May 11, 2022
59886f9
Merge branch 'ddp' of github.com:ksreenivasan/pruning_is_enough into ddp
ksreenivasan May 11, 2022
d45d514
looks like vanilla data loader works!
ksreenivasan May 11, 2022
5096528
moving args outside main
ksreenivasan May 11, 2022
04f7f9f
Merge branch 'ddp' of github.com:ksreenivasan/pruning_is_enough into ddp
ksreenivasan May 11, 2022
d1e22fd
trying to avoid mp.spawn()
ksreenivasan May 11, 2022
a91bd06
trying to avoid mp.spawn() in main also
ksreenivasan May 11, 2022
9cb4ae4
changing main to get rank as arg
ksreenivasan May 11, 2022
3a7ea3d
bringing more functionality back in
ksreenivasan May 11, 2022
0726f75
trying original val loop without mp.spawn
ksreenivasan May 12, 2022
fdee77b
trying train without mp.spawn()
ksreenivasan May 12, 2022
e1dda18
minor bugfix
ksreenivasan May 12, 2022
28929b3
making logs camel case
ksreenivasan May 12, 2022
87daf12
minor bugfix
ksreenivasan May 12, 2022
2748435
changing log file name
ksreenivasan May 12, 2022
045387a
distributed launch
PaulCCCCCCH May 12, 2022
8d82298
fixing validate bug and trying to be careful with memory before finetune
ksreenivasan May 12, 2022
815bb55
getting rid of before round acc
ksreenivasan May 12, 2022
dd5a27e
starting from scratch. implementing just GM imagenet from the pytorch…
ksreenivasan May 13, 2022
c78e35d
Merge branch 'ddp' of github.com:ksreenivasan/pruning_is_enough into ddp
ksreenivasan May 13, 2022
c00eca9
minor bugfixes to create model with scores
ksreenivasan May 13, 2022
a40f048
deleting whitespace
ksreenivasan May 13, 2022
721b71c
adding run script
ksreenivasan May 13, 2022
94dc5ef
trying to add mixed_precision to see if I can increase batch size to 1k
ksreenivasan May 13, 2022
8a87780
okay looks like mixed precision did the trick!
ksreenivasan May 13, 2022
ae77d7b
adding timing logs and checking if it works for GM
ksreenivasan May 13, 2022
8d71891
adding few more timing logs
ksreenivasan May 13, 2022
123411d
minor bugfixes in subnetconv
ksreenivasan May 13, 2022
bf6dbe4
adding scaler in main training loop
ksreenivasan May 13, 2022
5f6aafe
adding timing logs to original pytorch example to compare
ksreenivasan May 14, 2022
6e40fca
adding lists to store results
ksreenivasan May 14, 2022
02d3a78
adding regularizer and some other features
ksreenivasan May 14, 2022
1b7313c
adding imports
ksreenivasan May 14, 2022
9834796
minor bugfixes
ksreenivasan May 14, 2022
0e0662a
adding additional project step and reordering reg_loss
ksreenivasan May 14, 2022
9205fb9
adding a commented print to check regularization_loss is working
ksreenivasan May 14, 2022
73f36ab
trying switch_to_wt to sanity check finetune
ksreenivasan May 14, 2022
a4752c5
Merge branch 'ddp' of github.com:ksreenivasan/pruning_is_enough into ddp
ksreenivasan May 14, 2022
04be2b8
adding round_all_ones before switch_to_wt
ksreenivasan May 14, 2022
d2dbd2b
trying gm with switch_to_wt to sanity check the code
ksreenivasan May 14, 2022
910486c
making sure we save model.module not DDP model itself
ksreenivasan May 14, 2022
dc77a55
adding switch_to_prune() and trying kaiming normal init for wt training
ksreenivasan May 15, 2022
f1233e1
adding prune(), get_sparsity etc.
ksreenivasan May 15, 2022
5b447a2
adding model strings to make comparing easier
ksreenivasan May 29, 2022
9cf9105
updating model with bias=False and affine bn
ksreenivasan May 29, 2022
c416362
trying to mimic torchvision resnet50
ksreenivasan May 29, 2022
51594fc
changing last layer to linear
ksreenivasan May 29, 2022
8d52ef8
minor bugfixes adding bias to final fc layer
ksreenivasan May 29, 2022
63f38ce
plugging in subnetconv model into original script to compare
ksreenivasan May 29, 2022
0b6cfdc
minor bugfixes and adding weight init to builder
ksreenivasan May 30, 2022
d263960
adding round_model to complete comparison between mymodel and torchvi…
ksreenivasan May 30, 2022
19d7044
it works! looks like the models are near identical now.
ksreenivasan May 31, 2022
3a7ad5e
changing model back to signed_constant for GM
ksreenivasan Jun 1, 2022
cfd62cb
turns out you need to switch_to_wt before instantiating ddp
ksreenivasan Jun 1, 2022
d9743ec
adding finetune feature
ksreenivasan Jun 1, 2022
188735d
finetuning seems to work well!
ksreenivasan Jun 1, 2022
ccf2867
adding save_model before finetune and bugfixes
ksreenivasan Jun 3, 2022
9d11bb3
trying cifar10 on ddp to debug
ksreenivasan Jun 3, 2022
7a933c9
changing model to resnet20
ksreenivasan Jun 3, 2022
873f1a6
adding get_layers for resnet20
ksreenivasan Jun 3, 2022
a276ffd
minor typo for cifar10 ddp, looks like it works
ksreenivasan Jun 3, 2022
782a85f
print reg_loss only for pruning
ksreenivasan Jun 3, 2022
b139623
adding optimizer and results subdir for hyperparam opt
ksreenivasan Jun 6, 2022
5c35b32
minor typo with opt
ksreenivasan Jun 6, 2022
f6116fe
adding mkdir for subfolder
ksreenivasan Jun 6, 2022
e1c68d3
updating run script
ksreenivasan Jun 7, 2022
20b98ce
minor changes. for cifar10 testing i think
ksreenivasan Jun 18, 2022
6a053b7
Merge branch 'ddp' of github.com:ksreenivasan/pruning_is_enough into ddp
ksreenivasan Jun 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 40 additions & 18 deletions args_helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -878,43 +878,59 @@ def parse_arguments(self, jupyter_mode=False):
default=0,
help="Use mixed precision or not"
)
parser.add_argument('--transformer_emsize', type=int, default=200,
help='size of word embeddings')
parser.add_argument('--transformer_nhid', type=int, default=200,
help='number of hidden units per layer')
parser.add_argument('--transformer_nlayers', type=int, default=2,
help='number of layers')
parser.add_argument('--transformer_clip', type=float, default=0.25,
help='gradient clipping')
parser.add_argument('--transformer_bptt', type=int, default=35,
help='sequence length')
parser.add_argument('--transformer_dropout', type=float, default=0.2,
help='dropout applied to layers (0 = no dropout)')
parser.add_argument('--transformer_nhead', type=int, default=2,
help='the number of heads in the encoder/decoder of the transformer model')

parser.add_argument('--transformer_emsize',
type=int, default=200,
help='size of word embeddings'
)
parser.add_argument('--transformer_nhid',
type=int,
default=200,
help='number of hidden units per layer'
)
parser.add_argument('--transformer_nlayers',
type=int,
default=2,
help='number of layers'
)
parser.add_argument('--transformer_clip',
type=float,
default=0.25,
help='gradient clipping'
)
parser.add_argument('--transformer_bptt',
type=int,
default=35,
help='sequence length'
)
parser.add_argument('--transformer_dropout',
type=float,
default=0.2,
help='dropout applied to layers (0 = no dropout)'
)
parser.add_argument('--transformer_nhead',
type=int,
default=2,
help='the number of heads in the encoder/decoder of the transformer model'
)
parser.add_argument(
"--only-sanity",
action="store_true",
default=False,
help="Only run sanity checks on the files in specific directory or subdirectories"
)

parser.add_argument(
"--invert-sanity-check",
action="store_true",
default=False,
help="Enable this to run the inverted sanity check (for HC)"
)

parser.add_argument(
"--sanity-folder",
default=None,
type=str,
metavar="PATH",
help="directory(s) to access for only sanity check",
)

parser.add_argument(
"--sr-version",
default=1,
Expand All @@ -927,6 +943,12 @@ def parse_arguments(self, jupyter_mode=False):
default=False,
help="Enable this use full train data and not leave anything for validation"
)
parser.add_argument(
"--port",
default=29500,
type=int,
help="Specify port to use for DDP",
)

if jupyter_mode:
args = parser.parse_args("")
Expand Down
6 changes: 3 additions & 3 deletions cifar_exec.sh
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,11 @@ BLOCK
#:<<BLOCK
# Using validation to figure out hyperparams
# NOTE: make sure to delete/comment subfolder from the config file or else it may not work
conf_file="configs/training/resnet32/cifar100_resnet32_training"
conf_file="configs/hypercube/resnet20/resnet20_sparsity_1_44_unflagT_real"
conf_end=".yml"
log_root="resnet32_wt_"
log_root="ddp_resnet20_sp1_4_"
log_end="_log"
subfolder_root="resnet32_wt_"
subfolder_root="ddp_resnet20_sp1_4_"

for trial in 1
do
Expand Down
63 changes: 63 additions & 0 deletions configs/ddp_debug/conf1.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# subfolder: target_sparsity_0_59_unflagT_real

# Hypercube optimization
algo: 'hc_iter'
iter_period: 5

# Architecture
arch: resnet20

# ===== Dataset ===== #
dataset: CIFAR10
name: resnet20_quantized_iter_hc

# ===== Learning Rate Policy ======== #
optimizer: sgd
lr: 0.1 #0.01
lr_policy: cosine_lr #constant_lr #multistep_lr
fine_tune_lr: 0.01
fine_tune_lr_policy: multistep_lr

# ===== Network training config ===== #
epochs: 20
wd: 0.0
momentum: 0.9
batch_size: 512

# ===== Sparsity =========== #
conv_type: SubnetConv
bn_type: NonAffineBatchNorm
freeze_weights: True
prune_type: BottomK
# enter target sparsity here
target_sparsity: 20
# decide if you want to "unflag"
unflag_before_finetune: True
init: signed_constant
score_init: unif #skew #half #bimodal #skew # bern
scale_fan: False #True

# ===== Rounding ===== #
round: naive
noise: True
noise_ratio: 0

# ===== Quantization ===== #
hc_quantized: True
quantize_threshold: 0.5

# ===== Regularization ===== #
regularization: L2
lmbda: 0.0001 # 1e-4

# ===== Hardware setup ===== #
workers: 8
# gpu: 1
multiprocessing_distributed: True
mixed_precision: True

# ===== Checkpointing ===== #
checkpoint_at_prune: False

# ==== sanity check ==== #
skip_sanity_checks: True
62 changes: 62 additions & 0 deletions configs/ddp_debug/conf2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# subfolder: target_sparsity_0_59_unflagT_real

# Hypercube optimization
algo: 'hc_iter'
iter_period: 5

# Architecture
arch: resnet20

# ===== Dataset ===== #
dataset: CIFAR10
name: resnet20_quantized_iter_hc

# ===== Learning Rate Policy ======== #
optimizer: sgd
lr: 0.1 #0.01
lr_policy: cosine_lr #constant_lr #multistep_lr
fine_tune_lr: 0.01
fine_tune_lr_policy: multistep_lr

# ===== Network training config ===== #
epochs: 20
wd: 0.0
momentum: 0.9
batch_size: 512

# ===== Sparsity =========== #
conv_type: SubnetConv
bn_type: NonAffineBatchNorm
freeze_weights: True
prune_type: BottomK
# enter target sparsity here
target_sparsity: 20
# decide if you want to "unflag"
unflag_before_finetune: True
init: signed_constant
score_init: unif #skew #half #bimodal #skew # bern
scale_fan: False #True

# ===== Rounding ===== #
round: naive
noise: True
noise_ratio: 0

# ===== Quantization ===== #
hc_quantized: True
quantize_threshold: 0.5

# ===== Regularization ===== #
regularization: L2
lmbda: 0.0001 # 1e-4

# ===== Hardware setup ===== #
workers: 8
gpu: 0
mixed_precision: True

# ===== Checkpointing ===== #
checkpoint_at_prune: False

# ==== sanity check ==== #
skip_sanity_checks: True
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,9 @@ lmbda: 0.0001 # 1e-4 #0.00005 # 5e-5

# ===== Hardware setup ===== #
workers: 4
gpu: 0
# gpu: 0
multiprocessing_distributed: True
mixed_precision: True

# ===== Checkpointing ===== #
checkpoint_at_prune: False
Expand Down
68 changes: 68 additions & 0 deletions configs/hypercube/resnet50/imagenet/resnet50_sparsity_10.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# subfolder: regular_imagenet_resnet50
# trial_num: 1
#lam_finetune_loss: 1
#num_step_finetune: 5

# Hypercube optimization
algo: 'hc_iter'
iter_period: 5

# Architecture
arch: ResNet50

# ===== Dataset ===== #
dataset: ImageNet
name: resnet50_imagenet
data: /home/ubuntu/ILSVRC2012/

# ===== Learning Rate Policy ======== #
optimizer: sgd
lr: 0.4 #0.01
lr_policy: cosine_lr #constant_lr #multistep_lr
fine_tune_lr: 0.04
fine_tune_lr_policy: multistep_lr

# ===== Network training config ===== #
epochs: 88
wd: 0.0
momentum: 0.9
batch_size: 1024
mixed_precision: True

# ===== Sparsity =========== #
conv_type: SubnetConv
bn_type: NonAffineBatchNorm
freeze_weights: True
prune_type: BottomK
# enter target sparsity here
target_sparsity: 10
# decide if you want to "unflag"
unflag_before_finetune: False
init: signed_constant
score_init: unif #skew #half #bimodal #skew # bern
scale_fan: False #True

# ===== Rounding ===== #
round: naive
noise: True
noise_ratio: 0

# ===== Quantization ===== #
hc_quantized: True
quantize_threshold: 0.5

# ===== Regularization ===== #
regularization: L2
lmbda: 0.0000001 # 1e-7

# ===== Hardware setup ===== #
workers: 12
multiprocessing_distributed: True
mixed_precision: True
# gpu: 1

# ===== Checkpointing ===== #
checkpoint_at_prune: False

# ==== sanity check ==== #
skip_sanity_checks: True
68 changes: 68 additions & 0 deletions configs/hypercube/resnet50/imagenet/resnet50_sparsity_15.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# subfolder: regular_imagenet_resnet50
# trial_num: 1
#lam_finetune_loss: 1
#num_step_finetune: 5

# Hypercube optimization
algo: 'hc_iter'
iter_period: 5

# Architecture
arch: ResNet50

# ===== Dataset ===== #
dataset: ImageNet
name: resnet50_imagenet
data: /home/ubuntu/ILSVRC2012/

# ===== Learning Rate Policy ======== #
optimizer: sgd
lr: 0.4 #0.01
lr_policy: cosine_lr #constant_lr #multistep_lr
fine_tune_lr: 0.04
fine_tune_lr_policy: multistep_lr

# ===== Network training config ===== #
epochs: 88
wd: 0.0
momentum: 0.9
batch_size: 1024
mixed_precision: True

# ===== Sparsity =========== #
conv_type: SubnetConv
bn_type: NonAffineBatchNorm
freeze_weights: True
prune_type: BottomK
# enter target sparsity here
target_sparsity: 15
# decide if you want to "unflag"
unflag_before_finetune: False
init: signed_constant
score_init: unif #skew #half #bimodal #skew # bern
scale_fan: False #True

# ===== Rounding ===== #
round: naive
noise: True
noise_ratio: 0

# ===== Quantization ===== #
hc_quantized: True
quantize_threshold: 0.5

# ===== Regularization ===== #
regularization: L2
lmbda: 0.00000005 # 5e-8

# ===== Hardware setup ===== #
workers: 12
multiprocessing_distributed: True
mixed_precision: True
# gpu: 1

# ===== Checkpointing ===== #
checkpoint_at_prune: False

# ==== sanity check ==== #
skip_sanity_checks: True
Loading