All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
-
Fixed trainer by default
NoneinDDPAccelerator(#4915) -
Fixed
LightningOptimizerexposes optimizer attributes (#5095)
- Added "monitor" key to saved
ModelCheckpoints(#4383) - Added
ConfusionMatrixclass interface (#4348) - Added multiclass AUROC metric (#4236)
- Added global step indexing to the checkpoint name for a better sub-epoch checkpointing experience (#3807)
- Added optimizer hooks in callbacks (#4379)
- Added option to log momentum (#4384)
- Added
current_scoretoModelCheckpoint.on_save_checkpoint(#4721) - Added logging using
self.login train and evaluation for epoch end hooks ( #4552, #4495, #4439) #4684) #4913) - Added ability for DDP plugin to modify optimizer state saving (#4675)
- Added casting to python types for numpy scalars when logging hparams (#4647)
- Added
prefixargument in loggers (#4557) - Added printing of total num of params, trainable and non-trainable params in ModelSummary (#4521)
- Added
PrecisionRecallCurve, ROC, AveragePrecisionclass metric (#4549) - Added custom
ApexandNativeAMPasPrecision plugins(#4355) - Added
DALI MNISTexample (#3721) - Added
sharded pluginfor DDP for multi-gpu training memory optimizations ( #4639, #4686, #4675, #4737, #4773) - Added
experiment_idto the NeptuneLogger (#3462) - Added
Pytorch Geometricintegration example with Lightning (#4568) - Added
all_gathermethod toLightningModulewhich allows gradient based tensor synchronizations for use-cases such as negative sampling. (#5012) - Enabled
self.login most functions (#4969) - Added changeable extension variable for
ModelCheckpoint(#4977)
- Tuner algorithms will be skipped if
fast_dev_run=True(#3903) WandbLoggerdoes not force wandbreinitarg to True anymore and creates a run only when needed (#4648)- Changed
automatic_optimizationto be a model attribute (#4602) - Changed
Simple Profilerreport to order by percentage time spent + num calls (#4880) - Simplify optimization Logic (#4984)
- Classification metrics overhaul (#4837)
- Updated
fast_dev_runto accept integer representing num_batches (#4629) - Refactored optimizer (#4658)
- Deprecated
prefixargument inModelCheckpoint(#4765) - Deprecated the old way of assigning hyper-parameters through
self.hparams = ...(#4813) - Deprecated
mode='auto'fromModelCheckpointandEarlyStopping(#4695)
- Removed
reorderparameter of theaucmetric (#5004) - Removed
multiclass_rocandmulticlass_precision_recall_curve, userocandprecision_recall_curveinstead (#4549)
- Added feature to move tensors to CPU before saving (#4309)
- Fixed
LoggerConnectorto have logged metrics on root device in DP (#4138) - Auto convert tensors to contiguous format when
gather_all(#4907) - Fixed
PYTHONPATHfor ddp test model (#4528) - Fixed allowing logger to support indexing (#4595)
- Fixed DDP and manual_optimization (#4976)
- Added casting to python types for numpy scalars when logging
hparams(#4647) - Added warning when progress bar refresh rate is less than 20 on Google Colab to prevent crashing (#4654)
- Added
F1class metric (#4656)
- Consistently use
step=trainer.global_stepinLearningRateMonitorindependently oflogging_interval(#4376) - Metric states are no longer as default added to
state_dict(#4685) - Renamed class metric
Fbeta>>FBeta(#4656) - Model summary: add 1 decimal place (#4745)
- Do not override
PYTHONWARNINGS(#4700) - Changed
init_ddp_connectionmoved fromDDPtoDDPPlugin(#4407)
- Fixed checkpoint
hparamsdict casting whenomegaconfis available (#4770) - Fixed incomplete progress bars when total batches not divisible by refresh rate (#4577)
- Updated SSIM metric (#4566)(#4656)
- Fixed batch_arg_name - add
batch_arg_nameto all calls to_adjust_batch_sizebug (#4812) - Fixed
torchtextdata to GPU (#4785) - Fixed a crash bug in MLFlow logger (#4716)
- Added lambda closure to
manual_optimizer_step(#4618)
- Change Metrics
persistentdefault mode toFalse(#4685) - LoggerConnector log_metrics will use
total_batch_idxinstead ofglobal_stepwhen logging ontraining step(#4738)
- Prevent crash if
sync_dist=Trueon CPU (#4626) - Fixed average pbar Metrics (#4534)
- Fixed
setupcallback hook to correctly pass the LightningModule through (#4608) - Allowing decorate model init with saving
hparamsinside (#4662) - Fixed
split_idxset byLoggerConnectorinon_trainer_inittoTrainer(#4697)
- Added metrics aggregation in Horovod and fixed early stopping (#3775)
- Added
manual_optimizer_stepwhich work withAMP Nativeandaccumulated_grad_batches(#4485) - Added
persistent(mode)method to metrics, to enable and disable metric states being added tostate_dict(#4482) - Added congratulations at the end of our notebooks (#4555)
- Added parameters
move_metrics_to_cpuin Trainer to disable gpu leak (#4592)
- Changed
fsspecto tuner (#4458) - Unify SLURM/TorchElastic under backend plugin (#4578, #4580, #4581, #4582, #4583)
- Fixed feature-lack in
hpc_load(#4526) - Fixed metrics states being overridden in DDP mode (#4482)
- Fixed
lightning_getattr,lightning_hasattrnot finding the correct attributes in datamodule (#4347) - Fixed automatic optimization AMP by
manual_optimization_step(#4485) - Replace
MisconfigurationExceptionwith warning inModelCheckpointCallback (#4560) - Fixed logged keys in mlflow logger (#4412)
- Fixed
is_picklableby catchingAttributeError(#4508) - Fixed multi test dataloaders dict
AttributeErrorerror (#4480) - Fixed show progress bar only for
progress_rank 0onDDP_SLURM(#4437)
- Added PyTorch 1.7 Stable support (#3821)
- Added timeout for
tpu_device_existsto ensure process does not hang indefinitely (#4340)
- W&B log in sync with
Trainerstep (#4405) - Hook
on_after_backwardis called only whenoptimizer_stepis being called (#4439) - Moved
track_and_norm_gradintotraining loopand called only whenoptimizer_stepis being called (#4439) - Changed type checker with explicit cast of
ref_modelobject (#4457) - Changed
distributed_backend->accelerator(#4429)
- Deprecated passing
ModelCheckpointinstance tocheckpoint_callbackTrainer argument (#4336)
- Disable saving checkpoints if not trained (#4372)
- Fixed error using
auto_select_gpus=Truewithgpus=-1(#4209) - Disabled training when
limit_train_batches=0(#4371) - Fixed that metrics do not store computational graph for all seen data (#4313)
- Fixed AMP unscale for
on_after_backward(#4439) - Fixed TorchScript export when module includes Metrics (#4428)
- Fixed TorchScript trace method's data to device and docstring (#4360)
- Fixed CSV logger warning (#4419)
- Fixed skip DDP parameter sync (#4301)
- Fixed
WandbLogger_sanitize_callable function (#4422) - Fixed
AMP Native_unscalegradient (#4441)
- Added
dirpathandfilenameparameter inModelCheckpoint(#4213) - Added plugins docs and DDPPlugin to customize ddp across all accelerators (#4258)
- Added
strictoption to the scheduler dictionary (#3586) - Added
fsspecsupport for profilers (#4162) - Added autogenerated helptext to
Trainer.add_argparse_args(#4344) - Added support for string values in
Trainer'sprofilerparameter (#3656) - Added support for string values in
Trainer'sprofilerparameter (#3656) - Added
optimizer_closuretooptimizer.stepwhen supported (#4190) - Added unification of regression metrics (#4166)
- Added checkpoint load from Bytes (#4314)
- Improved error messages for invalid
configure_optimizersreturns (#3587) - Allow changing the logged step value in
validation_step(#4130) - Allow setting
replace_sampler_ddp=Truewith a distributed sampler already added (#4273) - Fixed santized parameters for
WandbLogger.log_hyperparams(#4320)
- Deprecated
filepathinModelCheckpoint(#4213) - Deprecated
reorderparameter of theaucmetric (#4237) - Deprecated bool values in
Trainer'sprofilerparameter (#3656)
- Fixed setting device ids in DDP (#4297)
- Fixed synchronization of best model path in
ddp_accelerator(#4323) - Fixed
WandbLoggernot uploading checkpoint artifacts at the end of training (#4341) - Fixed
FBetacomputation (#4183) - Fixed
accumulation across batcheshas completedbefore breaking training loop(#4278) - Fixed
ModelCheckpointdon't increase current_epoch and global_step when not training (#4291) - Fixed
COMET_EXPERIMENT_KEYenvironment variable usage in comet logger (#4230)
- Added persistent flag to
Metric.add_state(#4195)
- Added trace functionality to the function
to_torchscript(#4142)
- Called
on_load_checkpointbefore loadingstate_dict(#4057)
- Removed duplicate metric vs step log for train loop (#4173)
- Fixed the
self.logproblem invalidation_step()(#4169) - Fixed
hparamssaving - save the state whensave_hyperparameters()is called [in__init__] (#4163) - Fixed runtime failure while exporting
hparamsto yaml (#4158)
- Added getstate/setstate method for torch.save serialization (#4127)
- Added Explained Variance Metric + metric fix (#4013)
- Added Metric <-> Lightning Module integration tests (#4008)
- Added parsing OS env vars in
Trainer(#4022) - Added classification metrics (#4043)
- Updated explained variance metric (#4024)
- Enabled plugins (#4041)
- Enabled custom clusters (#4048)
- Enabled passing in custom accelerators (#4050)
- Added
LightningModule.toggle_optimizer(#4058) - Added
LightningModule.manual_backward(#4063) - Added
outputargument to*_batch_endhooks (#3965, #3966) - Added
outputargument to*_epoch_endhooks (#3967)
- Integrated metrics API with self.log (#3961)
- Decoupled Apex (#4052, #4054, #4055, #4056, #4058, #4060, #4061, #4062, #4063, #4064, #4065)
- Renamed all backends to
Accelerator(#4066) - Enabled manual returns (#4089)
- Removed support for EvalResult and TrainResult (#3968)
- Removed deprecated trainer flags:
overfit_pct,log_save_interval,row_log_interval(#3969) - Removed deprecated early_stop_callback (#3982)
- Removed deprecated model hooks (#3980)
- Removed deprecated callbacks (#3979)
- Removed
trainerargument inLightningModule.backward#4056)
- Fixed
current_epochproperty update to reflect true epoch number insideLightningDataModule, whenreload_dataloaders_every_epoch=True. (#3974) - Fixed to print scaler value in progress bar (#4053)
- Fixed mismatch between docstring and code regarding when
on_load_checkpointhook is called (#3996)
- Added new Metrics API. (#3868, #3921)
- Enable PyTorch 1.7 compatibility (#3541)
- Added
LightningModule.to_torchscriptto support exporting asScriptModule(#3258) - Added warning when dropping unpicklable
hparams(#2874) - Added EMB similarity (#3349)
- Added
ModelCheckpoint.to_yamlmethod (#3048) - Allow
ModelCheckpointmonitor to beNone, meaning it will always save (#3630) - Disabled optimizers setup during testing (#3059)
- Added support for datamodules to save and load checkpoints when training (#3563)
- Added support for datamodule in learning rate finder (#3425)
- Added gradient clip test for native AMP (#3754)
- Added dist lib to enable syncing anything across devices (#3762)
- Added
broadcasttoTPUBackend(#3814) - Added
XLADeviceUtilsclass to check XLA device type (#3274)
- Refactored accelerator backends:
- moved TPU
xxx_stepto backend (#3118) - refactored DDP backend
forward(#3119) - refactored GPU backend
__step(#3120) - refactored Horovod backend (#3121, #3122)
- remove obscure forward call in eval + CPU backend
___step(#3123) - reduced all simplified forward (#3126)
- added hook base method (#3127)
- refactor eval loop to use hooks - use
test_modefor if so we can split later (#3129) - moved
___step_endhooks (#3130) - training forward refactor (#3134)
- training AMP scaling refactor (#3135)
- eval step scaling factor (#3136)
- add eval loop object to streamline eval loop (#3138)
- refactored dataloader process hook (#3139)
- refactored inner eval loop (#3141)
- final inner eval loop hooks (#3154)
- clean up hooks in
run_evaluation(#3156) - clean up data reset (#3161)
- expand eval loop out (#3165)
- moved hooks around in eval loop (#3195)
- remove
_evaluatefx (#3197) Trainer.fithook clean up (#3198)- DDPs train hooks (#3203)
- refactor DDP backend (#3204, #3207, #3208, #3209, #3210)
- reduced accelerator selection (#3211)
- group prepare data hook (#3212)
- added data connector (#3285)
- modular is_overridden (#3290)
- adding
Trainer.tune()(#3293) - move
run_pretrain_routine->setup_training(#3294) - move train outside of setup training (#3297)
- move
prepare_datato data connector (#3307) - moved accelerator router (#3309)
- train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
- duplicate data interface definition up into DataHooks class (#3344)
- inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
- all logging related calls in a connector (#3395)
- device parser (#3400, #3405)
- added model connector (#3407)
- moved eval loop logging to loggers (#3408)
- moved eval loop (#3412#3408)
- trainer/separate argparse (#3421, #3428, #3432)
- move
lr_finder(#3434) - organize args (##3435, #3442, #3447, #3448, #3449, #3456)
- move specific accelerator code (#3457)
- group connectors (#3472)
- accelerator connector methods x/n (#3469, #3470, #3474)
- merge backends x/n (#3476, #3477, #3478, #3480, #3482)
- apex plugin (#3502)
- precision plugins (#3504)
- Result - make monitor default to
checkpoint_onto simplify (#3571) - reference to the Trainer on the
LightningDataModule(#3684) - add
.logto lightning module (#3686, #3699, #3701, #3704, #3715) - enable tracking original metric when step and epoch are both true (#3685)
- deprecated results obj, added support for simpler comms (#3681)
- move backends back to individual files (#3712)
- fixes logging for eval steps (#3763)
- decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806)
- remove weight loading hack for ddp_cpu (#3808)
- separate
torchelasticfrom DDP (#3810) - separate SLURM from DDP (#3809)
- decoupled DDP2 (#3816)
- bug fix with logging val epoch end + monitor (#3812)
- decoupled DDP, DDP spawn (#3733, #3817, #3819, #3927)
- callback system and init DDP (#3836)
- adding compute environments (#3837, #3842)
- epoch can now log independently (#3843)
- test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
- fixed
init_slurm_connectioncausing hostname errors (#3856) - moves init apex from LM to apex connector (#3923)
- moves sync bn to each backend (#3925)
- moves configure ddp to each backend (#3924)
- moved TPU
- Deprecation warning (#3844)
- Changed
LearningRateLoggertoLearningRateMonitor(#3251) - Used
fsspecinstead ofgfilefor all IO (#3320)- Swaped
torch.loadforfsspecload in DDP spawn backend (#3787) - Swaped
torch.loadforfsspecload in cloud_io loading (#3692) - Added support for
to_disk()to use remote filepaths withfsspec(#3930) - Updated model_checkpoint's to_yaml to use
fsspecopen (#3801) - Fixed
fsspecis inconsistant when doingfs.ls(#3805)
- Swaped
- Refactor
GPUStatsMonitorto improve training speed (#3257) - Changed IoU score behavior for classes absent in target and pred (#3098)
- Changed IoU
remove_bgbool toignore_indexoptional int (#3098) - Changed defaults of
save_top_kandsave_lasttoNonein ModelCheckpoint (#3680) row_log_intervalandlog_save_intervalare now based on training loop'sglobal_stepinstead of epoch-internal batch index (#3667)- Silenced some warnings. verified ddp refactors (#3483)
- Cleaning up stale logger tests (#3490)
- Allow
ModelCheckpointmonitor to beNone(#3633) - Enable
Nonemodel checkpoint default (#3669) - Skipped
best_model_pathifcheckpoint_callbackisNone(#2962) - Used
raise .. from ..to explicitly chain exceptions (#3750) - Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
- Write predictions in LightningModule instead of EvalResult #3882
- Deprecated
TrainResultandEvalResult, useself.logandself.writefrom theLightningModuleto log metrics and write predictions.training_stepcan now only return a scalar (for the loss) or a dictionary with anything you want. (#3681) - Deprecate
early_stop_callbackTrainer argument (#3845) - Rename Trainer arguments
row_log_interval>>log_every_n_stepsandlog_save_interval>>flush_logs_every_n_steps(#3748)
- Removed experimental Metric API (#3868,
#3943,
#3949,
#3946), listed changes before final removal:
- Added
EmbeddingSimilaritymetric (#3349, #3358) - Added hooks to metric module interface (#2528)
- Added error when AUROC metric is used for multiclass problems (#3350)
- Fixed
ModelCheckpointwithsave_top_k=-1option not tracking the best models when a monitor metric is available (#3735) - Fixed counter-intuitive error being thrown in
Accuracymetric for zero target tensor (#3764) - Fixed aggregation of metrics (#3517)
- Fixed Metric aggregation (#3321)
- Fixed RMSLE metric (#3188)
- Renamed
reductiontoclass_reductionin classification metrics (#3322) - Changed
class_reductionsimilar to sklearn for classification metrics (#3322) - Renaming of precision recall metric (#3308)
- Added
- Fixed
on_train_batch_starthook to end epoch early (#3700) - Fixed
num_sanity_val_stepsis clipped tolimit_val_batches(#2917) - Fixed ONNX model save on GPU (#3145)
- Fixed
GpuUsageLoggerto work on different platforms (#3008) - Fixed auto-scale batch size not dumping
auto_lr_findparameter (#3151) - Fixed
batch_outputswith optimizer frequencies (#3229) - Fixed setting batch size in
LightningModule.datamodulewhen usingauto_scale_batch_size(#3266) - Fixed Horovod distributed backend compatibility with native AMP (#3404)
- Fixed batch size auto scaling exceeding the size of the dataset (#3271)
- Fixed getting
experiment_idfrom MLFlow only once instead of each training loop (#3394) - Fixed
overfit_batcheswhich now correctly disables shuffling for the training loader. (#3501) - Fixed gradient norm tracking for
row_log_interval > 1(#3489) - Fixed
ModelCheckpointname formatting (3164) - Fixed auto-scale batch size (#3151)
- Fixed example implementation of AutoEncoder (#3190)
- Fixed invalid paths when remote logging with TensorBoard (#3236)
- Fixed change
t()totranspose()as XLA devices do not support.t()on 1-dim tensor (#3252) - Fixed (weights only) checkpoints loading without PL (#3287)
- Fixed
gather_all_tensorscross GPUs in DDP (#3319) - Fixed CometML save dir (#3419)
- Fixed forward key metrics (#3467)
- Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
- Fixed global step increment in training loop when
training_epoch_endhook is used (#3673) - Fixed dataloader shuffling not getting turned off with
overfit_batches > 0anddistributed_backend = "ddp"(#3534) - Fixed determinism in
DDPSpawnBackendwhen usingseed_everythingin main process (#3335) - Fixed
ModelCheckpointperiodto actually save everyperiodepochs (#3630) - Fixed
val_progress_bartotal withnum_sanity_val_steps(#3751) - Fixed Tuner dump: add
current_epochto dumped_params (#3261) - Fixed
current_epochandglobal_stepproperties mismatch betweenTrainerandLightningModule(#3785) - Fixed learning rate scheduler for optimizers with internal state (#3897)
- Fixed
tbptt_reduce_fxwhen non-floating tensors are logged (#3796) - Fixed model checkpoint frequency (#3852)
- Fixed logging non-tensor scalar with result breaks subsequent epoch aggregation (#3855)
- Fixed
TrainerEvaluationLoopMixinactivatesmodel.train()at the end (#3858) - Fixed
overfit_batcheswhen using with multiple val/test_dataloaders (#3857) - Fixed enables
training_stepto returnNone(#3862) - Fixed init nan for checkpointing (#3863)
- Fixed for
load_from_checkpoint(#2776) - Fixes incorrect
batch_sizeswhen Dataloader returns a dict with multiple tensors (#3668) - Fixed unexpected signature for
validation_step(#3947)
- Added SyncBN for DDP (#2801, #2838)
- Added basic
CSVLogger(#2721) - Added SSIM metrics (#2671)
- Added BLEU metrics (#2535)
- Added support to export a model to ONNX format (#2596)
- Added support for
Trainer(num_sanity_val_steps=-1)to check all validation data before training (#2246) - Added struct. output:
- Added class
LightningDataModule(#2668) - Added support for PyTorch 1.6 (#2745)
- Added call DataModule hooks implicitly in trainer (#2755)
- Added support for Mean in DDP Sync (#2568)
- Added remaining
sklearnmetrics:AveragePrecision,BalancedAccuracy,CohenKappaScore,DCG,Hamming,Hinge,Jaccard,MeanAbsoluteError,MeanSquaredError,MeanSquaredLogError,MedianAbsoluteError,R2Score,MeanPoissonDeviance,MeanGammaDeviance,MeanTweedieDeviance,ExplainedVariance(#2562) - Added support for
limit_{mode}_batches (int)to work with infinite dataloader (IterableDataset) (#2840) - Added support returning python scalars in DP (#1935)
- Added support to Tensorboard logger for OmegaConf
hparams(#2846) - Added tracking of basic states in
Trainer(#2541) - Tracks all outputs including TBPTT and multiple optimizers (#2890)
- Added GPU Usage Logger (#2932)
- Added
strict=Falseforload_from_checkpoint(#2819) - Added saving test predictions on multiple GPUs (#2926)
- Auto log the computational graph for loggers that support this (#3003)
- Added warning when changing monitor and using results obj (#3014)
- Added a hook
transfer_batch_to_deviceto theLightningDataModule(#3038)
- Truncated long version numbers in progress bar (#2594)
- Enabling val/test loop disabling (#2692)
- Refactored into
acceleratormodule: - Using
.comet.configfile forCometLogger(#1913) - Updated hooks arguments - breaking for
setupandteardown(#2850) - Using
gfileto support remote directories (#2164) - Moved optimizer creation after device placement for DDP backends (#2904)
- Support
**DictConfigforhparamserialization (#2519) - Removed callback metrics from test results obj (#2994)
- Re-enabled naming metrics in ckpt name (#3060)
- Changed progress bar epoch counting to start from 0 (#3061)
- Deprecated Trainer attribute
ckpt_path, which will now be set byweights_save_path(#2681)
- Removed deprecated: (#2760)
- core decorator
data_loader - Module hook
on_sanity_check_startand loadingload_from_metrics - package
pytorch_lightning.logging - Trainer arguments:
show_progress_bar,num_tpu_cores,use_amp,print_nan_grads - LR Finder argument
num_accumulation_steps
- core decorator
- Fixed
accumulate_grad_batchesfor last batch (#2853) - Fixed setup call while testing (#2624)
- Fixed local rank zero casting (#2640)
- Fixed single scalar return from training (#2587)
- Fixed Horovod backend to scale LR schedlers with the optimizer (#2626)
- Fixed
dtypeanddeviceproperties not getting updated in submodules (#2657) - Fixed
fast_dev_runto run for all dataloaders (#2581) - Fixed
save_dirin loggers getting ignored by default value ofweights_save_pathwhen user did not specifyweights_save_path(#2681) - Fixed
weights_save_pathgetting ignored whenlogger=Falseis passed to Trainer (#2681) - Fixed TPU multi-core and Float16 (#2632)
- Fixed test metrics not being logged with
LoggerCollection(#2723) - Fixed data transfer to device when using
torchtext.data.Fieldandinclude_lengths is True(#2689) - Fixed shuffle argument for distributed sampler (#2789)
- Fixed logging interval (#2694)
- Fixed loss value in the progress bar is wrong when
accumulate_grad_batches > 1(#2738) - Fixed correct CWD for ddp sub-processes when using Hydra (#2719)
- Fixed selecting GPUs using
CUDA_VISIBLE_DEVICES(#2739, #2796) - Fixed false
num_classeswarning in metrics (#2781) - Fixed shell injection vulnerability in subprocess call (#2786)
- Fixed LR finder and
hparamscompatibility (#2821) - Fixed
ModelCheckpointnot saving the latest information whensave_last=True(#2881) - Fixed ImageNet example: learning rate scheduler, number of workers and batch size when using DDP (#2889)
- Fixed apex gradient clipping (#2829)
- Fixed save apex scaler states (#2828)
- Fixed a model loading issue with inheritance and variable positional arguments (#2911)
- Fixed passing
non_blocking=Truewhen transferring a batch object that does not support it (#2910) - Fixed checkpointing to remote file paths (#2925)
- Fixed adding val step argument to metrics (#2986)
- Fixed an issue that caused
Trainer.test()to stall in ddp mode (#2997) - Fixed gathering of results with tensors of varying shape (#3020)
- Fixed batch size auto-scaling feature to set the new value on the correct model attribute (#3043)
- Fixed automatic batch scaling not working with half precision (#3045)
- Fixed setting device to root gpu (#3042)
- Removed auto val reduce (#2462)
- Flattening Wandb Hyperparameters (#2459)
- Fixed using the same DDP python interpreter and actually running (#2482)
- Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
- Made
TensorBoardLoggerandCometLoggerpickleable (#2518) - Fixed a problem with
MLflowLoggercreating multiple run folders (#2502) - Fixed global_step increment (#2455)
- Fixed TPU hanging example (#2488)
- Fixed
argparsedefault value bug (#2526) - Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
- Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
- Fixed Trainer
.fit()returning last not best weights in "ddp_spawn" (#2565) - Fixed passing (do not pass) TPU weights back on test (#2566)
- Fixed DDP tests and
.test()(#2512, #2570)
- Added reduce ddp results on eval (#2434)
- Added a warning when an
IterableDatasethas__len__defined (#2437)
- Enabled no returns from eval (#2446)
- Fixes train outputs (#2428)
- Fixes Conda dependencies (#2412)
- Fixed Apex scaling with decoupled backward (#2433)
- Fixed crashing or wrong displaying progressbar because of missing ipywidgets (#2417)
- Fixed TPU saving dir (fc26078e, 04e68f02)
- Fixed logging on rank 0 only (#2425)
- Added TorchText support for moving data to GPU (#2379)
- Changed epoch indexing from 0 instead of 1 (#2289)
- Refactor Model
backward(#2276) - Refactored
training_batch+ tests to verify correctness (#2327, #2328) - Refactored training loop (#2336)
- Made optimization steps for hooks (#2363)
- Changed default apex level to 'O2' (#2362)
- Moved
TrainsLoggerto Bolts (#2384)
- Fixed parsing TPU arguments and TPU tests (#2094)
- Fixed number batches in case of multiple dataloaders and
limit_{*}_batches(#1920, #2226) - Fixed an issue with forward hooks not being removed after model summary (#2298)
- Fix for
load_from_checkpoint()not working with absolute path on Windows (#2294) - Fixed an issue how _has_len handles
NotImplementedErrore.g. raised bytorchtext.data.Iterator(#2293), (#2307) - Fixed
average_precisionmetric (#2319) - Fixed ROC metric for CUDA tensors (#2304)
- Fixed
average_precisionmetric (#2319) - Fixed lost compatibility with custom datatypes implementing
.to(#2335) - Fixed loading model with kwargs (#2387)
- Fixed sum(0) for
trainer.num_val_batches(#2268) - Fixed checking if the parameters are a
DictConfigObject (#2216) - Fixed SLURM weights saving (#2341)
- Fixed swaps LR scheduler order (#2356)
- Fixed adding tensorboard
hparamslogging test (#2342) - Fixed use model ref for tear down (#2360)
- Fixed logger crash on DDP (#2388)
- Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
- Fixed loading past checkpoints from v0.7.x (#2405)
- Fixed loading model without arguments (#2403)
- Fixed Windows compatibility issue (#2358)
- Fixed the
load_from_checkpointpath detected as URL bug (#2244) - Fixed hooks - added barrier (#2245, #2257, #2260)
- Fixed
hparams- remove frame inspection onself.hparams(#2253) - Fixed setup and on fit calls (#2252)
- Fixed GPU template (#2255)
- Added
overfit_batches,limit_{val|test}_batchesflags (overfit now uses training set for all three) (#2213) - Added metrics
- Added type hints in
Trainer.fit()andTrainer.test()to reflect that also a list of dataloaders can be passed in (#1723) - Allow dataloaders without sampler field present (#1907)
- Added option
save_lastto save the model at the end of every epoch inModelCheckpoint(#1908) - Early stopping checks
on_validation_end(#1458) - Attribute
best_model_pathtoModelCheckpointfor storing and later retrieving the path to the best saved model file (#1799) - Speed up single-core TPU training by loading data using
ParallelLoader(#2033) - Added a model hook
transfer_batch_to_devicethat enables moving custom data structures to the target device (1756) - Added black formatter for the code with code-checker on pull (1610)
- Added back the slow spawn ddp implementation as
ddp_spawn(#2115) - Added loading checkpoints from URLs (#1667)
- Added a callback method
on_keyboard_interruptfor handling KeyboardInterrupt events during training (#2134) - Added a decorator
auto_move_datathat moves data to the correct device when using the LightningModule for inference (#1905) - Added
ckpt_pathoption toLightningModule.test(...)to load particular checkpoint (#2190) - Added
setupandteardownhooks for model (#2229)
- Allow user to select individual TPU core to train on (#1729)
- Removed non-finite values from loss in
LRFinder(#1862) - Allow passing model hyperparameters as complete kwarg list (#1896)
- Renamed
ModelCheckpoint's attributesbesttobest_model_scoreandkth_best_modeltokth_best_model_path(#1799) - Re-Enable Logger's
ImportErrors (#1938) - Changed the default value of the Trainer argument
weights_summaryfromfulltotop(#2029) - Raise an error when lightning replaces an existing sampler (#2020)
- Enabled
prepare_datafrom correct processes - clarify local vs global rank (#2166) - Remove explicit flush from tensorboard logger (#2126)
- Changed epoch indexing from 1 instead of 0 (#2206)
- Deprecated flags: (#2213)
overfit_pctin favour ofoverfit_batchesval_percent_checkin favour oflimit_val_batchestest_percent_checkin favour oflimit_test_batches
- Deprecated
ModelCheckpoint's attributesbestandkth_best_model(#1799) - Dropped official support/testing for older PyTorch versions <1.3 (#1917)
- Deprecated Trainer
proc_rankin favour ofglobal_rank(#2166, #2269)
- Removed unintended Trainer argument
progress_bar_callback, the callback should be passed in byTrainer(callbacks=[...])instead (#1855) - Removed obsolete
self._devicein Trainer (#1849) - Removed deprecated API (#2073)
- Packages:
pytorch_lightning.pt_overrides,pytorch_lightning.root_module - Modules:
pytorch_lightning.logging.comet_logger,pytorch_lightning.logging.mlflow_logger,pytorch_lightning.logging.test_tube_logger,pytorch_lightning.overrides.override_data_parallel,pytorch_lightning.core.model_saving,pytorch_lightning.core.root_module - Trainer arguments:
add_row_log_interval,default_save_path,gradient_clip,nb_gpu_nodes,max_nb_epochs,min_nb_epochs,nb_sanity_val_steps - Trainer attributes:
nb_gpu_nodes,num_gpu_nodes,gradient_clip,max_nb_epochs,min_nb_epochs,nb_sanity_val_steps,default_save_path,tng_tqdm_dic
- Packages:
- Run graceful training teardown on interpreter exit (#1631)
- Fixed user warning when apex was used together with learning rate schedulers (#1873)
- Fixed multiple calls of
EarlyStoppingcallback (#1863) - Fixed an issue with
Trainer.from_argparse_argswhen passing in unknown Trainer args (#1932) - Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
- Fixed root node resolution for SLURM cluster with dash in host name (#1954)
- Fixed
LearningRateLoggerin multi-scheduler setting (#1944) - Fixed test configuration check and testing (#1804)
- Fixed an issue with Trainer constructor silently ignoring unknown/misspelled arguments (#1820)
- Fixed
save_weights_onlyin ModelCheckpoint (#1780) - Allow use of same
WandbLoggerinstance for multiple training loops (#2055) - Fixed an issue with
_auto_collect_argumentscollecting local variables that are not constructor arguments and not working for signatures that have the instance not namedself(#2048) - Fixed mistake in parameters' grad norm tracking (#2012)
- Fixed CPU and hanging GPU crash (#2118)
- Fixed an issue with the model summary and
example_input_arraydepending on a specific ordering of the submodules in a LightningModule (#1773) - Fixed Tpu logging (#2230)
- Fixed Pid port + duplicate
rank_zerologging (#2140, #2231)
- Added callback for logging learning rates (#1498)
- Added transfer learning example (for a binary classification task in computer vision) (#1564)
- Added type hints in
Trainer.fit()andTrainer.test()to reflect that also a list of dataloaders can be passed in (#1723). - Added auto scaling of batch size (#1638)
- The progress bar metrics now also get updated in
training_epoch_end(#1724) - Enable
NeptuneLoggerto work withdistributed_backend=ddp(#1753) - Added option to provide seed to random generators to ensure reproducibility (#1572)
- Added override for hparams in
load_from_ckpt(#1797) - Added support multi-node distributed execution under
torchelastic(#1811, #1818) - Added using
store_truefor bool args (#1822, #1842) - Added dummy logger for internally disabling logging for some features (#1836)
- Enable
non-blockingfor device transfers to GPU (#1843) - Replace mata_tags.csv with hparams.yaml (#1271)
- Reduction when
batch_size < num_gpus(#1609) - Updated LightningTemplateModel to look more like Colab example (#1577)
- Don't convert
namedtupletotuplewhen transferring the batch to target device (#1589) - Allow passing hparams as keyword argument to LightningModule when loading from checkpoint (#1639)
- Args should come after the last positional argument (#1807)
- Made ddp the default if no backend specified with multiple GPUs (#1789)
- Deprecated
tags_csvin favor ofhparams_file(#1271)
- Fixed broken link in PR template (#1675)
- Fixed ModelCheckpoint not None checking filepath (#1654)
- Trainer now calls
on_load_checkpoint()when resuming from a checkpoint (#1666) - Fixed sampler logic for ddp with iterable dataset (#1734)
- Fixed
_reset_eval_dataloader()for IterableDataset (#1560) - Fixed Horovod distributed backend to set the
root_gpuproperty (#1669) - Fixed wandb logger
global_stepaffects other loggers (#1492) - Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
- Fixed bugs that prevent lr finder to be used together with early stopping and validation dataloaders (#1676)
- Fixed a bug in Trainer that prepended the checkpoint path with
version_when it shouldn't (#1748) - Fixed lr key name in case of param groups in LearningRateLogger (#1719)
- Fixed saving native AMP scaler state (introduced in #1561)
- Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
- Fixed num processes wasn't being set properly and auto sampler was ddp failing (#1819)
- Fixed bugs in semantic segmentation example (#1824)
- Fixed saving native AMP scaler state (#1561, #1777)
- Fixed native amp + ddp (#1788)
- Fixed
hparamlogging with metrics (#1647)
- Allow logging of metrics together with
hparams(#1630) - Allow metrics logged together with hparams (#1630)
- Removed Warning from trainer loop (#1634)
- Fixed ModelCheckpoint not being fixable (#1632)
- Fixed CPU DDP breaking change and DDP change (#1635)
- Tested pickling (#1636)
- Added flag
replace_sampler_ddpto manually disable sampler replacement in DDP (#1513) - Added speed parity tests (max 1 sec difference per epoch)(#1482)
- Added
auto_select_gpusflag to trainer that enables automatic selection of available GPUs on exclusive mode systems. - Added learning rate finder (#1347)
- Added support for ddp mode in clusters without SLURM (#1387)
- Added
test_dataloadersparameter toTrainer.test()(#1434) - Added
terminate_on_nanflag to trainer that performs a NaN check with each training iteration when set toTrue(#1475) - Added speed parity tests (max 1 sec difference per epoch)(#1482)
- Added
terminate_on_nanflag to trainer that performs a NaN check with each training iteration when set toTrue. (#1475) - Added
ddp_cpubackend for testing ddp without GPUs (#1158) - Added Horovod support as a distributed backend
Trainer(distributed_backend='horovod')(#1529) - Added support for 8 core distributed training on Kaggle TPU's (#1568)
- Added support for native AMP (#1561, #1580)
- Changed the default behaviour to no longer include a NaN check with each training iteration. (#1475)
- Decoupled the progress bar from trainer` it is a callback now and can be customized or even be replaced entirely (#1450).
- Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
- Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
- Updated semantic segmentation example with custom U-Net and logging (#1371)
- Disabled val and test shuffling (#1600)
- Deprecated
training_tqdm_dictin favor ofprogress_bar_dict(#1450).
- Removed
test_dataloadersparameter fromTrainer.fit()(#1434)
- Added the possibility to pass nested metrics dictionaries to loggers (#1582)
- Fixed memory leak from opt return (#1528)
- Fixed saving checkpoint before deleting old ones (#1453)
- Fixed loggers - flushing last logged metrics even before continue, e.g.
trainer.test()results (#1459) - Fixed optimizer configuration when
configure_optimizersreturns dict withoutlr_scheduler(#1443) - Fixed
LightningModule- mixing hparams and arguments inLightningModule.__init__()crashes load_from_checkpoint() (#1505) - Added a missing call to the
on_before_zero_gradmodel hook (#1493). - Allow use of sweeps with
WandbLogger(#1512) - Fixed a bug that caused the
callbacksTrainer argument to reference a global variable (#1534). - Fixed a bug that set all boolean CLI arguments from
Trainer.add_argparse_argsalways to True (#1571) - Fixed do not copy the batch when training on a single GPU (#1576, #1579)
- Fixed soft checkpoint removing on DDP (#1408)
- Fixed automatic parser bug (#1585)
- Fixed bool conversion from string (#1606)
- Added
rank_zero_warnfor warning only in rank 0 (#1428)
- Fixed default
DistributedSamplerfor DDP training (#1425) - Fixed workers warning not on windows (#1430)
- Fixed returning tuple from
run_training_batch(#1431) - Fixed gradient clipping (#1438)
- Fixed pretty print (#1441)
- Added same step loggers' metrics aggregation (#1278)
- Added parity test between a vanilla MNIST model and lightning model (#1284)
- Added parity test between a vanilla RNN model and lightning model (#1351)
- Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
- Added support for hierarchical
dict(#1152) - Added
TrainsLoggerclass (#1122) - Added type hints to
pytorch_lightning.core(#946) - Added support for
IterableDatasetin validation and testing (#1104) - Added support for non-primitive types in
hparamsforTensorboardLogger(#1130) - Added a check that stops the training when loss or weights contain
NaNorinfvalues. (#1097) - Added support for
IterableDatasetwhenval_check_interval=1.0(default), this will trigger validation at the end of each epoch. (#1283) - Added
summarymethod to Profilers. (#1259) - Added informative errors if user defined dataloader has zero length (#1280)
- Added testing for python 3.8 (#915)
- Added a
training_epoch_endmethod which is the mirror ofvalidation_epoch_end. (#1357) - Added model configuration checking (#1199)
- Added support for optimizer frequencies through
LightningModule.configure_optimizers()(#1269) - Added option to run without an optimizer by returning
Nonefromconfigure_optimizers. (#1279) - Added a warning when the number of data loader workers is small. (#1378)
- Changed (renamed and refatored)
TensorRunningMean->TensorRunningAccum: running accumulations were generalized. (#1278) - Changed
progress_bar_refresh_ratetrainer flag to disable progress bar when set to 0. (#1108) - Enhanced
load_from_checkpointto also forward params to the model (#1307) - Updated references to
self.forward()to instead use the__call__interface. (#1211) - Changed default behaviour of
configure_optimizersto use no optimizer rather than Adam. (#1279) - Allow to upload models on W&B (#1339)
- On DP and DDP2 unsqueeze is automated now (#1319)
- Did not always create a DataLoader during reinstantiation, but the same type as before (if subclass of DataLoader) (#1346)
- Did not interfere with a default sampler (#1318)
- Remove default Adam optimizer (#1317)
- Give warnings for unimplemented required lightning methods (#1317)
- Made
evaluatemethod private >>Trainer._evaluate(...). (#1260) - Simplify the PL examples structure (shallower and more readable) (#1247)
- Changed min max gpu memory to be on their own plots (#1358)
- Remove
.itemwhich causes sync issues (#1254) - Changed smoothing in TQDM to decrease variability of time remaining between training / eval (#1194)
- Change default logger to dedicated one (#1064)
- Deprecated Trainer argument
print_nan_grads(#1097) - Deprecated Trainer argument
show_progress_bar(#1108)
- Removed test for no test dataloader in .fit (#1495)
- Removed duplicated module
pytorch_lightning.utilities.arg_parsefor loading CLI arguments (#1167) - Removed wandb logger's
finalizemethod (#1193) - Dropped
torchvisiondependency in tests and added own MNIST dataset class instead (#986)
- Fixed
model_checkpointwhen saving all models (#1359) Trainer.add_argparse_argsclassmethod fixed. Now it adds a type for the arguments (#1147)- Fixed bug related to type checking of
ReduceLROnPlateaulr schedulers(#1126) - Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
- Fixed a bug that created an extra dataloader with active
reload_dataloaders_every_epoch(#1196) - Fixed all warnings and errors in the docs build process (#1191)
- Fixed an issue where
val_percent_check=0would not disable validation (#1251) - Fixed average of incomplete
TensorRunningMean(#1309) - Fixed
WandbLogger.watchwithwandb.init()(#1311) - Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235).
- Fixed a bug that would cause
trainer.test()to run on the validation set when overloadingvalidation_epoch_endandtest_end(#1353) - Fixed
WandbLogger.watch- use of the watch method without importingwandb(#1311) - Fixed
WandbLoggerto be used with 'ddp' - allow reinits in sub-processes (#1149, #1360) - Made
training_epoch_endbehave likevalidation_epoch_end(#1357) - Fixed
fast_dev_runrunning validation twice (#1365) - Fixed pickle error from quick patch
__code__(#1352) - Fixed memory leak on GPU0 (#1094, #1349)
- Fixed checkpointing interval (#1272)
- Fixed validation and training loops run the partial dataset (#1192)
- Fixed running
on_validation_endonly on main process in DDP (#1125) - Fixed
load_spawn_weightsonly in proc rank 0 (#1385) - Fixes
use_ampissue (#1145) - Fixes using deprecated
use_ampattribute (#1145) - Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1377)
- Fixed
Unimplemented backend XLAerror on TPU (#1387)
- Fixes
printissues anddata_loader(#1080)
- Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
- Added
reload_dataloaders_every_epoch=Falseflag for trainer. Some users require reloading data every epoch (#926) - Added
progress_bar_refresh_rate=50flag for trainer. Throttle refresh rate on notebooks (#926) - Updated governance docs
- Added a check to ensure that the metric used for early stopping exists before training commences (#542)
- Added
optimizer_idxargument tobackwardhook (#733) - Added
entityargument toWandbLoggerto be passed towandb.init(#783) - Added a tool for profiling training runs (#782)
- Improved flexibility for naming of TensorBoard logs, can now set
versionto astrto just save to that directory, and usename=''to prevent experiment-name directory (#804) - Added option to specify
stepkey when logging metrics (#808) - Added
train_dataloader,val_dataloaderandtest_dataloaderarguments toTrainer.fit(), for alternative data parsing (#759) - Added Tensor Processing Unit (TPU) support (#868)
- Added semantic segmentation example (#751,#876, #881)
- Split callbacks in multiple files (#849)
- Support for user defined callbacks (#889 and #950)
- Added support for multiple loggers to be passed to
Traineras an iterable (e.g. list, tuple, etc.) (#903) - Added support for step-based learning rate scheduling (#941)
- Added support for logging
hparamsas dict (#1029) - Checkpoint and early stopping now work without val. step (#1041)
- Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
- Added type hints for function arguments (#912, )
- Added default
argparserforTrainer(#952, #1023) - Added TPU gradient clipping (#963)
- Added max/min number of steps in
Trainer(#728)
- Improved
NeptuneLoggerby addingclose_after_fitargument to allow logging after training(#908) - Changed default TQDM to use
tqdm.autofor prettier outputs in IPython notebooks (#752) - Changed
pytorch_lightning.loggingtopytorch_lightning.loggers(#767) - Moved the default
tqdm_dictdefinition from Trainer toLightningModule, so it can be overridden by the user (#749) - Moved functionality of
LightningModule.load_from_metricsintoLightningModule.load_from_checkpoint(#995) - Changed Checkpoint path parameter from
filepathtodirpath(#1016) - Freezed models
hparamsasNamespaceproperty (#1029) - Dropped
loggingconfig in package init (#1015) - Renames model steps (#1051)
training_end>>training_epoch_endvalidation_end>>validation_epoch_endtest_end>>test_epoch_end
- Refactor dataloading, supports infinite dataloader (#955)
- Create single file in
TensorBoardLogger(#777)
- Deprecated
pytorch_lightning.logging(#767) - Deprecated
LightningModule.load_from_metricsin favour ofLightningModule.load_from_checkpoint(#995, #1079) - Deprecated
@data_loaderdecorator (#926) - Deprecated model steps
training_end,validation_endandtest_end(#1051, #1056)
- Removed dependency on
pandas(#736) - Removed dependency on
torchvision(#797) - Removed dependency on
scikit-learn(#801)
- Fixed a bug where early stopping
on_end_epochwould be called inconsistently whencheck_val_every_n_epoch == 0(#743) - Fixed a bug where the model checkpointer didn't write to the same directory as the logger (#771)
- Fixed a bug where the
TensorBoardLoggerclass would create an additional empty log file during fitting (#777) - Fixed a bug where
global_stepwas advanced incorrectly when usingaccumulate_grad_batches > 1(#832) - Fixed a bug when calling
self.logger.experimentwith multiple loggers (#1009) - Fixed a bug when calling
logger.append_tagson aNeptuneLoggerwith a single tag (#1009) - Fixed sending back data from
.spawnby saving and loading the trained model in/out of the process (#1017 - Fixed port collision on DDP (#1010)
- Fixed/tested pass overrides (#918)
- Fixed comet logger to log after train (#892)
- Remove deprecated args to learning rate step function (#890)
- Added support for resuming from a specific checkpoint via
resume_from_checkpointargument (#516) - Added support for
ReduceLROnPlateauscheduler (#320) - Added support for Apex mode
O2in conjunction with Data Parallel (#493) - Added option (
save_top_k) to save the top k models in theModelCheckpointclass (#128) - Added
on_train_startandon_train_endhooks toModelHooks(#598) - Added
TensorBoardLogger(#607) - Added support for weight summary of model with multiple inputs (#543)
- Added
map_locationargument toload_from_metricsandload_from_checkpoint(#625) - Added option to disable validation by setting
val_percent_check=0(#649) - Added
NeptuneLoggerclass (#648) - Added
WandbLoggerclass (#627)
- Changed the default progress bar to print to stdout instead of stderr (#531)
- Renamed
step_idxtostep,epoch_idxtoepoch,max_num_epochstomax_epochsandmin_num_epochstomin_epochs(#589) - Renamed
total_batch_nbtototal_batches,nb_val_batchestonum_val_batches,nb_training_batchestonum_training_batches,max_nb_epochstomax_epochs,min_nb_epochstomin_epochs,nb_test_batchestonum_test_batches, andnb_val_batchestonum_val_batches(#567) - Changed gradient logging to use parameter names instead of indexes (#660)
- Changed the default logger to
TensorBoardLogger(#609) - Changed the directory for tensorboard logging to be the same as model checkpointing (#706)
- Deprecated
max_nb_epochsandmin_nb_epochs(#567) - Deprecated the
on_sanity_check_starthook inModelHooks(#598)
- Removed the
save_best_onlyargument fromModelCheckpoint, usesave_top_k=1instead (#128)
- Fixed a bug which ocurred when using Adagrad with cuda (#554)
- Fixed a bug where training would be on the GPU despite setting
gpus=0orgpus=[](#561) - Fixed an error with
print_nan_gradientswhen some parameters do not require gradient (#579) - Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
- Fixed support for PyTorch 1.1.0 (#552)
- Fixed an issue with early stopping when using a
val_check_interval < 1.0inTrainer(#492) - Fixed bugs relating to the
CometLoggerobject that would cause it to not work properly (#481) - Fixed a bug that would occur when returning
-1fromon_batch_startfollowing an early exit or when the batch wasNone(#509) - Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
- Fixed a bug where batch 'segments' would remain on the GPU when using
truncated_bptt > 1(#532) - Fixed a bug when using
IterableDataset(#547) - Fixed a bug where
.itemwas called on non-tensor objects (#602) - Fixed a bug where
Trainer.trainwould crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already atmax_epochs(#608) - Fixed a bug where early stopping would begin two epochs early (#617)
- Fixed a bug where
num_training_batchesandnum_test_batcheswould sometimes be rounded down to zero (#649) - Fixed a bug where an additional batch would be processed when manually setting
num_training_batches(#653) - Fixed a bug when batches did not have a
.copymethod (#701) - Fixed a bug when using
log_gpu_memory=Truein Python 3.6 (#715) - Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
- Fixed a bug where
on_train_endwas not called when ealy stopping (#723)
- Added option to disable default logger, checkpointer, and early stopping by passing
logger=False,checkpoint_callback=Falseandearly_stop_callback=Falserespectively - Added
CometLoggerfor use with Comet.ml - Added
val_check_intervalargument toTrainerallowing validition to be performed at every given number of batches - Added functionality to save and load hyperparameters using the standard checkpoint mechanism
- Added call to
torch.cuda.empty_cachebefore training starts - Added option for user to override the call t
backward - Added support for truncated backprop through time via the
truncated_bptt_stepsargument inTrainer - Added option to operate on all outputs from
training_stepin DDP2 - Added a hook for modifying DDP init
- Added a hook for modifying Apex
- Changed experiment version to be padded with zeros (e.g.
/dir/version_9becomes/dir/version_0009) - Changed callback metrics to include any metrics given in logs or progress bar
- Changed the default for
save_best_onlyinModelCheckpointtoTrue - Added
tng_data_loaderfor backwards compatibility - Renamed
MLFlowLogger.clienttoMLFlowLogger.experimentfor consistency - Moved
global_stepincrement to happen after the batch has been processed - Changed weights restore to first attempt HPC weights before restoring normally, preventing both weights being restored and running out of memory
- Changed progress bar functionality to add multiple progress bars for train/val/test
- Changed calls to
printto uselogginginstead
- Deprecated
tng_dataloader
- Fixed an issue where the number of batches was off by one during training
- Fixed a bug that occured when setting a ckeckpoint callback and
early_stop_callback=False - Fixed an error when importing CometLogger
- Fixed a bug where the
gpusargument had some unexpected behaviour - Fixed a bug where the computed total number of batches was sometimes incorrect
- Fixed a bug where the progress bar would sometimes not show the total number of batches in test mode
- Fixed a bug when using the
log_gpu_memory='min_max'option inTrainer - Fixed a bug where checkpointing would sometimes erase the current directory
- Added
weights_summaryargument toTrainerto be set tofull(full summary),top(just top level modules) or other - Added
tagsargument toMLFlowLogger
- Changed default for
amp_leveltoO1
- Removed the
print_weights_summaryargument fromTrainer
- Fixed a bug where logs were not written properly
- Fixed a bug where
logger.finalizewasn't called after training is complete - Fixed callback metric errors in DDP
- Fixed a bug where
TestTubeLoggerdidn't log to the correct directory
- Added the
LightningLoggerBaseclass for experiment loggers - Added
MLFlowLoggerfor logging withmlflow - Added
TestTubeLoggerfor logging withtest_tube - Added a different implementation of DDP (
distributed_backed='ddp2') where every node has one model using all GPUs - Added support for optimisers which require a closure (e.g. LBFGS)
- Added automatic
MASTER_PORTdefualt for DDP when not set manually - Added new GPU memory logging options
'min_max'(log only the min/max utilization) and'all'(log all the GPU memory)
- Changed schedulers to always be called with the current epoch
- Changed
test_tubeto an optional dependency - Changed data loaders to internally use a getter instead of a python property
- Disabled auto GPU loading when restoring weights to prevent out of memory errors
- Changed logging, early stopping and checkpointing to occur by default
- Fixed a bug with samplers that do not specify
set_epoch - Fixed a bug when using the
MLFlowLoggerwith unsupported data types, this will now raise a warning - Fixed a bug where gradient norms were alwasy zero using
track_grad_norm - Fixed a bug which causes a crash when logging memory
- Changed
data_batchargument tobatchthroughout - Changed
batch_iargument tobatch_idxthroughout - Changed
tng_dataloadermethod totrain_dataloader - Changed
on_tng_metricsmethod toon_training_metrics - Changed
gradient_clipargument togradient_clip_val - Changed
add_log_row_intervaltorow_log_interval
- Fixed a bug with tensorboard logging in multi-gpu setup
- Added the flag
log_gpu_memorytoTrainerto deactivate logging of GPU memory utilization - Added SLURM resubmit functionality (port from test-tube)
- Added optional weight_save_path to trainer to remove the need for a checkpoint_callback when using cluster training
- Added option to use single gpu per node with
DistributedDataParallel
- Changed functionality of
validation_endandtest_endwith multiple dataloaders to be given all of the dataloaders at once rather than in seperate calls - Changed print_nan_grads to only print the parameter value and gradients when they contain NaN
- Changed gpu API to take integers as well (e.g.
gpus=2instead ofgpus=[0, 1]) - All models now loaded on to CPU to avoid device and out of memory issues in PyTorch
- Fixed a bug where data types that implement
.tobut not.cudawould not be properly moved onto the GPU - Fixed a bug where data would not be re-shuffled every epoch when using a
DistributedSampler
- Added
test_stepandtest_endmethods, used whenTrainer.testis called - Added
GradientAccumulationSchedulercallback which can be used to schedule changes to the number of accumulation batches - Added option to skip the validation sanity check by setting
nb_sanity_val_steps = 0
- Fixed a bug when setting
nb_sanity_val_steps = 0
- Changed the default
val_check_intervalto1.0 - Changed defaults for
nb_val_batches,nb_tng_batchesandnb_test_batchesto 0
- Fixed a bug where the full validation set as used despite setting
val_percent_check - Fixed a bug where an
Exceptionwas thrown when using a data set containing a single batch - Fixed a bug where an
Exceptionwas thrown if noval_dataloaderwas given - Fixed a bug where tuples were not properly transfered to the GPU
- Fixed a bug where data of a non standard type was not properly handled by the trainer
- Fixed a bug when loading data as a tuple
- Fixed a bug where
AttributeErrorcould be suppressed by theTrainer
- Added support for data to be given as a
dictorlistwith a single gpu - Added support for
configure_optimizersto return a single optimizer, two list (optimizers and schedulers), or a single list
- Fixed a bug where returning just an optimizer list (i.e. without schedulers) from
configure_optimizerswould throw anException
- Added
optimizer_stepmethod that can be overridden to change the standard optimizer behaviour
- Added supoort for multiple validation dataloaders
- Added support for latest test-tube logger (optimised for
torch==1.2.0)
validation_stepandval_dataloaderare now optionallr_scheduleris now activated after epoch
- Fixed a bug where a warning would show when using
lr_schedulerintorch>1.1.0 - Fixed a bug where an
Exceptionwould be thrown if usingtorch.DistributedDataParallelwithout using aDistributedSampler, this now throws aWarninginstead
- Fixed a bug where accumulate gradients would scale the loss incorrectly
- Changed install requirement to
torch==1.2.0
- Changed install requirement to
torch==1.1.0
- Added 16-bit support for a single GPU
- Added support for training continuation (preserves epoch, global step etc.)
- Changed
training_stepandvalidation_step, outputs will no longer be automatically reduced
- Removed need for
Experimentobject inTrainer
- Fixed issues with reducing outputs from generative models (such as images and text)
- Added a decorator to do lazy data loading internally
- Fixed a bug where
Experimentobject was not process safe, potentially causing logs to be overwritten