Optimizers
Classy comes with a set of well established predefined Optimizers that you can easily plug in your experiments. At the moment we support:
Adam
One of the most famous Optimizer for Natural Language Processing applications, virtually ubiquitous.
To use it, put the following yaml lines in your own profile or config.
model:
optim_conf:
_target_: classy.optim.factories.AdamWithWarmupFactory
lr: 3e-5
warmup_steps: 5000
total_steps: ${training.pl_trainer.max_steps}
weight_decay: 0.01
no_decay_params:
- bias
- LayerNorm.weight
AdamW
Adam implementation with weight decay fix as stated in the original paper.
To use it, put the following yaml lines in your own profile or config.
model:
optim_conf:
_target_: classy.optim.factories.AdamWWithWarmupFactory
lr: 3e-5
warmup_steps: 5000
total_steps: ${training.pl_trainer.max_steps}
weight_decay: 0.01
no_decay_params:
- bias
- LayerNorm.weight
Adafactor
An Optimizer that you should use in order to reduce the VRAM memory usage. Performances are almost on par with AdamW
To use it, put the following yaml lines in your own profile or config.
model:
optim_conf:
_target_: classy.optim.factories.AdafactorWithWarmupFactory
lr: 2e-5
warmup_steps: 5000
total_steps: ${training.pl_trainer.max_steps}
weight_decay: 0.01
no_decay_params:
- bias
- LayerNorm.weight
RAdam
A more recent Optimizer that stabilizes training and let's you skip the warmup phase. You can replace AdamW with RAdam in almost every scenario.
To use it, put the following yaml lines in your own profile or config.
model:
optim_conf:
_target_: classy.optim.factories.RAdamFactory
lr: 3e-5
weight_decay: 0.01
no_decay_params:
- bias
- LayerNorm.weight
Custom Optimizers
If you want to implement your own Optimizer and Learning Rate Scheduler you can simply create a class that inherits from classy.optim.TorchFactory
and implement the __call__
method returning either the Optimizer or a dictionary containing the Optimizer and the Scheduler in the following way:
class AdagradWithWarmup(TorchFactory):
"""
Factory for Adagrad optimizer with warmup learning rate scheduler
reference paper for Adagrad: https://jmlr.org/papers/v12/duchi11a.html
"""
def __init__(
self,
lr: float,
warmup_steps: int,
total_steps: int,
weight_decay: float,
no_decay_params: Optional[List[str]],
):
super().__init__(weight_decay, no_decay_params)
self.lr = lr
self.warmup_steps = warmup_steps
self.total_steps = total_steps
def __call__(self, module: torch.nn.Module):
optimizer = Adagrad(
module.parameters(), lr=self.lr, weight_decay=self.weight_decay
)
scheduler = transformers.get_linear_schedule_with_warmup(
optimizer, self.warmup_steps, self.total_steps
)
return {
"optimizer": optimizer,
"lr_scheduler": {
"scheduler": scheduler,
"interval": "step",
"frequency": 1,
},
}
This __call__
method should return any of the possible return types from the configure_optimizers
method of pytorch_lightning. But if you don't have to do fancy stuff this piece of code is everything you'll need :).
Then, you can use your own Optimizer in your experiments by specifing it in your profile or config.
model:
optim_conf:
_target_: my_repo.optimization.AdagradWithWarmup
lr: 3e-5
weight_decay: 0.01
no_decay_params:
- bias
- LayerNorm.weight