Optimizers

Classy comes with a set of well established predefined Optimizers that you can easily plug in your experiments. At the moment we support:

Adam

One of the most famous Optimizer for Natural Language Processing applications, virtually ubiquitous.

🔨 Implementation
📄 Paper

To use it, put the following yaml lines in your own profile or config.

model:
  optim_conf:
    _target_: classy.optim.factories.AdamWithWarmupFactory
    lr: 3e-5
    warmup_steps: 5000
    total_steps: ${training.pl_trainer.max_steps}
    weight_decay: 0.01
    no_decay_params:
      - bias
      - LayerNorm.weight

AdamW

Adam implementation with weight decay fix as stated in the original paper.

🔨 Implementation
📄 Paper

To use it, put the following yaml lines in your own profile or config.

model:
  optim_conf:
    _target_: classy.optim.factories.AdamWWithWarmupFactory
    lr: 3e-5
    warmup_steps: 5000
    total_steps: ${training.pl_trainer.max_steps}
    weight_decay: 0.01
    no_decay_params:
      - bias
      - LayerNorm.weight

Adafactor

An Optimizer that you should use in order to reduce the VRAM memory usage. Performances are almost on par with AdamW

🔨 Implementation
📄 Paper

To use it, put the following yaml lines in your own profile or config.

model:
  optim_conf:
    _target_: classy.optim.factories.AdafactorWithWarmupFactory
    lr: 2e-5
    warmup_steps: 5000
    total_steps: ${training.pl_trainer.max_steps}
    weight_decay: 0.01
    no_decay_params:
      - bias
      - LayerNorm.weight

RAdam

A more recent Optimizer that stabilizes training and let's you skip the warmup phase. You can replace AdamW with RAdam in almost every scenario.

🔨 Implementation
📄 Paper

To use it, put the following yaml lines in your own profile or config.

model:
  optim_conf:
    _target_: classy.optim.factories.RAdamFactory
    lr: 3e-5
    weight_decay: 0.01
    no_decay_params:
      - bias
      - LayerNorm.weight

Custom Optimizers

If you want to implement your own Optimizer and Learning Rate Scheduler you can simply create a class that inherits from classy.optim.TorchFactory and implement the __call__ method returning either the Optimizer or a dictionary containing the Optimizer and the Scheduler in the following way:

class AdagradWithWarmup(TorchFactory):
    """
    Factory for Adagrad optimizer with warmup learning rate scheduler
    reference paper for Adagrad: https://jmlr.org/papers/v12/duchi11a.html
    """

    def __init__(
        self,
        lr: float,
        warmup_steps: int,
        total_steps: int,
        weight_decay: float,
        no_decay_params: Optional[List[str]],
    ):
        super().__init__(weight_decay, no_decay_params)
        self.lr = lr
        self.warmup_steps = warmup_steps
        self.total_steps = total_steps

    def __call__(self, module: torch.nn.Module):
        optimizer = Adagrad(
            module.parameters(), lr=self.lr, weight_decay=self.weight_decay
        )
        scheduler = transformers.get_linear_schedule_with_warmup(
            optimizer, self.warmup_steps, self.total_steps
        )
        return {
            "optimizer": optimizer,
            "lr_scheduler": {
                "scheduler": scheduler,
                "interval": "step",
                "frequency": 1,
            },
        }

This __call__ method should return any of the possible return types from the configure_optimizers method of pytorch_lightning. But if you don't have to do fancy stuff this piece of code is everything you'll need :).

Then, you can use your own Optimizer in your experiments by specifing it in your profile or config.

model:
  optim_conf:
    _target_: my_repo.optimization.AdagradWithWarmup
    lr: 3e-5
    weight_decay: 0.01
    no_decay_params:
      - bias
      - LayerNorm.weight

Optimizers

Adam​

AdamW​

Adafactor​

RAdam​

Custom Optimizers​

Adam

AdamW

Adafactor

RAdam

Custom Optimizers