Custom Evaluation Metric
Adding a custom metric for evaluation is easy in classy, and you can use it for both classy evaluate and
classy train (to monitor performance or, perhaps, even early-stop). To do this, you just need to:
- Write your Evaluation class
class Evaluation:
def __call__(
self,
path: str,
predicted_samples: List[ClassySample],
) -> Dict:
raise NotImplementedError
- Write its config
- Train, specifying your evaluation
classy train [...] -c [...] evaluation=<your evaluation name> - using
classy evaluatenow prints your custom evaluation
A Minimal Example
As an example, imagine you want to use SpanF1 to evaluate your NER (Named Entity Recognition) system. First, you implement the class:
from datasets import load_metric
class SeqEvalSpanEvaluation(Evaluation):
def __init__(self):
self.backend_metric = load_metric("seqeval")
def __call__(
self,
predicted_samples: List[TokensSample],
) -> Dict:
metric_out = self.backend_metric.compute(
predictions=[labels for _, labels in predicted_samples],
references=[sample.labels for sample, _ in predicted_samples],
)
p, r, f1 = metric_out["overall_precision"], metric_out["overall_recall"], metric_out["overall_f1"]
return {"precision": p, "recall": r, "f1": f1}
We use here the SpanF1 metric implemented in the HuggingFace datasets library (this is what load_metric("seqeval") does). Then, you write the corresponding config:
_target_: 'classy.evaluation.span.SeqEvalSpanEvaluation'
Finally, add this evaluation metric to your training configuration, train your model and automatically evaluate with your metric:
Monitoring at Training Time
As a matter of fact, most of the time you'll want to monitor your evaluation metric on some dataset (most likely, the validation) also during training. You can achieve this as follows:
callbacks=evaluation is what does the magic. Behind the scenes, what is happening is that you are adding a callback
with the following config (which, obviously, you can modify either with -c or via profile):
- _target_: "classy.pl_callbacks.prediction.PredictionPLCallback"
path: null # leave it to null to set it to validation path
prediction_dataset_conf: ${prediction.dataset}
on_result:
file_dumper:
_target_: "classy.pl_callbacks.prediction.FileDumperPredictionCallback"
evaluation:
_target_: "classy.pl_callbacks.prediction.EvaluationPredictionCallback"
evaluation: ${evaluation}
settings:
- name: "validation"
path: null # leave it to null to set it to PredictionPLCallback.path
token_batch_size: 800
limit: 1000
prediction_param_conf_path: null
on_result:
- "file_dumper"
- "evaluation"
Left as it is, this config tells classy to use the model being trained to predict all samples in the validation dataset,
and runs 2 callbacks on the resulting (sample, prediction) tuples:
- FileDumperPredictionCallback; this callback dumps the (sample, prediction) tuples that your model predicts at each validation epoch in a dedicated folder in your experiment directory
- EvaluationPredictionCallback (the actual magic); this callback evaluates the (sample, prediction) tuples with the evaluation metric you specified and logs the result
More in detail, PredictionPLCallback is a powerful class supporting quite the number of evaluation scenarios during your training. It has 2 main arguments:
- on_result, a dictionary of (name, callback) pairs; each callback here is a classy.pl_callbacks.prediction.PredictionCallback class
- settings, a list of settings where model prediction should be performed, each made up of:
- name
- path (containing the dataset you want to evaluate upon)
- token_batch_size, the token batch size you want to use (remember, no gradient computation here)
- limit, the maximum number of samples to be used (chosen as they occur in the dataset); set it to -1 to use all of them
- prediction_param_conf_path, the path to the prediction params config file you want to use (leave it to null if not needed)
- (optionally) on_result, a list containing the name of the on_result callbacks to want to launch on this setting; if not provided, all callbacks will be used
You can use your metric for early-stopping as well! Just add
-c [...] callbacks_monitor=<setting-name>-<name-of-metric-returned-in-evaluation-dict> callbacks_mode=<max-or-min>.
For instance, in our example, to early-stop on SpanF1 on the validation set,
use -c [...] callbacks_monitor=validation-f1 callbacks_mode=max
Swapping Evaluation Metric
classy also supports changing the evaluation metric directly when using classy evaluate, regardless of the config
used in classy train. To do so, you can use the the --evaluation-config CLI parameter to classy evaluate. This
parameter specifies the configuration path (e.g. configurations/evaluation/span.yaml) where the config of the desired
evaluation metric is stored.
Note that interpolation to other configs is currently not supported in this setting.