Skip to main content

Organizing your data

Data Formatting

classy requires data to be formatted in a specific way according to the task you're tackling (check out Tasks and Input Formats in the documentation).

In our case of Named Entity Recognition (i.e., Token Classification), we need the data to be formatted such that each line represents a single sample. For instance, taking again our running example of Barack Obama visited Google in California, we can format it as follows:

Barack Obama visited Google in California\tPER PER O ORG O LOC

That is, a TSV (tab-separated values) file which has a space-separated sequence of tokens as the first column, and a space-separated sequence of labels as the second column (both sequences must have the same number of elements).

tip

classy by default supports .tsv and .jsonl as input formats (see the documentation), but you can add custom formats as well.

If your dataset is already formatted like this, great! Otherwise, this is the only bit where coding is required. You can either convert it yourself (via a python or bash script, whatever you're comfortable with), or you can register a custom data reader to support your dataset format.

Organizing Datasets

In classy, as in standard machine learning projects, the most simple way to organize your datasets is to create a directory containing the train, validation and test datasets.

data/ner-data
├── train.tsv
├── validation.tsv
└── test.tsv

In this way, classy will automatically infer the splits of your dataset from the directory structure.

tip

If you have multiple training files, or you want to specify the splits using a different directory structure, you can use a training coordinates file. You can find a complete guide on how to do it in the Reference Manual.