Skip to main content

Custom Data Format

By default, classy only supports .tsv and .jsonl files. However, you can easily add support for your own file format on some task. You just need to implement your own data driver and register it:

# implement your data driver
@DataDriver.register(YOUR_TASK, YOUR_FILE_EXTENSION)
class CustomDataDriver(DataDriver):
def read(self, lines: Iterator[str]) -> Iterator[ClassySample]:
raise NotImplementedError

def save(
self,
samples: Iterator[ClassySample],
path: str,
use_predicted_annotation: bool = False,
):
raise NotImplementedError
caution

classy uses the tuple (task, file-extension) to determine the data driver to instantiate for some file. This means that postpending file extensions is mandatory, even on Unix systems.

A Minimal Example

For instance, imagine you were to reimplement the .jsonl data driver for Sequence Classification:

@DataDriver.register(SEQUENCE, "jsonl")
class JSONLSequenceDataDriver(SequenceDataDriver):
pass
info

SequenceDataDriver is just a subclass of DataDriver where the sample types have been downcasted to SequenceSample only.

You would first implement the read method:

def read(self, lines: Iterator[str]) -> Iterator[SequenceSample]:
# iterate on lines
for line in lines:
# read json object and instantiate sequence sample
yield SequenceSample(**json.loads(line))

and, then, the save method:

def save(self, samples: Iterator[SequenceSample], path: str):
with open(path, "w") as f:
# iterate on samples
for sample in samples:
# dump json object
f.write(
json.dumps(
{"sequence": sample.sequence, "label": sample.reference_annotation}
)
+ "\n"
)
tip

While both .jsonl and .tsv are one-sample-per-line formats, your own data driver does not need to follow this behavior. As you have access to the lines iterator, you can read your file as you see fit.