Model training
The aindo.rdml library offers two generative models for synthetic data generation:
- A TabularModelthat generates all the relational data excluding columns that contain text.
- A TextModelthat generates only text columns. Users must specify aTextModelfor each table containing text columns.
Tabular Model
To instantiate and build a TabularModel the user needs to use the
TabularModel.build() class method and to provide a preproc, which is
a TabularPreproc object, and a size, denoting the desired model dimensions.
The size argument can be defined in one of the following formats:
- A TabularModelSizeobject containing the integer attributesn_layers,handd;
- A string or a Sizeobject, internally mapping to a default configuration ofTabularModelSize. The options are:"small"/Size.SMALL,"medium"/Size.MEDIUM, or"large"/Size.LARGE.
The user may specify the type of layer used by the model with the block parameter.
The available blocks are "free" (the default) and "lstm".
Optionally, the user may also provide a dropout value for the dropout layers in the model.
The model is trained using a TabularTrainer object,
which is built from the TabularModel.
The trainer has an optional parameter dp_budget, which, if provided, must be
a DpBudget object containing the (epsilon, delta)-budget
for differentially private (DP) training.
If not provided, the training will have no differential privacy guarantees.
Notice that DP training is available only for single-table datasets.
To train a model, the user also needs to build a TabularDataset object
containing the preprocessed training data.
The TabularDataset is built from the raw training data
and the same TabularPreproc object used to build the model.
There are three options to instantiate a TabularDataset object:
- From the raw data, and storing the processed data in RAM.
In this case the TabularDataset.from_data()method should be invoked.
- From the raw data, but storing the processed data on disk.
In this case, again the TabularDataset.from_data()method should be invoked, but theon_diskparameter should be set toTrue. Moreover, thepathparameter can be used to provide a directory where to store the processed data. By default, the data is stored in a temporary directory and deleted at the end of the process. When stored on disk, during training the data will be loaded one batch at a time. This may slightly slow down the training, but will reduce the memory consumption.
- From data already processed and stored non disk.
When using the TabularDataset.from_data()method, withon_diskset toTrueand providing apath, the data is stored in the provided directory, and can be reaccessed for later use with theTabularDataset.from_disk()method, providing theTabularPreprocand thepathto the directory.
Tip
When creating the TabularDataset from a very large raw dataset,
we suggest storing the TabularDataset on disk with the on_disk option,
to keep more RAM available during training.
The critical size of the row data depends on the total memory available (and the amount of RAM needed
to train the desired model with the desired batch size).
However, the row data must be preprocessed before being saved to disk and in extreme cases it is possible
that this operation may require more RAM than available.
In this case, the user should also provide a value for the chunk_size parameter.
The chunk_size should be a positive integer, and if provided the row data will be split into chuncks of
the given size, and the preprocessing will be performed one chunk at a time.
As soon as a chunk is preprocessed, it is saved to disk (if the on_disk option is chosen).
The smaller the value of chunk_size, the more the memory footprint of the preprocessing is reduced,
at the expense of possibly taking longer.
The TabularDataset has another optional argument, block_size,
which is an integer fixing the maximum length of the internal representation of the input used during training.
A smaller block_size will reduce the time of a single training epoch, but will introduce approximations that
may compromise the quality of the generated synthetic data.
The given block_size should be larger than the maximal internal representation of each table in the dataset.
For this reason, this parameter is available only for multi-table datasets.
Once the trainer and the training dataset are ready,
the TabularTrainer.train() method is used to train the model.
The method requires:
- The training dataset (dataset);
- The desired number of training epochs (n_epochs), or alternatively of training steps (n_steps);
- Either the batch size (batch_size) or the (CPU or GPU) available memory in MB (memory). The latter is used in turn to compute an optimal batch size, such that:- The batch size is a power of two and does not exceed 256, batch_size = 2**n, 0 <= n <= 8;
- Each epoch consists of at least 50 steps, len(dataset) // batch_size >= 50;
- Each training step will not require more (CPU or GPU) memory than the available one provided.
 
- The batch size is a power of two and does not exceed 256, 
Additionally, users can provide the optional arguments:
- lr: The learning rate, whose optimal value is otherwise automatically determined.
- valid: A- Validationobject that configures validation during training. The validation dataset must be provided as a- TabularDatasetobject via the argument- dataset, and various functionalities can be activated with the dedicated arguments, including learning rate scheduling and early stopping. To protect the validation data with DP guarantees, a- DpValidobject should be provided through the- dpparameter. For further information, please refer to the API reference.
- hooks: A sequence of custom training hooks crafted by the user, described in the next section.
- accumulate_grad: The number of gradient accumulation steps. By default, it is set to 1, meaning the model is updated at each step.
- dp_step: A- DpStepobject containing the data needed for the differentially private step. It should be provided if and only if the trainer was equipped with a DP-budget, and therefore only for single-table datasets. For the available settings, please refer to the API reference.
- world_size: The number of GPUs to use for distributed training. If more than one GPU is available, distributing the training over the available GPUs can speed up the training. If 0 (the default), the training is performed on a single device, the current device of the- TabularTrainerobject.
Here is an example of training of the tabular model, with a validation step at the end of each epoch:
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularDataset, TabularModel, TabularPreproc, TabularTrainer, Validation
data = RelationalData(data=..., schema=...)
data_train, data_valid = data.split(ratio=0.1)
preproc = TabularPreproc.from_schema(schema=data.schema).fit(data=data)
model_tabular = TabularModel.build(preproc=preproc, size="small")
dataset_train = TabularDataset.from_data(data=data_train, preproc=preproc)
dataset_valid = TabularDataset.from_data(data=data_valid, preproc=preproc)
trainer_tabular = TabularTrainer(model=model_tabular)
trainer_tabular.train(
    dataset=dataset_train,
    n_epochs=100,
    batch_size=32,
    valid=Validation(dataset=dataset_valid, each=1, trigger="epoch"),
)
Custom hooks (expert user)
The experienced user might opt to specify personalized training hooks using the hooks parameter of the
TabularTrainer.train() method.
These hooks must extend the TrainHook class, whose __init__() method takes at least two arguments
to define the frequency of the activation of the hook:
an integer each, and a trigger, that may be "epoch" or "step".
A custom hook must implement the _hook(n) method, which is invoked when the hook is triggered
by the each and trigger arguments and receives as an argument the number of current epoch or current step,
depending on the value of trigger.
A custom hook may also override the following methods:
- setup(trainer, hooks), invoked before the training begins, takes as input the trainer and the previously defined hooks.
- hook(), called at each training step. The default behavior is to check if the trigger is activated and in such case calls the- _hook()method.
- _cleanup(), called at the end of the training, it should return the status of the current hook.
- cleanup(hook_status), called at the end of the training, receives in input the status of the previous hooks and should return the status of the current hook. Its default behavior is to check the statuses of the previous hooks and to call the- _cleanup()method.
Text Model
As for the TabularModel, to instantiate and build
a TextModel instance, the user is required to provide a preproc,
which in this case is a TextPreproc, and a size,
which is a TextModelSize, a Size, or a string
representation of the latter.
For a TextModel, the user is also required to provide a block_size, corresponding to
the maximum text length that the model can process in a single forward step.
Finally, the user may provide the optional dropout parameter.
Alternatively, the user may build a TextModel from a pretrained model,
with the constructor TextModel.build_from_pretrained(),
providing a TextPreproc and a path to the pretrained model.
The optional block_size option is also available to fix the maximum text length that the model can process
during fine-tuning.
To build the training (and validation) dataset, the user must instantiate
a TextDataset object.
The options are similar to the ones for the TabularDataset,
however in this case the max_block_size parameter is not available.
To reduce the block size, it is possible to set the block_size parameter
in the TextModel.build(), or the TextModel.max_block_size attribute.
A reasonable value for the block size can be obtained from the TextDataset.max_text_len attribute
of the training dataset.
The associated trainer is a TextTrainer object,
which is built from a TextModel.
At the moment, DP training is not available for TextTrainer models,
therefore the dp_budget option is not available.
The TextTrainer.train() method has the same arguments
as the TabularTrainer.train() method,
except for the dp_step option which is not active.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextDataset, TextModel, TextPreproc, TextTrainer
data = RelationalData(data=..., schema=...)
data_train, data_valid = data.split(ratio=0.1)
preproc_text = TextPreproc.from_schema_table(schema=data.schema, table="listings").fit(data=data)
model_text = TextModel.build(
    preproc=preproc_text,
    size="small",
    block_size=1024,
)
dataset_train = TextDataset.from_data(data=data_train, preproc=preproc_text)
trainer_text = TextTrainer(model=model_text)
trainer_text.train(
    dataset=dataset_train,
    n_epochs=100,
    batch_size=32,
)