Event generation
To train and generate synthetic event data, the aindo.rdml.synth.eval package provides
the EventModel class.
Similar to a TabularModel, an EventModel
should be built, trained, and then used for data generation.
It can also be saved and later loaded for continued use in another session.
Building the model
An EventModel is instantiated using the
EventModel.build() class method.
You must provide:
-
An
EventPreprocobject. -
The model
size, which can be aTabularModelSize, aSize, or a string equivalent of the latter. -
Optional arguments:
block(either"free"[default] or"lstm"), and dropout.
These arguments mirror those in TabularModel.build().
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.event import EventModel, EventPreproc
data = RelationalData(data=..., schema=...)
preproc = EventPreproc.from_schema(
schema=data.schema,
ord_cols={"visits": "date", "diagnosis": "date", ...},
final_tables=["discharge"],
).fit(data=data)
model = EventModel.build(preproc=preproc, size="small")
Training the model
To train an EventModel,
use the EventTrainer class.
Training requires an EventDataset, which is
created from the training data and the same preprocessor used for building the model.
Use the EventDataset.from_data() class method
to process the data and create the dataset.
By default, data is stored in memory.
Alternatively, to save it on disk, set on_disk=True.
If a path is provided to the optional path parameter, it is used as the location where to store the processed data.
If no path is provided, it is saved in a temporary directory.
The processed data stored on disk can be loaded in a different session, by instantiating an
EventDataset with the
EventDataset.from_disk() class method.
As for the TabularDataset, also the
EventDataset.from_data() method has an optional chunk_size
parameter.
If given, the preprocessing of the raw data is performed one chunk at a time.
We suggest to enable chunked preprocessing when the available RAM is less than the amount required
to preprocess the data all at once.
When RAM constraints are an issue, it is often wiser to store the preprocessed data on disk with the on_disk option,
to free up as much memory as possible for training.
With an EventTrainer and an
EventDataset ready,
you can call EventTrainer.train() to train the model.
Training behaves just like in the tabular case, with support for:
- Dynamic batch sizing based on the provided available memory.
- Validation via a
Validationobject. - Differential privacy.
- Custom training hooks.
- Multi-GPU training.
Refer to the tabular training section of this guide for full details.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.event import EventDataset, EventModel, EventPreproc, EventTrainer
from aindo.rdml.synth import Validation
data = RelationalData(data=..., schema=...)
data_train, data_valid = data.split(ratio=0.1)
preproc = EventPreproc.from_schema(
schema=data.schema,
ord_cols={"visits": "date", "diagnosis": "date", ...},
final_tables=["discharge"],
).fit(data=data)
model = EventModel.build(preproc=preproc, size="small")
dataset_train = EventDataset.from_data(data=data_train, preproc=preproc)
dataset_valid = EventDataset.from_data(data=data_valid, preproc=preproc)
trainer = EventTrainer(model=model)
trainer.train(
dataset=dataset_train,
n_epochs=100,
batch_size=32,
valid=Validation(dataset=dataset_valid, each=1, trigger="epoch"),
)
Saving and loading
Models and trainers can be saved using the EventModel.save() and
EventTrainer.save() methods.
from aindo.rdml.synth.event import EventModel
model = EventModel.build(preproc=..., size=...)
# Train the event model
...
model.save(path="path/to/ckpt")
A saved model or trainer can be later loaded with the EventModel.load() and
EventTrainer.load() methods.
from aindo.rdml.synth.event import EventModel
model = EventModel.load(path="path/to/ckpt")
data_synth = model.generate(
n_samples=1_000,
batch_size=256,
)
Tip
Similar to the tabular case, save the trainer only if you plan to resume training.
Tip
The saved EventTrainer contains the model
(EventTrainer.model attribute), so saving both is redundant.
For more details, please see the saving and loading section of the tabular case.
Generation of event data
A trained EventModel can be used to generate synthetic event data
using the EventModel.generate() method.
The output is a RelationalData object containing the generated data.
Generation modes
To generate new event series from scratch, specify the number of samples using the n_samples parameter.
In this case, the model generates n_samples synthetic event sequences, including rows for the root table.
Alternatively, you can guide the generation process by providing a context through the ctx parameter.
The model will then perform conditional generation, starting from the provided context.
The context can be of two types:
-
Partial Root Table: The context may include some or all columns from the root table. In this case, the model will complete the remaining columns of the root table and generate the associated event sequence. These context columns must be declared in the
EventPreproc. As an example, to generate time series only for male patients, use as context column thegendercolumn from thepatienttable, and provide as context a constant column with the value"male". -
Partial Time Series: The context may include the full root table and partial sequences of events in one or more event tables. In this case, the model will continue generating events from where the sequences left off.
Stopping criteria
By default, the model learns when to stop generating an event sequence based on the training data. If a final event is generated, the sequence terminates automatically.
You can override this behavior with the following optional parameters
of EventModel.generate():
max_n_events: The maximum number of events in the final sequence (including those in the context, if provided).forbidden_events: A collection of event tables that should not be generated. This includes both final and non-final events. For example, consider two final tables,dischargeandongoing, with the latter containing the final events for the patients with ongoing treatments at the time of the data collection. If you only want to generate discharges, simply addongoingto this list.force_final(bool): IfTrue, forces the sequence to end with a final event—unlessmax_n_eventsis reached first. This is useful when some sequences in the original data do not end in a final event, while the user needs to generate complete sequences.force_event(bool): IfTrue, the model continues generating non-final events until themax_n_eventslimit is reached.
Warning
A positive value for max_n_events is required when force_event is set to True.
Warning
force_final and force_event cannot both be set to True.
Additional parameters
There are three additional optional parameters, analogous to those in
TabularModel.generate():
chunk_size: The size of each chunk for chunked generation. By default, the generation is performed in a single chunk.
Tip
As for the tabular case, chunked generation should be used when the available RAM is not enough to perform in a single chunk the pre-processing of the context and the post-processing of the generated event data.
batch_size: The number of samples generated in parallel. Defaults to 0, which means all samples are generated in one batch.temp: A strictly positive number (default 1) controlling the randomness of generation. Higher values increase variability, while lower values make the generation more deterministic.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.event import EventModel
data = RelationalData(data=..., schema=...)
model = EventModel.build(preproc=..., size=...)
# Train the model
...
# Generate synthetic events from scratch
data_synth = model.generate(
n_samples=200,
batch_size=512,
)
# Continue the time series in the test set,
# until a final event is reached, or the series reaches 200 events
data_synth_continue = model.generate(
ctx=data,
force_final=True,
max_n_events=200,
batch_size=512,
)
Generation of "what-if" scenarios
One powerful use case is the generation of "what-if" scenarios.
In many cases, it may be useful to simulate multiple possible future outcomes based on a given input time series.
To support this, the EventModel offers the
EventModel.generate_sample() method.
This method requires two main inputs:
ctx: A context (such as a partial root table or partial time series, as described above).n: An integer indicating the number of samples to generate.
It also accepts the same optional parameters as the
EventModel.generate() method.
These parameters allow the user to control the stopping criteria for event generation, the chunk and batch sizes,
and the sampling temperature.
The output is a GenSample object which is a list of
RelationalData instances.
Each sample represents a possible future outcome and contains the corresponding generated events.
If the input context is a partial root table, each scenario will also include the completed root table columns.
Example
Imagine you have the medical records of a single patient up to a certain point in time, represented as an (incomplete) event series. You may want to explore the likelihood of various future outcomes if the sequence continues, and how such likelihood may change according to the possible next actions taken.
To do this, you can use the partial sequences as context for a generation of n scenarios, and set the desired
stopping criterion—e.g., with force_final=True if you're specifically interested in the discharge information.
The model will produce n plausible continuations of the patient’s medical history,
each representing a different possible future.
These can then be analyzed statistically to understand the range and distribution of potential outcomes.
Now, consider a second scenario where the event series is identical except for a single modification to the last event—perhaps a different diagnosis or procedure. This gives you two versions of the patient's timeline, each diverging at one key point.
By running the same multi-sample generation on both versions, you can compare the outcomes and assess the statistical impact of that one differing event on the patient’s future medical history.