Relational data structure
The primary data structure of the aindo.rdml package is the
RelationalData class, which encapsulate a relational dataset
consisting of one or more tables.
A RelationalData object is a dictionary of pandas.DataFrame objects
that contain the data tables, along with a Schema object that describes:
- The types of the table columns.
- The relational structure of the dataset, including primary and foreign keys.
The RelationalData class, together with the
Schema class and other necessary classes for building one,
can be found in the aindo.rdml.relational module.
The Schema class
A Schema object is a collection of named
Table objects, also containing the information about relations among tables.
Each Table object contains the columns of interest of that table.
There are two primary types of columns:
PrimaryKey's andForeignKey's define the relational structure of the data.- Feature
Column's, namely columns that are not keys.
Feature columns
When building a Schema, to each feature column is associated
a Column type.
The associated type will instruct the various routines of the library on how to treat the data in the column.
For example, a Column.CATEGORICAL will be preprocessed differently than a Column.INTEGER before being fed to the
generative model during training
(more info in the ColumnPreproc section).
It will also appear differently in the evaluation report
(more info in the Synthetic data report section).
The available Column types are:
BOOLEAN, CATEGORICAL, NUMERIC, INTEGER, DATE, TIME, DATETIME, COORDINATES, ITAFISCALCODE, and TEXT.
Building a Schema
To illustrate how to build a Schema from scratch, let us work with (a subset of) the
BasketballMen dataset,
which consists of the following tables:
players: The root table with the primary keyplayerID.season: A child table of players linked via the foreign keyplayerID.all_star: Another child table of players connected by the foreign keyplayerID.
Let us load the tables with pandas and let us gather the pandas.DataFrame's into a dictionary:
import pandas as pd
df_players = pd.read_csv("path/to/basket/dir/players.csv")
df_season = pd.read_csv("path/to/basket/dir/season.csv")
df_all_star = pd.read_csv("path/to/basket/dir/all_star.csv")
dfs = {
"players": df_players,
"season": df_season,
"all_star": df_all_star,
}
To build a Schema, users must import the Column,
PrimaryKey, ForeignKey,
Table, and Schema objects
from the aindo.rdml.relational module.
Tables and columns that are present in the data but that are not included in the
Schema will be ignored.
from aindo.rdml.relational import Column, ForeignKey, PrimaryKey, Schema, Table
schema = Schema(
players=Table(
playerID=PrimaryKey(),
pos=Column.CATEGORICAL,
height=Column.NUMERIC,
weight=Column.NUMERIC,
college=Column.CATEGORICAL,
race=Column.CATEGORICAL,
birthCity=Column.CATEGORICAL,
birthState=Column.CATEGORICAL,
birthCountry=Column.CATEGORICAL,
),
season=Table(
playerID=ForeignKey(parent="players"),
year=Column.INTEGER,
stint=Column.INTEGER,
tmID=Column.CATEGORICAL,
lgID=Column.CATEGORICAL,
GP=Column.INTEGER,
points=Column.INTEGER,
GS=Column.INTEGER,
assists=Column.INTEGER,
steals=Column.INTEGER,
minutes=Column.INTEGER,
),
all_star=Table(
playerID=ForeignKey(parent="players"),
conference=Column.CATEGORICAL,
league_id=Column.CATEGORICAL,
points=Column.INTEGER,
rebounds=Column.INTEGER,
assists=Column.INTEGER,
blocks=Column.INTEGER,
),
)
print(schema)
Out:
Schema:
players:Table
Primary key: playerID
Feature columns:
pos:<Column.CATEGORICAL: 'Categorical'>
height:<Column.NUMERIC: 'Numeric'>
weight:<Column.NUMERIC: 'Numeric'>
college:<Column.CATEGORICAL: 'Categorical'>
race:<Column.CATEGORICAL: 'Categorical'>
birthCity:<Column.CATEGORICAL: 'Categorical'>
birthState:<Column.CATEGORICAL: 'Categorical'>
birthCountry:<Column.CATEGORICAL: 'Categorical'>
Foreign keys:
season:Table
Primary key: None
Feature columns:
year:<Column.INTEGER: 'Integer'>
stint:<Column.INTEGER: 'Integer'>
tmID:<Column.CATEGORICAL: 'Categorical'>
lgID:<Column.CATEGORICAL: 'Categorical'>
GP:<Column.INTEGER: 'Integer'>
points:<Column.INTEGER: 'Integer'>
GS:<Column.INTEGER: 'Integer'>
assists:<Column.INTEGER: 'Integer'>
steals:<Column.INTEGER: 'Integer'>
minutes:<Column.INTEGER: 'Integer'>
Foreign keys:
playerID:ForeignKey(parent=players)
all_star:Table
Primary key: None
Feature columns:
conference:<Column.CATEGORICAL: 'Categorical'>
league_id:<Column.CATEGORICAL: 'Categorical'>
points:<Column.INTEGER: 'Integer'>
rebounds:<Column.INTEGER: 'Integer'>
assists:<Column.INTEGER: 'Integer'>
blocks:<Column.INTEGER: 'Integer'>
Foreign keys:
playerID:ForeignKey(parent=players)
The RelationalData class
A RelationalData object is defined by combining the loaded data
and a Schema object:
from aindo.rdml.relational import RelationalData, Schema
dfs = {
"players": ...,
"season": ...,
"all_star": ...,
}
schema = Schema(...)
data = RelationalData(data=dfs, schema=schema)
The RelationalData.split() method allows to split the data
into train, test and possibly validation sets, while respecting the consistency of the relational data structure.
from aindo.rdml.relational import RelationalData
data = RelationalData(data=..., schema=...)
data_train_valid, data_test = data.split(ratio=0.1)
data_train, data_valid = data_train_valid.split(ratio=0.1)