explabox.ingestibles

Ingestibles are your model and data, which can be turned into digestibles that explore/examine/explain/expose your data and/or model.

class explabox.ingestibles.Ingestible(data=None, model=None, splits={'test': 'test', 'train': 'train', 'validation': 'validation'})

Bases: dict

Parameters:
check_requirements(elements=['data', 'model'])

Check if the required elements are in the ingestibles.

Parameters:

elements (List[str], optional) – Elements to check. Defaults to [‘data’, ‘model’].

Raises:

ValueError – The required element is not in the ingestibles.

Returns:

True if all requirements are included.

Return type:

bool

property data
get_named_split(name, validate=False)

Get split by name.

Parameters:
  • name (KT) – Name of split.

  • validate (bool, optional) – Return None if no split is found or throw an error. Defaults to False.

Raises:

ValueError – Unknown split

Returns:

Provider of split if it exists, else None.

Return type:

Optional[InstanceProvider]

property labels

Labelprovider.

property labelset

Label names.

property model

Predictive model.

property splits

Names of splits.

property test

Test data split.

property train

Train data split.

property validation

Validation data split.

explabox.ingestibles.import_data(dataset, data_cols, label_cols, label_map=None, method='infer', _to_instancelib=True, **read_kwargs)

Import data in an instancelib Environment.

Examples

Import from an online .csv file with data in the ‘text’ column and labels in ‘category’:

>>> from genbase import import_data
>>> ds = import_data('https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv',
                     data_cols='text', label_cols='category')

Convert a pandas DataFrame to instancelib Environment:

>>> from genbase import import_data
>>> import pandas as pd
>>> df = pd.read_csv('https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv')
>>> ds = import_data(df, data_cols='text', label_cols='category')

Download a .zip file and convert each file in the zip to an instancelib Environment:

>>> from genbase import import_data
>>> ds = import_data('https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip',
                     data_cols='review', label_cols='rating')

Convert a huggingface dataset (sst2) to an instancelib Environment:

>>> from genbase import import_data
>>> from datasets import load_dataset
>>> ds = import_data(load_dataset('glue', 'sst2'), data_cols='sentence', label_cols='label')
Parameters:
  • dataset (_type_) – Dataset to import.

  • data_cols (Union[KT, List[KT]]) – Name of column(s) containing data.

  • label_cols (Union[KT, List[KT]]) – Name of column(s) containing labels.

  • label_map (Optional[Union[Callable, dict]], optional) – Label renaming dictionary/function. Defaults to None.

  • method (Method, optional) – Method used to import data. Choose from ‘infer’, ‘glob’, ‘pandas’. Defaults to ‘infer’.

  • _to_instancelib (bool, optional) – Whether to convert the final result to instancelib. Defaults to True.

  • **read_kwargs – Optional arguments passed to reading call.

Raises:
  • ImportError – Unable to import file.

  • ValueError – Invalid type of method.

  • NotImplementedError – Import not yet implemented.

Returns:

Environment for each file or dataset provided.

Return type:

Union[il.Environment, pd.DataFrame]

explabox.ingestibles.import_model(model, environment=None, train='train', label_map=None)

Import a model from file or from a Python object.

Examples

Make a scikit-learn text classifier and train it on SST2

>>> from genbase import import_data, import_model
>>> from datasets import load_dataset
>>> ds = import_data(load_dataset('glue', 'sst2'), data_cols='sentence', label_cols='label')
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> pipeline = Pipeline([('tfidf', TfidfVectorizer()),
...                      ('clf', MultinomialNB())])
>>> import_model(pipeline, ds, train='train')
Load a pretrained ONNX model downloaded from

https://github.com/mpbron/instancelib-onnx/blob/main/example_models/data-model.onnx

>>> from genbase import import_model
>>> import_model('data-model.onnx', label_map={0: 'Bedrijfsnieuws', 1: 'Games', 2: 'Smartphones'})
Parameters:
  • model – Model or path to model to import.

  • environment (Optional[Environment], optional) – Environment corresponding to model (with dataset and ground-truth labels), used for importing models and/or training them.

  • train (Union[int, float, str, InstanceProvider], optional) – Train split size, name in environment or provider. Defaults to ‘train’.

  • label_map (Optional[Dict[LT, LT]], optional) – Conversion of label IDs to named labels. Defaults to None.

Raises:
  • ImportError – Unable to import model or file.

  • NotImplementedError – Type of model is not yet supported.

Returns:

Instancelib wrapped model.

Return type:

AbstractClassifier

explabox.ingestibles.rename_labels(provider, mapping)

Rename labels in a labelprovider or environment.

Parameters:
  • provider (Union[il.Environment, il.LabelProvider]) – Provider to rename labels in.

  • mapping (Union[Callable, dict]) – Rename function or dictionary containing label mapping.

Returns:

Original provider with labels remapped.

Return type:

Union[il.Environment, il.LabelProvider]

explabox.ingestibles.train_test_split(environment, train_size, train_name='train', test_name='test')

Split an environment into training and test data, and save it to the original environment.

Parameters:
  • environment (instancelib.Environment) – Environment containing all data (environment.dataset), including labels (environment.labels).

  • train_size (Union[int, float]) – Size of training data, as a proportion [0, 1] or number of instances > 1.

  • train_name (str, optional) – Name of train split. Defaults to ‘train’.

  • test_name (str, optional) – Name of train split. Defaults to ‘test’.

Returns:

Environment with named splits train_name (containing training data) and test_name

(containing test data)

Return type:

instancelib.Environment

Subpackages: