explabox.ingestibles
Ingestibles are your model and data, which can be turned into digestibles that explore/examine/explain/expose your data and/or model.
- class explabox.ingestibles.Ingestible(data=None, model=None, splits={'test': 'test', 'train': 'train', 'validation': 'validation'})
Bases:
dict- Parameters:
data (Environment | None) –
model (AbstractClassifier | None) –
- check_requirements(elements=['data', 'model'])
Check if the required elements are in the ingestibles.
- Parameters:
elements (List[str], optional) – Elements to check. Defaults to [‘data’, ‘model’].
- Raises:
ValueError – The required element is not in the ingestibles.
- Returns:
True if all requirements are included.
- Return type:
bool
- property data
- get_named_split(name, validate=False)
Get split by name.
- Parameters:
name (KT) – Name of split.
validate (bool, optional) – Return None if no split is found or throw an error. Defaults to False.
- Raises:
ValueError – Unknown split
- Returns:
Provider of split if it exists, else None.
- Return type:
Optional[InstanceProvider]
- property labels
Labelprovider.
- property labelset
Label names.
- property model
Predictive model.
- property splits
Names of splits.
- property test
Test data split.
- property train
Train data split.
- property validation
Validation data split.
- explabox.ingestibles.import_data(dataset, data_cols, label_cols, label_map=None, method='infer', _to_instancelib=True, **read_kwargs)
Import data in an instancelib Environment.
Examples
Import from an online .csv file with data in the ‘text’ column and labels in ‘category’:
>>> from genbase import import_data >>> ds = import_data('https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv', data_cols='text', label_cols='category')
Convert a pandas DataFrame to instancelib Environment:
>>> from genbase import import_data >>> import pandas as pd >>> df = pd.read_csv('https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv') >>> ds = import_data(df, data_cols='text', label_cols='category')
Download a .zip file and convert each file in the zip to an instancelib Environment:
>>> from genbase import import_data >>> ds = import_data('https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip', data_cols='review', label_cols='rating')
Convert a huggingface dataset (sst2) to an instancelib Environment:
>>> from genbase import import_data >>> from datasets import load_dataset >>> ds = import_data(load_dataset('glue', 'sst2'), data_cols='sentence', label_cols='label')
- Parameters:
dataset (_type_) – Dataset to import.
data_cols (Union[KT, List[KT]]) – Name of column(s) containing data.
label_cols (Union[KT, List[KT]]) – Name of column(s) containing labels.
label_map (Optional[Union[Callable, dict]], optional) – Label renaming dictionary/function. Defaults to None.
method (Method, optional) – Method used to import data. Choose from ‘infer’, ‘glob’, ‘pandas’. Defaults to ‘infer’.
_to_instancelib (bool, optional) – Whether to convert the final result to instancelib. Defaults to True.
**read_kwargs – Optional arguments passed to reading call.
- Raises:
ImportError – Unable to import file.
ValueError – Invalid type of method.
NotImplementedError – Import not yet implemented.
- Returns:
Environment for each file or dataset provided.
- Return type:
Union[il.Environment, pd.DataFrame]
- explabox.ingestibles.import_model(model, environment=None, train='train', label_map=None)
Import a model from file or from a Python object.
Examples
Make a scikit-learn text classifier and train it on SST2
>>> from genbase import import_data, import_model >>> from datasets import load_dataset >>> ds = import_data(load_dataset('glue', 'sst2'), data_cols='sentence', label_cols='label') >>> from sklearn.pipeline import Pipeline >>> from sklearn.naive_bayes import MultinomialNB >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> pipeline = Pipeline([('tfidf', TfidfVectorizer()), ... ('clf', MultinomialNB())]) >>> import_model(pipeline, ds, train='train')
- Load a pretrained ONNX model downloaded from
https://github.com/mpbron/instancelib-onnx/blob/main/example_models/data-model.onnx
>>> from genbase import import_model >>> import_model('data-model.onnx', label_map={0: 'Bedrijfsnieuws', 1: 'Games', 2: 'Smartphones'})
- Parameters:
model – Model or path to model to import.
environment (Optional[Environment], optional) – Environment corresponding to model (with dataset and ground-truth labels), used for importing models and/or training them.
train (Union[int, float, str, InstanceProvider], optional) – Train split size, name in environment or provider. Defaults to ‘train’.
label_map (Optional[Dict[LT, LT]], optional) – Conversion of label IDs to named labels. Defaults to None.
- Raises:
ImportError – Unable to import model or file.
NotImplementedError – Type of model is not yet supported.
- Returns:
Instancelib wrapped model.
- Return type:
AbstractClassifier
- explabox.ingestibles.rename_labels(provider, mapping)
Rename labels in a labelprovider or environment.
- Parameters:
provider (Union[il.Environment, il.LabelProvider]) – Provider to rename labels in.
mapping (Union[Callable, dict]) – Rename function or dictionary containing label mapping.
- Returns:
Original provider with labels remapped.
- Return type:
Union[il.Environment, il.LabelProvider]
- explabox.ingestibles.train_test_split(environment, train_size, train_name='train', test_name='test')
Split an environment into training and test data, and save it to the original environment.
- Parameters:
environment (instancelib.Environment) – Environment containing all data (environment.dataset), including labels (environment.labels).
train_size (Union[int, float]) – Size of training data, as a proportion [0, 1] or number of instances > 1.
train_name (str, optional) – Name of train split. Defaults to ‘train’.
test_name (str, optional) – Name of train split. Defaults to ‘test’.
- Returns:
- Environment with named splits train_name (containing training data) and test_name
(containing test data)
- Return type:
instancelib.Environment
Subpackages: