explabox.ingestibles.data

Handling of data.

explabox.ingestibles.data.import_data(dataset, data_cols, label_cols, label_map=None, method='infer', _to_instancelib=True, **read_kwargs)

Import data in an instancelib Environment.

Examples

Import from an online .csv file with data in the ‘text’ column and labels in ‘category’:

>>> from genbase import import_data
>>> ds = import_data('https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv',
                     data_cols='text', label_cols='category')

Convert a pandas DataFrame to instancelib Environment:

>>> from genbase import import_data
>>> import pandas as pd
>>> df = pd.read_csv('https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv')
>>> ds = import_data(df, data_cols='text', label_cols='category')

Download a .zip file and convert each file in the zip to an instancelib Environment:

>>> from genbase import import_data
>>> ds = import_data('https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip',
                     data_cols='review', label_cols='rating')

Convert a huggingface dataset (sst2) to an instancelib Environment:

>>> from genbase import import_data
>>> from datasets import load_dataset
>>> ds = import_data(load_dataset('glue', 'sst2'), data_cols='sentence', label_cols='label')
Parameters:
  • dataset (_type_) – Dataset to import.

  • data_cols (Union[KT, List[KT]]) – Name of column(s) containing data.

  • label_cols (Union[KT, List[KT]]) – Name of column(s) containing labels.

  • label_map (Optional[Union[Callable, dict]], optional) – Label renaming dictionary/function. Defaults to None.

  • method (Method, optional) – Method used to import data. Choose from ‘infer’, ‘glob’, ‘pandas’. Defaults to ‘infer’.

  • _to_instancelib (bool, optional) – Whether to convert the final result to instancelib. Defaults to True.

  • **read_kwargs – Optional arguments passed to reading call.

Raises:
  • ImportError – Unable to import file.

  • ValueError – Invalid type of method.

  • NotImplementedError – Import not yet implemented.

Returns:

Environment for each file or dataset provided.

Return type:

Union[il.Environment, pd.DataFrame]

explabox.ingestibles.data.rename_labels(provider, mapping)

Rename labels in a labelprovider or environment.

Parameters:
  • provider (Union[il.Environment, il.LabelProvider]) – Provider to rename labels in.

  • mapping (Union[Callable, dict]) – Rename function or dictionary containing label mapping.

Returns:

Original provider with labels remapped.

Return type:

Union[il.Environment, il.LabelProvider]

explabox.ingestibles.data.train_test_split(environment, train_size, train_name='train', test_name='test')

Split an environment into training and test data, and save it to the original environment.

Parameters:
  • environment (instancelib.Environment) – Environment containing all data (environment.dataset), including labels (environment.labels).

  • train_size (Union[int, float]) – Size of training data, as a proportion [0, 1] or number of instances > 1.

  • train_name (str, optional) – Name of train split. Defaults to ‘train’.

  • test_name (str, optional) – Name of train split. Defaults to ‘test’.

Returns:

Environment with named splits train_name (containing training data) and test_name

(containing test data)

Return type:

instancelib.Environment