explabox.ingestibles.data
Handling of data.
- explabox.ingestibles.data.import_data(dataset, data_cols, label_cols, label_map=None, method='infer', _to_instancelib=True, **read_kwargs)
Import data in an instancelib Environment.
Examples
Import from an online .csv file with data in the ‘text’ column and labels in ‘category’:
>>> from genbase import import_data >>> ds = import_data('https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv', data_cols='text', label_cols='category')
Convert a pandas DataFrame to instancelib Environment:
>>> from genbase import import_data >>> import pandas as pd >>> df = pd.read_csv('https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv') >>> ds = import_data(df, data_cols='text', label_cols='category')
Download a .zip file and convert each file in the zip to an instancelib Environment:
>>> from genbase import import_data >>> ds = import_data('https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip', data_cols='review', label_cols='rating')
Convert a huggingface dataset (sst2) to an instancelib Environment:
>>> from genbase import import_data >>> from datasets import load_dataset >>> ds = import_data(load_dataset('glue', 'sst2'), data_cols='sentence', label_cols='label')
- Parameters:
dataset (_type_) – Dataset to import.
data_cols (Union[KT, List[KT]]) – Name of column(s) containing data.
label_cols (Union[KT, List[KT]]) – Name of column(s) containing labels.
label_map (Optional[Union[Callable, dict]], optional) – Label renaming dictionary/function. Defaults to None.
method (Method, optional) – Method used to import data. Choose from ‘infer’, ‘glob’, ‘pandas’. Defaults to ‘infer’.
_to_instancelib (bool, optional) – Whether to convert the final result to instancelib. Defaults to True.
**read_kwargs – Optional arguments passed to reading call.
- Raises:
ImportError – Unable to import file.
ValueError – Invalid type of method.
NotImplementedError – Import not yet implemented.
- Returns:
Environment for each file or dataset provided.
- Return type:
Union[il.Environment, pd.DataFrame]
- explabox.ingestibles.data.rename_labels(provider, mapping)
Rename labels in a labelprovider or environment.
- Parameters:
provider (Union[il.Environment, il.LabelProvider]) – Provider to rename labels in.
mapping (Union[Callable, dict]) – Rename function or dictionary containing label mapping.
- Returns:
Original provider with labels remapped.
- Return type:
Union[il.Environment, il.LabelProvider]
- explabox.ingestibles.data.train_test_split(environment, train_size, train_name='train', test_name='test')
Split an environment into training and test data, and save it to the original environment.
- Parameters:
environment (instancelib.Environment) – Environment containing all data (environment.dataset), including labels (environment.labels).
train_size (Union[int, float]) – Size of training data, as a proportion [0, 1] or number of instances > 1.
train_name (str, optional) – Name of train split. Defaults to ‘train’.
test_name (str, optional) – Name of train split. Defaults to ‘test’.
- Returns:
- Environment with named splits train_name (containing training data) and test_name
(containing test data)
- Return type:
instancelib.Environment