Datasets¶
The datasets are files in hdf5 format with data divided in train, test and validation. If this data was preprocessed, information about it is added.
For example.
DIM = 21
SIZE = 100000
X = np.random.rand(SIZE, DIM)
Y = np.asarray([1 if sum(row) > 0 else 0
for row in np.sin(6*X) + 0.1*np.random.randn(SIZE, 1)])
dataset_name = "test"
dataset = DataSetBuilder(
dataset_name,
validator="cross")
dataset.build_dataset(X, Y)
dataset.info()
Results in:
DATASET NAME: test
Author: Alejandro Martínez
Transforms: {}
MD5: daa92472f94c90b5bacb6ab172a73566
Description: test dataset
Dataset Mean Std Shape dType Labels
--------- -------- -------- ----------- ------- --------
train set 0.500195 0.288638 (70000, 21) float64 70000
test set 0.498693 0.288993 (20000, 21) float64 20000
valid set 0.499823 0.288698 (10000, 21) float64 10000
There are two more DataSetBuilder classes, a dataset builder for images, and another for files in csv format.
transforms = Transforms()
transforms.add(poly_features, {"degree": 2, "interaction_only": False})
dataset = DataSetBuilderFile(
name="test_ds_file",
train_folder_path="/home/ds/my_file.csv",
transforms=transforms)
dataset.build_dataset(label_column="target")
In the processing module are predefined functions for transforms, but, you can add your own functions. For more info about it, check Processing.