Welcome to mlPyp’s documentation!

A framework for machine learning’s pipelines.

Highlights:
  • Build datasets with train, test, valid data and transformations applied.
  • Build datasets with metadata for reproducible experiments.
  • Easy way to add distincts machine learning algorithms from Keras, scikit-learn, etc.
  • Models with scores and predictors.
  • Convert csv files to datasets.
  • Uses transformations for manipulate data (images).
_images/pipeline.png

Instalation

git clone https://github.com/elaeon/ML.git

You can install the python dependences with pip, but we strongly recommend install the dependences with conda and conda forge.

conda config --add channels conda-forge
conda create -n new_environment --file ML/requirements.txt
source activate new_environment
pip install ML/

Quick start

First, build a dataset

from ml.ds import DataSetBuilder
import numpy as np

DIM = 21
SIZE = 100000
X = np.random.rand(SIZE, DIM)
Y = np.asarray([1 if sum(row) > 0 else 0
    for row in np.sin(6*X) + 0.1*np.random.randn(SIZE, 1)])
dataset_name = "test_dataset"
dataset = DataSetBuilder(
    dataset_name,
    validator="cross")
dataset.build_dataset(X, Y)

Then, pass it to a classification model for training, in this case we used SVGC (was a Gaussian process with stochastic variational inference), once the training was finished you can predict some data.

from ml.clf.extended.w_gpy import SVGPC

classif = SVGPC(
    dataset=dataset,
    model_name="my_test_model",
    model_version="1")
classif.train(batch_size=128, num_steps=10)
classif.scores().print_scores(order_column="f1")

Using SVGPC for make predictions is like this:

classif = SVGPC(
    model_name="my_test_model",
    model_version="1")
predictions = np.asarray(list(classif.predict(X, chunk_size=258)))

You can use more extra models (see Extra models). Extend the base model and make you own predictors! For more information about this, see the section Models.

CLI

mlPyp has a CLI where you can admin your datasets and models. For example

ml datasets

Return a table of datasets previosly builded.

Total size: 6.75 MB
dataset            size     date
-----------------  -------  --------------------
numbers_tickets    2.27 MB  2017-01-26T22:25 UTC
numbers_tickets_d  4.48 MB  2017-01-25T17:01 UTC

Or

ml models

Returns

classif    model name      version  dataset    group
---------  ------------  ---------  ---------  -------
Boosting   numerai               1  numerai
SVGPC      test2                 1  test2      basic

You can use “–help” for view more options.

Support

If you encounter bugs then let me know .

Indices and tables