SISSOkit package

Submodules

SISSOkit.cross_validation module

SISSOkit.cross_validation.kfold(current_path, target_path, property_name, num_fold)[source]

Generates K-fold cross validation files for SISSO. You should at least contains two files necessary for SISSO: SISSO.in and train.dat before you use this function. All the arguments in SISSO.in will remain the same in CV files except nsample will change to the correct nsample in train.dat.

This function will generate a directory contains totally num_fold CV directories ,and cross_validation_info.dat containing the information about CV type (i.e. K-fold) and shuffle list, In each CV directory, there will be a new validation.dat contains the data left out from train.dat, and shuffle.dat contains sample index and number of train.dat and validation.dat, noting that the index is the index in the original train.dat, i.e. the index in the whole data set.

You can directly run SISSO on this CV files if original SISSO.in is correctly set.

Arguments:
current_path (string):

path to SISSO file on which you want to do cross validation. It should at least contains two files necessary for SISSO: SISSO.in and train.dat.

target_path (string):

path to newly generated cross validation directory.

property_name (string):

the name of the property you want to predict.

num_fold (int):

K of K-fold cross validation.

SISSOkit.cross_validation.leave_out(current_path, target_path, property_name, num_iter, frac=0, num_out=0)[source]

Generates leave-N-out cross validation files for SISSO. You should at least contains two files necessary for SISSO: SISSO.in and train.dat before you use this function. All the arguments in SISSO.in will remain the same in CV files except nsample will change to the correct nsample in train.dat.

This function will generate a directory contains totally num_iter CV directories ,and cross_validation_info.dat containing the information about CV type (i.e. leave-out), shuffle list and iteration times, In each CV directory, there will be a new validation.dat contains the data left out from train.dat, and shuffle.dat contains sample index and number of train.dat and validation.dat, noting that the index is the index in the original train.dat, i.e. the index in the whole data set.

You can directly run SISSO on this CV files if original SISSO.in is correctly set.

Arguments:
current_path (string):

path to SISSO file on which you want to do cross validation. It should at least contains two files necessary for SISSO: SISSO.in and train.dat.

target_path (string):

path to newly generated cross validation directory.

property_name (string):

the name of the property you want to predict.

num_iter (int):

the number of cross validation files.

frac (float):

the percentage of leave out samples. You should only pass either frac or num_out to this function.

num_out (int):

the number of leave out samples. You should only pass either frac or num_out to this function.

SISSOkit.evaluation module

class SISSOkit.evaluation.Classification(current_path)[source]

Bases: object

Basic class for evaluating the results of classification. You should instantiate it first.

Arguments:
current_path (string):

path to the directory of the SISSO result.

Its attributes contain all the input arguments in SISSO.in, information about train.dat and SISSO.out.

property descriptors

Returns descriptors.

evaluate_expression(expression, data=None)[source]

Returns the value computed using given expression over data in train.dat.

Returns the values using expression to compute data.

Arguments:
expression (string):

arithmetic expression. The expression should be in the same form as the descriptor in file SISSO.out, i.e., every operation should be enclosed in parenthesis, like (exp((A+B))/((C*D))^2).

data (pandas.DataFrame):

the data you want to compute, with samples as index and features as columns. The features name should be correspondent to the operands in expression.

Returns:

pandas.Series: values computed using expression

features_percent(descending=True)[source]

Computes the percentages of each feature in top subs_sis 1D descriptors.

get_descriptors(path=None)[source]

Returns descriptors.

class SISSOkit.evaluation.ClassificationCV(current_path, property_name=None, drop_index=[])[source]

Bases: SISSOkit.evaluation.Classification

Basic class for evaluating the cross validation results of classification. You should instantiate it first if you want to analyze CV results. You can use index to select a specific result from total CV results, and it will return a instance of Classification.

Arguments:
current_path (string):

path to the directory of the cross validation results.

property_name (string):

specifies the property name of your CV results.

drop_index (list):

specifies which CV results you don’t want to consider.

Note

You should use code in cross_validation.py to generate the CV files, otherwise the format may be wrong.

Its attributes contain all the input arguments in SISSO.in, information about train.dat, validation.dat, and SISSO.out.

descriptor_percent(descriptor)[source]

Return the percent of given descriptor appearing in the cross validation top subs_sis descriptors, and return the appearing index in the descriptor space.

Arguments:
descriptor (string):

the descriptor you want to check.

property descriptors

Returns descriptors. The first index is cross validation index.

drop(index=[])[source]

Drop some CV results.

Arguments:
index (list):

contain the CV index which you want to drop.

Returns:

An instance of RegressionCV that drop the results in CV index.

features_percent(descending=True)[source]

Return the percent of each feature in the top subs_sis descriptors. There are total n_cv*subs_sis descriptors, the feature percent is the percent over these descriptors.

find_materials_in_validation(*idxs)[source]

Returns the samples’ names and in which CV result they are.

Arguments:
idx (integer):

index of the sample.

Returns:

the samples’ names and in which CV result they are.

class SISSOkit.evaluation.Regression(current_path)[source]

Bases: object

Basic class for evaluating the results of regression. You should instantiate it first.

Arguments:
current_path (string):

path to the directory of the SISSO result.

Its attributes contain all the input arguments in SISSO.in, information about train.dat and SISSO.out.

property baseline

Returns the baseline, i.e., the errors with predicting every property using the mean value of the property in train.dat.

check_percentage(dimension, absolute=True)[source]

Checks percentage of each descriptors.

Arguments:
dimension (int):

dimension of the descriptor.

absolute (bool):

whether return the absolute descriptor. If it is True, the numerator is absolute value of each descriptor multiplied by corresponding coefficient, and denominator is the sum of intercept and every descriptor multiplied by coefficient. If it is False, the numerator is descriptor multiplied by corresponding coefficient, and denominator is the proeprty. In this case, sometimes it will be larger than 1.

Returns:

percentage which index is [dimension, sample].

check_predictions(dimension, multiply_coefficients=True)[source]

Checks predictions of each descriptor.

Arguments:
dimension (int):

dimension of the descriptor.

multiply_coefficients (bool):

whether it should be multiplied by coefficients.

Returns:

Predictions of each descriptor.

property coefficients

Returns coefficients. The index order is task, dimension of model, dimension descriptor.

property descriptors

Returns descriptors.

errors(training=True, display_task=False)[source]
Returns errors.

The index order is dimension, sample if display_task is False, or task, sample, dimension if display_task is True.

Arguments:
training (bool):

determines whether its training or not.

display_task (bool):

determines the errors contain index of task or not.

Returns:

Errors.

evaluate_expression(expression, data=None)[source]

Returns the value computed using given expression over data in train.dat.

Returns the values using expression to compute data.

Arguments:

expression (string): arithmetic expression.

The expression should be in the same form as the descriptor in file SISSO.out, i.e., every operation should be enclosed in parenthesis, like (exp((A+B))/((C*D))^2).

data (pandas.DataFrame):

the data you want to compute, with samples as index and features as columns. The features name should be correspondent to the operands in expression.

Returns:

pandas.Series: values computed using expression

features_percent(descending=True)[source]

Computes the percentages of each feature in top subs_sis 1D descriptors.

get_coefficients(path=None)[source]

Returns coefficients. The index order is task, dimension of model, dimension descriptor.

get_descriptors(path=None)[source]

Returns descriptors.

get_intercepts(path=None)[source]

Returns intercepts. The index order is task, dimension of model.

property intercepts

Returns intercepts. The index order is task, dimension of model.

predict(data, tasks=None, dimensions=None)[source]

Returns the predictions of data using the models found by SISSO.

Arguments:
data (string or pandas.DataFrame):

the path of the data or the data you want to compute. If it is string, it will be recognized as path to the data and use data=pd.read_csv(data,sep=r'\s+') to read the data, so remember to use space to seperate the data. Otherwise it should be pandas.DataFrame.

tasks (None or list):

specifies which sample should use which task of model to compute. It should be task=[ [task_index, [sample_indices] ] ]. For example, [ [1, [1,3]], [2, [2,4,5]] ] means sample 1 and 3 will be computed using task 1 model, and sample 2, 4 and 5 will be computed using task 2 model. If it is None, then compute all samples with task 1 model.

dimensions (None or list):

specifies which dimension of desctiptor will be used. For example, [2,5] means only compute using 2D and 5D models. If it is None, then compute all samples with all dimension models.

Returns:

Values computed using models found by SISSO.

predictions(training=True, display_task=False)[source]

Returns predictions. The index order is dimension, sample if display_task is False, or task, sample, dimension if display_task is True.

Arguments:
training (bool):

determines whether its training or not.

display_task (bool):

determines the predictions contain index of task or not.

Returns:

Predictions.

total_errors(training=True, display_task=False, display_baseline=False)[source]

Computes errors.

Arguments:
training (bool):

determines whether its training or not.

display_task (bool):

determines the errors contain index of task or not.

display_baseline (bool):

determines whether display baseline

Returns:

RMSE, MAE, 25%ile AE, 50%ile AE, 75%ile AE, 95%ile AE, MaxAE of given errors.

class SISSOkit.evaluation.RegressionCV(current_path, property_name=None, drop_index=[])[source]

Bases: SISSOkit.evaluation.Regression

Basic class for evaluating the cross validation results of regression. You should instantiate it first if you want to analyze CV results. You can use index to select a specific result from total CV results, and it will return a instance of Regression.

Arguments:
current_path (string):

path to the directory of the cross validation results.

property_name (string):

specifies the property name of your CV results.

drop_index (list):

specifies which CV results you don’t want to consider.

Note

You should use code in cross_validation.py to generate the CV files, otherwise the format may be wrong.

Its attributes contain all the input arguments in SISSO.in, information about train.dat, validation.dat, and SISSO.out.

property baseline

Returns the baseline, i.e., the errors with predicting every property using the mean value of the property in train.dat.

check_percentage(cv_idx, dimension, absolute=True)[source]

Checks percentage of each descriptors.

Arguments:
dimension (int):

dimension of the descriptor.

absolute (bool):

whether return the absolute descriptor. If it is True, the numerator is absolute value of each descriptor multiplied by corresponding coefficient, and denominator is the sum of intercept and every descriptor multiplied by coefficient. If it is False, the numerator is descriptor multiplied by corresponding coefficient, and denominator is the proeprty. In this case, sometimes it will be larger than 1.

Returns:

percentage which index is [dimension, sample].

check_predictions(cv_idx, dimension, training=True, multiply_coefficients=True)[source]

Check predictions of each descriptor.

Arguments:
cv_idx (integer):

specifies which CV result you want to check.

dimension (int):

dimension of the descriptor.

training (bool):

whether it is training.

multiply_coefficients (bool):

whether it should be multiplied by coefficients.

Returns

Predictions of each descriptor.

property coefficients

Returns coefficients. The index order is CV index, task, dimension of model, dimension descriptor.

descriptor_percent(descriptor)[source]

Return the percent of given descriptor appearing in the cross validation top subs_sis descriptors, and return the appearing index in the descriptor space.

Arguments:
descriptor (string):

the descriptor you want to check.

property descriptors

Returns descriptors. The first index is cross validation index.

drop(index=[])[source]

Drop some CV results.

Arguments:
index (list):

contain the CV index which you want to drop.

Returns:

An instance of RegressionCV that drop the results in CV index.

errors(training=True, display_cv=False, display_task=False)[source]

Returns errors. The index order is dimension, sample if display_cv is False and display_task is False, or task, sample, dimension if display_cv is False and display_task is True, or CV index, sample, dimension if display_cv is True and display_task is False, or CV index, task, sample, dimension if display_cv is True and display_task is True.

Arguments:
training (bool):

determines whether its training or not.

display_task (bool):

determines the errors contain index of task or not.

Returns:

Errors.

features_percent(descending=True)[source]

Return the percent of each feature in the top subs_sis descriptors. There are total n_cv*subs_sis descriptors, the feature percent is the percent over these descriptors.

find_materials_in_validation(*idxs)[source]

Returns the samples’ names and in which CV result they are.

Arguments:
idx (integer):

index of the sample.

Returns:

the samples’ names and in which CV result they are.

find_max_error()[source]

Returns the samples’ names and in which CV result they are that contribute to the maxAE.

property intercepts

Returns intercepts. The index order is CV index, task, dimension of model.

predict(data, cv_index=None, tasks=None, dimensions=None)[source]

Returns the predictions of data using the models found by SISSO.

Arguments:
data (string or pandas.DataFrame):

the path of the data or the data you want to compute. If it is string, it will be recognized as path to the data and use data=pd.read_csv(data,sep=r'\s+') to read the data, so remember to use space to seperate the data. Otherwise it should be pandas.DataFrame.

cv_index (None or list):

specifies which CV should be included. For example, [1,5] means CV1 and CV5 will be included. If it is None, then compute all CV results.

tasks (None or list):

specifies which sample should use which task of model to compute. It should be task=[ [task_index, [sample_indices] ] ]. For example, [ [1, [1,3]], [2, [2,4,5]] ] means sample 1 and 3 will be computed using task 1 model, and sample 2, 4 and 5 will be computed using task 2 model. If it is None, then compute all samples with task 1 model.

dimensions (None or list):

specifies which dimension of desctiptor will be used. For example, [2,5] means only compute using 2D and 5D models. If it is None, then compute all samples with all dimension models.

Returns:

Values computed using models found by SISSO.

predictions(training=True, display_cv=False, display_task=False)[source]

Returns predictions. The index order is dimension, sample if display_cv is False and display_task is False, or task, sample, dimension if display_cv is False and display_task is True, or CV index, sample, dimension if display_cv is True and display_task is False, or CV index, task, sample, dimension if display_cv is True and display_task is True.

Arguments:
training (bool):

determines whether its training or not.

display_task (bool):

determines the predictions contain index of task or not.

Returns:

Predictions.

total_errors(training=True, display_cv=False, display_task=False, display_baseline=False)[source]

Compute errors.

Arguments:
training (bool):

determines whether its training or not.

display_cv (bool):

determines the errors contain index of CV or not.

display_task (bool):

determines the errors contain index of task or not.

display_baseline (bool):

determines whether display baseline

Returns:

RMSE, MAE, 25%ile AE, 50%ile AE, 75%ile AE, 95%ile AE, MaxAE of given errors.

SISSOkit.evaluation.compute_errors(errors)[source]

Computes errors.

Arguments:
errors:

difference between predictions and exact value.

Returns:

RMSE, MAE, 25%ile AE, 50%ile AE, 75%ile AE, 95%ile AE, MaxAE of given errors.

SISSOkit.evaluation.compute_using_model_reg(path=None, result=None, training=True, data=None, task_idx=None, dimension_idx=None)[source]

Uses SISSO model of specific task_idx and dimension_idx to predict property of data. This function is a little bit hard to use, you can also see Regression.predict()

Arguments:
path (string):

directory path of SISSO result which contains the model you want to use.

result (Regression):

instance of Regression which contains the model you want to use.

training (bool):

whether the task is training or predicting. Default is True.

data (string or pandas.DataFrame):

the path of the data or the data you want to compute. If it is string, it will be recognized as path to the data and use data=pd.read_csv(data,sep=r'\s+') to read the data, so remember to use space to seperate the data. Otherwise it should be pandas.DataFrame.

task_idx (integer):

specifies which task of model you want to use.

dimension_idx (integer):

specifies which dimension of model you want to use.

Returns:

Values computed using given model. The index order is [task, dimension, sample], it may not include all the index here, depending on you input.

Note

  • You should only specify one of path or result to determine what model you want to use.

  • You only need to pass value to task_idx and dimension_idx when you want to use specific data.

  • If you don’t pass value to data, you will get predictions of train.dat if training is True or predictions of validation.dat if training is False. Also, in this case, you don’t need to pass task_idx and dimension_idx, because you already specify which task of model the sample should use, and it will return all dimension results.

  • If you pass value to data, then you have to specify task_idx, so it means that all samples should be computed using the same task. If you don’t pass value to dimension_idx, it will return values computed by all the dimension of models, otherwise it will only return values computed using dimension_idx of model.

SISSOkit.evaluation.evaluate_expression(expression, data)[source]

Returns the values using expression to compute data.

Arguments:
expression (string):

arithmetic expression. The expression should be in the same form as the descriptor in file SISSO.out, i.e., every operation should be enclosed in parenthesis, like (exp((A+B))/((C*D))^2).

data (pandas.DataFrame):

the data you want to compute, with samples as index and features as columns. The features name should be correspondent to the operands in expression.

Returns:

pandas.Series: values computed using expression

SISSOkit.evaluation.predict_reg(data, descriptors, coefficients, intercepts, tasks=None, dimensions=None)[source]

Returns the predictions.

Arguments:
data (string or pandas.DataFrame):

the path of the data or the data you want to compute. If it is string, it will be recognized as path to the data and use data=pd.read_csv(data,sep=r'\s+') to read the data, so remember to use space to seperate the data. Otherwise it should be pandas.DataFrame.

descriptors (list):

the descriptors you want to use. The index order should be dimension of model then dimension of desciptor.

coefficients (list):

the coefficients you want to use. The index order should be task, dimension of model then dimension of desciptor.

intercepts (float or list):

the intercepts you want to use. The index order should be task then dimension of model.

tasks (None or list):

specifies which sample should use which task of model to compute. It should be task=[ [task_index, [sample_indices] ] ]. For example, [ [1, [1,3]], [2, [2,4,5]] ] means sample 1 and 3 will be computed using task 1 model, and sample 2, 4 and 5 will be computed using task 2 model. If it is None, then compute all samples with task 1 model.

dimensions (None or list):

specifies which dimension of desctiptor will be used. For example, [2,5] means only compute using 2D and 5D models. If it is None, then compute all samples with all dimension models.

Returns:

Predictions using passed models.

Note

  • intercepts should be correspondent to other arguments.

  • If you just want to use 1 model, then descriptors, coefficients shoule be a list with string and float as item, and intercepts should be float. You don’t need to set tasks and dimensions in this case.

  • If you want to use many models, coefficients should be a list and index is different task. Each item in the list should also be a list and index is dimension of the model. Then each item of this list is also a list and index is dimension of descriptor, which contains specific descriptor or coefficient. So the index order should be task, dimension of model then dimension of desciptor. Index order of descriptors should be dimension of model then dimension of desciptor. Index order of intercepts should be task then dimension of model.

SISSOkit.notebook module

SISSOkit.notebook.generate_report(path, file_path, notebook_name, file_name=None)[source]

Generates jupyter notebook reports.

Arguments:
path (list):

path to SISSO results. If there is only one result over whole data set, then it should be a list containing only 1 item. If there is also cross validation results, it should be [path to result over whole data set, path to cross validation results].

file_path (string):

path to newly generated jupyter notebook.

notebook_name (int or string):

notebook index or notebook name. ===== ===== index name ===== ===== 0 regression 1 regression with CV ===== =====

file_name (None or string): the newly generated jupyter notebook name. If it is None,

the file name is the same as notebook template name.

SISSOkit.plot module

SISSOkit.plot.abs_errors_vs_dimension(*regressions, training=True, unit=None, fontsize=20, selected_errors=None, display_baseline=False, label='', **kw)[source]

Plots the histogram of absolute errors with box plot for errors.

Arguments:
regressions (evaluation.Regression or evaluation.RegressionCV):

regression result.

training (bool):

training errors or prediction errors.

unit (string):

unit of property.

fontsize (int):

fontsize of axis name.

seleted_errors (None or list):

what errors should appear in the plot. errors are ‘RMSE’,’MAE’,‘25%ile AE’,‘50%ile AE’,‘75%ile AE’,‘95%ile AE’,’MaxAE’. If it is None, then all the errors will appear in the plot.

display_baseline (bool):

whether plot baseline.

kw:

keyword arguments of histogram. It’s the same as hist() in matplotlib

SISSOkit.plot.baselineplot(regression, unit=None, marker='x', marker_color='r', marker_y=1, marker_shape=100, fontsize=20, **kw)[source]

Plots the histgram of baseline of regression.

Arguments:
regression (evaluation.Regression):

regression result.

unit (string):

unit of property.

marker (char):

marker type of mean value. It’s the same as in matplotlib.

marker_color (char):

marker color of mean value. It’s the same as in matplotlib.

marker_shape (int):

marker shape of mean value. It’s the same as in matplotlib.

marker_y (float):

vertical coordinate of marker of mean value.

fontsize (int):

fontsize of axis name.

kw:

keyword arguments of histogram. It’s the same as hist() in matplotlib.

SISSOkit.plot.boxplot(regression, training=True, unit=None, fontsize=20, **kwargs)[source]

Plots the boxplot of regression.

Arguments:
regression (evaluation.Regression or evaluation.RegressionCV):

regression result.

training (bool):

training errors or prediction errors.

unit (string):

unit of property.

fontsize (int):

fontsize of axis name.

kw:

keyword arguments of histogram. It’s the same as hist() in matplotlib

SISSOkit.plot.error_hist(dimension, *regressions, training=True, absolute=False, unit=None, fontsize=20, **kw)[source]

Plots the histgram of errors of regression.

Arguments:
dimension (int):

dimension of descriptor.

regressions (evaluation.Regression or evaluation.RegressionCV):

regression result.

training (bool):

training errors or prediction errors.

absolute (bool):

absolute errors or not.

unit (string):

unit of property.

fontsize (int):

fontsize of axis name.

kw:

keyword arguments of histogram. It’s the same as hist() in matplotlib

SISSOkit.plot.errors_details(regression, training=True)[source]

Plots the detailed information about regression, including histograme of signed errors, preditction v.s. property and histgram of absolute errors with markers.

Arguments:
regression (evaluation.Regression or evaluation.RegressionCV):

regression result.

training (bool):

training errors or prediction errors.

SISSOkit.plot.hist_with_markers(dimension, *regressions, training=True, unit=None, fontsize=20, selected_errors=None, marker_x=0, marker=None, **kw)[source]

Plots the histogram of absolute errors with markers

Arguments:
dimension (int):

dimension of descriptor.

regressions (evaluation.Regression or evaluation.RegressionCV):

regression result.

training (bool):

training errors or prediction errors.

unit (string):

unit of property.

fontsize (int):

fontsize of axis name.

seleted_errors (None or list):

what errors should pinpoint in the plot. errors are ‘RMSE’,’MAE’,‘25%ile AE’,‘50%ile AE’,‘75%ile AE’,‘95%ile AE’,’MaxAE’. If it is None, then all the errors will appear in the plot.

marker_x (float):

horrizontal coordinate of marker.

marker (NOne or list):

marker type. It’s the same as in matplotlib. If it is None, then will use the default types.

kw:

keyword arguments of histogram. It’s the same as hist() in matplotlib

SISSOkit.plot.prediction_vs_property(dimension, *regressions, training=True, unit=None, fontsize=20, **kw)[source]

Plots the scatter plot of prediction v.s. property.

Arguments:
dimension (int):

dimension of descriptor.

regressions (evaluation.Regression or evaluation.RegressionCV):

regression result.

training (bool):

training errors or prediction errors.

unit (string):

unit of property.

fontsize (int):

fontsize of axis name.

kw:

keyword arguments of histogram. It’s the same as hist() in matplotlib

SISSOkit.utils module

SISSOkit.utils.descriptors_to_markdown(expression)[source]

Returns the markdown form of expression

class SISSOkit.utils.lazyproperty(func)[source]

Bases: object

Lazy property

SISSOkit.utils.models_to_markdown(regression, task, dimension, indent='')[source]

Returns the markdown form of models.

SISSOkit.utils.scientific_notation_to_markdown(value)[source]

Returns the markdown form of scientific notation form of value.

SISSOkit.utils.seperate_DataFrame(dataframe, n_list)[source]

Returns the seperated DataFrame.

SISSOkit.utils.seperate_list(original_list, n_list)[source]

Returns the seperated list.

SISSOkit.utils.start_and_number(n_list)[source]

Returns the start index and number in each turn.

Module contents