SISSOkit package¶
Submodules¶
SISSOkit.cross_validation module¶
-
SISSOkit.cross_validation.
kfold
(current_path, target_path, property_name, num_fold)[source]¶ Generates K-fold cross validation files for SISSO. You should at least contains two files necessary for SISSO:
SISSO.in
andtrain.dat
before you use this function. All the arguments inSISSO.in
will remain the same in CV files exceptnsample
will change to the correct nsample intrain.dat
.This function will generate a directory contains totally
num_fold
CV directories ,andcross_validation_info.dat
containing the information about CV type (i.e. K-fold) and shuffle list, In each CV directory, there will be a newvalidation.dat
contains the data left out fromtrain.dat
, andshuffle.dat
contains sample index and number oftrain.dat
andvalidation.dat
, noting that the index is the index in the originaltrain.dat
, i.e. the index in the whole data set.You can directly run SISSO on this CV files if original
SISSO.in
is correctly set.- Arguments:
- current_path (string):
path to SISSO file on which you want to do cross validation. It should at least contains two files necessary for SISSO:
SISSO.in
andtrain.dat
.- target_path (string):
path to newly generated cross validation directory.
- property_name (string):
the name of the property you want to predict.
- num_fold (int):
K of K-fold cross validation.
-
SISSOkit.cross_validation.
leave_out
(current_path, target_path, property_name, num_iter, frac=0, num_out=0)[source]¶ Generates leave-N-out cross validation files for SISSO. You should at least contains two files necessary for SISSO:
SISSO.in
andtrain.dat
before you use this function. All the arguments inSISSO.in
will remain the same in CV files exceptnsample
will change to the correct nsample intrain.dat
.This function will generate a directory contains totally
num_iter
CV directories ,andcross_validation_info.dat
containing the information about CV type (i.e. leave-out), shuffle list and iteration times, In each CV directory, there will be a newvalidation.dat
contains the data left out fromtrain.dat
, andshuffle.dat
contains sample index and number oftrain.dat
andvalidation.dat
, noting that the index is the index in the originaltrain.dat
, i.e. the index in the whole data set.You can directly run SISSO on this CV files if original
SISSO.in
is correctly set.- Arguments:
- current_path (string):
path to SISSO file on which you want to do cross validation. It should at least contains two files necessary for SISSO:
SISSO.in
andtrain.dat
.- target_path (string):
path to newly generated cross validation directory.
- property_name (string):
the name of the property you want to predict.
- num_iter (int):
the number of cross validation files.
- frac (float):
the percentage of leave out samples. You should only pass either
frac
ornum_out
to this function.- num_out (int):
the number of leave out samples. You should only pass either
frac
ornum_out
to this function.
SISSOkit.evaluation module¶
-
class
SISSOkit.evaluation.
Classification
(current_path)[source]¶ Bases:
object
Basic class for evaluating the results of classification. You should instantiate it first.
- Arguments:
- current_path (string):
path to the directory of the SISSO result.
Its attributes contain all the input arguments in
SISSO.in
, information abouttrain.dat
andSISSO.out
.-
property
descriptors
¶ Returns descriptors.
-
evaluate_expression
(expression, data=None)[source]¶ Returns the value computed using given expression over data in
train.dat
.Returns the values using
expression
to computedata
.- Arguments:
- expression (string):
arithmetic expression. The expression should be in the same form as the descriptor in file
SISSO.out
, i.e., every operation should be enclosed in parenthesis, like (exp((A+B))/((C*D))^2).- data (pandas.DataFrame):
the data you want to compute, with samples as index and features as columns. The features name should be correspondent to the operands in expression.
- Returns:
pandas.Series: values computed using
expression
-
class
SISSOkit.evaluation.
ClassificationCV
(current_path, property_name=None, drop_index=[])[source]¶ Bases:
SISSOkit.evaluation.Classification
Basic class for evaluating the cross validation results of classification. You should instantiate it first if you want to analyze CV results. You can use index to select a specific result from total CV results, and it will return a instance of
Classification
.- Arguments:
- current_path (string):
path to the directory of the cross validation results.
- property_name (string):
specifies the property name of your CV results.
- drop_index (list):
specifies which CV results you don’t want to consider.
Note
You should use code in
cross_validation.py
to generate the CV files, otherwise the format may be wrong.Its attributes contain all the input arguments in
SISSO.in
, information abouttrain.dat
,validation.dat
, andSISSO.out
.-
descriptor_percent
(descriptor)[source]¶ Return the percent of given descriptor appearing in the cross validation top subs_sis descriptors, and return the appearing index in the descriptor space.
- Arguments:
- descriptor (string):
the descriptor you want to check.
-
property
descriptors
¶ Returns descriptors. The first index is cross validation index.
-
drop
(index=[])[source]¶ Drop some CV results.
- Arguments:
- index (list):
contain the CV index which you want to drop.
- Returns:
An instance of RegressionCV that drop the results in CV index.
-
class
SISSOkit.evaluation.
Regression
(current_path)[source]¶ Bases:
object
Basic class for evaluating the results of regression. You should instantiate it first.
- Arguments:
- current_path (string):
path to the directory of the SISSO result.
Its attributes contain all the input arguments in
SISSO.in
, information abouttrain.dat
andSISSO.out
.-
property
baseline
¶ Returns the baseline, i.e., the errors with predicting every property using the mean value of the property in
train.dat
.
-
check_predictions
(dimension, multiply_coefficients=True)[source]¶ Check predictions of each descriptor.
- Arguments:
- dimension (int):
dimension of the descriptor.
- multiply_coefficients (bool):
whether it should be multiplied by coefficients.
- Returns
Predictions of each descriptor.
-
property
coefficients
¶ Returns coefficients. The index order is task, dimension of model, dimension descriptor.
-
property
descriptors
¶ Returns descriptors.
-
errors
(training=True, display_task=False)[source]¶ - Returns errors.
The index order is dimension, sample if
display_task
isFalse
, or task, sample, dimension ifdisplay_task
isTrue
.- Arguments:
- training (bool):
determines whether its training or not.
- display_task (bool):
determines the errors contain index of task or not.
- Returns:
Errors.
-
evaluate_expression
(expression, data=None)[source]¶ Returns the value computed using given expression over data in
train.dat
.Returns the values using
expression
to computedata
.- Arguments:
expression (string): arithmetic expression.
The expression should be in the same form as the descriptor in file
SISSO.out
, i.e., every operation should be enclosed in parenthesis, like (exp((A+B))/((C*D))^2).- data (pandas.DataFrame):
the data you want to compute, with samples as index and features as columns. The features name should be correspondent to the operands in expression.
- Returns:
pandas.Series: values computed using
expression
-
features_percent
(descending=True)[source]¶ Computes the percentages of each feature in top subs_sis 1D descriptors.
-
get_coefficients
(path=None)[source]¶ Returns coefficients. The index order is task, dimension of model, dimension descriptor.
-
property
intercepts
¶ Returns intercepts. The index order is task, dimension of model.
-
predict
(data, tasks=None, dimensions=None)[source]¶ Returns the predictions of
data
using the models found by SISSO.- Arguments:
- data (string or pandas.DataFrame):
the path of the data or the data you want to compute. If it is
string
, it will be recognized as path to the data and usedata=pd.read_csv(data,sep=r'\s+')
to read the data, so remember to use space to seperate the data. Otherwise it should bepandas.DataFrame
.- tasks (None or list):
specifies which sample should use which task of model to compute. It should be task=[ [task_index, [sample_indices] ] ]. For example, [ [1, [1,3]], [2, [2,4,5]] ] means sample 1 and 3 will be computed using task 1 model, and sample 2, 4 and 5 will be computed using task 2 model. If it is None, then compute all samples with task 1 model.
- dimensions (None or list):
specifies which dimension of desctiptor will be used. For example, [2,5] means only compute using 2D and 5D models. If it is None, then compute all samples with all dimension models.
- Returns:
Values computed using models found by SISSO.
-
predictions
(training=True, display_task=False)[source]¶ Returns predictions. The index order is dimension, sample if
display_task
isFalse
, or task, sample, dimension ifdisplay_task
isTrue
.- Arguments:
- training (bool):
determines whether its training or not.
- display_task (bool):
determines the predictions contain index of task or not.
- Returns:
Predictions.
-
total_errors
(training=True, display_task=False, display_baseline=False)[source]¶ Compute errors.
- Arguments:
- training (bool):
determines whether its training or not.
- display_task (bool):
determines the errors contain index of task or not.
- display_baseline (bool):
determines whether display baseline
- Returns:
RMSE, MAE, 25%ile AE, 50%ile AE, 75%ile AE, 95%ile AE, MaxAE of given errors.
-
class
SISSOkit.evaluation.
RegressionCV
(current_path, property_name=None, drop_index=[])[source]¶ Bases:
SISSOkit.evaluation.Regression
Basic class for evaluating the cross validation results of regression. You should instantiate it first if you want to analyze CV results. You can use index to select a specific result from total CV results, and it will return a instance of
Regression
.- Arguments:
- current_path (string):
path to the directory of the cross validation results.
- property_name (string):
specifies the property name of your CV results.
- drop_index (list):
specifies which CV results you don’t want to consider.
Note
You should use code in
cross_validation.py
to generate the CV files, otherwise the format may be wrong.Its attributes contain all the input arguments in
SISSO.in
, information abouttrain.dat
,validation.dat
, andSISSO.out
.-
property
baseline
¶ Returns the baseline, i.e., the errors with predicting every property using the mean value of the property in
train.dat
.
-
check_predictions
(cv_idx, dimension, training=True, multiply_coefficients=True)[source]¶ Check predictions of each descriptor.
- Arguments:
- cv_idx (integer):
specifies which CV result you want to check.
- dimension (int):
dimension of the descriptor.
- training (bool):
whether it is training.
- multiply_coefficients (bool):
whether it should be multiplied by coefficients.
- Returns
Predictions of each descriptor.
-
property
coefficients
¶ Returns coefficients. The index order is CV index, task, dimension of model, dimension descriptor.
-
descriptor_percent
(descriptor)[source]¶ Return the percent of given descriptor appearing in the cross validation top subs_sis descriptors, and return the appearing index in the descriptor space.
- Arguments:
- descriptor (string):
the descriptor you want to check.
-
property
descriptors
¶ Returns descriptors. The first index is cross validation index.
-
drop
(index=[])[source]¶ Drop some CV results.
- Arguments:
- index (list):
contain the CV index which you want to drop.
- Returns:
An instance of RegressionCV that drop the results in CV index.
-
errors
(training=True, display_cv=False, display_task=False)[source]¶ Returns errors. The index order is dimension, sample if
display_cv
isFalse
anddisplay_task
isFalse
, or task, sample, dimension ifdisplay_cv
isFalse
anddisplay_task
isTrue
, or CV index, sample, dimension ifdisplay_cv
isTrue
anddisplay_task
isFalse
, or CV index, task, sample, dimension ifdisplay_cv
isTrue
anddisplay_task
isTrue
.- Arguments:
- training (bool):
determines whether its training or not.
- display_task (bool):
determines the errors contain index of task or not.
- Returns:
Errors.
-
features_percent
(descending=True)[source]¶ Return the percent of each feature in the top subs_sis descriptors. There are total n_cv*subs_sis descriptors, the feature percent is the percent over these descriptors.
-
find_materials_in_validation
(*idxs)[source]¶ Returns the samples’ names and in which CV result they are.
- Arguments:
- idx (integer):
index of the sample.
- Returns:
the samples’ names and in which CV result they are.
-
find_max_error
()[source]¶ Returns the samples’ names and in which CV result they are that contribute to the maxAE.
-
property
intercepts
¶ Returns intercepts. The index order is CV index, task, dimension of model.
-
predict
(data, cv_index=None, tasks=None, dimensions=None)[source]¶ Returns the predictions of
data
using the models found by SISSO.- Arguments:
- data (string or pandas.DataFrame):
the path of the data or the data you want to compute. If it is
string
, it will be recognized as path to the data and usedata=pd.read_csv(data,sep=r'\s+')
to read the data, so remember to use space to seperate the data. Otherwise it should bepandas.DataFrame
.- cv_index (None or list):
specifies which CV should be included. For example, [1,5] means CV1 and CV5 will be included. If it is None, then compute all CV results.
- tasks (None or list):
specifies which sample should use which task of model to compute. It should be task=[ [task_index, [sample_indices] ] ]. For example, [ [1, [1,3]], [2, [2,4,5]] ] means sample 1 and 3 will be computed using task 1 model, and sample 2, 4 and 5 will be computed using task 2 model. If it is None, then compute all samples with task 1 model.
- dimensions (None or list):
specifies which dimension of desctiptor will be used. For example, [2,5] means only compute using 2D and 5D models. If it is None, then compute all samples with all dimension models.
- Returns:
Values computed using models found by SISSO.
-
predictions
(training=True, display_cv=False, display_task=False)[source]¶ Returns predictions. The index order is dimension, sample if
display_cv
isFalse
anddisplay_task
isFalse
, or task, sample, dimension ifdisplay_cv
isFalse
anddisplay_task
isTrue
, or CV index, sample, dimension ifdisplay_cv
isTrue
anddisplay_task
isFalse
, or CV index, task, sample, dimension ifdisplay_cv
isTrue
anddisplay_task
isTrue
.- Arguments:
- training (bool):
determines whether its training or not.
- display_task (bool):
determines the predictions contain index of task or not.
- Returns:
Predictions.
-
total_errors
(training=True, display_cv=False, display_task=False, display_baseline=False)[source]¶ Compute errors.
- Arguments:
- training (bool):
determines whether its training or not.
- display_cv (bool):
determines the errors contain index of CV or not.
- display_task (bool):
determines the errors contain index of task or not.
- display_baseline (bool):
determines whether display baseline
- Returns:
RMSE, MAE, 25%ile AE, 50%ile AE, 75%ile AE, 95%ile AE, MaxAE of given errors.
-
SISSOkit.evaluation.
compute_errors
(errors)[source]¶ Compute errors.
- Arguments:
- errors:
difference between predictions and exact value.
- Returns:
RMSE, MAE, 25%ile AE, 50%ile AE, 75%ile AE, 95%ile AE, MaxAE of given errors.
-
SISSOkit.evaluation.
compute_using_model_reg
(path=None, result=None, training=True, data=None, task_idx=None, dimension_idx=None)[source]¶ Uses SISSO model of specific
task_idx
anddimension_idx
to predict property ofdata
. This function is a little bit hard to use, you can also seeRegression.predict()
- Arguments:
- path (string):
directory path of SISSO result which contains the model you want to use.
- result (Regression):
instance of Regression which contains the model you want to use.
- training (bool):
whether the task is training or predicting. Default is True.
- data (string or pandas.DataFrame):
the path of the data or the data you want to compute. If it is
string
, it will be recognized as path to the data and usedata=pd.read_csv(data,sep=r'\s+')
to read the data, so remember to use space to seperate the data. Otherwise it should bepandas.DataFrame
.- task_idx (integer):
specifies which task of model you want to use.
- dimension_idx (integer):
specifies which dimension of model you want to use.
- Returns:
Values computed using given model. The index order is [task, dimension, sample], it may not include all the index here, depending on you input.
Note
You should only specify one of
path
orresult
to determine what model you want to use.You only need to pass value to
task_idx
anddimension_idx
when you want to use specificdata
.If you don’t pass value to
data
, you will get predictions oftrain.dat
iftraining
isTrue
or predictions ofvalidation.dat
iftraining
isFalse
. Also, in this case, you don’t need to pass task_idx and dimension_idx, because you already specify which task of model the sample should use, and it will return all dimension results.If you pass value to
data
, then you have to specifytask_idx
, so it means that all samples should be computed using the same task. If you don’t pass value todimension_idx
, it will return values computed by all the dimension of models, otherwise it will only return values computed usingdimension_idx
of model.
-
SISSOkit.evaluation.
evaluate_expression
(expression, data)[source]¶ Returns the values using
expression
to computedata
.- Arguments:
- expression (string):
arithmetic expression. The expression should be in the same form as the descriptor in file
SISSO.out
, i.e., every operation should be enclosed in parenthesis, like (exp((A+B))/((C*D))^2).- data (pandas.DataFrame):
the data you want to compute, with samples as index and features as columns. The features name should be correspondent to the operands in expression.
- Returns:
pandas.Series: values computed using
expression
-
SISSOkit.evaluation.
predict_reg
(data, descriptors, coefficients, intercepts, tasks=None, dimensions=None)[source]¶ Returns the predictions.
- Arguments:
- data (string or pandas.DataFrame):
the path of the data or the data you want to compute. If it is
string
, it will be recognized as path to the data and usedata=pd.read_csv(data,sep=r'\s+')
to read the data, so remember to use space to seperate the data. Otherwise it should bepandas.DataFrame
.- descriptors (list):
the descriptors you want to use. The index order should be dimension of model then dimension of desciptor.
- coefficients (list):
the coefficients you want to use. The index order should be task, dimension of model then dimension of desciptor.
- intercepts (float or list):
the intercepts you want to use. The index order should be task then dimension of model.
- tasks (None or list):
specifies which sample should use which task of model to compute. It should be task=[ [task_index, [sample_indices] ] ]. For example, [ [1, [1,3]], [2, [2,4,5]] ] means sample 1 and 3 will be computed using task 1 model, and sample 2, 4 and 5 will be computed using task 2 model. If it is None, then compute all samples with task 1 model.
- dimensions (None or list):
specifies which dimension of desctiptor will be used. For example, [2,5] means only compute using 2D and 5D models. If it is None, then compute all samples with all dimension models.
- Returns:
Predictions using passed models.
Note
intercepts
should be correspondent to other arguments.If you just want to use 1 model, then
descriptors
,coefficients
shoule be a list with string and float as item, andintercepts
should befloat
. You don’t need to settasks
anddimensions
in this case.If you want to use many models,
coefficients
should be a list and index is different task. Each item in the list should also be a list and index is dimension of the model. Then each item of this list is also a list and index is dimension of descriptor, which contains specific descriptor or coefficient. So the index order should be task, dimension of model then dimension of desciptor. Index order ofdescriptors
should be dimension of model then dimension of desciptor. Index order ofintercepts
should be task then dimension of model.
SISSOkit.notebook module¶
-
SISSOkit.notebook.
generate_report
(path, file_path, notebook_name, file_name=None)[source]¶ Generates jupyter notebook reports.
- Arguments:
- path (list):
path to SISSO results. If there is only one result over whole data set, then it should be a list containing only 1 item. If there is also cross validation results, it should be [path to result over whole data set, path to cross validation results].
- file_path (string):
path to newly generated jupyter notebook.
- notebook_name (int or string):
notebook index or notebook name. ===== ===== index name ===== ===== 0 regression 1 regression with CV ===== =====
- file_name (None or string): the newly generated jupyter notebook name. If it is
None
, the file name is the same as notebook template name.
SISSOkit.plot module¶
-
SISSOkit.plot.
abs_errors_vs_dimension
(*regressions, training=True, unit=None, fontsize=20, selected_errors=None, display_baseline=False, label='', **kw)[source]¶ Plots the histogram of absolute errors with box plot for errors.
- Arguments:
- regressions (evaluation.Regression or evaluation.RegressionCV):
regression result.
- training (bool):
training errors or prediction errors.
- unit (string):
unit of property.
- fontsize (int):
fontsize of axis name.
- seleted_errors (None or list):
what errors should appear in the plot. errors are ‘RMSE’,’MAE’,‘25%ile AE’,‘50%ile AE’,‘75%ile AE’,‘95%ile AE’,’MaxAE’. If it is
None
, then all the errors will appear in the plot.- display_baseline (bool):
whether plot baseline.
- kw:
keyword arguments of histogram. It’s the same as
hist()
in matplotlib
-
SISSOkit.plot.
baselineplot
(regression, unit=None, marker='x', marker_color='r', marker_y=1, marker_shape=100, fontsize=20, **kw)[source]¶ Plots the histgram of baseline of regression.
- Arguments:
- regression (evaluation.Regression):
regression result.
- unit (string):
unit of property.
- marker (char):
marker type of mean value. It’s the same as in matplotlib.
- marker_color (char):
marker color of mean value. It’s the same as in matplotlib.
- marker_shape (int):
marker shape of mean value. It’s the same as in matplotlib.
- marker_y (float):
vertical coordinate of marker of mean value.
- fontsize (int):
fontsize of axis name.
- kw:
keyword arguments of histogram. It’s the same as
hist()
in matplotlib.
-
SISSOkit.plot.
boxplot
(regression, training=True, unit=None, fontsize=20, **kwargs)[source]¶ Plots the boxplot of regression.
- Arguments:
- regression (evaluation.Regression or evaluation.RegressionCV):
regression result.
- training (bool):
training errors or prediction errors.
- unit (string):
unit of property.
- fontsize (int):
fontsize of axis name.
- kw:
keyword arguments of histogram. It’s the same as
hist()
in matplotlib
-
SISSOkit.plot.
error_hist
(dimension, *regressions, training=True, absolute=False, unit=None, fontsize=20, **kw)[source]¶ Plots the histgram of errors of regression.
- Arguments:
- dimension (int):
dimension of descriptor.
- regressions (evaluation.Regression or evaluation.RegressionCV):
regression result.
- training (bool):
training errors or prediction errors.
- absolute (bool):
absolute errors or not.
- unit (string):
unit of property.
- fontsize (int):
fontsize of axis name.
- kw:
keyword arguments of histogram. It’s the same as
hist()
in matplotlib
-
SISSOkit.plot.
errors_details
(regression, training=True)[source]¶ Plots the detailed information about regression, including histograme of signed errors, preditction v.s. property and histgram of absolute errors with markers.
- Arguments:
- regression (evaluation.Regression or evaluation.RegressionCV):
regression result.
- training (bool):
training errors or prediction errors.
-
SISSOkit.plot.
hist_with_markers
(dimension, *regressions, training=True, unit=None, fontsize=20, selected_errors=None, marker_x=0, marker=None, **kw)[source]¶ Plots the histogram of absolute errors with markers
- Arguments:
- dimension (int):
dimension of descriptor.
- regressions (evaluation.Regression or evaluation.RegressionCV):
regression result.
- training (bool):
training errors or prediction errors.
- unit (string):
unit of property.
- fontsize (int):
fontsize of axis name.
- seleted_errors (None or list):
what errors should pinpoint in the plot. errors are ‘RMSE’,’MAE’,‘25%ile AE’,‘50%ile AE’,‘75%ile AE’,‘95%ile AE’,’MaxAE’. If it is
None
, then all the errors will appear in the plot.- marker_x (float):
horrizontal coordinate of marker.
- marker (NOne or list):
marker type. It’s the same as in matplotlib. If it is
None
, then will use the default types.- kw:
keyword arguments of histogram. It’s the same as
hist()
in matplotlib
-
SISSOkit.plot.
prediction_vs_property
(dimension, *regressions, training=True, unit=None, fontsize=20, **kw)[source]¶ Plots the scatter plot of prediction v.s. property.
- Arguments:
- dimension (int):
dimension of descriptor.
- regressions (evaluation.Regression or evaluation.RegressionCV):
regression result.
- training (bool):
training errors or prediction errors.
- unit (string):
unit of property.
- fontsize (int):
fontsize of axis name.
- kw:
keyword arguments of histogram. It’s the same as
hist()
in matplotlib
SISSOkit.utils module¶
-
SISSOkit.utils.
models_to_markdown
(regression, task, dimension, indent='')[source]¶ Returns the markdown form of models.