Feature Selectors

FractionVariableSelector

class bcselector.variable_selection.FractionVariableSelector[source]

Bases: bcselector.variable_selection._VariableSelector

Ranks all features in dataset with difference cost filter method.

Methods Summary

fit(data, target_variable, costs, r[, …])

Ranks all features in dataset with fraction cost filter method.

score(model, scoring_function)

Method scores selected features step by step by scoring_function.

plot_scores([budget, …])

Plots scores of each iteration of the algorithm.

get_cost_results()

Getter to obtain cost-sensitive results.

get_no_cost_results()

Getter to obtain NO-cost-sensitive results.

Methods Documentation

fit(data, target_variable, costs, r, j_criterion_func='cife', number_of_features=None, budget=None, stop_budget=False, **kwargs)[source]

Ranks all features in dataset with fraction cost filter method.

Parameters
  • data (np.ndarray or pd.) – Matrix or data frame of data that we want to rank features.

  • target_variable (np.ndarray or pd.core.series.Series) – Vector or series of target variable. Number of rows in data must equal target_variable length

  • costs (list or dict) – Costs of features. Must be the same size as columns in data. When using data as np.array, provide costs as list of floats or integers. When using data as pd.DataFrame, provide costs as list of floats or integers or dict {‘col_1’:cost_1,…}.

  • r (int or float) – Cost scaling parameter. Higher r is, higher is the impact of the cost on selection.

  • j_criterion_func (str) – Method of approximation of the conditional mutual information Must be one of [‘mim’,’mifs’,’mrmr’,’jmi’,’cife’]. All methods can be seen by running: >>> from bcselector.information_theory.j_criterion_approximations.__all__

  • number_of_features (int) – Optional argument, constraint to selected number of features.

  • budget (int or float) – Optional argument, constraint to selected total cost of features.

  • stop_budget (bool) – Optional argument, TODO - must delete this argument

  • **kwargs – Arguments passed to fraction_find_best_feature() function and then to j_criterion_func.

Examples

>>> from bcselector.variable_selection import FractionVariableSelector
>>> fvs = FractionVariableSelector()
>>> fvs.fit(X, y, costs, lamb=1, j_criterion_func='mim')
score(model, scoring_function)

Method scores selected features step by step by scoring_function. In each step one more feature is added. Of course user can do that on his own, but using score function we are sure that feature selection is performed on the same train set and it is much easier to use, than writing a loop on our own.

Parameters
  • model (sklearn.base.ClassifierMixin) – Any classifier from sklearn API.

  • scoring_function (function) – Classification metric function from sklearn. Must be one of [‘roc_auc_score’]. For more scoring functions open an GH issue.

Returns

  • total_scores (list) – List of scoring_function scores for each step. One step is one feature in algorighm ranking order.

  • total_costs (list) – List of accumulated costs for each step. One step is one feature in algorighm ranking order.

Examples

>>> from bcselector.variable_selection import FractionVariableSelector
>>> from sklearn.metrics import roc_auc_score
>>> from sklearn.linear_model import LogisticRegression
>>> fvs = FractionVariableSelector()
>>> fvs.fit(X, y, costs, lamb=1, j_criterion_func='mim')
>>> model = LogisticRegression()
>>> fvs.score(roc_auc_score, model)
plot_scores(budget=None, compare_no_cost_method=False, savefig=False, annotate=False, annotate_box=False, figsize=(12, 8), bbox_pos=(0.72, 0.6), plot_title=None, x_axis_title=None, y_axis_title=None, **kwargs)

Plots scores of each iteration of the algorithm.

Parameters
  • budget (int or float) – Budget to be ploted on the figure as vertical line.

  • compare_no_cost_method (bool = False) – Plot no-cost curve on the plot.

  • savefig (bool) – Save figure with scores, savefig arguments passed with kwargs.

  • annotate (bool) – Annotate plot with feature indexes on the plot.

  • annotate_box (bool) – Plot box with features data: id, name and cost.

  • figsize (tuple) – Figsize.

  • bbox_pos (tuple) – Position of box with features data.

  • plot_title (str) –

  • x_axis_title (str) –

  • y_axis_title (str) –

  • **kwargs (list) – Arguments passed to np.savefig()

get_cost_results()

Getter to obtain cost-sensitive results.

Returns

  • variables_selected_order (list) – Indexes of features selected.

  • cost_variables_selected_order (list) – Costs of features selected. In the same order as variables_selected_order

get_no_cost_results()

Getter to obtain NO-cost-sensitive results.

Returns

  • variables_selected_order (list) – Indexes of features selected.

  • cost_variables_selected_order (list) – Costs of features selected. In the same order as variables_selected_order

DiffVariableSelector

class bcselector.variable_selection.DiffVariableSelector[source]

Bases: bcselector.variable_selection._VariableSelector

Ranks all features in dataset with difference cost filter method.

Methods Summary

fit(data, target_variable, costs, lamb[, …])

Ranks all features in dataset with difference cost filter method.

score(model, scoring_function)

Method scores selected features step by step by scoring_function.

plot_scores([budget, …])

Plots scores of each iteration of the algorithm.

get_cost_results()

Getter to obtain cost-sensitive results.

get_no_cost_results()

Getter to obtain NO-cost-sensitive results.

Methods Documentation

fit(data, target_variable, costs, lamb, j_criterion_func='cife', number_of_features=None, budget=None, stop_budget=False, **kwargs)[source]

Ranks all features in dataset with difference cost filter method.

Parameters
  • data (np.ndarray or pd.) – Matrix or data frame of data that we want to rank features.

  • target_variable (np.ndarray or pd.core.series.Series) – Vector or series of target variable. Number of rows in data must equal target_variable length

  • costs (list or dict) – Costs of features. Must be the same size as columns in data. When using data as np.array, provide costs as list of floats or integers. When using data as pd.DataFrame, provide costs as list of floats or integers or dict {‘col_1’:cost_1,…}.

  • lamb (int or float) – Cost scaling parameter. Higher lambda is, higher is the impact of the cost on selection.

  • j_criterion_func (str) – Method of approximation of the conditional mutual information Must be one of [‘mim’,’mifs’,’mrmr’,’jmi’,’cife’]. All methods can be seen by running: >>> from bcselector.information_theory.j_criterion_approximations.__all__

  • number_of_features (int) – Optional argument, constraint to selected number of features.

  • budget (int or float) – Optional argument, constraint to selected total cost of features.

  • stop_budget (bool) – Optional argument, TODO - must delete this argument

  • **kwargs – Arguments passed to difference_find_best_feature() function and then to j_criterion_func.

Examples

>>> from bcselector.variable_selection import DiffVariableSelector
>>> dvs = DiffVariableSelector()
>>> dvs.fit(X, y, costs, lamb=1, j_criterion_func='mim')
score(model, scoring_function)

Method scores selected features step by step by scoring_function. In each step one more feature is added. Of course user can do that on his own, but using score function we are sure that feature selection is performed on the same train set and it is much easier to use, than writing a loop on our own.

Parameters
  • model (sklearn.base.ClassifierMixin) – Any classifier from sklearn API.

  • scoring_function (function) – Classification metric function from sklearn. Must be one of [‘roc_auc_score’]. For more scoring functions open an GH issue.

Returns

  • total_scores (list) – List of scoring_function scores for each step. One step is one feature in algorighm ranking order.

  • total_costs (list) – List of accumulated costs for each step. One step is one feature in algorighm ranking order.

Examples

>>> from bcselector.variable_selection import FractionVariableSelector
>>> from sklearn.metrics import roc_auc_score
>>> from sklearn.linear_model import LogisticRegression
>>> fvs = FractionVariableSelector()
>>> fvs.fit(X, y, costs, lamb=1, j_criterion_func='mim')
>>> model = LogisticRegression()
>>> fvs.score(roc_auc_score, model)
plot_scores(budget=None, compare_no_cost_method=False, savefig=False, annotate=False, annotate_box=False, figsize=(12, 8), bbox_pos=(0.72, 0.6), plot_title=None, x_axis_title=None, y_axis_title=None, **kwargs)

Plots scores of each iteration of the algorithm.

Parameters
  • budget (int or float) – Budget to be ploted on the figure as vertical line.

  • compare_no_cost_method (bool = False) – Plot no-cost curve on the plot.

  • savefig (bool) – Save figure with scores, savefig arguments passed with kwargs.

  • annotate (bool) – Annotate plot with feature indexes on the plot.

  • annotate_box (bool) – Plot box with features data: id, name and cost.

  • figsize (tuple) – Figsize.

  • bbox_pos (tuple) – Position of box with features data.

  • plot_title (str) –

  • x_axis_title (str) –

  • y_axis_title (str) –

  • **kwargs (list) – Arguments passed to np.savefig()

get_cost_results()

Getter to obtain cost-sensitive results.

Returns

  • variables_selected_order (list) – Indexes of features selected.

  • cost_variables_selected_order (list) – Costs of features selected. In the same order as variables_selected_order

get_no_cost_results()

Getter to obtain NO-cost-sensitive results.

Returns

  • variables_selected_order (list) – Indexes of features selected.

  • cost_variables_selected_order (list) – Costs of features selected. In the same order as variables_selected_order