nextstep.feature_selection module

nextstep.feature_selection.get_features_from_RFE(estimator, X, y, num_features=None, step=1, verbose=0)

a wrapper based feature selection method that considers the selection of a set of features as a search problem. The algorithm selects features by recursively considering smaller and smaller sets of features.

Parameters
  • estimator – an sklearn supervised learning estimator with a fit method and a coef attribute or through a feature_importances attribute.

  • X (pandas DataFrame of shape (n_samples, n_features)) – the training input samples

  • y (array-like of shape (n_samples,)) – the target values

  • num_features (int or None, optional (default=None)) – number of features to select. If None, half will be selected.

  • step (int or float, optional (default=1)) – If greater than or equal to 1, then step corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), then step corresponds to the percentage (rounded down) of features to remove at each iteration.

  • verbose (int, optional (default=0)) – controls verbosity of output

Returns

a list containing the best n features

nextstep.feature_selection.get_features_from_SelectKBest(score_func, X, y, num_features=10, show_plot=True)

a filter based feature selection method where the user specifies a metric and uses that to filter features; also, this function displays the SelectKBest result with a plot, which facilitates identification of the “elbow” point

Parameters
  • score_func – a function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. For a list of available score_func, refer to the “See also” section on https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html.

  • X (pandas DataFrame of shape (n_samples, n_features)) – the training input samples

  • y (array-like of shape (n_samples,)) – the target values

  • num_features (int, optional (default=10)) – number of top features to select

  • show_plot (boolean, optional (default=True)) – if set to True, a plot of the scores of the k best features is displayed.

Returns

a pandas Dataframe containing the best k features, sorted by score in descending order

nextstep.feature_selection.get_selector_from_RFE(estimator, X, y, num_features=None, step=1, verbose=0)

a wrapper based feature selection method that considers the selection of a set of features as a search problem. The algorithm selects features by recursively considering smaller and smaller sets of features.

Parameters
  • estimator – an sklearn supervised learning estimator with a fit method and a coef attribute or through a feature_importances attribute.

  • X (pandas DataFrame of shape (n_samples, n_features)) – the training input samples

  • y (array-like of shape (n_samples,)) – the target values

  • num_features (int or None, optional (default=None)) – number of features to select. If None, half will be selected.

  • step (int or float, optional (default=1)) – If greater than or equal to 1, then step corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), then step corresponds to the percentage (rounded down) of features to remove at each iteration.

  • verbose (int, optional (default=0)) – controls verbosity of output

Returns

the fitted RFE selector

nextstep.feature_selection.get_selector_from_SelectKBest(score_func, X, y, num_features=10)

a filter based feature selection method where the user specifies a metric and uses that to filter features

Parameters
  • score_func – a function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. For a list of available score_func, refer to the “See also” section on https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html.

  • X (pandas DataFrame of shape (n_samples, n_features)) – the training input samples

  • y (array-like of shape (n_samples,)) – the target values

  • num_features (int, optional (default=10)) – number of top features to select

Returns

the fitted SelectKBest selector

nextstep.feature_selection.majority_voting(X, y, num_features=10, step=1, verbose=0)

Unlike select_features_by_majority_voting which allows users to customise the base selectors, this function further abstracts the feature selection process from the user by providing a set of pre-defined base feature selectors suitable for regression tasks. The pre-defined selectors consists of two from SelectKBest (f regression & mutual info regression), two from RFE (linear regression & SVR), and two from embedded method (regression tree & LGB regression)

Parameters
  • X (pandas DataFrame of shape (n_samples, n_features)) – the training input samples

  • y (array-like of shape (n_samples,)) – the target values

  • num_features (int or None, optional (default=None)) – number of features to select for each base selector. If None, 10 features will be selected.

  • step (int or float, optional (default=1)) – for RFE; if greater than or equal to 1, then step corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), then step corresponds to the percentage (rounded down) of features to remove at each iteration.

  • verbose (int, optional (default=0)) – for RFE; controls verbosity of output

Returns

a pandas DataFrame ranked by the number of times a feature has been seleced by each of the pre-defined selectors; boolean columns indicating whether a feature has been selected by a particular selector.

nextstep.feature_selection.select_features_by_majority_voting(X, selectors_dict)

combines the various feature selection tools and use majority voting to decide which features to keep. Only works on selectors with a get_support() function

Parameters
  • X (pandas DataFrame of shape (n_samples, n_features)) – the training input samples

  • selectors_dict – a dictionary with key being the name of a feature selection method, value being a selector object

Returns

a pandas DataFrame ranked by the number of times a feature has been seleced by the various selectors; boolean columns indicating whether a feature has been selected by a particular selector.