resampling package

Submodules

resampling.undersampling module

Module containing the Undersampling class and the command line interface.

class resampling.undersampling.Undersampling(input_dataset_path, output_dataset_path, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml Undersampling
Wrapper of most of the imblearn.under_sampling methods.
Remove samples from the majority class of a given dataset, with or without replacement. If regression is specified as type, the data will be resampled to classes in order to apply the undersampling model. Visit the imbalanced-learn official website for the different methods accepted in this wrapper: RandomUnderSampler, NearMiss, CondensedNearestNeighbour, TomekLinks, EditedNearestNeighbours, NeighbourhoodCleaningRule, ClusterCentroids.
Parameters:
  • input_dataset_path (str) – Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_dataset_path (str) –

    Path to the output dataset. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • method (str) - (None) Undersampling method. It’s a mandatory property. Values: random (RandomUnderSampler: Under-sample the majority classes by randomly picking samples with or without replacement), nearmiss (NearMiss: Class to perform under-sampling based on NearMiss methods), cnn (CondensedNearestNeighbour: Class to perform under-sampling based on the condensed nearest neighbour method), tomeklinks (TomekLinks: Class to perform under-sampling by removing Tomek’s links), enn (EditedNearestNeighbours: Class to perform under-sampling based on the edited nearest neighbour method), ncr (NeighbourhoodCleaningRule: Class performing under-sampling based on the neighbourhood cleaning rule), cluster (ClusterCentroids: Method that under samples the majority class by replacing a cluster of majority samples by the cluster centroid of a KMeans algorithm).

    • type (str) - (None) Type of oversampling. It’s a mandatory property. Values: regression (the oversampling will be applied on a continuous dataset), classification (the oversampling will be applied on a classified dataset).

    • target (dict) - ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.

    • evaluate (bool) - (False) Whether or not to evaluate the dataset before and after applying the resampling.

    • evaluate_splits (int) - (3) [2~100|1] Number of folds to be applied by the Repeated Stratified K-Fold evaluation method. Must be at least 2.

    • evaluate_repeats (int) - (3) [2~100|1] Number of times Repeated Stratified K-Fold cross validator needs to be repeated.

    • n_bins (int) - (5) [1~100|1] Only for regression undersampling. The number of classes that the user wants to generate with the target data.

    • balanced_binning (bool) - (False) Only for regression undersampling. Decides whether samples are to be distributed roughly equally across all classes.

    • sampling_strategy (dict) - ({ “target”: “auto” }) Sampling information to sample the data set. Formats: { “target”: “auto” }, { “ratio”: 0.3 }, { “dict”: { 0: 300, 1: 200, 2: 100 } } or { “list”: [0, 2, 3] }. When “target”, specify the class targeted by the resampling; the number of samples in the different classes will be equalized; possible choices are: majority (resample only the majority class), not minority (resample all classes but the minority class), not majority (resample all classes but the majority class), all (resample all classes), auto (equivalent to ‘not minority’). When “ratio”, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling (ONLY IN CASE OF BINARY CLASSIFICATION). When “dict”, the keys correspond to the targeted classes, the values correspond to the desired number of samples for each targeted class. When “list”, the list contains the classes targeted by the resampling.

    • version (int) - (1) Only for NearMiss method. Version of the NearMiss to use. Values: 1 (selects samples of the majority class that their average distances to three closest instances of the minority class are the smallest), 2 (uses three farthest samples of the minority class), 3 (selects a given number of the closest samples of the majority class for each sample of the minority class).

    • n_neighbors (int) - (1) [1~100|1] Only for NearMiss, CondensedNearestNeighbour, EditedNearestNeighbours and NeighbourhoodCleaningRule methods. Size of the neighbourhood to consider to compute the average distance to the minority point samples.

    • threshold_cleaning (float) - (0.5) [0~1|0.1] Only for NeighbourhoodCleaningRule method. Threshold used to whether consider a class or not during the cleaning after applying ENN.

    • random_state_method (int) - (5) [1~1000|1] Only for RandomUnderSampler and ClusterCentroids methods. Controls the randomization of the algorithm.

    • random_state_evaluate (int) - (5) [1~1000|1] Controls the shuffling applied to the Repeated Stratified K-Fold evaluation method.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

    • sandbox_path (str) - (“./”) [WF property] Parent path to the sandbox directory.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.resampling.undersampling import undersampling
prop = {
    'method': 'enn',
    'type': 'regression',
    'target': {
        'column': 'target'
    },
    'evaluate': true,
    'n_bins': 10,
    'n_neighbors': 3,
    'sampling_strategy': {
        'target': 'auto'
    }
}
undersampling(input_dataset_path='/path/to/myDataset.csv',
            output_dataset_path='/path/to/newDataset.csv',
            properties=prop)
Info:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the Undersampling resampling.undersampling.Undersampling object.

resampling.undersampling.main()[source]

Command line execution of this building block. Please check the command line documentation.

resampling.undersampling.undersampling(input_dataset_path: str, output_dataset_path: str, properties: dict | None = None, **kwargs) int[source]

Execute the Undersampling class and execute the launch() method.

resampling.oversampling module

Module containing the Oversampling class and the command line interface.

class resampling.oversampling.Oversampling(input_dataset_path, output_dataset_path, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml Oversampling
Wrapper of most of the imblearn.over_sampling methods.
Involves supplementing the training data with multiple copies of some of the minority classes of a given dataset. If regression is specified as type, the data will be resampled to classes in order to apply the oversampling model. Visit the imbalanced-learn official website for the different methods accepted in this wrapper: RandomOverSampler, SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN
Parameters:
  • input_dataset_path (str) –

    Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_dataset_path (str) –

    Path to the output dataset. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • method (str) - (None) Oversampling method. It’s a mandatory property. Values: random (RandomOverSampler: Object to over-sample the minority classes by picking samples at random with replacement), smote (SMOTE: This object is an implementation of SMOTE - Synthetic Minority Over-sampling Technique), borderline (BorderlineSMOTE: This algorithm is a variant of the original SMOTE algorithm. Borderline samples will be detected and used to generate new synthetic samples), svmsmote (SVMSMOTE: Variant of SMOTE algorithm which use an SVM algorithm to detect sample to use for generating new synthetic samples), adasyn (ADASYN: Perform over-sampling using Adaptive Synthetic -ADASYN- sampling approach for imbalanced datasets).

    • type (str) - (None) Type of oversampling. It’s a mandatory property. Values: regression (the oversampling will be applied on a continuous dataset), classification (the oversampling will be applied on a classified dataset).

    • target (dict) - ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.

    • evaluate (bool) - (False) Whether or not to evaluate the dataset before and after applying the resampling.

    • evaluate_splits (int) - (3) [2~100|1] Number of folds to be applied by the Repeated Stratified K-Fold evaluation method. Must be at least 2.

    • evaluate_repeats (int) - (3) [2~100|1] Number of times Repeated Stratified K-Fold cross validator needs to be repeated.

    • n_bins (int) - (5) [1~100|1] Only for regression oversampling. The number of classes that the user wants to generate with the target data.

    • balanced_binning (bool) - (False) Only for regression oversampling. Decides whether samples are to be distributed roughly equally across all classes.

    • sampling_strategy (dict) - ({ “target”: “auto” }) Sampling information to sample the data set. Formats: { “target”: “auto” }, { “ratio”: 0.3 }, { “dict”: { 0: 300, 1: 200, 2: 100 } } or { “list”: [0, 2, 3] }. When “target”, specify the class targeted by the resampling; the number of samples in the different classes will be equalized; possible choices are: minority (resample only the minority class), not minority (resample all classes but the minority class), not majority (resample all classes but the majority class), all (resample all classes), auto (equivalent to ‘not majority’). When “ratio”, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling (ONLY IN CASE OF BINARY CLASSIFICATION). When “dict”, the keys correspond to the targeted classes, the values correspond to the desired number of samples for each targeted class. When “list”, the list contains the classes targeted by the resampling.

    • k_neighbors (int) - (5) [1~100|1] Only for SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN. The number of nearest neighbours used to construct synthetic samples.

    • random_state_method (int) - (5) [1~1000|1] Controls the randomization of the algorithm.

    • random_state_evaluate (int) - (5) [1~1000|1] Controls the shuffling applied to the Repeated Stratified K-Fold evaluation method.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

    • sandbox_path (str) - (“./”) [WF property] Parent path to the sandbox directory.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.resampling.oversampling import oversampling
prop = {
    'method': 'random,
    'type': 'regression,
    'target': {
        'column': 'target'
    },
    'evaluate': true,
    'n_bins': 10,
    'sampling_strategy': {
        'target': 'minority'
    }
}
oversampling(input_dataset_path='/path/to/myDataset.csv',
            output_dataset_path='/path/to/newDataset.csv',
            properties=prop)
Info:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the Oversampling resampling.oversampling.Oversampling object.

resampling.oversampling.main()[source]

Command line execution of this building block. Please check the command line documentation.

resampling.oversampling.oversampling(input_dataset_path: str, output_dataset_path: str, properties: dict | None = None, **kwargs) int[source]

Execute the Oversampling class and execute the launch() method.

resampling.resampling module

Module containing the Resampling class and the command line interface.

class resampling.resampling.Resampling(input_dataset_path, output_dataset_path, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml Resampling
Wrapper of the imblearn.combine methods.
Combine over- and under-sampling methods to remove samples and supplement the dataset. If regression is specified as type, the data will be resampled to classes in order to apply the resampling model. Visit the imbalanced-learn official website for the different methods accepted in this wrapper: SMOTETomek, SMOTEENN.
Parameters:
  • input_dataset_path (str) –

    Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_dataset_path (str) –

    Path to the output dataset. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • method (str) - (None) Resampling method. It’s a mandatory property. Values: smotetomek (SMOTETomek: Class to perform over-sampling using SMOTE and cleaning using Tomek links), smotenn (SMOTEENN: Class to perform over-sampling using SMOTE and cleaning using ENN).

    • type (str) - (None) Type of oversampling. It’s a mandatory property. Values: regression (the oversampling will be applied on a continuous dataset), classification (the oversampling will be applied on a classified dataset).

    • target (dict) - ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.

    • evaluate (bool) - (False) Whether or not to evaluate the dataset before and after applying the resampling.

    • evaluate_splits (int) - (3) [2~100|1] Number of folds to be applied by the Repeated Stratified K-Fold evaluation method. Must be at least 2.

    • evaluate_repeats (int) - (3) [2~100|1] Number of times Repeated Stratified K-Fold cross validator needs to be repeated.

    • n_bins (int) - (5) [1~100|1] Only for regression resampling. The number of classes that the user wants to generate with the target data.

    • balanced_binning (bool) - (False) Only for regression resampling. Decides whether samples are to be distributed roughly equally across all classes.

    • sampling_strategy_over (dict) - ({ “target”: “auto” }) Sampling information applied in the dataset oversampling process. Formats: { “target”: “auto” }, { “ratio”: 0.3 } or { “dict”: { 0: 300, 1: 200, 2: 100 } }. When “target”, specify the class targeted by the resampling; the number of samples in the different classes will be equalized; possible choices are: minority (resample only the minority class), not minority (resample all classes but the minority class), not majority (resample all classes but the majority class), all (resample all classes), auto (equivalent to ‘not majority’). When “ratio”, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling (ONLY IN CASE OF BINARY CLASSIFICATION). When “dict”, the keys correspond to the targeted classes and the values correspond to the desired number of samples for each targeted class.

    • sampling_strategy_under (dict) - ({ “target”: “auto” }) Sampling information applied in the dataset cleaning process. Formats: { “target”: “auto” } or { “list”: [0, 2, 3] }. When “target”, specify the class targeted by the resampling; the number of samples in the different classes will be equalized; possible choices are: majority (resample only the majority class), not minority (resample all classes but the minority class), not majority (resample all classes but the majority class), all (resample all classes), auto (equivalent to ‘not minority’). When “list”, the list contains the classes targeted by the resampling.

    • random_state_method (int) - (5) [1~1000|1] Controls the randomization of the algorithm.

    • random_state_evaluate (int) - (5) [1~1000|1] Controls the shuffling applied to the Repeated Stratified K-Fold evaluation method.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

    • sandbox_path (str) - (“./”) [WF property] Parent path to the sandbox directory.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.resampling.resampling import resampling
prop = {
    'method': 'smotenn',
    'type': 'regression',
    'target': {
        'column': 'target'
    },
    'evaluate': true,
    'n_bins': 10,
    'sampling_strategy_over': {
        'dict': { '4': 1000, '5': 1000, '6': 1000, '7': 1000 }
    },
    'sampling_strategy_under': {
        'list': [0,1]
    }
}
resampling(input_dataset_path='/path/to/myDataset.csv',
            output_dataset_path='/path/to/newDataset.csv',
            properties=prop)
Info:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the Resampling resampling.resampling.Resampling object.

resampling.resampling.main()[source]

Command line execution of this building block. Please check the command line documentation.

resampling.resampling.resampling(input_dataset_path: str, output_dataset_path: str, properties: dict | None = None, **kwargs) int[source]

Execute the Resampling class and execute the launch() method.