regression package

Submodules

regression.linear_regression module

Module containing the LinearRegression class and the command line interface.

class regression.linear_regression.LinearRegression(input_dataset_path, output_model_path, output_test_table_path=None, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml LinearRegression
Wrapper of the scikit-learn LinearRegression method.
Trains and tests a given dataset and saves the model and scaler. Visit the LinearRegression documentation page in the sklearn official website for further information.

Parameters:

input_dataset_path (str) – Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).
output_model_path (str) –
Path to the output model file. File type: output. Sample file. Accepted formats: pkl (edam:format_3653).
output_test_table_path (str) (Optional) –
Path to the test table file. File type: output. Sample file. Accepted formats: csv (edam:format_3752).
output_plot_path (str) (Optional) –
Residual plot checks the error between actual values and predicted values. File type: output. Sample file. Accepted formats: png (edam:format_3603).
properties (dic - Python dictionary object containing the tool parameters, not input/output files) –
- independent_vars (dict) - ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.
- target (dict) - ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.
- weight (dict) - ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.
- random_state_train_test (int) - (5) [1~1000|1] Controls the shuffling applied to the data before applying the split.
- test_size (float) - (0.2) [0~1|0.05] Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0.
- scale (bool) - (False) Whether or not to scale the input dataset.
- remove_tmp (bool) - (True) [WF property] Remove temporal files.
- restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.regression.linear_regression import linear_regression
prop = {
    'independent_vars': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'target': {
        'column': 'target'
    },
    'test_size': 0.2
}
linear_regression(input_dataset_path='/path/to/myDataset.csv',
                output_model_path='/path/to/newModel.pkl',
                output_test_table_path='/path/to/newTable.csv',
                output_plot_path='/path/to/newPlot.png',
                properties=prop)

Info:

wrapped_software:
- name: scikit-learn LinearRegression
- version: >=0.24.2
- license: BSD 3-Clause
ontology:
- name: EDAM
- schema: http://edamontology.org/EDAM.owl

check_data_params(out_log, err_log)[source]: Checks all the input/output paths and parameters

launch() → int[source]: Execute the LinearRegression regression.linear_regression.LinearRegression object.

regression.linear_regression.linear_regression(input_dataset_path: str, output_model_path: str, output_test_table_path: str | None = None, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) → int[source]: Execute the LinearRegression class and execute the launch() method.

regression.linear_regression.main()[source]: Command line execution of this building block. Please check the command line documentation.

regression.polynomial_regression module

Module containing the PolynomialRegression class and the command line interface.

class regression.polynomial_regression.PolynomialRegression(input_dataset_path, output_model_path, output_test_table_path=None, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml PolynomialRegression
Wrapper of the scikit-learn LinearRegression method with PolynomialFeatures.
Trains and tests a given dataset and saves the model and scaler. Visit the LinearRegression documentation page in the sklearn official website for further information.

Parameters:

input_dataset_path (str) –
Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).
output_model_path (str) –
Path to the output model file. File type: output. Sample file. Accepted formats: pkl (edam:format_3653).
output_test_table_path (str) (Optional) –
Path to the test table file. File type: output. Sample file. Accepted formats: csv (edam:format_3752).
output_plot_path (str) (Optional) –
Residual plot checks the error between actual values and predicted values. File type: output. Sample file. Accepted formats: png (edam:format_3603).
properties (dic - Python dictionary object containing the tool parameters, not input/output files) –
- independent_vars (dict) - ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.
- target (dict) - ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.
- weight (dict) - ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.
- random_state_train_test (int) - (5) [1~1000|1] Controls the shuffling applied to the data before applying the split.
- degree (int) - (2) [1~100|1] Polynomial degree.
- test_size (float) - (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0.
- scale (bool) - (False) Whether or not to scale the input dataset.
- remove_tmp (bool) - (True) [WF property] Remove temporal files.
- restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.regression.polynomial_regression import polynomial_regression
prop = {
    'independent_vars': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'target': {
        'column': 'target'
    },
    'degree': 2,
    'test_size': 0.2
}
polynomial_regression(input_dataset_path='/path/to/myDataset.csv',
                    output_model_path='/path/to/newModel.pkl',
                    output_test_table_path='/path/to/newTable.csv',
                    output_plot_path='/path/to/newPlot.png',
                    properties=prop)

Info:

wrapped_software:
- name: scikit-learn LinearRegression
- version: >=0.24.2
- license: BSD 3-Clause
ontology:
- name: EDAM
- schema: http://edamontology.org/EDAM.owl

check_data_params(out_log, err_log)[source]: Checks all the input/output paths and parameters

launch() → int[source]: Execute the PolynomialRegression regression.polynomial_regression.PolynomialRegression object.

regression.polynomial_regression.main()[source]: Command line execution of this building block. Please check the command line documentation.

regression.polynomial_regression.polynomial_regression(input_dataset_path: str, output_model_path: str, output_test_table_path: str | None = None, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) → int[source]: Execute the PolynomialRegression class and execute the launch() method.

regression.random_forest_regressor module

Module containing the RandomForestRegressor class and the command line interface.

class regression.random_forest_regressor.RandomForestRegressor(input_dataset_path, output_model_path, output_test_table_path=None, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml RandomForestRegressor
Wrapper of the scikit-learn RandomForestRegressor method.
Trains and tests a given dataset and saves the model and scaler. Visit the RandomForestRegressor documentation page.

Parameters:

input_dataset_path (str) –
Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).
output_model_path (str) –
Path to the output model file. File type: output. Sample file. Accepted formats: pkl (edam:format_3653).
output_test_table_path (str) (Optional) –
Path to the test table file. File type: output. Sample file. Accepted formats: csv (edam:format_3752).
output_plot_path (str) (Optional) –
Residual plot checks the error between actual values and predicted values. File type: output. Sample file. Accepted formats: png (edam:format_3603).
properties (dic - Python dictionary object containing the tool parameters, not input/output files) –
- independent_vars (dict) - ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.
- target (dict) - ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.
- weight (dict) - ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.
- n_estimators (int) - (10) The number of trees in the forest.
- max_depth (int) - (None) The maximum depth of the tree.
- random_state_method (int) - (5) [1~1000|1] Controls the randomness of the estimator.
- random_state_train_test (int) - (5) [1~1000|1] Controls the shuffling applied to the data before applying the split.
- test_size (float) - (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0.
- scale (bool) - (False) Whether or not to scale the input dataset.
- remove_tmp (bool) - (True) [WF property] Remove temporal files.
- restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.regression.random_forest_regressor import random_forest_regressor
prop = {
    'independent_vars': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'target': {
        'column': 'target'
    },
    'n_estimators': 10,
    'max_depth': 5,
    'test_size': 0.2
}
random_forest_regressor(input_dataset_path='/path/to/myDataset.csv',
                        output_model_path='/path/to/newModel.pkl',
                        output_test_table_path='/path/to/newTable.csv',
                        output_plot_path='/path/to/newPlot.png',
                        properties=prop)

Info:

wrapped_software:
- name: scikit-learn RandomForestRegressor
- version: >0.24.2
- license: BSD 3-Clause
ontology:
- name: EDAM
- schema: http://edamontology.org/EDAM.owl

check_data_params(out_log, err_log)[source]: Checks all the input/output paths and parameters

launch() → int[source]: Execute the RandomForestRegressor regression.random_forest_regressor.RandomForestRegressor object.

regression.random_forest_regressor.main()[source]: Command line execution of this building block. Please check the command line documentation.

regression.random_forest_regressor.random_forest_regressor(input_dataset_path: str, output_model_path: str, output_test_table_path: str | None = None, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) → int[source]: Execute the RandomForestRegressor class and execute the launch() method.

regression.regression_predict module

Module containing the RegressionPredict class and the command line interface.

class regression.regression_predict.RegressionPredict(input_model_path, output_results_path, input_dataset_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml RegressionPredict
Makes predictions from an input dataset and a given regression model.
Makes predictions from an input dataset (provided either as a file or as a dictionary property) and a given regression model trained with LinearRegression, RandomForestRegressor methods.

Parameters:

input_model_path (str) –
Path to the input model. File type: input. Sample file. Accepted formats: pkl (edam:format_3653).
input_dataset_path (str) (Optional) –
Path to the dataset to predict. File type: input. Sample file. Accepted formats: csv (edam:format_3752).
output_results_path (str) –
Path to the output results file. File type: output. Sample file. Accepted formats: csv (edam:format_3752).
properties (dic - Python dictionary object containing the tool parameters, not input/output files) –
- predictions (list) - (None) List of dictionaries with all values you want to predict targets. It will be taken into account only in case input_dataset_path is not provided. Format: [{ ‘var1’: 1.0, ‘var2’: 2.0 }, { ‘var1’: 4.0, ‘var2’: 2.7 }] for datasets with headers and [[ 1.0, 2.0 ], [ 4.0, 2.7 ]] for datasets without headers.
- remove_tmp (bool) - (True) [WF property] Remove temporal files.
- restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.regression.regression_predict import regression_predict
prop = {
    'predictions': [
        {
            'var1': 1.0,
            'var2': 2.0
        },
        {
            'var1': 4.0,
            'var2': 2.7
        }
    ]
}
regression_predict(input_model_path='/path/to/myModel.pkl',
                    output_results_path='/path/to/newPredictedResults.csv',
                    input_dataset_path='/path/to/myDataset.csv',
                    properties=prop)

Info:

wrapped_software:
- name: scikit-learn
- version: >=0.24.2
- license: BSD 3-Clause
ontology:
- name: EDAM
- schema: http://edamontology.org/EDAM.owl

check_data_params(out_log, err_log)[source]: Checks all the input/output paths and parameters

launch() → int[source]: Execute the RegressionPredict regression.regression_predict.RegressionPredict object.

regression.regression_predict.main()[source]: Command line execution of this building block. Please check the command line documentation.

regression.regression_predict.regression_predict(input_model_path: str, output_results_path: str, input_dataset_path: str | None = None, properties: dict | None = None, **kwargs) → int[source]: Execute the RegressionPredict class and execute the launch() method.