dimensionality_reduction package

Submodules

dimensionality_reduction.pls_components module

Module containing the PLSComponents class and the command line interface.

class dimensionality_reduction.pls_components.PLSComponents(input_dataset_path, output_results_path, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml PLSComponents
Wrapper of the scikit-learn PLSRegression method.
Calculates best components number for a Partial Least Square (PLS) Regression. Visit the PLSRegression documentation page in the sklearn official website for further information.

Parameters:

input_dataset_path (str) – Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).
output_results_path (str) –
Table with R2 and MSE for calibration and cross-validation data for the best number of components. File type: output. Sample file. Accepted formats: csv (edam:format_3752).
output_plot_path (str) (Optional) –
Path to the Mean Square Error plot. File type: output. Sample file. Accepted formats: png (edam:format_3603).
properties (dic - Python dictionary object containing the tool parameters, not input/output files) –
- features (dict) - ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.
- target (dict) - ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.
- optimise (boolean) - (False) Whether or not optimise the process of MSE calculation. Beware, if True selected, the process can take a long time depending on the max_components value.
- max_components (int) - (10) [1~1000|1] Maximum number of components to use by default for PLS queries.
- cv (int) - (10) [1~10000|1] Specify the number of folds in the cross-validation splitting strategy. Value must be between 2 and number of samples in the dataset.
- scale (bool) - (False) Whether or not to scale the input dataset.
- remove_tmp (bool) - (True) [WF property] Remove temporal files.
- restart (bool) - (False) [WF property] Do not execute if output files exist.
- sandbox_path (str) - (“./”) [WF property] Parent path to the sandbox directory.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.dimensionality_reduction.pls_components import pls_components
prop = {
    'features': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'target': {
        'column': 'target'
    },
    'max_components': 10,
    'cv': 10
}
pls_components(input_dataset_path='/path/to/myDataset.csv',
                output_results_path='/path/to/newTable.csv',
                output_plot_path='/path/to/newPlot.png',
                properties=prop)

Info:

wrapped_software:
- name: scikit-learn PLSRegression
- version: >=0.24.2
- license: BSD 3-Clause
ontology:
- name: EDAM
- schema: http://edamontology.org/EDAM.owl

check_data_params(out_log, err_log)[source]: Checks all the input/output paths and parameters

launch() → int[source]: Execute the PLSComponents dimensionality_reduction.pls_components.PLSComponents object.

warn(**kwargs)[source]

dimensionality_reduction.pls_components.main()[source]: Command line execution of this building block. Please check the command line documentation.

dimensionality_reduction.pls_components.pls_components(input_dataset_path: str, output_results_path: str, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) → int[source]: Execute the PLSComponents class and execute the launch() method.

dimensionality_reduction.pls_regression module

Module containing the PLS_Regression class and the command line interface.

class dimensionality_reduction.pls_regression.PLS_Regression(input_dataset_path, output_results_path, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml PLS_Regression
Wrapper of the scikit-learn PLSRegression method.
Gives results for a Partial Least Square (PLS) Regression. Visit the PLSRegression documentation page in the sklearn official website for further information.

Parameters:

input_dataset_path (str) –
Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).
output_results_path (str) –
Table with R2 and MSE for calibration and cross-validation data. File type: output. Sample file. Accepted formats: csv (edam:format_3752).
output_plot_path (str) (Optional) –
Path to the R2 cross-validation plot. File type: output. Sample file. Accepted formats: png (edam:format_3603).
properties (dic - Python dictionary object containing the tool parameters, not input/output files) –
- features (dict) - ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.
- target (dict) - ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.
- n_components (int) - (5) [1~1000|1] Maximum number of components to use by default for PLS queries.
- cv (int) - (10) [1~10000|1] Specify the number of folds in the cross-validation splitting strategy. Value must be betwwen 2 and number of samples in the dataset.
- scale (bool) - (False) Whether or not to scale the input dataset.
- remove_tmp (bool) - (True) [WF property] Remove temporal files.
- restart (bool) - (False) [WF property] Do not execute if output files exist.
- sandbox_path (str) - (“./”) [WF property] Parent path to the sandbox directory.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.dimensionality_reduction.pls_regression import pls_regression
prop = {
    'features': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'target': {
        'column': 'target'
    },
    'n_components': 12,
    'cv': 10
}
pls_regression(input_dataset_path='/path/to/myDataset.csv',
                output_results_path='/path/to/newTable.csv',
                output_plot_path='/path/to/newPlot.png',
                properties=prop)

Info:

wrapped_software:
- name: scikit-learn PLSRegression
- version: >=0.24.2
- license: BSD 3-Clause
ontology:
- name: EDAM
- schema: http://edamontology.org/EDAM.owl

check_data_params(out_log, err_log)[source]: Checks all the input/output paths and parameters

launch() → int[source]: Execute the PLS_Regression dimensionality_reduction.pls_regression.PLS_Regression object.

warn(**kwargs)[source]

dimensionality_reduction.pls_regression.main()[source]: Command line execution of this building block. Please check the command line documentation.

dimensionality_reduction.pls_regression.pls_regression(input_dataset_path: str, output_results_path: str, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) → int[source]: Execute the PLS_Regression class and execute the launch() method.

dimensionality_reduction.principal_component module

Module containing the PrincipalComponentAnalysis class and the command line interface.

class dimensionality_reduction.principal_component.PrincipalComponentAnalysis(input_dataset_path, output_results_path, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml PrincipalComponentAnalysis
Wrapper of the scikit-learn PCA method.
Analyses a given dataset through Principal Component Analysis (PCA). Visit the PCA documentation page in the sklearn official website for further information.

Parameters:

input_dataset_path (str) –
Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).
output_results_path (str) –
Path to the analysed dataset. File type: output. Sample file. Accepted formats: csv (edam:format_3752).
output_plot_path (str) (Optional) –
Path to the Principal Component plot, only if number of components is 2 or 3. File type: output. Sample file. Accepted formats: png (edam:format_3603).
properties (dic - Python dictionary object containing the tool parameters, not input/output files) –
- features (dict) - ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.
- target (dict) - ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked.
- n_components (dict) - ({}) Dictionary containing the number of components to keep (int) or the minimum number of principal components such the 0 to 1 range of the variance (float) is retained. If not set ({}) all components are kept. Formats for integer values: { “value”: 2 } or for float values: { “value”: 0.3 }
- random_state_method (int) - (5) [1~1000|1] Controls the randomness of the estimator.
- scale (bool) - (False) Whether or not to scale the input dataset.
- remove_tmp (bool) - (True) [WF property] Remove temporal files.
- restart (bool) - (False) [WF property] Do not execute if output files exist.
- sandbox_path (str) - (“./”) [WF property] Parent path to the sandbox directory.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.dimensionality_reduction.principal_component import principal_component
prop = {
    'features': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'target': {
        'column': 'target'
    },
    'n_components': {
        'int': 2
    }
}
principal_component(input_dataset_path='/path/to/myDataset.csv',
                    output_results_path='/path/to/newTable.csv',
                    output_plot_path='/path/to/newPlot.png',
                    properties=prop)

Info:

wrapped_software:
- name: scikit-learn PCA
- version: >=0.24.2
- license: BSD 3-Clause
ontology:
- name: EDAM
- schema: http://edamontology.org/EDAM.owl

check_data_params(out_log, err_log)[source]: Checks all the input/output paths and parameters

launch() → int[source]: Execute the PrincipalComponentAnalysis dimensionality_reduction.pincipal_component.PrincipalComponentAnalysis object.

dimensionality_reduction.principal_component.main()[source]: Command line execution of this building block. Please check the command line documentation.

dimensionality_reduction.principal_component.principal_component(input_dataset_path: str, output_results_path: str, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) → int[source]: Execute the PrincipalComponentAnalysis class and execute the launch() method.