clustering package

Submodules

clustering.agglomerative_coefficient module

Module containing the AgglomerativeCoefficient class and the command line interface.

class clustering.agglomerative_coefficient.AgglomerativeCoefficient(input_dataset_path, output_results_path, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml AgglomerativeCoefficient
Wrapper of the scikit-learn AgglomerativeClustering method.
Clusters a given dataset and calculates best K coefficient. Visit the AgglomerativeClustering documentation page in the sklearn official website for further information.
Parameters:
  • input_dataset_path (str) – Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_results_path (str) –

    Path to the gap values list. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • output_plot_path (str) (Optional) –

    Path to the elbow method and gap statistics plot. File type: output. Sample file. Accepted formats: png (edam:format_3603).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • predictors (dict) - ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.

    • max_clusters (int) - (6) [1~100|1] Maximum number of clusters to use by default for kmeans queries.

    • affinity (str) - (“euclidean”) Metric used to compute the linkage. If linkage is “ward”, only “euclidean” is accepted. Values: euclidean (Computes the Euclidean distance between two 1-D arrays), l1, l2, manhattan (Compute the Manhattan distance), cosine (Compute the Cosine distance between 1-D arrays), precomputed (means that the flatten array containing the upper triangular of the distance matrix of the original data is used).

    • linkage (str) - (“ward”) The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. Values: ward (minimizes the variance of the clusters being merged), complete (uses the maximum distances between all observations of the two sets), average (uses the average of the distances of each observation of the two sets), single (uses the minimum of the distances between all observations of the two sets).

    • scale (bool) - (False) Whether or not to scale the input dataset.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.clustering.agglomerative_coefficient import agglomerative_coefficient
prop = {
    'predictors': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'clusters': 3,
    'affinity': 'euclidean',
    'linkage': 'ward',
    'plots': [
        {
            'title': 'Plot 1',
            'features': ['feat1', 'feat2']
        }
    ]
}
agglomerative_coefficient(input_dataset_path='/path/to/myDataset.csv',
                        output_results_path='/path/to/newTable.csv',
                        output_plot_path='/path/to/newPlot.png',
                        properties=prop)
Info:
  • wrapped_software:
    • name: scikit-learn AgglomerativeClustering

    • version: >=0.24.2

    • license: BSD 3-Clause

  • ontology:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the AgglomerativeCoefficient clustering.agglomerative_coefficient.AgglomerativeCoefficient object.

clustering.agglomerative_coefficient.agglomerative_coefficient(input_dataset_path: str, output_results_path: str, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) int[source]

Execute the AgglomerativeCoefficient class and execute the launch() method.

clustering.agglomerative_coefficient.main()[source]

Command line execution of this building block. Please check the command line documentation.

clustering.agglomerative_clustering module

Module containing the AgglClustering class and the command line interface.

class clustering.agglomerative_clustering.AgglClustering(input_dataset_path, output_results_path, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml AgglClustering
Wrapper of the scikit-learn AgglomerativeClustering method.
Clusters a given dataset. Visit the AgglomerativeClustering documentation page in the sklearn official website for further information.
Parameters:
  • input_dataset_path (str) –

    Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_results_path (str) –

    Path to the clustered dataset. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • output_plot_path (str) (Optional) –

    Path to the clustering plot. File type: output. Sample file. Accepted formats: png (edam:format_3603).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • predictors (dict) - ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of multiple formats, the first one will be picked.

    • clusters (int) - (3) [1~100|1] The number of clusters to form as well as the number of centroids to generate.

    • affinity (str) - (“euclidean”) Metric used to compute the linkage. If linkage is “ward”, only “euclidean” is accepted. Values: euclidean (Computes the Euclidean distance between two 1-D arrays), l1, l2, manhattan (Compute the Manhattan distance), cosine (Compute the Cosine distance between 1-D arrays), precomputed (means that the flatten array containing the upper triangular of the distance matrix of the original data is used).

    • linkage (str) - (“ward”) The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. Values: ward (minimizes the variance of the clusters being merged), complete (uses the maximum distances between all observations of the two sets), average (uses the average of the distances of each observation of the two sets), single (uses the minimum of the distances between all observations of the two sets).

    • plots (list) - (None) List of dictionaries with all plots you want to generate. Only 2D or 3D plots accepted. Format: [ { ‘title’: ‘Plot 1’, ‘features’: [‘feat1’, ‘feat2’] } ].

    • scale (bool) - (False) Whether or not to scale the input dataset.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.clustering.agglomerative_clustering import agglomerative_clustering
prop = {
    'predictors': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'clusters': 3,
    'affinity': 'euclidean',
    'linkage': 'ward',
    'plots': [
        {
            'title': 'Plot 1',
            'features': ['feat1', 'feat2']
        }
    ]
}
agglomerative_clustering(input_dataset_path='/path/to/myDataset.csv',
                        output_results_path='/path/to/newTable.csv',
                        output_plot_path='/path/to/newPlot.png',
                        properties=prop)
Info:
  • wrapped_software:
    • name: scikit-learn AgglomerativeClustering

    • version: >=0.24.2

    • license: BSD 3-Clause

  • ontology:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the AgglClustering clustering.agglomerative_clustering.AgglClustering object.

clustering.agglomerative_clustering.agglomerative_clustering(input_dataset_path: str, output_results_path: str, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) int[source]

Execute the AgglClustering class and execute the launch() method.

clustering.agglomerative_clustering.main()[source]

Command line execution of this building block. Please check the command line documentation.

clustering.dbscan module

Module containing the DBSCANClustering class and the command line interface.

class clustering.dbscan.DBSCANClustering(input_dataset_path, output_results_path, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml DBSCANClustering
Wrapper of the scikit-learn DBSCAN method.
Clusters a given dataset. Visit the DBSCAN documentation page in the sklearn official website for further information.
Parameters:
  • input_dataset_path (str) –

    Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_results_path (str) –

    Path to the clustered dataset. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • output_plot_path (str) (Optional) –

    Path to the clustering plot. File type: output. Sample file. Accepted formats: png (edam:format_3603).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • predictors (dict) - ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.

    • eps (float) - (0.5) [0~10|0.1] The maximum distance between two samples for one to be considered as in the neighborhood of the other.

    • min_samples (int) - (5) [1~100|1] The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

    • metric (str) - (“euclidean”) The metric to use when calculating distance between instances in a feature array. Values: cityblock (Compute the City Block -Manhattan- distance), cosine (Compute the Cosine distance between 1-D arrays), euclidean (Computes the Euclidean distance between two 1-D arrays), l1, l2, manhattan (Compute the Manhattan distance), braycurtis (Compute the Bray-Curtis distance between two 1-D arrays), canberra (Compute the Canberra distance between two 1-D arrays), chebyshev (Compute the Chebyshev distance), correlation (Compute the correlation distance between two 1-D arrays), dice (Compute the Dice dissimilarity between two boolean 1-D arrays), hamming (Compute the Hamming distance between two 1-D arrays), jaccard (Compute the Jaccard-Needham dissimilarity between two boolean 1-D arrays), kulsinski (Compute the Kulsinski dissimilarity between two boolean 1-D arrays), mahalanobis (Compute the Mahalanobis distance between two 1-D arrays), minkowski (Compute the Minkowski distance between two 1-D arrays), rogerstanimoto (Compute the Rogers-Tanimoto dissimilarity between two boolean 1-D arrays), russellrao (Compute the Russell-Rao dissimilarity between two boolean 1-D arrays), seuclidean (Return the standardized Euclidean distance between two 1-D arrays), sokalmichener (Compute the Sokal-Michener dissimilarity between two boolean 1-D arrays), sokalsneath (Compute the Sokal-Sneath dissimilarity between two boolean 1-D arrays), sqeuclidean (Compute the squared Euclidean distance between two 1-D arrays), yule (Compute the Yule dissimilarity between two boolean 1-D arrays).

    • plots (list) - (None) List of dictionaries with all plots you want to generate. Only 2D or 3D plots accepted. Format: [ { ‘title’: ‘Plot 1’, ‘features’: [‘feat1’, ‘feat2’] } ].

    • scale (bool) - (False) Whether or not to scale the input dataset.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.clustering.dbscan import dbscan
prop = {
    'predictors': {
            'columns': [ 'column1', 'column2', 'column3' ]
    },
    'eps': 1.4,
    'min_samples': 3,
    'metric': 'euclidean',
    'plots': [
        {
            'title': 'Plot 1',
            'features': ['feat1', 'feat2']
        }
    ]
}
dbscan(input_dataset_path='/path/to/myDataset.csv',
        output_results_path='/path/to/newTable.csv',
        output_plot_path='/path/to/newPlot.png',
        properties=prop)
Info:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the DBSCANClustering clustering.dbscan.DBSCANClustering object.

clustering.dbscan.dbscan(input_dataset_path: str, output_results_path: str, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) int[source]

Execute the DBSCANClustering class and execute the launch() method.

clustering.dbscan.main()[source]

Command line execution of this building block. Please check the command line documentation.

clustering.k_means_coefficient module

Module containing the KMeansCoefficient class and the command line interface.

class clustering.k_means_coefficient.KMeansCoefficient(input_dataset_path, output_results_path, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml KMeansCoefficient
Wrapper of the scikit-learn KMeans method.
Clusters a given dataset and calculates best K coefficient. Visit the KMeans documentation page in the sklearn official website for further information.
Parameters:
  • input_dataset_path (str) –

    Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_results_path (str) –

    Table with WCSS (elbow method), Gap and Silhouette coefficients for each cluster. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • output_plot_path (str) (Optional) –

    Path to the elbow method and gap statistics plot. File type: output. Sample file. Accepted formats: png (edam:format_3603).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • predictors (dict) - ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.

    • max_clusters (int) - (6) [1~100|1] Maximum number of clusters to use by default for kmeans queries.

    • random_state_method (int) - (5) [1~1000|1] Determines random number generation for centroid initialization.

    • scale (bool) - (False) Whether or not to scale the input dataset.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.clustering.k_means_coefficient import k_means_coefficient
prop = {
    'predictors': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'max_clusters': 3
}
k_means_coefficient(input_dataset_path='/path/to/myDataset.csv',
                    output_results_path='/path/to/newTable.csv',
                    output_plot_path='/path/to/newPlot.png',
                    properties=prop)
Info:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the KMeansCoefficient clustering.k_means_coefficient.KMeansCoefficient object.

clustering.k_means_coefficient.k_means_coefficient(input_dataset_path: str, output_results_path: str, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) int[source]

Execute the KMeansCoefficient class and execute the launch() method.

clustering.k_means_coefficient.main()[source]

Command line execution of this building block. Please check the command line documentation.

clustering.k_means module

Module containing the KMeansClustering class and the command line interface.

class clustering.k_means.KMeansClustering(input_dataset_path, output_results_path, output_model_path, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml KMeansClustering
Wrapper of the scikit-learn KMeans method.
Clusters a given dataset and saves the model and scaler. Visit the KMeans documentation page in the sklearn official website for further information.
Parameters:
  • input_dataset_path (str) –

    Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_results_path (str) –

    Path to the clustered dataset. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • output_model_path (str) –

    Path to the output model file. File type: output. Sample file. Accepted formats: pkl (edam:format_3653).

  • output_plot_path (str) (Optional) –

    Path to the clustering plot. File type: output. Sample file. Accepted formats: png (edam:format_3603).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • predictors (dict) - ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.

    • clusters (int) - (3) [1~100|1] The number of clusters to form as well as the number of centroids to generate.

    • plots (list) - (None) List of dictionaries with all plots you want to generate. Only 2D or 3D plots accepted. Format: [ { ‘title’: ‘Plot 1’, ‘features’: [‘feat1’, ‘feat2’] } ].

    • random_state_method (int) - (5) [1~1000|1] Determines random number generation for centroid initialization.

    • scale (bool) - (False) Whether or not to scale the input dataset.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.clustering.k_means import k_means
prop = {
    'predictors': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'clusters': 3,
    'plots': [
        {
            'title': 'Plot 1',
            'features': ['feat1', 'feat2']
        }
    ]
}
k_means(input_dataset_path='/path/to/myDataset.csv',
        output_results_path='/path/to/newTable.csv',
        output_model_path='/path/to/newModel.pkl',
        output_plot_path='/path/to/newPlot.png',
        properties=prop)
Info:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the KMeansClustering clustering.k_means.KMeansClustering object.

clustering.k_means.k_means(input_dataset_path: str, output_results_path: str, output_model_path: str, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) int[source]

Execute the KMeansClustering class and execute the launch() method.

clustering.k_means.main()[source]

Command line execution of this building block. Please check the command line documentation.

clustering.spectral_coefficient module

Module containing the SpectralCoefficient class and the command line interface.

class clustering.spectral_coefficient.SpectralCoefficient(input_dataset_path, output_results_path, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml SpectralCoefficient
Wrapper of the scikit-learn SpectralClustering method.
Clusters a given dataset and calculates best K coefficient. Visit the SpectralClustering documentation page in the sklearn official website for further information.
Parameters:
  • input_dataset_path (str) –

    Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_results_path (str) –

    Table with WCSS (elbow method), Gap and Silhouette coefficients for each cluster. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • output_plot_path (str) (Optional) –

    Path to the elbow method and gap statistics plot. File type: output. Sample file. Accepted formats: png (edam:format_3603).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • predictors (dict) - ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.

    • max_clusters (int) - (6) [1~100|1] Maximum number of clusters to use by default for kmeans queries.

    • random_state_method (int) - (5) [1~1000|1] A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver=’amg’ and by the K-Means initialization.

    • scale (bool) - (False) Whether or not to scale the input dataset.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.clustering.spectral_coefficient import spectral_coefficient
prop = {
    'predictors': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'max_clusters': 6
}
spectral_coefficient(input_dataset_path='/path/to/myDataset.csv',
                    output_results_path='/path/to/newTable.csv',
                    output_plot_path='/path/to/newPlot.png',
                    properties=prop)
Info:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the SpectralCoefficient clustering.spectral_coefficient.SpectralCoefficient object.

clustering.spectral_coefficient.main()[source]

Command line execution of this building block. Please check the command line documentation.

clustering.spectral_coefficient.spectral_coefficient(input_dataset_path: str, output_results_path: str, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) int[source]

Execute the SpectralCoefficient class and execute the launch() method.

clustering.spectral_clustering module

Module containing the SpecClustering class and the command line interface.

class clustering.spectral_clustering.SpecClustering(input_dataset_path, output_results_path, output_plot_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml SpecClustering
Wrapper of the scikit-learn SpectralClustering method.
Clusters a given dataset. Visit the SpectralClustering documentation page in the sklearn official website for further information.
Parameters:
  • input_dataset_path (str) –

    Path to the input dataset. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_results_path (str) –

    Path to the clustered dataset. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • output_plot_path (str) (Optional) –

    Path to the clustering plot. File type: output. Sample file. Accepted formats: png (edam:format_3603).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • predictors (dict) - ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [“column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked.

    • clusters (int) - (3) [1~100|1] The number of clusters to form as well as the number of centroids to generate.

    • affinity (string) - (“rbf”) How to construct the affinity matrix. Values: nearest_neighbors (construct the affinity matrix by computing a graph of nearest neighbors), rbf (construct the affinity matrix using a radial basis function -RBF- kernel), precomputed (interpret X as a precomputed affinity matrix), precomputed_nearest_neighbors (interpret X as a sparse graph of precomputed nearest neighbors and constructs the affinity matrix by selecting the n_neighbors nearest neighbors).

    • plots (list) - (None) List of dictionaries with all plots you want to generate. Only 2D or 3D plots accepted. Format: [ { ‘title’: ‘Plot 1’, ‘features’: [‘feat1’, ‘feat2’] } ].

    • random_state_method (int) - (5) [1~1000|1] A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver=’amg’ and by the K-Means initialization.

    • scale (bool) - (False) Whether or not to scale the input dataset.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.clustering.spectral_clustering import spectral_clustering
prop = {
    'predictors': {
        'columns': [ 'column1', 'column2', 'column3' ]
    },
    'clusters': 3,
    'affinity': 'rbf',
    'plots': [
        {
            'title': 'Plot 1',
            'features': ['feat1', 'feat2']
        }
    ]
}
spectral_clustering(input_dataset_path='/path/to/myDataset.csv',
                    output_results_path='/path/to/newTable.csv',
                    output_plot_path='/path/to/newPlot.png',
                    properties=prop)
Info:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the SpecClustering clustering.spectral_clustering.SpecClustering object.

clustering.spectral_clustering.main()[source]

Command line execution of this building block. Please check the command line documentation.

clustering.spectral_clustering.spectral_clustering(input_dataset_path: str, output_results_path: str, output_plot_path: str | None = None, properties: dict | None = None, **kwargs) int[source]

Execute the SpecClustering class and execute the launch() method.

clustering.clustering_predict module

Module containing the ClusteringPredict class and the command line interface.

class clustering.clustering_predict.ClusteringPredict(input_model_path, output_results_path, input_dataset_path=None, properties=None, **kwargs)[source]

Bases: BiobbObject

biobb_ml ClusteringPredict
Makes predictions from an input dataset and a given clustering model.
Makes predictions from an input dataset (provided either as a file or as a dictionary property) and a given clustering model fitted with KMeans method.
Parameters:
  • input_model_path (str) –

    Path to the input model. File type: input. Sample file. Accepted formats: pkl (edam:format_3653).

  • input_dataset_path (str) (Optional) –

    Path to the dataset to predict. File type: input. Sample file. Accepted formats: csv (edam:format_3752).

  • output_results_path (str) –

    Path to the output results file. File type: output. Sample file. Accepted formats: csv (edam:format_3752).

  • properties (dic - Python dictionary object containing the tool parameters, not input/output files) –

    • predictions (list) - (None) List of dictionaries with all values you want to predict targets. It will be taken into account only in case input_dataset_path is not provided. Format: [{ ‘var1’: 1.0, ‘var2’: 2.0 }, { ‘var1’: 4.0, ‘var2’: 2.7 }] for datasets with headers and [[ 1.0, 2.0 ], [ 4.0, 2.7 ]] for datasets without headers.

    • remove_tmp (bool) - (True) [WF property] Remove temporal files.

    • restart (bool) - (False) [WF property] Do not execute if output files exist.

Examples

This is a use example of how to use the building block from Python:

from biobb_ml.clustering.clustering_predict import clustering_predict
prop = {
    'predictions': [
        {
            'var1': 1.0,
            'var2': 2.0
        },
        {
            'var1': 4.0,
            'var2': 2.7
        }
    ]
}
clustering_predict(input_model_path='/path/to/myModel.pkl',
                    output_results_path='/path/to/newPredictedResults.csv',
                    input_dataset_path='/path/to/myDataset.csv',
                    properties=prop)
Info:
check_data_params(out_log, err_log)[source]

Checks all the input/output paths and parameters

launch() int[source]

Execute the ClusteringPredict clustering.clustering_predict.ClusteringPredict object.

clustering.clustering_predict.clustering_predict(input_model_path: str, output_results_path: str, input_dataset_path: str | None = None, properties: dict | None = None, **kwargs) int[source]

Execute the ClusteringPredict class and execute the launch() method.

clustering.clustering_predict.main()[source]

Command line execution of this building block. Please check the command line documentation.