BioBB ML Command Line Help

Generic usage:

biobb_command [-h] --config CONFIG --input_file(s) <input_file(s)> --output_file <output_file>

Decision_tree

Wrapper of the scikit-learn DecisionTreeClassifier method.

Get help

Command:

decision_tree -h

usage: decision_tree [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn DecisionTreeClassifier method. 

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the statistics plot. If target is binary it shows confusion matrix, distributions of the predicted probabilities of both classes and ROC curve. If target is non-binary it shows confusion matrix. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: pkl.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: PKL
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the statistics plot. If target is binary it shows confusion matrix, distributions of the predicted probabilities of both classes and ROC curve. If target is non-binary it shows confusion matrix. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

independent_vars (object): ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
weight (object): ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
criterion (string): (gini) The function to measure the quality of a split. .
max_depth (integer): (4) The maximum depth of the model. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples..
normalize_cm (boolean): (False) Whether or not to normalize the confusion matrix..
random_state_method (integer): (5) Controls the randomness of the estimator..
random_state_train_test (integer): (5) Controls the shuffling applied to the data before applying the split..
test_size (number): (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  criterion: entropy
  independent_vars:
    columns:
    - interest_rate
    - credit
    - march
    - previous
    - duration
  max_depth: 4
  normalize_cm: false
  scale: true
  target:
    column: y
  test_size: 0.2

Command line

decision_tree --config config_decision_tree.yml --input_dataset_path dataset_decision_tree.csv --output_model_path ref_output_model_decision_tree.pkl --output_test_table_path ref_output_test_decision_tree.csv --output_plot_path ref_output_plot_decision_tree.png

JSON

Common config file 

{
  "properties": {
    "independent_vars": {
      "columns": [
        "interest_rate",
        "credit",
        "march",
        "previous",
        "duration"
      ]
    },
    "target": {
      "column": "y"
    },
    "criterion": "entropy",
    "max_depth": 4,
    "normalize_cm": false,
    "test_size": 0.2,
    "scale": true
  }
}

Command line

decision_tree --config config_decision_tree.json --input_dataset_path dataset_decision_tree.csv --output_model_path ref_output_model_decision_tree.pkl --output_test_table_path ref_output_test_decision_tree.csv --output_plot_path ref_output_plot_decision_tree.png

Agglomerative_clustering

Wrapper of the scikit-learn AgglomerativeClustering method.

Get help

Command:

agglomerative_clustering -h

usage: agglomerative_clustering [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn AgglomerativeClustering method. 

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the clustering plot. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the clustered dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the clustered dataset. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the clustering plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictors (object): ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of multiple formats, the first one will be picked..
clusters (integer): (3) The number of clusters to form as well as the number of centroids to generate..
affinity (string): (euclidean) Metric used to compute the linkage. If linkage is “ward”, only “euclidean” is accepted. .
linkage (string): (ward) The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. .
plots (array): (None) List of dictionaries with all plots you want to generate. Only 2D or 3D plots accepted. Format: [ { ‘title’: ‘Plot 1’, ‘features’: [’feat1’, ‘feat2’] } ]..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  clusters: 3
  linkage: average
  plots:
  - features:
    - sepal_length
    - sepal_width
    title: Plot 1
  - features:
    - petal_length
    - petal_width
    title: Plot 2
  - features:
    - sepal_length
    - sepal_width
    - petal_length
    title: Plot 3
  - features:
    - petal_length
    - petal_width
    - sepal_width
    title: Plot 4
  - features:
    - sepal_length
    - petal_width
    title: Plot 5
  predictors:
    columns:
    - sepal_length
    - sepal_width
    - petal_length
    - petal_width
  scale: true

Command line

agglomerative_clustering --config config_agglomerative_clustering.yml --input_dataset_path dataset_agglomerative_clustering.csv --output_results_path ref_output_results_agglomerative_clustering.csv --output_plot_path ref_output_plot_agglomerative_clustering.png

JSON

Common config file 

{
  "properties": {
    "predictors": {
      "columns": [
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width"
      ]
    },
    "clusters": 3,
    "linkage": "average",
    "plots": [
      {
        "title": "Plot 1",
        "features": [
          "sepal_length",
          "sepal_width"
        ]
      },
      {
        "title": "Plot 2",
        "features": [
          "petal_length",
          "petal_width"
        ]
      },
      {
        "title": "Plot 3",
        "features": [
          "sepal_length",
          "sepal_width",
          "petal_length"
        ]
      },
      {
        "title": "Plot 4",
        "features": [
          "petal_length",
          "petal_width",
          "sepal_width"
        ]
      },
      {
        "title": "Plot 5",
        "features": [
          "sepal_length",
          "petal_width"
        ]
      }
    ],
    "scale": true
  }
}

Command line

agglomerative_clustering --config config_agglomerative_clustering.json --input_dataset_path dataset_agglomerative_clustering.csv --output_results_path ref_output_results_agglomerative_clustering.csv --output_plot_path ref_output_plot_agglomerative_clustering.png

Linear_regression

Wrapper of the scikit-learn LinearRegression method.

Get help

Command:

linear_regression -h

usage: linear_regression [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn LinearRegression method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Residual plot checks the error between actual values and predicted values. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: pkl.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: PKL
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Residual plot checks the error between actual values and predicted values. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

independent_vars (object): ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
weight (object): ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
random_state_train_test (integer): (5) Controls the shuffling applied to the data before applying the split..
test_size (number): (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  independent_vars:
    columns:
    - size
    - year
    - view
  scale: true
  target:
    column: price
  test_size: 0.2

Command line

linear_regression --config config_linear_regression.yml --input_dataset_path dataset_linear_regression.csv --output_model_path ref_output_model_linear_regression.pkl --output_test_table_path ref_output_test_linear_regression.csv --output_plot_path ref_output_plot_linear_regression.png

JSON

Common config file 

{
  "properties": {
    "independent_vars": {
      "columns": [
        "size",
        "year",
        "view"
      ]
    },
    "target": {
      "column": "price"
    },
    "test_size": 0.2,
    "scale": true
  }
}

Command line

linear_regression --config config_linear_regression.json --input_dataset_path dataset_linear_regression.csv --output_model_path ref_output_model_linear_regression.pkl --output_test_table_path ref_output_test_linear_regression.csv --output_plot_path ref_output_plot_linear_regression.png

Random_forest_classifier

Wrapper of the scikit-learn RandomForestClassifier method.

Get help

Command:

random_forest_classifier -h

usage: random_forest_classifier [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn RandomForestClassifier method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the statistics plot. If target is binary it shows confusion matrix, distributions of the predicted probabilities of both classes and ROC curve. If target is non-binary it shows confusion matrix. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: pkl.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: PKL
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the statistics plot. If target is binary it shows confusion matrix, distributions of the predicted probabilities of both classes and ROC curve. If target is non-binary it shows confusion matrix. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

independent_vars (object): ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
weight (object): ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
n_estimators (integer): (100) The number of trees in the forest..
bootstrap (boolean): (True) Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree..
normalize_cm (boolean): (False) Whether or not to normalize the confusion matrix..
random_state_method (integer): (5) Controls the randomness of the estimator..
random_state_train_test (integer): (5) Controls the shuffling applied to the data before applying the split..
test_size (number): (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  bootstrap: true
  independent_vars:
    indexes:
    - 0
    - 1
    - 2
    - 3
    - 4
  n_estimators: 100
  normalize_cm: false
  scale: true
  target:
    index: 5
  test_size: 0.2

Command line

random_forest_classifier --config config_random_forest_classifier.yml --input_dataset_path dataset_random_forest_classifier.csv --output_model_path ref_output_model_random_forest_classifier.pkl --output_test_table_path ref_output_test_random_forest_classifier.csv --output_plot_path ref_output_plot_random_forest_classifier.png

JSON

Common config file 

{
  "properties": {
    "independent_vars": {
      "indexes": [
        0,
        1,
        2,
        3,
        4
      ]
    },
    "target": {
      "index": 5
    },
    "n_estimators": 100,
    "bootstrap": true,
    "normalize_cm": false,
    "test_size": 0.2,
    "scale": true
  }
}

Command line

random_forest_classifier --config config_random_forest_classifier.json --input_dataset_path dataset_random_forest_classifier.csv --output_model_path ref_output_model_random_forest_classifier.pkl --output_test_table_path ref_output_test_random_forest_classifier.csv --output_plot_path ref_output_plot_random_forest_classifier.png

Resampling

Wrapper of the imblearn.combine methods.

Get help

Command:

resampling -h

usage: resampling [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_dataset_path OUTPUT_DATASET_PATH

Wrapper of the imblearn.combine methods.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_dataset_path OUTPUT_DATASET_PATH
                        Path to the output dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_dataset_path (string): Path to the output dataset. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

method (string): (None) Resampling method. It’s a mandatory property. .
type (string): (None) Type of oversampling. It’s a mandatory property. .
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
evaluate (boolean): (False) Whether or not to evaluate the dataset before and after applying the resampling..
evaluate_splits (integer): (3) Number of folds to be applied by the Repeated Stratified K-Fold evaluation method. Must be at least 2..
evaluate_repeats (integer): (3) Number of times Repeated Stratified K-Fold cross validator needs to be repeated..
n_bins (integer): (5) Only for regression resampling. The number of classes that the user wants to generate with the target data..
balanced_binning (boolean): (False) Only for regression resampling. Decides whether samples are to be distributed roughly equally across all classes..
sampling_strategy_over (object): ({’target’: ‘auto’}) Sampling information applied in the dataset oversampling process. Formats: { “target”: “auto” }, { “ratio”: 0.3 } or { “dict”: { 0: 300, 1: 200, 2: 100 } }. When “target”, specify the class targeted by the resampling; the number of samples in the different classes will be equalized; possible choices are: minority (resample only the minority class), not minority (resample all classes but the minority class), not majority (resample all classes but the majority class), all (resample all classes), auto (equivalent to ‘not majority’). When “ratio”, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling (ONLY IN CASE OF BINARY CLASSIFICATION). When “dict”, the keys correspond to the targeted classes and the values correspond to the desired number of samples for each targeted class..
sampling_strategy_under (object): ({’target’: ‘auto’}) Sampling information applied in the dataset cleaning process. Formats: { “target”: “auto” } or { “list”: [0, 2, 3] }. When “target”, specify the class targeted by the resampling; the number of samples in the different classes will be equalized; possible choices are: majority (resample only the majority class), not minority (resample all classes but the minority class), not majority (resample all classes but the majority class), all (resample all classes), auto (equivalent to ‘not minority’). When “list”, the list contains the classes targeted by the resampling..
random_state_method (integer): (5) Controls the randomization of the algorithm..
random_state_evaluate (integer): (5) Controls the shuffling applied to the Repeated Stratified K-Fold evaluation method..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  evaluate: true
  method: smotenn
  n_bins: 10
  sampling_strategy_over:
    dict:
      4: 1000
      5: 1000
      6: 1000
      7: 1000
  sampling_strategy_under:
    list:
    - 0
    - 1
  target:
    column: VALUE
  type: regression

Command line

resampling --config config_resampling.yml --input_dataset_path dataset_resampling.csv --output_dataset_path ref_output_resampling.csv

JSON

Common config file 

{
  "properties": {
    "method": "smotenn",
    "type": "regression",
    "target": {
      "column": "VALUE"
    },
    "evaluate": true,
    "n_bins": 10,
    "sampling_strategy_over": {
      "dict": {
        "4": 1000,
        "5": 1000,
        "6": 1000,
        "7": 1000
      }
    },
    "sampling_strategy_under": {
      "list": [
        0,
        1
      ]
    }
  }
}

Command line

resampling --config config_resampling.json --input_dataset_path dataset_resampling.csv --output_dataset_path ref_output_resampling.csv

Classification_predict

Makes predictions from an input dataset and a given classification model.

Get help

Command:

classification_predict -h

usage: classification_predict [-h] [--config CONFIG] --input_model_path INPUT_MODEL_PATH --output_results_path OUTPUT_RESULTS_PATH [--input_dataset_path INPUT_DATASET_PATH]

Makes predictions from an input dataset and a given classification model.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the dataset to predict. Accepted formats: csv.

required arguments:
  --input_model_path INPUT_MODEL_PATH
                        Path to the input model. Accepted formats: pkl.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the output results file. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_model_path (string): Path to the input model. File type: input. Sample file. Accepted formats: PKL
input_dataset_path (string): Path to the dataset to predict. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the output results file. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictions (array): (None) List of dictionaries with all values you want to predict targets. It will be taken into account only in case input_dataset_path is not provided. Format: [{ ‘var1’: 1.0, ‘var2’: 2.0 }, { ‘var1’: 4.0, ‘var2’: 2.7 }] for datasets with headers and [[ 1.0, 2.0 ], [ 4.0, 2.7 ]] for datasets without headers..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  remove_tmp: false

Command line

classification_predict --config config_classification_predict.yml --input_model_path model_classification_predict.pkl --input_dataset_path input_classification_predict.csv --output_results_path ref_output_classification_predict.csv

JSON

Common config file 

{
  "properties": {
    "remove_tmp": false
  }
}

Command line

classification_predict --config config_classification_predict.json --input_model_path model_classification_predict.pkl --input_dataset_path input_classification_predict.csv --output_results_path ref_output_classification_predict.csv

Principal_component

Wrapper of the scikit-learn PCA method.

Get help

Command:

principal_component -h

usage: principal_component [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn PCA method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the Principal Component plot, only if number of components is 2 or 3. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the analysed dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the analysed dataset. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the Principal Component plot, only if number of components is 2 or 3. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

features (object): ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
n_components (object): ({}) Dictionary containing the number of components to keep (int) or the minimum number of principal components such the 0 to 1 range of the variance (float) is retained. If not set ({}) all components are kept. Formats for integer values: { “value”: 2 } or for float values: { “value”: 0.3 }.
random_state_method (integer): (5) Controls the randomness of the estimator..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  features:
    columns:
    - sepal_length
    - sepal_width
    - petal_length
    - petal_width
  n_components:
    value: 2
  scale: true
  target:
    column: target

Command line

principal_component --config config_principal_component.yml --input_dataset_path dataset_principal_component.csv --output_results_path ref_output_results_principal_component.csv --output_plot_path ref_output_plot_principal_component.png

JSON

Common config file 

{
  "properties": {
    "features": {
      "columns": [
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width"
      ]
    },
    "target": {
      "column": "target"
    },
    "n_components": {
      "value": 2
    },
    "scale": true
  }
}

Command line

principal_component --config config_principal_component.json --input_dataset_path dataset_principal_component.csv --output_results_path ref_output_results_principal_component.csv --output_plot_path ref_output_plot_principal_component.png

Spectral_clustering

Wrapper of the scikit-learn SpectralClustering method.

Get help

Command:

spectral_clustering -h

usage: spectral_clustering [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn SpectralClustering method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the clustering plot. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the clustered dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the clustered dataset. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the clustering plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictors (object): ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
clusters (integer): (3) The number of clusters to form as well as the number of centroids to generate..
affinity (string): (rbf) How to construct the affinity matrix. .
plots (array): (None) List of dictionaries with all plots you want to generate. Only 2D or 3D plots accepted. Format: [ { ‘title’: ‘Plot 1’, ‘features’: [’feat1’, ‘feat2’] } ]..
random_state_method (integer): (5) A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver=’amg’ and by the K-Means initialization..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  affinity: nearest_neighbors
  clusters: 3
  plots:
  - features:
    - sepal_length
    - sepal_width
    title: Plot 1
  - features:
    - petal_length
    - petal_width
    title: Plot 2
  - features:
    - sepal_length
    - sepal_width
    - petal_length
    title: Plot 3
  - features:
    - petal_length
    - petal_width
    - sepal_width
    title: Plot 4
  - features:
    - sepal_length
    - petal_width
    title: Plot 5
  predictors:
    columns:
    - sepal_length
    - sepal_width
    - petal_length
    - petal_width
  scale: true

Command line

spectral_clustering --config config_spectral_clustering.yml --input_dataset_path dataset_spectral_clustering.csv --output_results_path ref_output_results_spectral_clustering.csv --output_plot_path ref_output_plot_spectral_clustering.png

JSON

Common config file 

{
  "properties": {
    "predictors": {
      "columns": [
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width"
      ]
    },
    "clusters": 3,
    "affinity": "nearest_neighbors",
    "plots": [
      {
        "title": "Plot 1",
        "features": [
          "sepal_length",
          "sepal_width"
        ]
      },
      {
        "title": "Plot 2",
        "features": [
          "petal_length",
          "petal_width"
        ]
      },
      {
        "title": "Plot 3",
        "features": [
          "sepal_length",
          "sepal_width",
          "petal_length"
        ]
      },
      {
        "title": "Plot 4",
        "features": [
          "petal_length",
          "petal_width",
          "sepal_width"
        ]
      },
      {
        "title": "Plot 5",
        "features": [
          "sepal_length",
          "petal_width"
        ]
      }
    ],
    "scale": true
  }
}

Command line

spectral_clustering --config config_spectral_clustering.json --input_dataset_path dataset_spectral_clustering.csv --output_results_path ref_output_results_spectral_clustering.csv --output_plot_path ref_output_plot_spectral_clustering.png

Random_forest_regressor

Wrapper of the scikit-learn RandomForestRegressor method.

Get help

Command:

random_forest_regressor -h

usage: random_forest_regressor [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn RandomForestRegressor method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Residual plot checks the error between actual values and predicted values. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: pkl.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: PKL
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Residual plot checks the error between actual values and predicted values. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

independent_vars (object): ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
weight (object): ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
n_estimators (integer): (10) The number of trees in the forest..
max_depth (integer): (None) The maximum depth of the tree..
random_state_method (integer): (5) Controls the randomness of the estimator..
random_state_train_test (integer): (5) Controls the shuffling applied to the data before applying the split..
test_size (number): (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  independent_vars:
    range:
    - - 0
      - 5
    - - 7
      - 12
  max_depth: 5
  n_estimators: 10
  scale: true
  target:
    index: 13
  test_size: 0.2

Command line

random_forest_regressor --config config_random_forest_regressor.yml --input_dataset_path dataset_random_forest_regressor.csv --output_model_path ref_output_model_random_forest_regressor.pkl --output_test_table_path ref_output_test_random_forest_regressor.csv --output_plot_path ref_output_plot_random_forest_regressor.png

JSON

Common config file 

{
  "properties": {
    "independent_vars": {
      "range": [
        [
          0,
          5
        ],
        [
          7,
          12
        ]
      ]
    },
    "target": {
      "index": 13
    },
    "n_estimators": 10,
    "max_depth": 5,
    "test_size": 0.2,
    "scale": true
  }
}

Command line

random_forest_regressor --config config_random_forest_regressor.json --input_dataset_path dataset_random_forest_regressor.csv --output_model_path ref_output_model_random_forest_regressor.pkl --output_test_table_path ref_output_test_random_forest_regressor.csv --output_plot_path ref_output_plot_random_forest_regressor.png

Logistic_regression

Wrapper of the scikit-learn LogisticRegression method.

Get help

Command:

logistic_regression -h

usage: logistic_regression [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn LogisticRegression method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the statistics plot. If target is binary it shows confusion matrix, distributions of the predicted probabilities of both classes and ROC curve. If target is non-binary it shows confusion matrix. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: pkl.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: PKL
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the statistics plot. If target is binary it shows confusion matrix, distributions of the predicted probabilities of both classes and ROC curve. If target is non-binary it shows confusion matrix. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

independent_vars (object): ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
weight (object): ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
solver (string): (liblinear) Numerical optimizer to find parameters. .
c_parameter (number): (0.01) Inverse of regularization strength; must be a positive float. Smaller values specify stronger regularization..
normalize_cm (boolean): (False) Whether or not to normalize the confusion matrix..
random_state_method (integer): (5) Controls the randomness of the estimator..
random_state_train_test (integer): (5) Controls the shuffling applied to the data before applying the split..
test_size (number): (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  c_parameter: 0.01
  independent_vars:
    columns:
    - mean area
    - mean compactness
  normalize_cm: false
  scale: true
  solver: liblinear
  target:
    column: benign
  test_size: 0.2

Command line

logistic_regression --config config_logistic_regression.yml --input_dataset_path dataset_logistic_regression.csv --output_model_path ref_output_model_logistic_regression.pkl --output_test_table_path ref_output_test_logistic_regression.csv --output_plot_path ref_output_plot_logistic_regression.png

JSON

Common config file 

{
  "properties": {
    "independent_vars": {
      "columns": [
        "mean area",
        "mean compactness"
      ]
    },
    "target": {
      "column": "benign"
    },
    "solver": "liblinear",
    "c_parameter": 0.01,
    "normalize_cm": false,
    "test_size": 0.2,
    "scale": true
  }
}

Command line

logistic_regression --config config_logistic_regression.json --input_dataset_path dataset_logistic_regression.csv --output_model_path ref_output_model_logistic_regression.pkl --output_test_table_path ref_output_test_logistic_regression.csv --output_plot_path ref_output_plot_logistic_regression.png

Pls_components

Wrapper of the scikit-learn PLSRegression method.

Get help

Command:

pls_components -h

usage: pls_components [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn PLSRegression method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the Mean Square Error plot. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Table with R2 and MSE for calibration and cross-validation data for the best number of components. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Table with R2 and MSE for calibration and cross-validation data for the best number of components. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the Mean Square Error plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

features (object): ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
optimise (boolean): (False) Whether or not optimise the process of MSE calculation. Beware, if True selected, the process can take a long time depending on the max_components value..
max_components (integer): (10) Maximum number of components to use by default for PLS queries..
cv (integer): (10) Specify the number of folds in the cross-validation splitting strategy. Value must be between 2 and number of samples in the dataset..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  cv: 10
  features:
    range:
    - - 0
      - 29
  max_components: 30
  optimise: false
  scale: true
  target:
    index: 30

Command line

pls_components --config config_pls_components.yml --input_dataset_path dataset_pls_components.csv --output_results_path ref_output_results_pls_components.csv --output_plot_path ref_output_plot_pls_components.png

JSON

Common config file 

{
  "properties": {
    "features": {
      "range": [
        [
          0,
          29
        ]
      ]
    },
    "target": {
      "index": 30
    },
    "optimise": false,
    "max_components": 30,
    "cv": 10,
    "scale": true
  }
}

Command line

pls_components --config config_pls_components.json --input_dataset_path dataset_pls_components.csv --output_results_path ref_output_results_pls_components.csv --output_plot_path ref_output_plot_pls_components.png

Pairwise_comparison

Generates a pairwise comparison from a given dataset.

Get help

Command:

pairwise_comparison -h

usage: pairwise_comparison [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_plot_path OUTPUT_PLOT_PATH

Generates a pairwise comparison from a given dataset

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the pairwise comparison plot. Accepted formats: png.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the pairwise comparison plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

features (object): ({}) Independent variables or columns from your dataset you want to compare. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  features:
    indexes:
    - 0
    - 1
    - 2
    - 3

Command line

pairwise_comparison --config config_pairwise_comparison.yml --input_dataset_path dataset_pairwise_comparison.csv --output_plot_path ref_output_plot_pairwise_comparison.png

JSON

Common config file 

{
  "properties": {
    "features": {
      "indexes": [
        0,
        1,
        2,
        3
      ]
    }
  }
}

Command line

pairwise_comparison --config config_pairwise_comparison.json --input_dataset_path dataset_pairwise_comparison.csv --output_plot_path ref_output_plot_pairwise_comparison.png

K_neighbors

Wrapper of the scikit-learn KNeighborsClassifier method.

Get help

Command:

k_neighbors -h

usage: k_neighbors [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn KNeighborsClassifier method. 

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the statistics plot. If target is binary it shows confusion matrix, distributions of the predicted probabilities of both classes and ROC curve. If target is non-binary it shows confusion matrix. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: pkl.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: PKL
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the statistics plot. If target is binary it shows confusion matrix, distributions of the predicted probabilities of both classes and ROC curve. If target is non-binary it shows confusion matrix. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

independent_vars (object): ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
weight (object): ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
metric (string): (minkowski) The distance metric to use for the tree. .
n_neighbors (integer): (6) Number of neighbors to use by default for kneighbors queries..
normalize_cm (boolean): (False) Whether or not to normalize the confusion matrix..
random_state_train_test (integer): (5) Controls the shuffling applied to the data before applying the split..
test_size (number): (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  independent_vars:
    columns:
    - interest_rate
    - credit
    - march
    - previous
    - duration
  metric: minkowski
  n_neighbors: 5
  normalize_cm: false
  scale: true
  target:
    column: y
  test_size: 0.2

Command line

k_neighbors --config config_k_neighbors.yml --input_dataset_path dataset_k_neighbors.csv --output_model_path ref_output_model_k_neighbors.pkl --output_test_table_path ref_output_test_k_neighbors.csv --output_plot_path ref_output_plot_k_neighbors.png

JSON

Common config file 

{
  "properties": {
    "independent_vars": {
      "columns": [
        "interest_rate",
        "credit",
        "march",
        "previous",
        "duration"
      ]
    },
    "target": {
      "column": "y"
    },
    "metric": "minkowski",
    "n_neighbors": 5,
    "normalize_cm": false,
    "test_size": 0.2,
    "scale": true
  }
}

Command line

k_neighbors --config config_k_neighbors.json --input_dataset_path dataset_k_neighbors.csv --output_model_path ref_output_model_k_neighbors.pkl --output_test_table_path ref_output_test_k_neighbors.csv --output_plot_path ref_output_plot_k_neighbors.png

Support_vector_machine

Wrapper of the scikit-learn SupportVectorMachine method.

Get help

Command:

support_vector_machine -h

usage: support_vector_machine [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn SupportVectorMachine method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the statistics plot. If target is binary it shows confusion matrix, distributions of the predicted probabilities of both classes and ROC curve. If target is non-binary it shows confusion matrix. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: pkl.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: PKL
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the statistics plot. If target is binary it shows confusion matrix, distributions of the predicted probabilities of both classes and ROC curve. If target is non-binary it shows confusion matrix. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

independent_vars (object): ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
weight (object): ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
kernel (string): (rbf) Specifies the kernel type to be used in the algorithm. .
normalize_cm (boolean): (False) Whether or not to normalize the confusion matrix..
random_state_method (integer): (5) Controls the randomness of the estimator..
random_state_train_test (integer): (5) Controls the shuffling applied to the data before applying the split..
test_size (number): (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  independent_vars:
    range:
    - - 0
      - 2
    - - 4
      - 5
  kernel: rbf
  normalize_cm: false
  scale: true
  target:
    index: 6
  test_size: 0.2

Command line

support_vector_machine --config config_support_vector_machine.yml --input_dataset_path dataset_support_vector_machine.csv --output_model_path ref_output_model_support_vector_machine.pkl --output_test_table_path ref_output_test_support_vector_machine.csv --output_plot_path ref_output_plot_support_vector_machine.png

JSON

Common config file 

{
  "properties": {
    "independent_vars": {
      "range": [
        [
          0,
          2
        ],
        [
          4,
          5
        ]
      ]
    },
    "target": {
      "index": 6
    },
    "kernel": "rbf",
    "normalize_cm": false,
    "test_size": 0.2,
    "scale": true
  }
}

Command line

support_vector_machine --config config_support_vector_machine.json --input_dataset_path dataset_support_vector_machine.csv --output_model_path ref_output_model_support_vector_machine.pkl --output_test_table_path ref_output_test_support_vector_machine.csv --output_plot_path ref_output_plot_support_vector_machine.png

Regression_neural_network

Wrapper of the TensorFlow Keras Sequential method for regression.

Get help

Command:

regression_neural_network -h

usage: regression_neural_network [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the TensorFlow Keras Sequential method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Loss, MAE and MSE plots. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: h5.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: H5
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Loss, MAE and MSE plots. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

features (object): ({}) Independent variables or columns from your dataset you want to train. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
weight (object): ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
validation_size (number): (0.2) Represents the proportion of the dataset to include in the validation split. It should be between 0.0 and 1.0..
test_size (number): (0.1) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
hidden_layers (array): (None) List of dictionaries with hidden layers values. Format: [ { ‘size’: 50, ‘activation’: ‘relu’ } ]..
output_layer_activation (string): (softmax) Activation function to use in the output layer. .
optimizer (string): (Adam) Name of optimizer instance. .
learning_rate (number): (0.02) Determines the step size at each iteration while moving toward a minimum of a loss function.
batch_size (integer): (100) Number of samples per gradient update..
max_epochs (integer): (100) Number of epochs to train the model. As the early stopping is enabled, this is a maximum..
random_state (integer): (5) Controls the shuffling applied to the data before applying the split. ..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  batch_size: 32
  features:
    columns:
    - ZN
    - RM
    - AGE
    - LSTAT
  hidden_layers:
  - activation: relu
    size: 10
  - activation: relu
    size: 8
  learning_rate: 0.01
  max_epochs: 150
  optimizer: Adam
  target:
    column: MEDV
  test_size: 0.2
  validation_size: 0.2

Command line

regression_neural_network --config config_regression_neural_network.yml --input_dataset_path dataset_regression.csv --output_model_path ref_output_model_regression.h5 --output_test_table_path ref_output_test_regression.csv --output_plot_path ref_output_plot_regression.png

JSON

Common config file 

{
  "properties": {
    "features": {
      "columns": [
        "ZN",
        "RM",
        "AGE",
        "LSTAT"
      ]
    },
    "target": {
      "column": "MEDV"
    },
    "validation_size": 0.2,
    "test_size": 0.2,
    "hidden_layers": [
      {
        "size": 10,
        "activation": "relu"
      },
      {
        "size": 8,
        "activation": "relu"
      }
    ],
    "optimizer": "Adam",
    "learning_rate": 0.01,
    "batch_size": 32,
    "max_epochs": 150
  }
}

Command line

regression_neural_network --config config_regression_neural_network.json --input_dataset_path dataset_regression.csv --output_model_path ref_output_model_regression.h5 --output_test_table_path ref_output_test_regression.csv --output_plot_path ref_output_plot_regression.png

Polynomial_regression

Wrapper of the scikit-learn LinearRegression method with PolynomialFeatures.

Get help

Command:

polynomial_regression -h

usage: polynomial_regression [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn LinearRegression method with PolynomialFeatures.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Residual plot checks the error between actual values and predicted values. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: pkl.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: PKL
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Residual plot checks the error between actual values and predicted values. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

independent_vars (object): ({}) Independent variables you want to train from your dataset. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
weight (object): ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
random_state_train_test (integer): (5) Controls the shuffling applied to the data before applying the split..
degree (integer): (2) Polynomial degree..
test_size (number): (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  degree: 2
  independent_vars:
    columns:
    - LSTAT
    - RM
    - ZN
    - AGE
  scale: true
  target:
    column: MEDV
  test_size: 0.2

Command line

polynomial_regression --config config_polynomial_regression.yml --input_dataset_path dataset_polynomial_regression.csv --output_model_path ref_output_model_polynomial_regression.pkl --output_test_table_path ref_output_test_polynomial_regression.csv --output_plot_path ref_output_plot_polynomial_regression.png

JSON

Common config file 

{
  "properties": {
    "independent_vars": {
      "columns": [
        "LSTAT",
        "RM",
        "ZN",
        "AGE"
      ]
    },
    "target": {
      "column": "MEDV"
    },
    "degree": 2,
    "test_size": 0.2,
    "scale": true
  }
}

Command line

polynomial_regression --config config_polynomial_regression.json --input_dataset_path dataset_polynomial_regression.csv --output_model_path ref_output_model_polynomial_regression.pkl --output_test_table_path ref_output_test_polynomial_regression.csv --output_plot_path ref_output_plot_polynomial_regression.png

Scale_columns

Scales columns from a given dataset.

Get help

Command:

scale_columns -h

usage: scale_columns [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_dataset_path OUTPUT_DATASET_PATH

Scales columns from a given dataset

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_dataset_path OUTPUT_DATASET_PATH
                        Path to the output dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_dataset_path (string): Path to the output dataset. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

targets (object): ({}) Independent variables or columns from your dataset you want to scale. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  targets:
    columns:
    - VALUE

Command line

scale_columns --config config_scale_columns.yml --input_dataset_path dataset_scale.csv --output_dataset_path ref_output_scale.csv

JSON

Common config file 

{
  "properties": {
    "targets": {
      "columns": [
        "VALUE"
      ]
    }
  }
}

Command line

scale_columns --config config_scale_columns.json --input_dataset_path dataset_scale.csv --output_dataset_path ref_output_scale.csv

Autoencoder_neural_network

Wrapper of the TensorFlow Keras LSTM method for encoding.

Get help

Command:

autoencoder_neural_network -h

usage: autoencoder_neural_network [-h] [--config CONFIG] --input_decode_path INPUT_DECODE_PATH [--input_predict_path INPUT_PREDICT_PATH] --output_model_path OUTPUT_MODEL_PATH [--output_test_decode_path OUTPUT_TEST_DECODE_PATH] [--output_test_predict_path OUTPUT_TEST_PREDICT_PATH]

Wrapper of the TensorFlow Keras LSTM method for encoding.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --input_predict_path INPUT_PREDICT_PATH
                        Path to the input predict dataset. Accepted formats: csv.
  --output_test_decode_path OUTPUT_TEST_DECODE_PATH
                        Path to the test decode table file. Accepted formats: csv.
  --output_test_predict_path OUTPUT_TEST_PREDICT_PATH
                        Path to the test predict table file. Accepted formats: csv.

required arguments:
  --input_decode_path INPUT_DECODE_PATH
                        Path to the input decode dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: h5.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_decode_path (string): Path to the input decode dataset. File type: input. Sample file. Accepted formats: CSV
input_predict_path (string): Path to the input predict dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: H5
output_test_decode_path (string): Path to the test decode table file. File type: output. Sample file. Accepted formats: CSV
output_test_predict_path (string): Path to the test predict table file. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

optimizer (string): (Adam) Name of optimizer instance. .
learning_rate (number): (0.02) Determines the step size at each iteration while moving toward a minimum of a loss function.
batch_size (integer): (100) Number of samples per gradient update..
max_epochs (integer): (100) Number of epochs to train the model. As the early stopping is enabled, this is a maximum..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  batch_size: 32
  learning_rate: 0.01
  max_epochs: 300
  optimizer: Adam

Command line

autoencoder_neural_network --config config_autoencoder_neural_network.yml --input_decode_path dataset_autoencoder_decode.csv --input_predict_path dataset_autoencoder_predict.csv --output_model_path ref_output_model_autoencoder.h5 --output_test_decode_path ref_output_test_decode_autoencoder.csv --output_test_predict_path ref_output_test_predict_autoencoder.csv

JSON

Common config file 

{
  "properties": {
    "optimizer": "Adam",
    "learning_rate": 0.01,
    "batch_size": 32,
    "max_epochs": 300
  }
}

Command line

autoencoder_neural_network --config config_autoencoder_neural_network.json --input_decode_path dataset_autoencoder_decode.csv --input_predict_path dataset_autoencoder_predict.csv --output_model_path ref_output_model_autoencoder.h5 --output_test_decode_path ref_output_test_decode_autoencoder.csv --output_test_predict_path ref_output_test_predict_autoencoder.csv

K_means

Wrapper of the scikit-learn KMeans method.

Get help

Command:

k_means -h

usage: k_means [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH --output_model_path OUTPUT_MODEL_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn KMeans method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the clustering plot. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the clustered dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: pkl.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the clustered dataset. File type: output. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: PKL
output_plot_path (string): Path to the clustering plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictors (object): ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
clusters (integer): (3) The number of clusters to form as well as the number of centroids to generate..
plots (array): (None) List of dictionaries with all plots you want to generate. Only 2D or 3D plots accepted. Format: [ { ‘title’: ‘Plot 1’, ‘features’: [’feat1’, ‘feat2’] } ]..
random_state_method (integer): (5) Determines random number generation for centroid initialization..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  clusters: 3
  plots:
  - features:
    - sepal_length
    - sepal_width
    title: Plot 1
  - features:
    - petal_length
    - petal_width
    title: Plot 2
  - features:
    - sepal_length
    - sepal_width
    - petal_length
    title: Plot 3
  - features:
    - petal_length
    - petal_width
    - sepal_width
    title: Plot 4
  - features:
    - sepal_length
    - petal_width
    title: Plot 5
  predictors:
    columns:
    - sepal_length
    - sepal_width
    - petal_length
    - petal_width
  scale: true

Command line

k_means --config config_k_means.yml --input_dataset_path dataset_k_means.csv --output_results_path ref_output_results_k_means.csv --output_model_path ref_output_model_k_means.pkl --output_plot_path ref_output_plot_k_means.png

JSON

Common config file 

{
  "properties": {
    "predictors": {
      "columns": [
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width"
      ]
    },
    "clusters": 3,
    "plots": [
      {
        "title": "Plot 1",
        "features": [
          "sepal_length",
          "sepal_width"
        ]
      },
      {
        "title": "Plot 2",
        "features": [
          "petal_length",
          "petal_width"
        ]
      },
      {
        "title": "Plot 3",
        "features": [
          "sepal_length",
          "sepal_width",
          "petal_length"
        ]
      },
      {
        "title": "Plot 4",
        "features": [
          "petal_length",
          "petal_width",
          "sepal_width"
        ]
      },
      {
        "title": "Plot 5",
        "features": [
          "sepal_length",
          "petal_width"
        ]
      }
    ],
    "scale": true
  }
}

Command line

k_means --config config_k_means.json --input_dataset_path dataset_k_means.csv --output_results_path ref_output_results_k_means.csv --output_model_path ref_output_model_k_means.pkl --output_plot_path ref_output_plot_k_means.png

Map_variables

Maps the values of a given dataset.

Get help

Command:

map_variables -h

usage: map_variables [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_dataset_path OUTPUT_DATASET_PATH

Maps the values of a given dataset.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_dataset_path OUTPUT_DATASET_PATH
                        Path to the output dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_dataset_path (string): Path to the output dataset. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

targets (object): ({}) Independent variables or columns from your dataset you want to drop. If None given, all the columns will be taken. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  targets:
    columns:
    - target

Command line

map_variables --config config_map_variables.yml --input_dataset_path dataset_map_variables.csv --output_dataset_path ref_output_dataset_map_variables.csv

JSON

Common config file 

{
  "properties": {
    "targets": {
      "columns": [
        "target"
      ]
    }
  }
}

Command line

map_variables --config config_map_variables.json --input_dataset_path dataset_map_variables.csv --output_dataset_path ref_output_dataset_map_variables.csv

Regression_predict

Makes predictions from an input dataset and a given regression model.

Get help

Command:

regression_predict -h

usage: regression_predict [-h] [--config CONFIG] --input_model_path INPUT_MODEL_PATH --output_results_path OUTPUT_RESULTS_PATH [--input_dataset_path INPUT_DATASET_PATH]

Makes predictions from an input dataset and a given regression model.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the dataset to predict. Accepted formats: csv.

required arguments:
  --input_model_path INPUT_MODEL_PATH
                        Path to the input model. Accepted formats: pkl.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the output results file. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_model_path (string): Path to the input model. File type: input. Sample file. Accepted formats: PKL
input_dataset_path (string): Path to the dataset to predict. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the output results file. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictions (array): (None) List of dictionaries with all values you want to predict targets. It will be taken into account only in case input_dataset_path is not provided. Format: [{ ‘var1’: 1.0, ‘var2’: 2.0 }, { ‘var1’: 4.0, ‘var2’: 2.7 }] for datasets with headers and [[ 1.0, 2.0 ], [ 4.0, 2.7 ]] for datasets without headers..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  predictions:
  - AGE: 65.2
    LSTAT: 4.98
    RM: 6.575
    ZN: 18.0
  - AGE: 78.9
    LSTAT: 9.14
    RM: 6.421
    ZN: 0.0

Command line

regression_predict --config config_regression_predict.yml --input_model_path model_regression_predict.pkl --input_dataset_path input_regression_predict.csv --output_results_path ref_output_regression_predict.csv

JSON

Common config file 

{
  "properties": {
    "predictions": [
      {
        "LSTAT": 4.98,
        "ZN": 18.0,
        "RM": 6.575,
        "AGE": 65.2
      },
      {
        "LSTAT": 9.14,
        "ZN": 0.0,
        "RM": 6.421,
        "AGE": 78.9
      }
    ]
  }
}

Command line

regression_predict --config config_regression_predict.json --input_model_path model_regression_predict.pkl --input_dataset_path input_regression_predict.csv --output_results_path ref_output_regression_predict.csv

Dendrogram

Generates a dendrogram from a given dataset.

Get help

Command:

dendrogram -h

usage: dendrogram [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_plot_path OUTPUT_PLOT_PATH

Generates a dendrogram from a given dataset

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the dendrogram plot. Accepted formats: png.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the dendrogram plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

features (object): ({}) Independent variables or columns from your dataset you want to compare. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  features:
    columns:
    - Satisfaction
    - Loyalty

Command line

dendrogram --config config_dendrogram.yml --input_dataset_path dataset_dendrogram.csv --output_plot_path ref_output_plot_dendrogram.png

JSON

Common config file 

{
  "properties": {
    "features": {
      "columns": [
        "Satisfaction",
        "Loyalty"
      ]
    }
  }
}

Command line

dendrogram --config config_dendrogram.json --input_dataset_path dataset_dendrogram.csv --output_plot_path ref_output_plot_dendrogram.png

Agglomerative_coefficient

Wrapper of the scikit-learn AgglomerativeClustering method.

Get help

Command:

agglomerative_coefficient -h

usage: agglomerative_coefficient [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn AgglomerativeCoefficient method. 

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the elbow and gap methods plot. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the gap values list. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the gap values list. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the elbow method and gap statistics plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictors (object): ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
max_clusters (integer): (6) Maximum number of clusters to use by default for kmeans queries..
affinity (string): (euclidean) Metric used to compute the linkage. If linkage is “ward”, only “euclidean” is accepted. .
linkage (string): (ward) The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. .
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  max_clusters: 10
  predictors:
    columns:
    - sepal_length
    - sepal_width
  scale: true

Command line

agglomerative_coefficient --config config_agglomerative_coefficient.yml --input_dataset_path dataset_agglomerative_coefficient.csv --output_results_path ref_output_results_agglomerative_coefficient.csv --output_plot_path ref_output_plot_agglomerative_coefficient.png

JSON

Common config file 

{
  "properties": {
    "predictors": {
      "columns": [
        "sepal_length",
        "sepal_width"
      ]
    },
    "max_clusters": 10,
    "scale": true
  }
}

Command line

agglomerative_coefficient --config config_agglomerative_coefficient.json --input_dataset_path dataset_agglomerative_coefficient.csv --output_results_path ref_output_results_agglomerative_coefficient.csv --output_plot_path ref_output_plot_agglomerative_coefficient.png

Recurrent_neural_network

Wrapper of the TensorFlow Keras LSTM method using Recurrent Neural Networks.

Get help

Command:

recurrent_neural_network -h

usage: recurrent_neural_network [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the TensorFlow Keras LSTM method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Loss, accuracy and MSE plots. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: h5.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: H5
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Loss, accuracy and MSE plots. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
validation_size (number): (0.2) Represents the proportion of the dataset to include in the validation split. It should be between 0.0 and 1.0..
window_size (integer): (5) Number of steps for each window of training model..
test_size (integer): (5) Represents the number of samples of the dataset to include in the test split..
hidden_layers (array): (None) List of dictionaries with hidden layers values. Format: [ { ‘size’: 50, ‘activation’: ‘relu’ } ]..
optimizer (string): (Adam) Name of optimizer instance. .
learning_rate (number): (0.02) Determines the step size at each iteration while moving toward a minimum of a loss function.
batch_size (integer): (100) Number of samples per gradient update..
max_epochs (integer): (100) Number of epochs to train the model. As the early stopping is enabled, this is a maximum..
normalize_cm (boolean): (False) Whether or not to normalize the confusion matrix..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  batch_size: 32
  hidden_layers:
  - activation: relu
    size: 100
  - activation: relu
    size: 50
  - activation: relu
    size: 50
  learning_rate: 0.01
  max_epochs: 50
  optimizer: Adam
  target:
    index: 1
  test_size: 12
  validation_size: 0.2
  window_size: 5

Command line

recurrent_neural_network --config config_recurrent_neural_network.yml --input_dataset_path dataset_recurrent.csv --output_model_path ref_output_model_recurrent.h5 --output_test_table_path ref_output_test_recurrent.csv --output_plot_path ref_output_plot_recurrent.png

JSON

Common config file 

{
  "properties": {
    "target": {
      "index": 1
    },
    "window_size": 5,
    "validation_size": 0.2,
    "test_size": 12,
    "hidden_layers": [
      {
        "size": 100,
        "activation": "relu"
      },
      {
        "size": 50,
        "activation": "relu"
      },
      {
        "size": 50,
        "activation": "relu"
      }
    ],
    "optimizer": "Adam",
    "learning_rate": 0.01,
    "batch_size": 32,
    "max_epochs": 50
  }
}

Command line

recurrent_neural_network --config config_recurrent_neural_network.json --input_dataset_path dataset_recurrent.csv --output_model_path ref_output_model_recurrent.h5 --output_test_table_path ref_output_test_recurrent.csv --output_plot_path ref_output_plot_recurrent.png

Dummy_variables

Converts categorical variables into dummy/indicator variables (binaries).

Get help

Command:

dummy_variables -h

usage: dummy_variables [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_dataset_path OUTPUT_DATASET_PATH

Maps dummy variables from a given dataset.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_dataset_path OUTPUT_DATASET_PATH
                        Path to the output dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_dataset_path (string): Path to the output dataset. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

targets (object): ({}) Independent variables or columns from your dataset you want to drop. If None given, all the columns will be taken. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  targets:
    columns:
    - view

Command line

dummy_variables --config config_dummy_variables.yml --input_dataset_path dataset_dummy_variables.csv --output_dataset_path ref_output_dataset_dummy_variables.csv

JSON

Common config file 

{
  "properties": {
    "targets": {
      "columns": [
        "view"
      ]
    }
  }
}

Command line

dummy_variables --config config_dummy_variables.json --input_dataset_path dataset_dummy_variables.csv --output_dataset_path ref_output_dataset_dummy_variables.csv

Clustering_predict

Makes predictions from an input dataset and a given clustering model.

Get help

Command:

clustering_predict -h

usage: clustering_predict [-h] [--config CONFIG] --input_model_path INPUT_MODEL_PATH --output_results_path OUTPUT_RESULTS_PATH [--input_dataset_path INPUT_DATASET_PATH]

Makes predictions from an input dataset and a given clustering model.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the dataset to predict. Accepted formats: csv.

required arguments:
  --input_model_path INPUT_MODEL_PATH
                        Path to the input model. Accepted formats: pkl.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the output results file. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_model_path (string): Path to the input model. File type: input. Sample file. Accepted formats: PKL
input_dataset_path (string): Path to the dataset to predict. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the output results file. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictions (array): (None) List of dictionaries with all values you want to predict targets. It will be taken into account only in case input_dataset_path is not provided. Format: [{ ‘var1’: 1.0, ‘var2’: 2.0 }, { ‘var1’: 4.0, ‘var2’: 2.7 }] for datasets with headers and [[ 1.0, 2.0 ], [ 4.0, 2.7 ]] for datasets without headers..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  predictions:
  - petal_length: 1.4
    petal_width: 0.2
    sepal_length: 5.1
    sepal_width: 3.5
  - petal_length: 5.2
    petal_width: 2.3
    sepal_length: 6.7
    sepal_width: 3.0
  - petal_length: 5.0
    petal_width: 1.9
    sepal_length: 6.3
    sepal_width: 2.5

Command line

clustering_predict --config config_clustering_predict.yml --input_model_path model_clustering_predict.pkl --input_dataset_path input_clustering_predict.csv --output_results_path ref_output_results_clustering_predict.csv

JSON

Common config file 

{
  "properties": {
    "predictions": [
      {
        "sepal_length": 5.1,
        "sepal_width": 3.5,
        "petal_length": 1.4,
        "petal_width": 0.2
      },
      {
        "sepal_length": 6.7,
        "sepal_width": 3.0,
        "petal_length": 5.2,
        "petal_width": 2.3
      },
      {
        "sepal_length": 6.3,
        "sepal_width": 2.5,
        "petal_length": 5.0,
        "petal_width": 1.9
      }
    ]
  }
}

Command line

clustering_predict --config config_clustering_predict.json --input_model_path model_clustering_predict.pkl --input_dataset_path input_clustering_predict.csv --output_results_path ref_output_results_clustering_predict.csv

Undersampling

Wrapper of most of the imblearn.under_sampling methods.

Get help

Command:

undersampling -h

usage: undersampling [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_dataset_path OUTPUT_DATASET_PATH

Wrapper of most of the imblearn.under_sampling methods.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_dataset_path OUTPUT_DATASET_PATH
                        Path to the output dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_dataset_path (string): Path to the output dataset. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

method (string): (None) Undersampling method. It’s a mandatory property. .
type (string): (None) Type of oversampling. It’s a mandatory property. .
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
evaluate (boolean): (False) Whether or not to evaluate the dataset before and after applying the resampling..
evaluate_splits (integer): (3) Number of folds to be applied by the Repeated Stratified K-Fold evaluation method. Must be at least 2..
evaluate_repeats (integer): (3) Number of times Repeated Stratified K-Fold cross validator needs to be repeated..
n_bins (integer): (5) Only for regression undersampling. The number of classes that the user wants to generate with the target data..
balanced_binning (boolean): (False) Only for regression undersampling. Decides whether samples are to be distributed roughly equally across all classes..
sampling_strategy (object): ({’target’: ‘auto’}) Sampling information to sample the data set. Formats: { “target”: “auto” }, { “ratio”: 0.3 }, { “dict”: { 0: 300, 1: 200, 2: 100 } } or { “list”: [0, 2, 3] }. When “target”, specify the class targeted by the resampling; the number of samples in the different classes will be equalized; possible choices are: majority (resample only the majority class), not minority (resample all classes but the minority class), not majority (resample all classes but the majority class), all (resample all classes), auto (equivalent to ‘not minority’). When “ratio”, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling (ONLY IN CASE OF BINARY CLASSIFICATION). When “dict”, the keys correspond to the targeted classes, the values correspond to the desired number of samples for each targeted class. When “list”, the list contains the classes targeted by the resampling..
version (integer): (1) Only for NearMiss method. Version of the NearMiss to use. .
n_neighbors (integer): (1) Only for NearMiss, CondensedNearestNeighbour, EditedNearestNeighbours and NeighbourhoodCleaningRule methods. Size of the neighbourhood to consider to compute the average distance to the minority point samples..
threshold_cleaning (number): (0.5) Only for NeighbourhoodCleaningRule method. Threshold used to whether consider a class or not during the cleaning after applying ENN..
random_state_method (integer): (5) Only for RandomUnderSampler and ClusterCentroids methods. Controls the randomization of the algorithm..
random_state_evaluate (integer): (5) Controls the shuffling applied to the Repeated Stratified K-Fold evaluation method..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  evaluate: true
  method: enn
  n_bins: 10
  n_neighbors: 3
  target:
    column: VALUE
  type: regression

Command line

undersampling --config config_undersampling.yml --input_dataset_path dataset_resampling.csv --output_dataset_path ref_output_undersampling.csv

JSON

Common config file 

{
  "properties": {
    "method": "enn",
    "type": "regression",
    "target": {
      "column": "VALUE"
    },
    "evaluate": true,
    "n_bins": 10,
    "n_neighbors": 3
  }
}

Command line

undersampling --config config_undersampling.json --input_dataset_path dataset_resampling.csv --output_dataset_path ref_output_undersampling.csv

Correlation_matrix

Generates a correlation matrix from a given dataset.

Get help

Command:

correlation_matrix -h

usage: correlation_matrix [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_plot_path OUTPUT_PLOT_PATH

Generates a correlation matrix from a given dataset

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the correlation matrix plot. Accepted formats: png.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the correlation matrix plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

features (object): ({}) Independent variables or columns from your dataset you want to compare. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  features:
    columns:
    - sepal_length
    - sepal_width
    - petal_length
    - petal_width

Command line

correlation_matrix --config config_correlation_matrix.yml --input_dataset_path dataset_correlation_matrix.csv --output_plot_path ref_output_plot_correlation_matrix.png

JSON

Common config file 

{
  "properties": {
    "features": {
      "columns": [
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width"
      ]
    }
  }
}

Command line

correlation_matrix --config config_correlation_matrix.json --input_dataset_path dataset_correlation_matrix.csv --output_plot_path ref_output_plot_correlation_matrix.png

Oversampling

Wrapper of most of the imblearn.over_sampling methods.

Get help

Command:

oversampling -h

usage: oversampling [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_dataset_path OUTPUT_DATASET_PATH

Wrapper of most of the imblearn.over_sampling methods.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_dataset_path OUTPUT_DATASET_PATH
                        Path to the output dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_dataset_path (string): Path to the output dataset. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

method (string): (None) Oversampling method. It’s a mandatory property. .
type (string): (None) Type of oversampling. It’s a mandatory property. .
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
evaluate (boolean): (False) Whether or not to evaluate the dataset before and after applying the resampling..
evaluate_splits (integer): (3) Number of folds to be applied by the Repeated Stratified K-Fold evaluation method. Must be at least 2..
evaluate_repeats (integer): (3) Number of times Repeated Stratified K-Fold cross validator needs to be repeated..
n_bins (integer): (5) Only for regression oversampling. The number of classes that the user wants to generate with the target data..
balanced_binning (boolean): (False) Only for regression oversampling. Decides whether samples are to be distributed roughly equally across all classes..
sampling_strategy (object): ({’target’: ‘auto’}) Sampling information to sample the data set. Formats: { “target”: “auto” }, { “ratio”: 0.3 }, { “dict”: { 0: 300, 1: 200, 2: 100 } } or { “list”: [0, 2, 3] }. When “target”, specify the class targeted by the resampling; the number of samples in the different classes will be equalized; possible choices are: minority (resample only the minority class), not minority (resample all classes but the minority class), not majority (resample all classes but the majority class), all (resample all classes), auto (equivalent to ‘not majority’). When “ratio”, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling (ONLY IN CASE OF BINARY CLASSIFICATION). When “dict”, the keys correspond to the targeted classes, the values correspond to the desired number of samples for each targeted class. When “list”, the list contains the classes targeted by the resampling..
k_neighbors (integer): (5) Only for SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN. The number of nearest neighbours used to construct synthetic samples..
random_state_method (integer): (5) Controls the randomization of the algorithm..
random_state_evaluate (integer): (5) Controls the shuffling applied to the Repeated Stratified K-Fold evaluation method..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  evaluate: true
  method: random
  n_bins: 10
  sampling_strategy:
    target: minority
  target:
    column: VALUE
  type: regression

Command line

oversampling --config config_oversampling.yml --input_dataset_path dataset_resampling.csv --output_dataset_path ref_output_oversampling.csv

JSON

Common config file 

{
  "properties": {
    "method": "random",
    "type": "regression",
    "target": {
      "column": "VALUE"
    },
    "evaluate": true,
    "n_bins": 10,
    "sampling_strategy": {
      "target": "minority"
    }
  }
}

Command line

oversampling --config config_oversampling.json --input_dataset_path dataset_resampling.csv --output_dataset_path ref_output_oversampling.csv

Neural_network_decode

Wrapper of the TensorFlow Keras LSTM method for decoding.

Get help

Command:

neural_network_decode -h

usage: neural_network_decode [-h] [--config CONFIG] --input_decode_path INPUT_DECODE_PATH --input_model_path INPUT_MODEL_PATH --output_decode_path OUTPUT_DECODE_PATH [--output_predict_path OUTPUT_PREDICT_PATH]

Wrapper of the TensorFlow Keras LSTM method for decoding.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_predict_path OUTPUT_PREDICT_PATH
                        Path to the output predict file. Accepted formats: csv.

required arguments:
  --input_decode_path INPUT_DECODE_PATH
                        Path to the input decode dataset. Accepted formats: csv.
  --input_model_path INPUT_MODEL_PATH
                        Path to the input model. Accepted formats: h5.
  --output_decode_path OUTPUT_DECODE_PATH
                        Path to the output decode file. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_decode_path (string): Path to the input decode dataset. File type: input. Sample file. Accepted formats: CSV
input_model_path (string): Path to the input model. File type: input. Sample file. Accepted formats: H5
output_decode_path (string): Path to the output decode file. File type: output. Sample file. Accepted formats: CSV
output_predict_path (string): Path to the output predict file. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  remove_tmp: false

Command line

neural_network_decode --config config_neural_network_decode.yml --input_decode_path dataset_decoder.csv --input_model_path input_model_decoder.h5 --output_decode_path ref_output_decode_decoder.csv --output_predict_path ref_output_predict_decoder.csv

JSON

Common config file 

{
  "properties": {
    "remove_tmp": false
  }
}

Command line

neural_network_decode --config config_neural_network_decode.json --input_decode_path dataset_decoder.csv --input_model_path input_model_decoder.h5 --output_decode_path ref_output_decode_decoder.csv --output_predict_path ref_output_predict_decoder.csv

Pls_regression

Wrapper of the scikit-learn PLSRegression method.

Get help

Command:

pls_regression -h

usage: pls_regression [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn PLSRegression method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the R2 cross-validation plot. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Table with R2 and MSE for calibration and cross-validation data. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Table with R2 and MSE for calibration and cross-validation data. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the R2 cross-validation plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

features (object): ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
n_components (integer): (5) Maximum number of components to use by default for PLS queries..
cv (integer): (10) Specify the number of folds in the cross-validation splitting strategy. Value must be betwwen 2 and number of samples in the dataset..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  cv: 10
  features:
    range:
    - - 0
      - 29
  n_components: 12
  scale: true
  target:
    index: 30

Command line

pls_regression --config config_pls_regression.yml --input_dataset_path dataset_pls_regression.csv --output_results_path ref_output_results_pls_regression.csv --output_plot_path ref_output_plot_pls_regression.png

JSON

Common config file 

{
  "properties": {
    "features": {
      "range": [
        [
          0,
          29
        ]
      ]
    },
    "target": {
      "index": 30
    },
    "n_components": 12,
    "cv": 10,
    "scale": true
  }
}

Command line

pls_regression --config config_pls_regression.json --input_dataset_path dataset_pls_regression.csv --output_results_path ref_output_results_pls_regression.csv --output_plot_path ref_output_plot_pls_regression.png

K_means_coefficient

Wrapper of the scikit-learn KMeans method.

Get help

Command:

k_means_coefficient -h

usage: k_means_coefficient [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn KMeans method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the elbow and gap methods plot. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Table with WCSS (elbow method), Gap and Silhouette coefficients for each cluster. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Table with WCSS (elbow method), Gap and Silhouette coefficients for each cluster. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the elbow method and gap statistics plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictors (object): ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
max_clusters (integer): (6) Maximum number of clusters to use by default for kmeans queries..
random_state_method (integer): (5) Determines random number generation for centroid initialization..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  max_clusters: 10
  predictors:
    columns:
    - sepal_length
    - sepal_width
  scale: true

Command line

k_means_coefficient --config config_k_means_coefficient.yml --input_dataset_path dataset_k_means_coefficient.csv --output_results_path ref_output_results_k_means_coefficient.csv --output_plot_path ref_output_plot_k_means_coefficient.png

JSON

Common config file 

{
  "properties": {
    "predictors": {
      "columns": [
        "sepal_length",
        "sepal_width"
      ]
    },
    "max_clusters": 10,
    "scale": true
  }
}

Command line

k_means_coefficient --config config_k_means_coefficient.json --input_dataset_path dataset_k_means_coefficient.csv --output_results_path ref_output_results_k_means_coefficient.csv --output_plot_path ref_output_plot_k_means_coefficient.png

Spectral_coefficient

Wrapper of the scikit-learn SpectralClustering method.

Get help

Command:

spectral_coefficient -h

usage: spectral_coefficient [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn SpectralClustering method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the elbow and gap methods plot. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Table with WCSS (elbow method), Gap and Silhouette coefficients for each cluster. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Table with WCSS (elbow method), Gap and Silhouette coefficients for each cluster. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the elbow method and gap statistics plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictors (object): ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
max_clusters (integer): (6) Maximum number of clusters to use by default for kmeans queries..
random_state_method (integer): (5) A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver=’amg’ and by the K-Means initialization..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  max_clusters: 10
  predictors:
    columns:
    - sepal_length
    - sepal_width
  scale: true

Command line

spectral_coefficient --config config_spectral_coefficient.yml --input_dataset_path dataset_spectral_coefficient.csv --output_results_path ref_output_results_spectral_coefficient.csv --output_plot_path ref_output_plot_spectral_coefficient.png

JSON

Common config file 

{
  "properties": {
    "predictors": {
      "columns": [
        "sepal_length",
        "sepal_width"
      ]
    },
    "max_clusters": 10,
    "scale": true
  }
}

Command line

spectral_coefficient --config config_spectral_coefficient.json --input_dataset_path dataset_spectral_coefficient.csv --output_results_path ref_output_results_spectral_coefficient.csv --output_plot_path ref_output_plot_spectral_coefficient.png

Neural_network_predict

Makes predictions from an input dataset and a given model.

Get help

Command:

neural_network_predict -h

usage: neural_network_predict [-h] [--config CONFIG] --input_model_path INPUT_MODEL_PATH --output_results_path OUTPUT_RESULTS_PATH [--input_dataset_path INPUT_DATASET_PATH]

Makes predictions from an input dataset and a given classification model.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the dataset to predict. Accepted formats: csv.

required arguments:
  --input_model_path INPUT_MODEL_PATH
                        Path to the input model. Accepted formats: h5.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the output results file. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_model_path (string): Path to the input model. File type: input. Sample file. Accepted formats: H5
input_dataset_path (string): Path to the dataset to predict. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the output results file. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictions (array): (None) List of dictionaries with all values you want to predict targets. It will be taken into account only in case input_dataset_path is not provided. Format: [{ ‘var1’: 1.0, ‘var2’: 2.0 }, { ‘var1’: 4.0, ‘var2’: 2.7 }] for datasets with headers and [[ 1.0, 2.0 ], [ 4.0, 2.7 ]] for datasets without headers..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  predictions:
  - AGE: 65.2
    LSTAT: 4.98
    RM: 6.575
    ZN: 18.0
  - AGE: 78.9
    LSTAT: 9.14
    RM: 6.421
    ZN: 0.0
  - AGE: 61.1
    LSTAT: 4.03
    RM: 7.185
    ZN: 0.0

Command line

neural_network_predict --config config_neural_network_predict.yml --input_model_path input_model_predict.h5 --input_dataset_path dataset_predict.csv --output_results_path ref_output_predict.csv

JSON

Common config file 

{
  "properties": {
    "predictions": [
      {
        "ZN": 18.0,
        "RM": 6.575,
        "AGE": 65.2,
        "LSTAT": 4.98
      },
      {
        "ZN": 0.0,
        "RM": 6.421,
        "AGE": 78.9,
        "LSTAT": 9.14
      },
      {
        "ZN": 0.0,
        "RM": 7.185,
        "AGE": 61.1,
        "LSTAT": 4.03
      }
    ]
  }
}

Command line

neural_network_predict --config config_neural_network_predict.json --input_model_path input_model_predict.h5 --input_dataset_path dataset_predict.csv --output_results_path ref_output_predict.csv

Drop_columns

Drops columns from a given dataset.

Get help

Command:

drop_columns -h

usage: drop_columns [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_dataset_path OUTPUT_DATASET_PATH

Drops columns from a given dataset.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_dataset_path OUTPUT_DATASET_PATH
                        Path to the output dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_dataset_path (string): Path to the output dataset. File type: output. Sample file. Accepted formats: CSV

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

targets (object): ({}) Independent variables or columns from your dataset you want to drop. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  targets:
    columns:
    - WEIGHT
    - SCORE

Command line

drop_columns --config config_drop_columns.yml --input_dataset_path dataset_drop.csv --output_dataset_path ref_output_drop.csv

JSON

Common config file 

{
  "properties": {
    "targets": {
      "columns": [
        "WEIGHT",
        "SCORE"
      ]
    }
  }
}

Command line

drop_columns --config config_drop_columns.json --input_dataset_path dataset_drop.csv --output_dataset_path ref_output_drop.csv

Dbscan

Wrapper of the scikit-learn DBSCAN method.

Get help

Command:

dbscan -h

usage: dbscan [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn DBSCAN method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the clustering plot. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the clustered dataset. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the clustered dataset. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the clustering plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

predictors (object): ({}) Features or columns from your dataset you want to use for fitting. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
eps (number): (0.5) The maximum distance between two samples for one to be considered as in the neighborhood of the other..
min_samples (integer): (5) The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself..
metric (string): (euclidean) The metric to use when calculating distance between instances in a feature array. .
plots (array): (None) List of dictionaries with all plots you want to generate. Only 2D or 3D plots accepted. Format: [ { ‘title’: ‘Plot 1’, ‘features’: [’feat1’, ‘feat2’] } ]..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  eps: 1.4
  min_samples: 3
  plots:
  - features:
    - sepal_length
    - sepal_width
    title: Plot 1
  - features:
    - petal_length
    - petal_width
    title: Plot 2
  - features:
    - sepal_length
    - sepal_width
    - petal_length
    title: Plot 3
  - features:
    - petal_length
    - petal_width
    - sepal_width
    title: Plot 4
  - features:
    - sepal_length
    - petal_width
    title: Plot 5
  predictors:
    columns:
    - sepal_length
    - sepal_width
    - petal_length
    - petal_width
  scale: true

Command line

dbscan --config config_dbscan.yml --input_dataset_path dataset_dbscan.csv --output_results_path ref_output_results_dbscan.csv --output_plot_path ref_output_plot_dbscan.png

JSON

Common config file 

{
  "properties": {
    "predictors": {
      "columns": [
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width"
      ]
    },
    "eps": 1.4,
    "min_samples": 3,
    "plots": [
      {
        "title": "Plot 1",
        "features": [
          "sepal_length",
          "sepal_width"
        ]
      },
      {
        "title": "Plot 2",
        "features": [
          "petal_length",
          "petal_width"
        ]
      },
      {
        "title": "Plot 3",
        "features": [
          "sepal_length",
          "sepal_width",
          "petal_length"
        ]
      },
      {
        "title": "Plot 4",
        "features": [
          "petal_length",
          "petal_width",
          "sepal_width"
        ]
      },
      {
        "title": "Plot 5",
        "features": [
          "sepal_length",
          "petal_width"
        ]
      }
    ],
    "scale": true
  }
}

Command line

dbscan --config config_dbscan.json --input_dataset_path dataset_dbscan.csv --output_results_path ref_output_results_dbscan.csv --output_plot_path ref_output_plot_dbscan.png

K_neighbors_coefficient

Wrapper of the scikit-learn KNeighborsClassifier method.

Get help

Command:

k_neighbors_coefficient -h

usage: k_neighbors_coefficient [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_results_path OUTPUT_RESULTS_PATH [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the scikit-learn KNeighborsClassifier method. 

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_plot_path OUTPUT_PLOT_PATH
                        Path to the accuracy plot. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_results_path OUTPUT_RESULTS_PATH
                        Path to the accuracy values list. Accepted formats: csv.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_results_path (string): Path to the accuracy values list. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Path to the accuracy plot. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

independent_vars (array): (None) Independent variables or columns from your dataset you want to train..
target (string): (None) Dependent variable or column from your dataset you want to predict..
metric (string): (minkowski) The distance metric to use for the tree. .
max_neighbors (integer): (6) Maximum number of neighbors to use by default for kneighbors queries..
random_state_train_test (integer): (5) Controls the shuffling applied to the data before applying the split..
test_size (number): (0.2) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  independent_vars:
    columns:
    - region
    - tenure
    - age
    - marital
    - address
    - income
    - ed
    - employ
    - retire
    - gender
    - reside
  max_neighbors: 15
  metric: minkowski
  scale: true
  target:
    column: custcat
  test_size: 0.2

Command line

k_neighbors_coefficient --config config_k_neighbors_coefficient.yml --input_dataset_path dataset_k_neighbors_coefficient.csv --output_results_path ref_output_test_k_neighbors_coefficient.csv --output_plot_path ref_output_plot_k_neighbors_coefficient.png

JSON

Common config file 

{
  "properties": {
    "independent_vars": {
      "columns": [
        "region",
        "tenure",
        "age",
        "marital",
        "address",
        "income",
        "ed",
        "employ",
        "retire",
        "gender",
        "reside"
      ]
    },
    "target": {
      "column": "custcat"
    },
    "metric": "minkowski",
    "max_neighbors": 15,
    "test_size": 0.2,
    "scale": true
  }
}

Command line

k_neighbors_coefficient --config config_k_neighbors_coefficient.json --input_dataset_path dataset_k_neighbors_coefficient.csv --output_results_path ref_output_test_k_neighbors_coefficient.csv --output_plot_path ref_output_plot_k_neighbors_coefficient.png

Classification_neural_network

Wrapper of the TensorFlow Keras Sequential method for classification.

Get help

Command:

classification_neural_network -h

usage: classification_neural_network [-h] [--config CONFIG] --input_dataset_path INPUT_DATASET_PATH --output_model_path OUTPUT_MODEL_PATH [--output_test_table_path OUTPUT_TEST_TABLE_PATH] [--output_plot_path OUTPUT_PLOT_PATH]

Wrapper of the TensorFlow Keras Sequential method.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Configuration file
  --output_test_table_path OUTPUT_TEST_TABLE_PATH
                        Path to the test table file. Accepted formats: csv.
  --output_plot_path OUTPUT_PLOT_PATH
                        Loss, accuracy and MSE plots. Accepted formats: png.

required arguments:
  --input_dataset_path INPUT_DATASET_PATH
                        Path to the input dataset. Accepted formats: csv.
  --output_model_path OUTPUT_MODEL_PATH
                        Path to the output model file. Accepted formats: h5.

I / O Arguments

Syntax: input_argument (datatype) : Definition

Config input / output arguments for this building block:

input_dataset_path (string): Path to the input dataset. File type: input. Sample file. Accepted formats: CSV
output_model_path (string): Path to the output model file. File type: output. Sample file. Accepted formats: H5
output_test_table_path (string): Path to the test table file. File type: output. Sample file. Accepted formats: CSV
output_plot_path (string): Loss, accuracy and MSE plots. File type: output. Sample file. Accepted formats: PNG

Config

Syntax: input_parameter (datatype) - (default_value) Definition

Config parameters for this building block:

features (object): ({}) Independent variables or columns from your dataset you want to train. You can specify either a list of columns names from your input dataset, a list of columns indexes or a range of columns indexes. Formats: { “columns”: [”column1”, “column2”] } or { “indexes”: [0, 2, 3, 10, 11, 17] } or { “range”: [[0, 20], [50, 102]] }. In case of mulitple formats, the first one will be picked..
target (object): ({}) Dependent variable you want to predict from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of mulitple formats, the first one will be picked..
weight (object): ({}) Weight variable from your dataset. You can specify either a column name or a column index. Formats: { “column”: “column3” } or { “index”: 21 }. In case of multiple formats, the first one will be picked..
validation_size (number): (0.2) Represents the proportion of the dataset to include in the validation split. It should be between 0.0 and 1.0..
test_size (number): (0.1) Represents the proportion of the dataset to include in the test split. It should be between 0.0 and 1.0..
hidden_layers (array): (None) List of dictionaries with hidden layers values. Format: [ { ‘size’: 50, ‘activation’: ‘relu’ } ]..
output_layer_activation (string): (softmax) Activation function to use in the output layer. .
optimizer (string): (Adam) Name of optimizer instance. .
learning_rate (number): (0.02) Determines the step size at each iteration while moving toward a minimum of a loss function.
batch_size (integer): (100) Number of samples per gradient update..
max_epochs (integer): (100) Number of epochs to train the model. As the early stopping is enabled, this is a maximum..
normalize_cm (boolean): (False) Whether or not to normalize the confusion matrix..
random_state (integer): (5) Controls the shuffling applied to the data before applying the split. ..
scale (boolean): (False) Whether or not to scale the input dataset..
remove_tmp (boolean): (True) Remove temporal files..
restart (boolean): (False) Do not execute if output files exist..

YAML

Common config file 

properties:
  batch_size: 100
  features:
    columns:
    - mean radius
    - mean texture
    - mean perimeter
    - mean area
    - mean smoothness
    - mean compactness
    - mean concavity
    - mean concave points
    - mean symmetry
    - mean fractal dimension
    - radius error
    - texture error
    - perimeter error
    - area error
    - smoothness error
    - compactness error
    - concavity error
    - concave points error
    - symmetry error
    - fractal dimension error
    - worst radius
    - worst texture
    - worst perimeter
    - worst area
    - worst smoothness
    - worst compactness
    - worst concavity
    - worst concave points
    - worst symmetry
    - worst fractal dimension
  hidden_layers:
  - activation: relu
    size: 50
  - activation: relu
    size: 50
  learning_rate: 0.02
  max_epochs: 100
  optimizer: Adam
  output_layer_activation: softmax
  scale: true
  target:
    column: benign
  test_size: 0.1
  validation_size: 0.2

Command line

classification_neural_network --config config_classification_neural_network.yml --input_dataset_path dataset_classification.csv --output_model_path ref_output_model_classification.h5 --output_test_table_path ref_output_test_classification.csv --output_plot_path ref_output_plot_classification.png

JSON

Common config file 

{
  "properties": {
    "features": {
      "columns": [
        "mean radius",
        "mean texture",
        "mean perimeter",
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
      ]
    },
    "target": {
      "column": "benign"
    },
    "validation_size": 0.2,
    "test_size": 0.1,
    "hidden_layers": [
      {
        "size": 50,
        "activation": "relu"
      },
      {
        "size": 50,
        "activation": "relu"
      }
    ],
    "output_layer_activation": "softmax",
    "optimizer": "Adam",
    "learning_rate": 0.02,
    "batch_size": 100,
    "max_epochs": 100,
    "scale": true
  }
}

Command line

classification_neural_network --config config_classification_neural_network.json --input_dataset_path dataset_classification.csv --output_model_path ref_output_model_classification.h5 --output_test_table_path ref_output_test_classification.csv --output_plot_path ref_output_plot_classification.png