Pydra dataflow engine

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- MacOS :: MacOS X
- POSIX :: Linux
Programming Language
- Python :: 3.7
Topic
- Scientific/Engineering

Project description

Python package

pydra-ml

Pydra-ML is a demo application that leverages Pydra together with scikit-learn to perform model comparison across a set of classifiers. The intent is to use this as an application to make Pydra more robust while allowing users to generate classification reports more easily. This application leverages Pydra's powerful splitters and combiners to scale across a set of classifiers and metrics. It will also use Pydra's caching to not redo model training and evaluation when new metrics are added, or when number of iterations (n_splits) is increased.

Upcoming features:

Improve output report containing SHAP feature analysis.
Allow for comparing scikit-learn pipelines.
Test on scikit-learn compatible classifiers

Installation

pydraml requires Python 3.7+.

pip install pydra-ml

CLI usage

This repo installs pydraml a CLI to allow usage without any programming.

To test the CLI copy the pydra_ml/tests/data/breast_cancer.csv and short-spec.json.sample to a folder and run.

$ pydraml -s short-spec.json.sample

This will generate a test-{metric}-{timestamp}.png file for each metric in the local folder together with a pickled results file containing all the scores from the model evaluations.

$ pydraml --help
Usage: pydraml [OPTIONS]

Options:
  -s, --specfile PATH   Specification file to use  [required]
  -p, --plugin TEXT...  Pydra plugin to use  [default: cf, n_procs=1]
  -c, --cache TEXT      Cache dir  [default:
                        /Users/satra/software/sensein/pydra-ml/cache-wf]

  --help                Show this message and exit.

With the plugin option you can use local multiprocessing

$ pydraml -s ../short-spec.json.sample -p cf "n_procs=8"

or execution via dask.

$ pydraml -s ../short-spec.json.sample -p dask "address=tcp://192.168.1.154:8786"

Current specification

The current specification is a JSON file as shown in the example below. It needs to contain all the fields described here. For datasets with many features, you will want to generate x_indices programmatically.

filename: Absolute path to the CSV file containing data. Can contain a column, named group to support GroupShuffleSplit, else each sample is treated as a group.
x_indices: Numeric (0-based) or string list of columns to use as input features
target_vars: String list of target variable (at present only one is supported)
n_splits: Number of shuffle split iterations to use
test_size: Fraction of data to use for test set in each iteration
clf_info: List of scikit-learn classifiers to use.
permute: List of booleans to indicate whether to generate a null model or not
gen_shap: Boolean indicating whether shap values are generated
nsamples: Number of samples to use for shap estimation
l1_reg: Type of regularizer to use for shap estimation
plot_top_n_shap: Number or proportion of top SHAP values to plot (e.g., 16 or 0.1 for top 10%). Set to 1.0 (float) to plot all features or 1 (int) to plot top first feature.
metrics: scikit-learn metric to use

`clf_info` specification

This is a list of classifiers from scikit learn and uses an array to encode:

- module
- classifier
- (optional) classifier parameters
- (optional) gridsearch param grid

when param grid is provided and default classifier parameters are not changed, then an empty dictionary MUST be provided as parameter 3.

Example specification:

{"filename": "breast_cancer.csv",
 "x_indices": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 "target_vars": ["target"],
 "n_splits": 100,
 "test_size": 0.2,
 "clf_info": [
 ["sklearn.ensemble", "AdaBoostClassifier"],
 ["sklearn.naive_bayes", "GaussianNB"],
 ["sklearn.tree", "DecisionTreeClassifier", {"max_depth": 5}],
 ["sklearn.ensemble", "RandomForestClassifier", {"n_estimators": 100}],
 ["sklearn.ensemble", "ExtraTreesClassifier", {"n_estimators": 100, "class_weight": "balanced"}],
 ["sklearn.linear_model", "LogisticRegressionCV", {"solver": "liblinear", "penalty": "l1"}],
 ["sklearn.neural_network", "MLPClassifier", {"alpha": 1, "max_iter": 1000}],
 ["sklearn.svm", "SVC", {"probability": true},
  [{"kernel": ["rbf", "linear"], "C": [1, 10, 100, 1000]}]],
 ["sklearn.neighbors", "KNeighborsClassifier", {},
  [{"n_neighbors": [3, 5, 7, 9, 11, 13, 15, 17, 19],
    "weights": ["uniform", "distance"]}]]
 ],
 "permute": [true, false],
 "gen_shap": true,
 "nsamples": 100,
 "l1_reg": "aic",
 "plot_top_n_shap": 16,
 "metrics": ["roc_auc_score"]
 }

Output:

The workflow will output:

results-{timestamp}.pkl containing 1 list per model used. For example, if assigned to variable results, it is accessed through results[0] to results[N] (if permute: [false,true] then it will output the model trained on the labels first results[0] and the model trained on permuted labels second results[1]. Each model contains:
- dict accesed through results[0][0] with model information: {'ml_wf.clf_info': ['sklearn.neural_network', 'MLPClassifier', {'alpha': 1, 'max_iter': 1000}], 'ml_wf.permute': False}
- pydra Result obj accesed through results[0][1] with attribute output which itself has attributes:
  - feature_names: from the columns of the data csv. And the following attributes organized in N lists for N bootstrapping samples:
  - output: N lists, each one with two lists for true and predicted labels.
  - score: N lists each one containing M different metric scores.
  - shaps: N lists each one with a list of shape (P,F) where P is the amount of predictions and F the different SHAP values for each feature. shaps is empty if gen_shap is set to false or if permute is set to true.
One figure per metric with performance distribution across splits (with or without null distribution trained on permuted labels)
shap-{timestamp} dir
- SHAP values are computed for each prediction in each split's test set (e.g., 30 bootstrapping splits with 100 prediction will create (30,100) array). The mean is taken across predictions for each split (e.g., resulting in a (64,30) array for 64 features and 30 bootstrapping samples).
- For binary classification, a more accurate display of feature importance obtained by splitting predictions into TP, TN, FP, and FN, which in turn can allow for error auditing (i.e., what a model pays attention to when making incorrect/false predictions)
  - quadrant_indexes.pkl: The TP, TN, FP, FN indexes are saved in as a dict with one key per model (permuted models without SHAP values will be skipped automatically), and each key values being a bootstrapping split.
  - summary_values_shap_{model_name}_{prediction_type}.csv contains all SHAP values and summary statistics ranked by the mean SHAP value across bootstrapping splits. A sample_n column can be empty or NaN if this split did not have the type of prediction in the filename (e.g., you may not have FNs or FPs in a given split with high performance).
  - summary_shap_{model_name}_{plot_top_n_shap}.png contains SHAP value summary statistics for all features (set to 1.0) or only the top N most important features for better visualization.

Developer installation

Install repo in developer mode:

git clone https://github.com/nipype/pydra-ml.git
cd pydra-ml
pip install -e .[dev]

It is also useful to install pre-commit:

pip install pre-commit
pre-commit

Project structure

tasks.py contain the annotated Pydra tasks.
classifier.py contains the Pydra workflow.
report.py contains report generation code.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- MacOS :: MacOS X
- POSIX :: Linux
Programming Language
- Python :: 3.7
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

0.7.0

Feb 24, 2024

0.6.0

Oct 21, 2023

0.5.1

Aug 12, 2021

0.4.0

Dec 8, 2020

0.3.2

Oct 31, 2020

0.3.1

Jul 27, 2020

0.3.0

Jun 23, 2020

0.2.0

Jun 21, 2020

This version

0.1.1

Jun 13, 2020

0.1.0

Jun 2, 2020

0.0.1

May 25, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydra_ml-0.1.1.tar.gz (16.2 kB view details)

Uploaded Jun 13, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydra_ml-0.1.1-py3-none-any.whl (65.4 kB view details)

Uploaded Jun 13, 2020 Python 3

File details

Details for the file pydra_ml-0.1.1.tar.gz.

File metadata

Download URL: pydra_ml-0.1.1.tar.gz
Upload date: Jun 13, 2020
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for pydra_ml-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`da1b819a40bbc4eab58583b6261237c9db72ef762595e81198c48c0f6f780dbe`
MD5	`a657c5071616766cc14c7d960bfa3fcb`
BLAKE2b-256	`3679846c779e5cc47c3a9c80c7e007baa6a541fed07f60b9fff068eeaed62ff9`

See more details on using hashes here.

File details

Details for the file pydra_ml-0.1.1-py3-none-any.whl.

File metadata

Download URL: pydra_ml-0.1.1-py3-none-any.whl
Upload date: Jun 13, 2020
Size: 65.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for pydra_ml-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`48aa4c64998ca8ad235edcb595281d920a4d38300fcaeb95192c0118babdb794`
MD5	`39602396f255bea732f09d0ee55e3419`
BLAKE2b-256	`e675f7f697860802296781166bc319a77108f2c8611935405f9e3addca502aa9`

See more details on using hashes here.

pydra-ml 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pydra-ml

Installation

CLI usage

Current specification

`clf_info` specification

Example specification:

Output:

Developer installation

Project structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

pydra-ml 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pydra-ml

Installation

CLI usage

Current specification

clf_info specification

Example specification:

Output:

Developer installation

Project structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`clf_info` specification