Deep learning similarity measure for comparing MS/MS spectra.
Project description
ms2deepscore
ms2deepscore provides a Siamese neural network that is trained to predict molecular structural similarities (Tanimoto scores) from pairs of mass spectrometry spectra.
The library provides an intuitive classes to prepare data, train a siamese model, and compute similarities between pairs of spectra.
In addition to the prediction of a structural similarity, MS2DeepScore can also make use of Monte-Carlo dropout to assess the model uncertainty.
Reference
If you use MS2DeepScore for your research, please cite the following:
"MS2DeepScore - a novel deep learning similarity measure for mass fragmentation spectrum comparisons" Florian Huber, Sven van der Burg, Justin J.J. van der Hooft, Lars Ridder, bioRxiv 2021, doi: https://doi.org/10.1101/2021.04.18.440324
Setup
Requirements
Python 3.7 or higher
Installation
Simply install using pip: pip install ms2deepscore
Prepare environment
We recommend to create an Anaconda environment with
conda create --name ms2deepscore python=3.8
conda activate ms2deepscore
pip install ms2deepscore
Alternatively, simply install in the environment of your choice by .
Or, to also include the full matchms functionality:
conda create --name ms2deepscore python=3.8
conda activate ms2deepscore
conda install --channel bioconda --channel conda-forge matchms
pip install ms2deepscore
Quick start: How to prepare data, train a model, and compute similarities.
See notebooks/MS2DeepScore_tutorial.ipynb for a more extensive fully-working example on test data.
There are two different ways to use MS2DeepScore to compute spectral similarities. You can train a new model on a dataset of your choice. That, however, should preferentially contain a substantial amount of spectra to learn relevant features, say > 10,000 spectra of sufficiently diverse types. The second way is much simpler: Use a model that was pretrained on a large dataset.
1) Use a pretrained model to compute spectral similarities
We provide a model which was trained on > 100,000 MS/MS spectra from GNPS, which can simply be downloaded from zenodo here. To then compute the similarities between spectra of your choice you can run something like:
from matchms import calculate_scores()
from matchms.importing import load_from_msp
from ms2deepscore import MS2DeepScore
from ms2deepscore.models import load_model
# Import data
references = load_from_msp("my_reference_spectra.msp")
queries = load_from_msp("my_query_spectra.msp")
# Load pretrained model
model = load_model("MS2DeepScore_allGNPSpositive_10k_500_500_200.hdf5")
similarity_measure = MS2DeepScore(model)
# Calculate scores and get matchms.Scores object
scores = calculate_scores(references, queries, similarity_measure)
If you want to calculate all-vs-all spectral similarities, e.g. to build a network, than you can run:
scores = calculate_scores(references, references, similarity_measure, is_symmetric=True)
To use Monte-Carlo Dropout to also get a uncertainty measure with each score, run the following:
from matchms import calculate_scores()
from matchms.importing import load_from_msp
from ms2deepscore import MS2DeepScoreMonteCarlo
from ms2deepscore.models import load_model
# Import data
references = load_from_msp("my_reference_spectra.msp")
queries = load_from_msp("my_query_spectra.msp")
# Load pretrained model
model = load_model("MS2DeepScore_allGNPSpositive_10k_500_500_200.hdf5")
similarity_measure = MS2DeepScoreMonteCarlo(model, n_ensembles=10)
# Calculate scores and get matchms.Scores object
scores = calculate_scores(references, queries, similarity_measure)
In that scenario, scores["score"]
contains the similarity scores (median of the ensemble of 10x10 scores) and scores["uncertainty"]
give an uncertainty estimate (interquartile range of ensemble of 10x10 scores.
2) Train an own MS2DeepScore model
Data preperation
Bin spectrums using ms2deepscore.SpectrumBinner
.
In this binned form we can feed spectra to the model.
from ms2deepscore import SpectrumBinner
spectrum_binner = SpectrumBinner(1000, mz_min=10.0, mz_max=1000.0, peak_scaling=0.5)
binned_spectrums = spectrum_binner.fit_transform(spectrums)
Create a data generator that will generate batches of training examples. Each training example consists of a pair of binned spectra and the corresponding reference similarity score.
from ms2deepscore.data_generators import DataGeneratorAllSpectrums
dimension = len(spectrum_binner.known_bins)
data_generator = DataGeneratorAllSpectrums(binned_spectrums, tanimoto_scores_df,
dim=dimension)
Train a model
Initialize and train a SiameseModel. It consists of a dense 'base' network that produces an embedding for each of the 2 inputs. The 'head' model computes the cosine similarity between the embeddings.
from tensorflow import keras
from ms2deepscore.models import SiameseModel
model = SiameseModel(spectrum_binner, base_dims=(200, 200, 200), embedding_dim=200,
dropout_rate=0.2)
model.compile(loss='mse', optimizer=keras.optimizers.Adam(lr=0.001))
model.fit(data_generator,
validation_data=data_generator,
epochs=2)
Predict similarity scores
Calculate similariteis for a pair of spectra
from ms2deepscore import MS2DeepScore
similarity_measure = MS2DeepScore(model)
score = similarity_measure.pair(spectrums[0], spectrums[1])
Contributing
We welcome contributions to the development of ms2deepscore! Have a look at the contribution guidelines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ms2deepscore-0.2.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33ff7853d2e9c5fd57efb57480238cf557ce27b7892ce8492eeab616b0cdaddf |
|
MD5 | 7ca811590abbe740eeb1c5450bf9d6a2 |
|
BLAKE2b-256 | d0204f0bec692f58860eff0ef96c21cd513a06ec0c42f3a105ac5c791127fbf7 |