Skip to main content

Nanopore methylation/modified base calling detached from basecalling

Project description

[Oxford Nanopore Technologies]

Remora

Methylation/modified base calling separated from basecalling.

Detailed documentation for all remora commands and algorithms can be found on the TODO Remora documentation page.

Installation

To install from gitlab source for development, the following commands can be run.

git clone git@github.com:nanoporetech/remora.git
pip install -e remora/[tests]

See help for any Remora command with the -h flag.

Getting Started

Remora models are trained to perform binary or categorical prediction of modified base content of a nanopore read. Models may also be trained to perform canonical base prediction, but this feature may be removed at a later time. The rest of the documentation will focus on the modified base detection task.

The Remora training/prediction input unit (refered to as a chunk) consists of:
  1. Section of normalized signal

  2. Canonical bases attributed to the section of signal

  3. Mapping between these two

Chunks have a fixed signal length defined at data preparation time and saved as a model attribute. A fixed position within the chunk is defined as the “focus position”. This position is the center of the base of interest.

Data Preparation

Remora data preparation begins from Taiyaki mapped signal files generally produced from Megalodon containing modified base annotations. An example dataset might be pre-processed with the following commands.

megalodon \
  pcr_fast5s/ \
  --reference ref.mmi \
  --output-directory mega_res_pcr \
  --outputs mappings signal_mappings \
  --num-reads 10000 \
  --guppy-config dna_r9.4.1_450bps_fast.cfg \
  --devices 0 \
  --processes 20
# Note the --ref-mods-all-motifs option defines the modified base characteristics
megalodon \
  sssI_fast5s/ \
  --ref-mods-all-motifs m 5mC CG 0 \
  --reference ref.mmi \
  --output-directory mega_res_sssI \
  --outputs mappings signal_mappings \
  --num-reads 10000 \
  --guppy-config dna_r9.4.1_450bps_fast.cfg \
  --devices 0 \
  --processes 20

python \
  taiyaki/misc/merge_mappedsignalfiles.py \
  mapped_signal_train_data.hdf5 \
  --input mega_res_pcr/signal_mappings.hdf5 None \
  --input mega_res_sssI/signal_mappings.hdf5 None \
  --allow_mod_merge \
  --batch_format

After the construction of a training dataset, chunks must be extracted and saved in a Remora-friendly format. The following command performs this task in Remora.

remora \
  dataset prepare \
  mapped_signal_train_data.hdf5 \
  --output-remora-training-file remora_train_chunks.npz \
  --motif CG 0 \
  --mod-bases m \
  --chunk-context 50 50 \
  --kmer-context-bases 6 6 \
  --max-chunks-per-read 20 \
  --log-filename log.txt

The resulting remora_train_chunks.npz file can then be used to train a Remora model.

Model Training

Models are trained with the remora model train command. For example a model can be trained with the following command.

remora \
  model train \
  remora_train_chunks.npz \
  --model remora/models/ConvLSTM_w_ref.py \
  --device 0 \
  --output-path remora_train_results

This command will produce a final model in ONNX format for use in Bonito, Megalodon or remora infer commands.

Model Inference

For testing purposes inference within Remora is provided given Taiyaki mapped signal files as input. The below command will call the held out validation dataset from the data preparation section above.

remora \
  infer from_taiyaki_mapped_signal \
  mega_res_pcr/split_signal_mappings.split_a.hdf5 \
  remora_train_results/model_best.onnx \
  --output-path remora_infer_results_pcr.txt \
  --device 0
remora \
  infer from_taiyaki_mapped_signal \
  mega_res_sssI/split_signal_mappings.split_a.hdf5 \
  remora_train_results/model_best.onnx \
  --output-path remora_infer_results_sssI.txt \
  --device 0

Note that in order to perfrom inference on a GPU device the onnxruntime-gpu package must be installed.

API

The Remora API can be applied to make modified base calls given a basecalled read via a RemoraRead object. sig should be a float32 numpy array. seq is a string derived from sig (can be either basecalls or other downstream derived sequence; e.g. mapped reference positions). seq_to_sig_map should be an int32 numpy array of length len(seq) + 1 and elements should be indices within sig array assigned to each base in seq.

from remora.model_util import load_model
from remora.data_chunks import RemoraRead
from remora.inference import call_read_mods

model, model_metadata = load_model("remora_train_results/model_best.onnx")
read = RemoraRead(sig, seq_to_sig_map, str_seq=seq)
mod_probs, _, pos = call_read_mods(
  read,
  model,
  model_metadata,
  return_mod_probs=True,
)

mod_probs will contain the probability of each modeled modified base as found in model_metadata[“mod_long_names”]. For example, run mod_probs.argmax(axis=1) to obtain the prediction for each input unit. pos contains the position (index in input sequence) for each prediction within mod_probs.

GPU Troubleshooting

Note that standard Remora models are small enough to run quite quickly on CPU resources and this is the primary recommandation. Running Remora models on GPU compute resources is considered experimental with minimal support.

Deployment of Remora models is facilitated by the Open Neural Network Exchange (ONNX) format. The onnxruntime python package is used to run the models. In order to support running models on GPU resources the GPU compatible package must be installed (pip install onnxruntime-gpu).

Once installed the remora infer command takes a --device argument. Similarly, the API remora.model_util.load_model function takes a device argument. These arguments specify the GPU device ID to use for inference.

Once the device option is specified, Remora will attempt to load the model on the GPU resources. If this fails a RemoraError will be raised. The likely cause of this is the required CUDA and cuDNN dependency versions. See the requirements on the onnxruntime documentation page here.

To check the versions of the various dependencies see the following commands.

# check cuda version
nvcc --version
# check cuDNN version
grep -A 2 "define CUDNN_MAJOR" `whereis cudnn | cut -f2 -d" "`
# check onnxruntime version
python -c "import onnxruntime as ort; print(ort.__version__)"

These versions should match a row in the table linked above. CUDA and cuDNN versions can be downloaded from the NVIDIA website (cuDNN link; CUDA link). The cuDNN download can be specified at runtime as in the following example.

CUDA_PATH=/path/to/cuda/include/cuda.h \
  CUDNN_H_PATH=/path/to/cuda/include/cudnn.h \
  remora \
  infer [arguments]

The onnxruntime dependency can be set via the python package install command. For example pip install “onnxruntime-gpu<1.7”.

Terms and licence

This is a research release provided under the terms of the Oxford Nanopore Technologies’ Public Licence. Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. Much as we would like to rectify every issue, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid change by Oxford Nanopore Technologies.

© 2021 Oxford Nanopore Technologies Ltd. Remora is distributed under the terms of the Oxford Nanopore Technologies’ Public Licence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ont-remora-0.1.0.tar.gz (15.0 MB view hashes)

Uploaded Source

Built Distribution

ont_remora-0.1.0-cp37-cp37m-macosx_10_15_x86_64.whl (15.1 MB view hashes)

Uploaded CPython 3.7m macOS 10.15+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page