Skip to main content

pairwise sequence alignment library

Project description

Travis Build Status:

Build Status

AppVeyor Build Status:

Build Status

Author: Jeff Daily (jeff.daily@pnnl.gov)

Table of Contents

This package contains Python bindings for parasail. Parasail is a SIMD C (C99) library containing implementations of the Smith-Waterman (local), Needleman-Wunsch (global), and semi-global pairwise sequence alignment algorithms.

Installation

back to top

Using pip

back to top

The recommended way of installing is to use the latest version available via pip.

pip install parasail

Binaries for Windows and OSX should be available via pip. Using pip on a Linux platform will first download the latest version of the parasail C library sources and then compile them automatically into a shared library. For an installation from sources, or to learn how the pip installation works on Linux, please read on.

Building from Source

back to top

The parasail python bindings are based on ctypes. Unfortunately, best practices are not firmly established for providing cross-platform and user-friendly python bindings based on ctypes. The approach with parasail-python is to install the parasail shared library as “package data” and use a relative path from the parasail/__init__.py in order to locate the shared library.

There are two approaches currently supported. First, you can compile your own parasail shared library using one of the recommended build processes described in the parasail C library README.md, then copy the parasail.dll (Windows), libparasail.so (Linux), or libparasail.dylib (OSX) shared library to parasail-python/parasail – the same folder location as parasasail-python/parasail/__init__.py.

The second approach is to let the setup.py script attempt to download and compile the parasail C library for you using the configure script that comes with it. This happens as a side effect of the bdist_wheel target.

python setup.py bdist_wheel

The bdist_wheel target will first look for the shared library. If it exists, it will happily install it as package data. Otherwise, the latest parasail master branch from github will be downloaded, unzipped, configured, made, and the shared library will be copied into the appropriate location for package data installation.

Quick Example

back to top

The Python interface only includes bindings for the dispatching functions, not the low-level instruction set-specific function calls. The Python interface also includes wrappers for the various PAM and BLOSUM matrices included in the distribution.

Gap open and extension penalties are specified as positive integers.

import parasail
result = parasail.sw_scan_16("asdf", "asdf", 11, 1, parasail.blosum62)
result = parasail.sw_stats_striped_8("asdf", "asdf", 11, 1, parasail.pam100)

Standard Function Naming Convention

back to top

To make it easier to find the function you’re looking for, the function names follow a naming convention. The following will use set notation {} to indicate a selection must be made and brackets [] to indicate an optional part of the name.

  • Non-vectorized, reference implementations.

    • Required, select one of global (nw), semi-global (sg), or local (sw) alignment.

    • Optional return alignment statistics.

    • Optional return DP table or last row/col.

    • Optional use a prefix scan implementation.

    • parasail. {nw,sg,sw} [_stats] [{_table,_rowcol}] [_scan]

  • Non-vectorized, traceback-capable reference implementations.

    • Required, select one of global (nw), semi-global (sg), or local (sw) alignment.

    • Optional use a prefix scan implementation.

    • parasail. {nw,sg,sw} _trace [_scan]

  • Vectorized.

    • Required, select one of global (nw), semi-global (sg), or local (sw) alignment.

    • Optional return alignment statistics.

    • Optional return DP table or last row/col.

    • Required, select vectorization strategy – striped is a good place to start, but scan is often faster for global alignment.

    • Required, select solution width. ‘sat’ will attempt 8-bit solution but if overflow is detected it will then perform the 16-bit operation. Can be faster in some cases, though 16-bit is often sufficient.

    • parasail. {nw,sg,sw} [_stats] [{_table,_rowcol}] {_striped,_scan,_diag} {_8,_16,_32,_64,_sat}

  • Vectorized, traceback-capable.

    • Required, select one of global (nw), semi-global (sg), or local (sw) alignment.

    • Required, select vectorization strategy – striped is a good place to start, but scan is often faster for global alignment.

    • Required, select solution width. ‘sat’ will attempt 8-bit solution but if overflow is detected it will then perform the 16-bit operation. Can be faster in some cases, though 16-bit is often sufficient.

    • parasail. {nw,sg,sw} _trace {_striped,_scan,_diag} {_8,_16,_32,_64,_sat}

Profile Function Naming Convention

back to top

It has been noted in literature that some performance can be gained by reusing the query sequence when using striped [Farrar, 2007] or scan [Daily, 2015] vector strategies. There is a special subset of functions that enables this behavior. For the striped and scan vector implementations only, a query profile can be created and reused for subsequent alignments. This can noticeably speed up applications such as database search.

  • Profile creation

    • Optional, prepare query profile for a function that returns statistics. Stats require additional data structures to be allocated.

    • Required, select solution width. ‘sat’ will allocate profiles for both 8- and 16-bit solutions.

    • parasail.profile_create [_stats] {_8,_16,_32,_64,_sat}

  • Profile use

    • Vectorized.

      • Required, select one of global (nw), semi-global (sg), or local (sw) alignment.

      • Optional return alignment statistics.

      • Optional return DP table or last row/col.

      • Required, select vectorization strategy – striped is a good place to start, but scan is often faster for global alignment.

      • Required, select solution width. ‘sat’ will attempt 8-bit solution but if overflow is detected it will then perform the 16-bit operation. Can be faster in some cases, though 16-bit is often sufficient.

      • parasail. {nw,sg,sw} [_stats] [{_table,_rowcol}] {_striped,_scan} _profile {_8,_16,_32,_64,_sat}

    • Vectorized, traceback-capable.

      • Required, select one of global (nw), semi-global (sg), or local (sw) alignment.

      • Required, select vectorization strategy – striped is a good place to start, but scan is often faster for global alignment.

      • Required, select solution width. ‘sat’ will attempt 8-bit solution but if overflow is detected it will then perform the 16-bit operation. Can be faster in some cases, though 16-bit is often sufficient.

      • parasail. {nw,sg,sw} _trace {_striped,_scan} _profile {_8,_16,_32,_64,_sat}

This is a sample function signature of one of the profile creation functions.

profile = parasail.profile_create_8("asdf", parasail.blosum62)
result1 = parasail.sw_trace_striped_profile_16(profile, "asdf", 10, 1)
result2 = parasail.nw_scan_profile_16(profile, "asdf", 10, 1)

Substitution Matrices

back to top

parasail bundles a number of substitution matrices including PAM and BLOSUM. To use them, look them up by name (useful for command-line parsing) or use directly. For example

print(parasail.blosum62)
matrix = parasail.Matrix("pam100")

You can also create your own matrices with simple match/mismatch values. For more complex matrices, you can start by copying a built-in matrix or start simple and modify values as needed. For example

# copy a built-in matrix, then modify like a numpy array
matrix = parasail.blosum62.copy()
matrix[2,4] = 200
matrix[3,:] = 100
user_matrix = parasail.matrix_create("ACGT", 2, -1)

You can also parse simple matrix files using the function if the file is in the following format:

#
# Any line starting with '#' is a comment.
#
# Needs a row for the alphabet.  First column is a repeat of the
# alphabet and assumed to be identical in order to the first alphabet row.
#
        A   T   G   C   S   W   R   Y   K   M   B   V   H   D   N   U
A   5  -4  -4  -4  -4   1   1  -4  -4   1  -4  -1  -1  -1  -2  -4
T  -4   5  -4  -4  -4   1  -4   1   1  -4  -1  -4  -1  -1  -2   5
G  -4  -4   5  -4   1  -4   1  -4   1  -4  -1  -1  -4  -1  -2  -4
C  -4  -4  -4   5   1  -4  -4   1  -4   1  -1  -1  -1  -4  -2  -4
S  -4  -4   1   1  -1  -4  -2  -2  -2  -2  -1  -1  -3  -3  -1  -4
W   1   1  -4  -4  -4  -1  -2  -2  -2  -2  -3  -3  -1  -1  -1   1
R   1  -4   1  -4  -2  -2  -1  -4  -2  -2  -3  -1  -3  -1  -1  -4
Y  -4   1  -4   1  -2  -2  -4  -1  -2  -2  -1  -3  -1  -3  -1   1
K  -4   1   1  -4  -2  -2  -2  -2  -1  -4  -1  -3  -3  -1  -1   1
M   1  -4  -4   1  -2  -2  -2  -2  -4  -1  -3  -1  -1  -3  -1  -4
B  -4  -1  -1  -1  -1  -3  -3  -1  -1  -3  -1  -2  -2  -2  -1  -1
V  -1  -4  -1  -1  -1  -3  -1  -3  -3  -1  -2  -1  -2  -2  -1  -4
H  -1  -1  -4  -1  -3  -1  -3  -1  -3  -1  -2  -2  -1  -2  -1  -1
D  -1  -1  -1  -4  -3  -1  -1  -3  -1  -3  -2  -2  -2  -1  -1  -1
N  -2  -2  -2  -2  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -2
U  -4   5  -4  -4  -4   1  -4   1   1  -4  -1  -4  -1  -1  -2   5
matrix_from_filename = parasail.Matrix("filename.txt")

SSW Library Emulation

back to top

The SSW library (https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library) performs Smith-Waterman local alignment using SSE2 instructions and a striped vector. Its result provides the primary score, a secondary score, beginning and ending locations of the alignment for both the query and reference sequences, as well as a SAM CIGAR. There are a few parasail functions that emulate this behavior, with the only exception being that parasail does not calculate a secondary score.

score_size = 1 # 0, use 8-bit align; 1, use 16-bit; 2, try both
profile = parasail.ssw_init("asdf", parasail.blosum62, score_size)
result = parasail.ssw_profile(profile, "asdf", 10, 1)
print(result.score1)
print(result.cigar)
print(result.ref_begin1)
print(result.ref_end1)
print(result.read_begin1)
print(result.read_end1)
# or skip profile creation
result = parasail.ssw("asdf", "asdf", 10, 1, parasail.blosum62)

Banded Global Alignment

back to top

There is one version of banded global alignment available. Though it is not vectorized, it might still be faster than using other parasail global alignment functions, especially for large sequences. The function signature is similar to the other parasail functions with the only exception being k, the band width.

band_size = 3
result = parasail.nw_banded("asdf", "asdf", 10, 1, band_size, matrix):

File Input

back to top

Parasail can parse FASTA, FASTQ, and gzipped versions of such files. The function parasail.sequences_from_file will return a list-like object containing Sequence instances. A parasail Sequence behaves like an immutable string but also has extra attributes name, comment, and qual. These attributes will return an empty string if the input file did not contain these fields.

Tracebacks

back to top

Parasail supports accessing a SAM CIGAR string from a result. You must use a traceback-capable alignment function. Refer to the C interface description above for details on how to use a traceback-capable alignment function.

result = parasail.sw_trace("asdf", "asdf", 10, 1, parasail.blosum62)
cigar = result.cigar
# cigars have seq, len, beg_query, and beg_ref properties
# the seq property is encoded
print(cigar.seq)
# use decode method to return a decoded cigar string
print(cigar.decode())

Citing parasail

back to top

If needed, please cite the following paper.

Daily, Jeff. (2016). Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics, 17(1), 1-11. doi:10.1186/s12859-016-0930-z

http://dx.doi.org/10.1186/s12859-016-0930-z

License: Battelle BSD-style

back to top

Copyright (c) 2015, Battelle Memorial Institute

  1. Battelle Memorial Institute (hereinafter Battelle) hereby grants permission to any person or entity lawfully obtaining a copy of this software and associated documentation files (hereinafter “the Software”) to redistribute and use the Software in source and binary forms, with or without modification. Such person or entity may use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and may permit others to do so, subject to the following conditions:

    • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimers.

    • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

    • Other than as used herein, neither the name Battelle Memorial Institute or Battelle may be used in any form whatsoever without the express written consent of Battelle.

    • Redistributions of the software in any form, and publications based on work performed using the software should include the following citation as a reference:

    Daily, Jeff. (2016). Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics, 17(1), 1-11. doi:10.1186/s12859-016-0930-z

  2. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL BATTELLE OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parasail-1.1.6.tar.gz (44.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

parasail-1.1.6-py2.py3-none-win_amd64.whl (1.5 MB view details)

Uploaded Python 2Python 3Windows x86-64

parasail-1.1.6-py2.py3-none-win32.whl (1.3 MB view details)

Uploaded Python 2Python 3Windows x86

parasail-1.1.6-py2.py3-none-macosx_10_12_x86_64.whl (4.3 MB view details)

Uploaded Python 2Python 3macOS 10.12+ x86-64

parasail-1.1.6-py2.py3-none-macosx_10_11_x86_64.whl (4.4 MB view details)

Uploaded Python 2Python 3macOS 10.11+ x86-64

parasail-1.1.6-py2.py3-none-macosx_10_10_x86_64.whl (4.3 MB view details)

Uploaded Python 2Python 3macOS 10.10+ x86-64

File details

Details for the file parasail-1.1.6.tar.gz.

File metadata

  • Download URL: parasail-1.1.6.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for parasail-1.1.6.tar.gz
Algorithm Hash digest
SHA256 02a85c65088726e8fddc5a2bd103f5c9a542dca17f6269a28848407e3312e442
MD5 516acd3b36545b2682468151a2a7c7bf
BLAKE2b-256 09fc0a9cb16f4f23d33c44aff28ce0ba239dff5d80211b22e2b0834f5e1fd85c

See more details on using hashes here.

File details

Details for the file parasail-1.1.6-py2.py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for parasail-1.1.6-py2.py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 e7d190bff490e8b04c3591cb9544ec16143792def3cb6e7924b55a6031c1a269
MD5 76f91d243255e29ce3247572fdbed9f1
BLAKE2b-256 16f89b48c4cb697df22328fd4062a78c0140ee498546d48f8a9c46eeb8272373

See more details on using hashes here.

File details

Details for the file parasail-1.1.6-py2.py3-none-win32.whl.

File metadata

File hashes

Hashes for parasail-1.1.6-py2.py3-none-win32.whl
Algorithm Hash digest
SHA256 e12af455e5e4620a177ff8d5877fb3b66d62048f870179bf53d75f45de719d39
MD5 64627149a9b8393a2a35adc1c0ed4649
BLAKE2b-256 0e654e6fb09f6ef7f5880a76e1296db08210b2c657db233c6943eb5fabccd9ac

See more details on using hashes here.

File details

Details for the file parasail-1.1.6-py2.py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for parasail-1.1.6-py2.py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e0632837b843cc676c7dd2b77ef4cc83446b9e4491513dfbb3dd259efbdd9a48
MD5 6cde8e93ea9d70904101be3dd81b7bf7
BLAKE2b-256 72ebeeb39d7522f0ffbe8b6fcc4059acedb510dbc3dae3ed43d37bf6f5fef1f8

See more details on using hashes here.

File details

Details for the file parasail-1.1.6-py2.py3-none-macosx_10_11_x86_64.whl.

File metadata

File hashes

Hashes for parasail-1.1.6-py2.py3-none-macosx_10_11_x86_64.whl
Algorithm Hash digest
SHA256 7d4a487b0c72fdd8dc79d1ae649e051b3c7c00b6b2da54af5899e9857c7d83d7
MD5 e3f350b70868a996dcdd2e4bd98214dc
BLAKE2b-256 aded67017a037f3ba258220ec665ebade9c86555196b0c131a6f90649859d3d0

See more details on using hashes here.

File details

Details for the file parasail-1.1.6-py2.py3-none-macosx_10_10_x86_64.whl.

File metadata

File hashes

Hashes for parasail-1.1.6-py2.py3-none-macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 7b5562a2bb19658ac05cc88fef9fe8bbbc3a00b0410c8b4e654830799a64cb5a
MD5 be5a6f9a3c68c46906e5a00af07ec71a
BLAKE2b-256 87bab316e7d5ed86dbb8386c98e4f0def696485af73d5f84776965bfe3f70970

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page