fuzzysearch is useful for finding approximate subsequence matches
Project description
fuzzysearch is a Python library for fuzzy substring searches. It implements efficient ad-hoc searching for approximate sub-sequences. Matching is done using a generalized Levenshtein Distance metric, with configurable parameters.
Free software: MIT license
Documentation: http://fuzzysearch.rtfd.org.
Installation
Just install using pip:
$ pip install fuzzysearch
Features
Fuzzy sub-sequence search: Find parts of a sequence which match a given sub-sequence.
Easy to use: A single function to call which returns a list of matches.
Set a maximum Levenshtein Distance for matches, including individual limits for the number of substitutions, insertions and/or deletions allowed for near-matches.
Includes optimized implementations for specific use-cases, e.g. allowing only substitutions.
Simple Examples
Just call find_near_matches() with the sequence to search, the sub-sequence you’re looking for, and the matching parameters:
>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]
>>> sequence = '''\
GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
GGGATAGG'''
>>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
>>> find_near_matches(subsequence, sequence, max_l_dist=2)
[Match(start=3, end=24, dist=1)]
Advanced Search Criteria
The search function supports four possible match criteria, which may be supplied in any combination:
maximum Levenshtein distance
maximum # of subsitutions
maximum # of deletions (elements appearing in the pattern search for, which are skipped in the matching sub-sequence)
maximum # of insertions (elements added in the matching sub-sequence which don’t appear in the pattern search for)
Not supplying a criterion means that there is no limit for it. For this reason, one must always supply max_l_dist and/or all other criteria.
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1)]
# this will not match since max-deletions is set to zero
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
[]
# note that a deletion + insertion may be combined to match a substution
>>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=1)] # the Levenshtein distance is still 1
# ... but deletion + insertion may also match other, non-substitution differences
>>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
[Match(start=3, end=10, dist=2)]
History
0.3.0 (2015-02-12)
Added C extensions for several search functions as well as internal functions
Use C extensions if available, or pure-Python implementations otherwise
setup.py attempts to build C extensions, but installs without if build fails
Added --noexts setup.py option to avoid trying to build the C extensions
Greatly improved testing and coverage
0.2.2 (2014-03-27)
Added support for searching through BioPython Seq objects
Added specialized search function allowing only subsitutions and insertions
Fixed several bugs
0.2.1 (2014-03-14)
Fixed major match grouping bug
0.2.0 (2013-03-13)
New utility function find_near_matches() for easier use
Additional documentation
0.1.0 (2013-11-12)
Two working implementations
Extensive test suite; all tests passing
Full support for Python 2.6-2.7 and 3.1-3.3
Bumped status from Pre-Alpha to Alpha
0.0.1 (2013-11-01)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for fuzzysearch-0.5.0.1-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5fda967c4d64082d6f4cf42f01dea16511ddb7f0f2d8b5da8a1785a4bb0b2625 |
|
MD5 | 20c93b374a441fa278345a47f6d455f0 |
|
BLAKE2b-256 | dcb02a6c1706e139283a4c4d2d296c38b54b5d53c2622fc3b680770d8a3479bb |
Hashes for fuzzysearch-0.5.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 288a3ec43ced224575799b455022fdd5bc0759cbb96694531b5691e3870ad56b |
|
MD5 | d7b0cb693a7c92ef7055232381fa8457 |
|
BLAKE2b-256 | 96115bfbfbbf930c9f20bcc465045e9f3d95cced3369812ea30086baca24d8bf |
Hashes for fuzzysearch-0.5.0-cp36-cp36m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2fd952a8436d746aca18bdb96d713b1c8637ec2496598c4c59dc9844ac9bd89e |
|
MD5 | 2fdcd1ec0f47cad6a2c2107bab558788 |
|
BLAKE2b-256 | 0a20be1655203203f9c79319bcb93968b6ddefa9b83cd494c32145b63996ed95 |
Hashes for fuzzysearch-0.5.0-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8aed2a7fb056421934d64c6b8237ac45670c32bcd43c550bf6a3b96e3543e8af |
|
MD5 | b75ae139aab9b6094864d87932f02114 |
|
BLAKE2b-256 | ba12c4dc8c22c7b8e6bc00d906c6a771b28182a5cb1641ce87ae253f34b976b0 |
Hashes for fuzzysearch-0.5.0-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7fb231cd742484a1815f592661f2d694eaf5f3d8b591c8b5420437aa5b29b89 |
|
MD5 | 9df92bacaecbb2532720ddbd17f40687 |
|
BLAKE2b-256 | e372362f3c506a26ae5c66dc11b4577dc1b04d750764c3363029ddd6bcf74717 |
Hashes for fuzzysearch-0.5.0-cp35-cp35m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7802828ec4ab4fa7d21312bd9cc6e501d43b4ca0f233b3c8a631f375a227441 |
|
MD5 | 9a238c0f0dda11eaf8df6840f5dc419c |
|
BLAKE2b-256 | 61edd5bbebcbfd2f83b9391dd13e670acbfbe7ae8d390098e07d7edf4070e8a7 |
Hashes for fuzzysearch-0.5.0-cp34-cp34m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9865f2d28b83459905c473176d785d842dbbf0ce49936886112cb164d76eb413 |
|
MD5 | 8f03f2d24e5b945ced7d29bca6cf57b1 |
|
BLAKE2b-256 | a10ecc29fb7995c1690eee120c692f7ac695d8d350ed6d42937b488910a9d6d8 |
Hashes for fuzzysearch-0.5.0-cp34-cp34m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6c6e6ca4c21c0e222dc758c1d28c2923051c5be5c927b33cc381ee5e5534413 |
|
MD5 | 7cb6c96b22944574b028bdfa32d3c3dc |
|
BLAKE2b-256 | 3c6f7a1f064061ec7e31542c004a59b942f25fe89d08cae65f001ecf852e85ee |
Hashes for fuzzysearch-0.5.0-cp33-cp33m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4f291e5c06bfcb507546c49e43005563cdcd598ddf5972a0112e6b97d7c60ca |
|
MD5 | 4aada2e12ceaad33fde3780e340a4956 |
|
BLAKE2b-256 | e02a343033cb6a8c2c48e15da28aa38db8e2174c66430e5f29ea41898d3bd558 |
Hashes for fuzzysearch-0.5.0-cp33-cp33m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 51f1a1bac5097cfa81937ebae41425d01e99a53ccdd099044cb6501c765f49c3 |
|
MD5 | 3a748a442e54013b54a0ce0d17477891 |
|
BLAKE2b-256 | cab4ec4ae2f5b870a2e18f356385b56be69fb8c563220ab46d86b5f9c476b7d5 |
Hashes for fuzzysearch-0.5.0-cp27-cp27m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 379a31674f9c852c361624a1928758b4cb465fc4a5acb8d6dfc4314f2068cc3d |
|
MD5 | aa9b0504b330a617f12c86af348ad31a |
|
BLAKE2b-256 | 83759496d0eecccd97ac190bb3b179e9764f1ba5fdceb8cdea276ada6eed93b4 |
Hashes for fuzzysearch-0.5.0-cp27-cp27m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 206a6030caa3625de89a0885fb444e8ebfb33731dfb6c4a54a3cbba2eceb6e90 |
|
MD5 | f5b2093f52e423716ace8986755dbd0a |
|
BLAKE2b-256 | 2dac21c5323d46f2bec09f6d1f51e980a4457e4d9a6bfff3ee86a1bcd0e3b511 |
Hashes for fuzzysearch-0.5.0-cp27-cp27m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bb91cf1bcbc59e7a22dc0800aa1fea168b6c19a0894f31bec17e0f1ca8c9e02 |
|
MD5 | f8cfeffe9e62dbdf47623d60435f5d88 |
|
BLAKE2b-256 | ef6c5acc429b31ff544bec0d15cdbcc3430dd7b9f6e7e2c20f7e3e22f6780311 |
Hashes for fuzzysearch-0.5.0-cp26-cp26m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ee248840f9644836357d5298e55b5b908e4db96a57a3b270a43c5c9c36309d0 |
|
MD5 | 830c38fa52e6f2422c2637fb0764b053 |
|
BLAKE2b-256 | 4a885011547b0e661b9c53a21d0aa225c1612c327bdf1b305aa4e04e078cbce5 |
Hashes for fuzzysearch-0.5.0-cp26-cp26m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b78aeeb77faad02d542eef935fd351f9486f360c6356a9d90288cd5b7dbbab52 |
|
MD5 | a88effc847debc2f4e0feeb1e480fe73 |
|
BLAKE2b-256 | 61675563dc6bb6ad6c9442953f83885414d51d7837631d28d4ca8e4023874121 |
Hashes for fuzzysearch-0.5.0-cp26-cp26m-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e0a660f133cba7d158eb7cb7cd37ba1b223dc0287c92d1efdd8a6f6f7addd91 |
|
MD5 | beedd4df1d2249c0a75e0b9c05b8021a |
|
BLAKE2b-256 | c1de0729201aad5d3bc648f08501f83cb0b1c89cb190160a19dc7fe4e601818e |