Skip to main content

ahocorapy - Pure python ahocorasick implementation

Project description

Build Status Test Coverage PyPi Version PyPi License PyPi Versions PyPi Wheel

ahocorapy - Fast Many-Keyword Search in Pure Python

ahocorapy is a pure python implementation of the Aho-Corasick Algorithm. Given a list of keywords one can check if at least one of the keywords exist in a given text in linear time.

Comparison:

Why another Aho-Corasick implementation?

We started working on this in the beginning of 2016. Our requirements included unicode support combined with python2.7. That was impossible with C-extension based libraries (like pyahocorasick). Pure python libraries were very slow or unusable due to memory explosion. Since then another pure python library was released py-aho-corasick. The repository also contains some discussion about different implementations. There is also acora, but it includes the note (‘current construction algorithm is not suitable for really large sets of keywords’) which really was the case the last time I tested, because RAM ran out quickly.

Differences

  • Compared to pyahocorasick our library supports unicode in python 2.7 just like py-aho-corasick. We don’t use any C-Extension so the library is not platform dependant.

  • On top of the standard Aho-Corasick longest suffix search, we also perform a shortcutting routine in the end, so that our lookup is fast while, the setup takes longer. During set up we go through the states and directly add transitions that are “offered” by the longest suffix or their longest suffixes. This leads to faster lookup times, because in the end we only have to follow simple transitions and don’t have to perform any additional suffix lookup. It also leads to a bigger memory footprint, because the number of transitions is higher, because they are all included explicitely and not implicitely hidden by suffix pointers.

  • We added a small tool that helps you visualize the resulting graph. This may help understanding the algorithm, if you’d like. See below.

Performance

I compared the two libraries mentioned above with ahocorapy. We used 50,000 keywords long list and an input text of 34,199 characters. In the text only one keyword of the list is contained. The setup process was run once per library and the search process was run 100 times. The following results are in seconds (not averaged for the lookup).

You can perform this test yourself using python tests/ahocorapy_performance_test.py. (Except for the pyahocorasick_py results. These were taken by importing the pure python version of the code of pyahocorasick. It’s not available through pypi as stated in the code.)

I also added measurements for the pure python libraries with run with pypy.

These are the results:

Library (Variant)

Setup (1x)

Search (100x)

ahocorapy

1.2s

1.76s

ahocorapy (run with pypy)

0.9s

0.09s

pyahocorasick

0.1s

0.06s

pyahocorasick (pure python variant in github repo)

0.5s

1.68s

py_aho_corasick

1.9s

13.2s

py_aho_corasick (run with pypy)

1.3s

3.71s

As expected the C-Extension shatters the pure python implementations. Even though there is probably still room for optimization in ahocorapy we are not going to get to the mark that pyahocorasick sets. ahocorapy’s lookups are faster than py_aho_corasick. When run with pypy [PyPy 5.1.2 with GCC 5.3.1 20160413] ahocorapy is almost as fast as pyahocorasick, at least when it comes to searching. The setup overhead is higher due to the suffix shortcutting mechanism used.

Basic Usage:

Installation

pip install ahocorapy

Creation of the Search Tree

from ahocorapy.keywordtree import KeywordTree
kwtree = KeywordTree(case_insensitive=True)
kwtree.add('malaga')
kwtree.add('lacrosse')
kwtree.add('mallorca')
kwtree.add('mallorca bella')
kwtree.add('orca')
kwtree.finalize()

Searching

result = kwtree.search('My favorite islands are malaga and sylt.')
print(result)

Prints :

('malaga', 24)

The search_all method returns a generator for all keywords found, or None if there is none.

results = kwtree.search_all('malheur on mallorca bellacrosse')
for result in results:
    print(result)

Prints :

('mallorca', 11)
('orca', 15)
('mallorca bella', 11)
('lacrosse', 23)

Drawing Graph

You can print the underlying graph with the Visualizer class. This feature requires a working pygraphviz library installed.

from ahocorapy_visualizer.visualizer import Visualizer
visualizer = Visualizer()
visualizer.draw('readme_example.png', kwtree)

The resulting .png of the graph looks like this:

Keyword Tree

graph for kwtree

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ahocorapy-1.4.1.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ahocorapy-1.4.1-py2.py3-none-any.whl (11.0 kB view details)

Uploaded Python 2Python 3

File details

Details for the file ahocorapy-1.4.1.tar.gz.

File metadata

  • Download URL: ahocorapy-1.4.1.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for ahocorapy-1.4.1.tar.gz
Algorithm Hash digest
SHA256 88f7360ebad64e508e1096914b04362ff69877b4fee07dfd9e08449a417c7b64
MD5 a1a001c959871ce93613e5bc17cc99ec
BLAKE2b-256 97870caffb8a0a2cdb9a32d8f0aa5cc1506099546c1d4c3d1abf51c4c1444ef4

See more details on using hashes here.

File details

Details for the file ahocorapy-1.4.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for ahocorapy-1.4.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 020df0042229360a2beac69c9f454b050842df2feaed4053202118b09b8720f8
MD5 da209b3bd6613d8fdf8f9f2a5e70f995
BLAKE2b-256 dc072842120b199a014ee7dbd98573bd34ef324c9b4632a5d99b4d18b8421f78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page