Skip to main content

Text retrieval and analytics engine.

Project description

What is Caterpillar?

https://img.shields.io/travis/Kapiche/caterpillar.svg?style=flat-square https://img.shields.io/coveralls/Kapiche/caterpillar.svg?style=flat-square https://codeship.com/projects/YOUR_PROJECT_UUID/status?branch=master

Caterpillar is a pure python text indexing and analytics library. Some features include:

  • pluggable key/value object store for storage (currently only implementation is SQLite)

  • transaction layer for reading/writing (along with associated locking semantics)

  • supports searching indexes with some built in scoring algorithm implementations (including TF/IDF)

  • stores additional data structures for analytics above and beyond traditional information retrieval data structures

  • has a plugin architecture for quickly accessing the data structures and performing custom analytics

  • has 100% test coverage

Quick Example

Quick example of using caterpillar below:

import os
import tempfile

from caterpillar.processing.index import IndexWriter, IndexConfig
from caterpillar.processing.schema import TEXT, Schema, NUMERIC
from caterpillar.storage.sqlite import SqliteStorage

index_dir = os.path.join(tempfile.mkdtemp(), "examples")
with open('caterpillar/test_resources/moby.txt', 'r') as f:
    data = f.read()
    with IndexWriter(index_dir, IndexConfig(SqliteStorage, Schema(text=TEXT, some_number=NUMERIC))) as writer:
        writer.add_document(text=data, some_number=1)

Installation

pip install caterpillar

Documentation

The documentation can be found here.

Roadmap

We are working on porting our issues from our internal issue tracker over to a more visible system. But, for the time being, here is a general roadmap:

  • Move to (possibly only) Python 3 (see below).

  • Revamp schema and field design.

  • Add a memory storage implementation.

  • Revamp query design.

  • Remove the NLTK dependency (great library, but only used for tokenisation).

  • Switch index structures over to a more efficient data structure (possibly numpy arrays or similar).

The current plan is to move to using GitHub issues with HuBoard, but stay tuned.

Python Version

Currently Python 2.7+ only. Work is underway to support Python 3+. WARNING: Caterpillar might become Python 3+ only in the future. Stay tuned.

BDFLs

Contributors

Anyone who is willing! In other words none yet, but we are more then accepting of contributions.

Contributing

Not code will be merged unless it has 100% test coverage and passes pep8. We code with a line length of 120 characters (see tox.ini [pep8] section) and we use py.test for testing. Tests are in a test sub-folder in each package. We generally run coverage as follows:

coverage erase; coverage run --source caterpillar -m py.test -v caterpillar; coverage report

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

caterpillar-1.0.0.dev10.tar.gz (45.6 kB view details)

Uploaded Source

caterpillar-1.0.0.dev10.macosx-10.10-x86_64.tar.gz (89.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

caterpillar-1.0.0.dev10-py2-none-any.whl (57.8 kB view details)

Uploaded Python 2

File details

Details for the file caterpillar-1.0.0.dev10.tar.gz.

File metadata

File hashes

Hashes for caterpillar-1.0.0.dev10.tar.gz
Algorithm Hash digest
SHA256 55e65a3905c551fe197e6b74fce4119a7bdc2413c9766091d934c376aed4b9b8
MD5 e96a6fe2c98b97e29672462b20ffc2cb
BLAKE2b-256 195b3ff47a2a8a541b57cfec934f133f72f050f0005ec1c3da183cfa51775148

See more details on using hashes here.

File details

Details for the file caterpillar-1.0.0.dev10.macosx-10.10-x86_64.tar.gz.

File metadata

File hashes

Hashes for caterpillar-1.0.0.dev10.macosx-10.10-x86_64.tar.gz
Algorithm Hash digest
SHA256 e661f482c0d71ee626404800a22d1fc2c96280f7391ba4e362815136d8277e3e
MD5 035d31199705e87758ffa4b90607047d
BLAKE2b-256 1d45fe536236bf6ac11efe903a21eba5f9acdd55b9de092840c1ab24078adf5c

See more details on using hashes here.

File details

Details for the file caterpillar-1.0.0.dev10-py2-none-any.whl.

File metadata

File hashes

Hashes for caterpillar-1.0.0.dev10-py2-none-any.whl
Algorithm Hash digest
SHA256 329770b4fe528bc1580a6726eb8b693986282906f35829676064140bca87ab36
MD5 28f8d9d10fa8b0bbfc95648c15658fe9
BLAKE2b-256 136340283ecff23d35219390a37033f84bcd97a5a70ab2fb0215600dd711a9e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page