Skip to main content

Text retrieval and analytics engine.

Project description

What is Caterpillar?

https://img.shields.io/travis/Kapiche/caterpillar.svg?style=flat-square https://img.shields.io/coveralls/Kapiche/caterpillar.svg?style=flat-square https://codeship.com/projects/YOUR_PROJECT_UUID/status?branch=master

Caterpillar is a pure python text indexing and analytics library. Some features include:

  • pluggable key/value object store for storage (currently only implementation is SQLite)

  • transaction layer for reading/writing (along with associated locking semantics)

  • supports searching indexes with some built in scoring algorithm implementations (including TF/IDF)

  • stores additional data structures for analytics above and beyond traditional information retrieval data structures

  • has a plugin architecture for quickly accessing the data structures and performing custom analytics

  • has 100% test coverage

Quick Example

Quick example of using caterpillar below:

import os
import tempfile

from caterpillar.processing.index import IndexWriter, IndexConfig
from caterpillar.processing.schema import TEXT, Schema, NUMERIC
from caterpillar.storage.sqlite import SqliteStorage

index_dir = os.path.join(tempfile.mkdtemp(), "examples")
with open('caterpillar/test_resources/moby.txt', 'r') as f:
    data = f.read()
    with IndexWriter(index_dir, IndexConfig(SqliteStorage, Schema(text=TEXT, some_number=NUMERIC))) as writer:
        writer.add_document(text=data, some_number=1)

Installation

pip install caterpillar

Documentation

The documentation can be found here.

Roadmap

We are working on porting our issues from our internal issue tracker over to a more visible system. But, for the time being, here is a general roadmap:

  • Move to (possibly only) Python 3 (see below).

  • Revamp schema and field design.

  • Add a memory storage implementation.

  • Revamp query design.

  • Remove the NLTK dependency (great library, but only used for tokenisation).

  • Switch index structures over to a more efficient data structure (possibly numpy arrays or similar).

The current plan is to move to using GitHub issues with HuBoard, but stay tuned.

Python Version

Currently Python 2.7+ only. Work is underway to support Python 3+. WARNING: Caterpillar might become Python 3+ only in the future. Stay tuned.

BDFLs

Contributors

Anyone who is willing! In other words none yet, but we are more then accepting of contributions.

Contributing

Not code will be merged unless it has 100% test coverage and passes pep8. We code with a line length of 120 characters (see tox.ini [pep8] section) and we use py.test for testing. Tests are in a test sub-folder in each package. We generally run coverage as follows:

coverage erase; coverage run --source caterpillar -m py.test -v caterpillar; coverage report

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

caterpillar-1.0.0.dev7.tar.gz (45.3 kB view details)

Uploaded Source

caterpillar-1.0.0.dev7.macosx-10.10-x86_64.tar.gz (87.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

caterpillar-1.0.0.dev7-py2-none-any.whl (57.5 kB view details)

Uploaded Python 2

File details

Details for the file caterpillar-1.0.0.dev7.tar.gz.

File metadata

File hashes

Hashes for caterpillar-1.0.0.dev7.tar.gz
Algorithm Hash digest
SHA256 002a32e2e6498716fbe3ab6b373308d65f04740616ae4cc53fc2bcc873851fb7
MD5 260df01edb34422d32d310f577b022cd
BLAKE2b-256 c1d060e0132b3dd9707d65f5ba5be3d54cd734160a2dfcc13c1ae2a5a3ce8b22

See more details on using hashes here.

File details

Details for the file caterpillar-1.0.0.dev7.macosx-10.10-x86_64.tar.gz.

File metadata

File hashes

Hashes for caterpillar-1.0.0.dev7.macosx-10.10-x86_64.tar.gz
Algorithm Hash digest
SHA256 7a1ac81602fc713b2527fdf034998a5c074fdf227e18664bd10118b8f420e4d7
MD5 8786f4930264c295ba155da99c06c93d
BLAKE2b-256 b4b3bc0e1f988e93ece7baedfa0520086183ac4b3ae58096591f149efa6f8dbb

See more details on using hashes here.

File details

Details for the file caterpillar-1.0.0.dev7-py2-none-any.whl.

File metadata

File hashes

Hashes for caterpillar-1.0.0.dev7-py2-none-any.whl
Algorithm Hash digest
SHA256 112a2ece42d6216c5ef6c1c45f693677f8bb2b89611231c45c0bf6fc1ae91981
MD5 3b25e62b126b68f4b819b984b10cb3b6
BLAKE2b-256 a4b20b6817c3ac41b2788d90c92c9ab4aa5df2f59d2eda259616f563fd0cfe4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page