Skip to main content

Text retrieval and analytics engine.

Project description

What is Caterpillar?

https://img.shields.io/travis/Kapiche/caterpillar.svg?style=flat-square https://img.shields.io/coveralls/Kapiche/caterpillar.svg?style=flat-square https://codeship.com/projects/YOUR_PROJECT_UUID/status?branch=master

Caterpillar is a pure python text indexing and analytics library. Some features include:

  • pluggable key/value object store for storage (currently only implementation is SQLite)

  • transaction layer for reading/writing (along with associated locking semantics)

  • supports searching indexes with some built in scoring algorithm implementations (including TF/IDF)

  • stores additional data structures for analytics above and beyond traditional information retrieval data structures

  • has a plugin architecture for quickly accessing the data structures and performing custom analytics

  • has 100% test coverage

Quick Example

Quick example of using caterpillar below:

import os
import tempfile

from caterpillar.processing.index import IndexWriter, IndexConfig
from caterpillar.processing.schema import TEXT, Schema, NUMERIC
from caterpillar.storage.sqlite import SqliteStorage

index_dir = os.path.join(tempfile.mkdtemp(), "examples")
with open('caterpillar/test_resources/moby.txt', 'r') as f:
    data = f.read()
    with IndexWriter(index_dir, IndexConfig(SqliteStorage, Schema(text=TEXT, some_number=NUMERIC))) as writer:
        writer.add_document(text=data, some_number=1)

Installation

pip install caterpillar

Documentation

The documentation can be found here.

Roadmap

We are working on porting our issues from our internal issue tracker over to a more visible system. But, for the time being, here is a general roadmap:

  • Move to (possibly only) Python 3 (see below).

  • Revamp schema and field design.

  • Add a memory storage implementation.

  • Revamp query design.

  • Remove the NLTK dependency (great library, but only used for tokenisation).

  • Switch index structures over to a more efficient data structure (possibly numpy arrays or similar).

The current plan is to move to using GitHub issues with HuBoard, but stay tuned.

Python Version

Currently Python 2.7+ only. Work is underway to support Python 3+. WARNING: Caterpillar might become Python 3+ only in the future. Stay tuned.

BDFLs

Contributors

Anyone who is willing! In other words none yet, but we are more then accepting of contributions.

Contributing

Not code will be merged unless it has 100% test coverage and passes pep8. We code with a line length of 120 characters (see tox.ini [pep8] section) and we use py.test for testing. Tests are in a test sub-folder in each package. We generally run coverage as follows:

coverage erase; coverage run --source caterpillar -m py.test -v caterpillar; coverage report

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

caterpillar-1.0.0.dev12.tar.gz (46.0 kB view details)

Uploaded Source

caterpillar-1.0.0.dev12.linux-x86_64.tar.gz (89.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

caterpillar-1.0.0.dev12-py2-none-any.whl (58.0 kB view details)

Uploaded Python 2

File details

Details for the file caterpillar-1.0.0.dev12.tar.gz.

File metadata

File hashes

Hashes for caterpillar-1.0.0.dev12.tar.gz
Algorithm Hash digest
SHA256 d1ef64dafa81b2bccf531967ae96609c94a32b880c98643104d58bfa8d51a53d
MD5 29d1431224f0040236a74834c636ad4a
BLAKE2b-256 192317df96c87b995f77d64c826647cca9045f00d48f96276888826a42749122

See more details on using hashes here.

File details

Details for the file caterpillar-1.0.0.dev12.linux-x86_64.tar.gz.

File metadata

File hashes

Hashes for caterpillar-1.0.0.dev12.linux-x86_64.tar.gz
Algorithm Hash digest
SHA256 b31f8f93bbcaa9b35bf5389b6a93709eec7bb386bf349689a628d42b86ffcf2b
MD5 ad40b358d5dcc579a6aa006e81214fcf
BLAKE2b-256 571833e3bcf14e6d06383a74bc5cb6839fa262cfb67ae71e2f444eb693e37338

See more details on using hashes here.

File details

Details for the file caterpillar-1.0.0.dev12-py2-none-any.whl.

File metadata

File hashes

Hashes for caterpillar-1.0.0.dev12-py2-none-any.whl
Algorithm Hash digest
SHA256 81600c0d5b7a19781e25407b8367378f7a32aba8576a75b4e6fa3b3dfa277ee6
MD5 521a1c300aae96e9d4284898bdcf6866
BLAKE2b-256 703b3023ffdba27867daf5e39fd7eddd90f659faed0a7f531cffe62e7297ca5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page