Skip to main content

Building blocks for spacy Matcher patterns

Project description

corpus-patterns

Github CI

A preparatory utils library.

Create a custom tokenizer

from corpus_patterns import set_tokenizer

nlp = spacy.blank("en")
nlp.tokenizer = set_tokenizer(nlp)

The tokenizer:

  1. Removes dashes from infixes
  2. Adds prefix/suffix rules for parenthesis/brackets
  3. Adds special exceptions to treat dotted text as a single token

Use with modified config file:

@spacy.registry.tokenizers("test")  # type: ignore
def create_corpus_tokenizer():
    def create_tokenizer(nlp):
        return set_tokenizer(nlp)
    return create_tokenizer

nlp = spacy.load("en_core_web_sm", config={"nlp": {"tokenizer": {"@tokenizers": "test"}}},
)

Add .jsonl files to directory

Each file will contain lines of spacy matcher patterns.

from corpus_patterns import create_rules
from pathlib import Path

create_rules(folder=Path("location-here"))  # check directory

Utils

  1. annotate_fragments() - given an nlp object and some *.txt files, create a single annotation *.jsonl file
  2. extract_lines_from_txt_files() - accepts an iterator of *.txt files and yields each line (after sorting the same and ensuring uniqueness of content).
  3. split_data() - given a list of text strings, split the same into two groups and return a dictionary containing these groups based on the ratio provided (defaults to 0.80)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_patterns-0.1.2.tar.gz (18.0 kB view hashes)

Uploaded Source

Built Distribution

corpus_patterns-0.1.2-py3-none-any.whl (23.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page