Building blocks for spacy Matcher patterns

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

corpus-patterns

Github CI

A preparatory utils library.

Create a custom tokenizer

from corpus_patterns import set_tokenizer

nlp = spacy.blank("en")
nlp.tokenizer = set_tokenizer(nlp)

The tokenizer:

Removes dashes from infixes
Adds prefix/suffix rules for parenthesis/brackets
Adds special exceptions to treat dotted text as a single token

Use with modified config file:

@spacy.registry.tokenizers("test")  # type: ignore
def create_corpus_tokenizer():
    def create_tokenizer(nlp):
        return set_tokenizer(nlp)
    return create_tokenizer

nlp = spacy.load("en_core_web_sm", config={"nlp": {"tokenizer": {"@tokenizers": "test"}}},
)

Add .jsonl files to directory

Each file will contain lines of spacy matcher patterns.

from corpus_patterns import create_rules
from pathlib import Path

create_rules(folder=Path("location-here"))  # check directory

Utils

annotate_fragments() - given an nlp object and some *.txt files, create a single annotation *.jsonl file
extract_lines_from_txt_files() - accepts an iterator of *.txt files and yields each line (after sorting the same and ensuring uniqueness of content).
split_data() - given a list of text strings, split the same into two groups and return a dictionary containing these groups based on the ratio provided (defaults to 0.80)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.2

Dec 20, 2023

0.1.1

Dec 19, 2023

0.1.0

Dec 19, 2023

0.0.9

Dec 18, 2023

0.0.8

Dec 17, 2023

0.0.7

Dec 17, 2023

0.0.6

Dec 15, 2023

0.0.5

Dec 14, 2023

0.0.4

Dec 4, 2023

0.0.3

Dec 3, 2023

0.0.2

Nov 30, 2023

0.0.1

Nov 30, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_patterns-0.1.2.tar.gz (18.0 kB view hashes)

Uploaded Dec 20, 2023 Source

Built Distribution

corpus_patterns-0.1.2-py3-none-any.whl (23.6 kB view hashes)

Uploaded Dec 20, 2023 Python 3

Hashes for corpus_patterns-0.1.2.tar.gz

Hashes for corpus_patterns-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`dacdacaee24d28d3fd674839d3385aaf0cc562de0f1f358664ceb68cd28c5143`
MD5	`13b5aecea1cef7e0f1f3062a03a702bc`
BLAKE2b-256	`47dbfe0d9a2c8a2acf831713e9482e355a88f5521dffd93a2369c0381a1ec50e`

Hashes for corpus_patterns-0.1.2-py3-none-any.whl

Hashes for corpus_patterns-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5544f0bebc839864c400edcc8d4a90524aaeec6de4816a5e57606bee6bf1336d`
MD5	`175deb278634feff02832f2a2e931f5b`
BLAKE2b-256	`5c70f24432a621cc1cddda79b8b1b95d411953dd49adbf339607890437e5af11`