Building blocks for spacy Matcher patterns

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

corpus-preprocess

Github CI

Helps preprocess Philippine legal corpus.

[!IMPORTANT] Relies on private corpus-assets to be downloaded locally.

Custom tokenizer

from preprocess import set_tokenizer

nlp = spacy.blank("en")
nlp.tokenizer = set_tokenizer(nlp)

The tokenizer:

Removes dashes from infixes
Adds prefix/suffix rules for parenthesis/brackets
Adds special exceptions to treat dotted text as a single token

SpanRuler from assets

Use in tandem with tokenizer, ensure only longest spans:

from spacy.language import Language
from spacy.util import filter_spans
from preprocess import set_patterns_from_assets
import spacy

@spacy.registry.tokenizers("toktest")
def create_corpus_tokenizer():
    def create_tokenizer(nlp):
        return set_tokenizer(nlp)
    return create_tokenizer


@Language.component(name="filter_added_spans")
def filter_added_spans(doc):
    doc.spans["ruler"] = filter_spans(doc.spans["ruler"])
    return doc

nlp = spacy.blank("en", config={"nlp": {"tokenizer": {"@tokenizers": "toktest"}}})
ruler = nlp.add_pipe("span_ruler", config={"phrase_matcher_attr": "LOWER"}, validate=True) # defaults to 'ruler' key
patterns = set_patterns_from_assets(folder)
ruler.add_patterns(patterns)
nlp.add_pipe("filter_added_spans") # ensures only longest spans are included
nlp.to_disk("models/")  # will save entire directory which includes the pipeline

Utils

annotate_fragments() - given an nlp object and some *.txt files, create a single annotation *.jsonl file
extract_lines_from_txt_files() - accepts an iterator of *.txt files and yields each line (after sorting the same and ensuring uniqueness of content).
split_data() - given a list of text strings, split the same into two groups and return a dictionary containing these groups based on the ratio provided (defaults to 0.80)

Processes

Asset folder

This library presumes the existence of a local corpus-assets folder having the following structure:

- ents:
  - casenames.txt # each line is a clean case
  - clean_statute_titles.txt # each line is a clean title
- concepts:
  - political: # main subject category
      - bill_of_rights: # sub-topic
          - patterns.json # contains matcher files
          - q.txt # contains lines which can be used to query the database
- metas:
  - artifacts:
    - axiom:
      - patterns.json # same
      - q.txt # same

Generate queries

The q.txt lines will be used as criteria to fetch relevant segments from the database.

The db file should be have an "opinion_segments" table with fts-enabled on the "text" column. /scripts/extract.py utilizes table.search().

See code

def extract_txt_from_db(
    source_db_file: str,
    path: Path,
    max_segments: int,
    min_char_segment: int = 100,
    max_char_segment: int = 3000,
    is_unique_txt: bool = True,
):
    """An fts expression is auto-generated by `q.txt` files found in the `path`. This
    expression is used to generate strings of text that match the aggregated query."""
    db = Database(source_db_file)
    tbl = db["opinion_segments"]
    rows = tbl.search(  # type: ignore
        q=create_fts_expr(path),
        where="category='ruling' and char_count > :min_char and char_count < :max_char ",
        where_args={"min_char": min_char_segment, "max_char": max_char_segment},
        limit=max_segments,
        columns=["text", "id"],
    )
    if is_unique_txt:
        rows = filter_unique_texts(rows)
    return rows

Create matcher patterns

A SpanRuler component will be based on patterns.json (with q.txt as phrases). These patterns are aggregated via set_patterns_from_assets(). See code:

def set_patterns_from_assets(path: Path):
    axioms = axiom.collect_patterns(path.joinpath("meta"))
    concepts = create_concept_patterns(path.joinpath("concepts"))
    ents = extract_ents(path.joinpath("ents"))
    return axioms + concepts + ents

Categorize queried segments via patterns found

A TextCategorizer component can be trained using the results of the span ruler: see sample code:

@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
    def __init__(self, nlp: Language, name: str, path: str):
        self.nlp = nlp
        options = list({p["id"].split("/")[0] for p in create_patterns(path)})  # type: ignore
        if len(options) == 1:
            options.append(f"not_{options[0]}")
        self.options = options

    def __call__(self, doc) -> Doc:
        default = {op: 0.0 for op in self.options}
        cats = [self.nlp.vocab.strings[s.id].split("/")[0] for s in doc.spans["sc"]]
        doc.cats = default | {k: 1.0 for k, _ in Counter(cats).items()}
        return doc

[!NOTE] Note: if textcat is in the pipeline, if only one label is found, will error out, hence need to a not option. If textcat_multilabel is used, then a single category is fine.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.7

Dec 22, 2023

0.0.6

Dec 22, 2023

0.0.5

Dec 22, 2023

0.0.4

Dec 21, 2023

0.0.3

Dec 20, 2023

0.0.2

Dec 20, 2023

This version

0.0.1

Dec 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_preprocess-0.0.1.tar.gz (20.9 kB view hashes)

Uploaded Dec 20, 2023 Source

Built Distribution

corpus_preprocess-0.0.1-py3-none-any.whl (25.4 kB view hashes)

Uploaded Dec 20, 2023 Python 3

Hashes for corpus_preprocess-0.0.1.tar.gz

Hashes for corpus_preprocess-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`8a4b5dc5786ca9e25a19e2d8690f3dea31734d7722a5d2045786042fe959c7a7`
MD5	`2694a0595078b5610352226412dea4a1`
BLAKE2b-256	`776f2238d575edb993128a73ecb904b61921be674532c6f3b2ba5b1b6ddc2b4d`

Hashes for corpus_preprocess-0.0.1-py3-none-any.whl

Hashes for corpus_preprocess-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d352742714e8e4fd9f211ae7b8d07c9fede0777d46c7edfc0976c18fb2fa6e7e`
MD5	`4821302e9179828b59de6d3d4ecaf291`
BLAKE2b-256	`1d850f96e63a891b99fce4c65c7ad27ccb60d17131c3062cda36500c4c2e2e07`