Utility functions to preprocess Phil. legalese in weasel-based flows.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

corpus-preprocess

Github CI

Utility functions to preprocess Phil. legalese in weasel-based flows:

lexcat-proj; and
lexcat-multi

[!IMPORTANT] Relies on a private corpus-assets to be cloned locally.

- corpus-assets: #  folder should have the following structure:
  - data: # used as data folder in tokenization
    - single_tokens.json
    - report_publishers.json
  - ents: # collected in `setup_span_ruler.py`
    - casenames.txt # each line is a clean case
    - clean_statute_titles.txt # each line is a clean title
  - concepts: # collected in `setup_span_ruler.py`
    - political: # main subject category
        - bill_of_rights: # sub-topic
            - patterns.json # contains matcher files
            - q.txt # contains lines which can be used to query the database
  - metas: # collected in `setup_span_ruler.py`
    - artifacts:
      - axiom:
        - patterns.json # same
        - q.txt # same

Custom tokenizer / span ruler

import spacy

from .setup_span_ruler import set_patterns_from_assets
from .setup_tokenizer import customize_tokenizer
from .tokens_single import import_data_tokens
from .utils import validated_path

# limit number of spans returned, ruler key is default
@Language.component(name="filter_added_spans")
def filter_added_spans(doc):
    doc.spans["ruler"] = filter_spans(doc.spans["ruler"])
    return doc

# initialize model, get special rules for tokenization, here: tokens_dir = /corpus_assets/data
rules_file = validated_path(tokens_dir)
special_rules = import_data_tokens(data_path=rules_file)
nlp = spacy.load("en_core_web_sm", exclude=("ner", "senter"))
nlp.tokenizer = customize_tokenizer(nlp, special_rules)

# prepare patterns for span rule, here assets_dir = /corpus_assets
span_patterns = set_patterns_from_assets(path=validated_path(assets_dir))
ruler = nlp.add_pipe("span_ruler", config={"phrase_matcher_attr": "LOWER"})
ruler.add_patterns(span_patterns)
nlp.add_pipe("filter_added_spans")
nlp.to_disk("models/")  # will save entire directory which includes the pipeline

[!NOTE] Loading the model can take awhile if more patterns in set_patterns_from_assets() are included, e.g. 130k pattern files takes about 90seconds.

Processes

Generate queries

The q.txt lines will be used as criteria to fetch relevant segments from the database.

The db file should be have an "opinion_segments" table with fts-enabled on the "text" column. /scripts/extract.py utilizes table.search(). See code:

def extract_txt_from_db(
    source_db_file: str,
    path: Path,
    max_segments: int,
    min_char_segment: int = 100,
    max_char_segment: int = 3000,
    is_unique_txt: bool = True,
):
    """An fts expression is auto-generated by `q.txt` files found in the `path`. This
    expression is used to generate strings of text that match the aggregated query."""
    db = Database(source_db_file)
    tbl = db["opinion_segments"]
    rows = tbl.search(  # type: ignore
        q=create_fts_expr(path), # an sqlite fts5 expression is made via q.txt files
        where="category='ruling' and char_count > :min_char and char_count < :max_char ",
        where_args={"min_char": min_char_segment, "max_char": max_char_segment},
        limit=max_segments,
        columns=["text", "id"],
    )
    if is_unique_txt:
        rows = filter_unique_texts(rows)
    return rows

Create matcher patterns

A SpanRuler component will be based on patterns.json (with q.txt as phrases). These patterns are aggregated via set_patterns_from_assets() but can be used individually. See code:

def set_patterns_from_assets(path: Path):
    axioms = axiom.collect_patterns(path.joinpath("meta"))
    concepts = create_concept_patterns(path.joinpath("concepts"))
    ents = extract_ents(path.joinpath("ents"))
    return axioms + concepts + ents

Enabling textcat_multilabel

The create_concept_patterns() can be mapped to their ids which is their location in corpus-assets. This makes it possible to create a textcat-multilabel component using the span.id, e.g.:

textcat_options = [concept["id"].split("/")[0] for concept in concept_patterns]

@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
    def __init__(self, nlp: Language, name: str, options: list[str]):
        self.nlp = nlp
        self.options = options

    def __call__(self, doc) -> Doc:
        doc.cats = {op: 0.0 for op in self.options}
        for span in doc.spans["sc"]:
            if span.id:  # some spans won't have an id
                value = self.nlp.vocab.strings[span.id]
                if "/" in value:  # e.g. political/bill_of_rights
                    main_topic = value.split("/")[0]  # just political
                    if main_topic in self.options:
                        if doc.cats[main_topic] == 0.0:
                            doc.cats[main_topic] = 1.0
        return doc

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.7

Dec 22, 2023

0.0.6

Dec 22, 2023

0.0.5

Dec 22, 2023

This version

0.0.4

Dec 21, 2023

0.0.3

Dec 20, 2023

0.0.2

Dec 20, 2023

0.0.1

Dec 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_preprocess-0.0.4.tar.gz (20.9 kB view hashes)

Uploaded Dec 21, 2023 Source

Built Distribution

corpus_preprocess-0.0.4-py3-none-any.whl (25.6 kB view hashes)

Uploaded Dec 21, 2023 Python 3

Hashes for corpus_preprocess-0.0.4.tar.gz

Hashes for corpus_preprocess-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`80af1c2ab4728fb04ae1c0264cad2baabe9b866657070eb30b809790c21ab349`
MD5	`df5d31750db38fac780aab00b7094c6a`
BLAKE2b-256	`c804e32585a3a4ff8ae96e42e2747b350abaf808149fc11c97c676b107bb0d4a`

Hashes for corpus_preprocess-0.0.4-py3-none-any.whl

Hashes for corpus_preprocess-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ac3380a72899f43057e03dd204a5f88663e4f7515968fbdba992b71dd6657940`
MD5	`cf60495e404755d128d36d807f46d79a`
BLAKE2b-256	`37cd3d7c0a1b100c01f29e28a58c4960babbec638bad6136c7762417910680d1`