Utility functions to preprocess Phil. legalese in weasel-based flows.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

corpus-preprocess

Github CI

Utility functions to preprocess Phil. legalese in weasel-based flows:

lexcat-proj; and
lexcat-multi

[!IMPORTANT] Requires private corpus-assets folder and sqlite3 db in citelaws-data to be cloned locally.

- corpus-assets: # folder structure
  - concept: # must be two-level nested patterns.json + q.txt
  - artifact: # single folder patterns.json + q.txt
  - text: # each file is a .txt

Language customization

Assuming familiarity with spacy:

nlp.tokenizer = customize_tokenizer(nlp, special_token_rules) # custom tokenizer
ruler = nlp.add_pipe(
    "span_ruler",
    config={
        "spans_key": "ruler",
        "phrase_matcher_attr": "LOWER",
        "spans_filter": {"@misc": "spacy.first_longest_spans_filter.v1"}, # longest spans only
    },
)
ruler.add_patterns(patterns)  # created patterns from this library and corpus-assets

[!NOTE] Loading model with 130k pattern lines takes ~2 min.

Training data

Concept spans

for folder in get_concepts(asset_dir.joinpath("concept")):
    bn = DocBin()
    # use q.txt as queries to the db
    # number of segments per q.txt to fetch
    docs = apply_concept_q_filter(nlp, db_file, filter_path=folder, max_segments=500)
    for doc in docs:
        bn.add(doc)
    bn.to_disk(asset_dir.joinpath(f"train/{folder.stem}.spacy"))

Each concept_dir contains subtopics:

- corpus-assets: # folder structure
  - concept: # must be two-level nested
    - political: # main subject category
        - bill_of_rights: # sub-topic
            - patterns.json # contains matcher files
            - q.txt # contains lines which can be used to query the database

Because of this structure, it's possible to train a textcat_multilabel component:

textcat_options = [concept["id"].split("/")[0] for concept in concept_patterns]

@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
    def __init__(self, nlp: Language, name: str, options: list[str]):
        self.nlp = nlp
        self.options = options

    def __call__(self, doc) -> Doc:
        doc.cats = {op: 0.0 for op in self.options}
        for span in doc.spans["sc"]:
            if span.id:  # some spans won't have an id
                value = self.nlp.vocab.strings[span.id]
                if "/" in value:  # e.g. political/bill_of_rights
                    main_topic = value.split("/")[0]  # just political
                    if main_topic in self.options:
                        if doc.cats[main_topic] == 0.0:
                            doc.cats[main_topic] = 1.0
        return doc

Non-concept spans

Although patterns from set_patterns() are included in the constructed nlp object, can ensure that a certain of rows (filter_count) are fetched from the database that have spans which are labeled title and/or serial, etc.

for label in {"unit", "ref", "serial", "title", "axiom", "date", "juridical"}:
    bn = DocBin()
    docs = apply_label_filter(nlp, db_file, filter_labels={label}, filter_count=1500)
    for doc in docs:
        bn.add(doc)
    bn.to_disk(asset_dir.joinpath(f"train/{label}.spacy"))

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.7

Dec 22, 2023

0.0.6

Dec 22, 2023

0.0.5

Dec 22, 2023

0.0.4

Dec 21, 2023

0.0.3

Dec 20, 2023

0.0.2

Dec 20, 2023

0.0.1

Dec 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_preprocess-0.0.7.tar.gz (21.9 kB view hashes)

Uploaded Dec 22, 2023 Source

Built Distribution

corpus_preprocess-0.0.7-py3-none-any.whl (27.2 kB view hashes)

Uploaded Dec 22, 2023 Python 3

Hashes for corpus_preprocess-0.0.7.tar.gz

Hashes for corpus_preprocess-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`41bc54d0f12dfc3fa9ba4ba7b0f0dec86459d7a3f5dbfea07c67c77ead045e6f`
MD5	`85ee52bc4a1b444f81b4d2b0d6b83f19`
BLAKE2b-256	`eb3f5293b52d7734940cbffc9fee4b41994c4bb5ab58527c7a286bf8e6b9e335`

Hashes for corpus_preprocess-0.0.7-py3-none-any.whl

Hashes for corpus_preprocess-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d83ebe3e7892ed7759b94877a02e07229641cfcb2a6d4976e7f9e3d136da60d`
MD5	`c9acde61a643cc93bf8ff3c1d6504608`
BLAKE2b-256	`627988d8f73fe64988d87e5b43972636d6c542c80002bb7c179b9dc773c83979`