Skip to main content

Utility functions to preprocess Phil. legalese in weasel-based flows.

Project description

corpus-preprocess

Github CI

Utility functions to preprocess Phil. legalese in weasel-based flows:

  1. lexcat-proj; and
  2. lexcat-multi

[!IMPORTANT] Requires private corpus-assets folder and sqlite3 db in citelaws-data to be cloned locally.

- corpus-assets: # folder structure
  - concept: # must be two-level nested patterns.json + q.txt
  - artifact: # single folder patterns.json + q.txt
  - text: # each file is a .txt

Language customization

Assuming familiarity with spacy:

nlp.tokenizer = customize_tokenizer(nlp, special_token_rules) # custom tokenizer
ruler = nlp.add_pipe(
    "span_ruler",
    config={
        "spans_key": "ruler",
        "phrase_matcher_attr": "LOWER",
        "spans_filter": {"@misc": "spacy.first_longest_spans_filter.v1"}, # longest spans only
    },
)
ruler.add_patterns(patterns)  # created patterns from this library and corpus-assets

[!NOTE] Loading model with 130k pattern lines takes ~2 min.

Training data

Concept spans

for folder in get_concepts(asset_dir.joinpath("concept")):
    bn = DocBin()
    # use q.txt as queries to the db
    # number of segments per q.txt to fetch
    docs = apply_concept_q_filter(nlp, db_file, filter_path=folder, max_segments=500)
    for doc in docs:
        bn.add(doc)
    bn.to_disk(asset_dir.joinpath(f"train/{folder.stem}.spacy"))

Each concept_dir contains subtopics:

- corpus-assets: # folder structure
  - concept: # must be two-level nested
    - political: # main subject category
        - bill_of_rights: # sub-topic
            - patterns.json # contains matcher files
            - q.txt # contains lines which can be used to query the database

Because of this structure, it's possible to train a textcat_multilabel component:

textcat_options = [concept["id"].split("/")[0] for concept in concept_patterns]

@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
    def __init__(self, nlp: Language, name: str, options: list[str]):
        self.nlp = nlp
        self.options = options

    def __call__(self, doc) -> Doc:
        doc.cats = {op: 0.0 for op in self.options}
        for span in doc.spans["sc"]:
            if span.id:  # some spans won't have an id
                value = self.nlp.vocab.strings[span.id]
                if "/" in value:  # e.g. political/bill_of_rights
                    main_topic = value.split("/")[0]  # just political
                    if main_topic in self.options:
                        if doc.cats[main_topic] == 0.0:
                            doc.cats[main_topic] = 1.0
        return doc

Non-concept spans

Although patterns from set_patterns() are included in the constructed nlp object, can ensure that a certain of rows (filter_count) are fetched from the database that have spans which are labeled title and/or serial, etc.

for label in {"unit", "ref", "serial", "title", "axiom", "date", "juridical"}:
    bn = DocBin()
    docs = apply_label_filter(nlp, db_file, filter_labels={label}, filter_count=1500)
    for doc in docs:
        bn.add(doc)
    bn.to_disk(asset_dir.joinpath(f"train/{label}.spacy"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_preprocess-0.0.7.tar.gz (21.9 kB view hashes)

Uploaded Source

Built Distribution

corpus_preprocess-0.0.7-py3-none-any.whl (27.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page