Utility functions to preprocess Phil. legalese in weasel-based flows.
Project description
corpus-preprocess
Utility functions to preprocess Phil. legalese in weasel-based flows:
- lexcat-proj; and
- lexcat-multi
[!IMPORTANT] Relies on a private corpus-assets to be cloned locally.
- corpus-assets: # folder should have the following structure:
- data: # used as data folder in tokenization
- single_tokens.json
- report_publishers.json
- ents: # collected in `setup_span_ruler.py`
- casenames.txt # each line is a clean case
- clean_statute_titles.txt # each line is a clean title
- concepts: # collected in `setup_span_ruler.py`
- political: # main subject category
- bill_of_rights: # sub-topic
- patterns.json # contains matcher files
- q.txt # contains lines which can be used to query the database
- metas: # collected in `setup_span_ruler.py`
- artifacts:
- axiom:
- patterns.json # same
- q.txt # same
Custom tokenizer / span ruler
import spacy
from .setup_span_ruler import set_patterns_from_assets
from .setup_tokenizer import customize_tokenizer
from .tokens_single import import_data_tokens
from .utils import validated_path
# limit number of spans returned, ruler key is default
@Language.component(name="filter_added_spans")
def filter_added_spans(doc):
doc.spans["ruler"] = filter_spans(doc.spans["ruler"])
return doc
# initialize model, get special rules for tokenization, here: tokens_dir = /corpus_assets/data
rules_file = validated_path(tokens_dir)
special_rules = import_data_tokens(data_path=rules_file)
nlp = spacy.load("en_core_web_sm", exclude=("ner", "senter"))
nlp.tokenizer = customize_tokenizer(nlp, special_rules)
# prepare patterns for span rule, here assets_dir = /corpus_assets
span_patterns = set_patterns_from_assets(path=validated_path(assets_dir))
ruler = nlp.add_pipe("span_ruler", config={"phrase_matcher_attr": "LOWER"})
ruler.add_patterns(span_patterns)
nlp.add_pipe("filter_added_spans")
nlp.to_disk("models/") # will save entire directory which includes the pipeline
[!NOTE] Loading the model can take awhile if more patterns in
set_patterns_from_assets()
are included, e.g. 130k pattern files takes about 90seconds.
Processes
Generate queries
The q.txt
lines will be used as criteria to fetch relevant segments from the database.
The db file should be have an "opinion_segments" table with fts-enabled on the "text" column. /scripts/extract.py
utilizes table.search(). See code:
def extract_txt_from_db(
source_db_file: str,
path: Path,
max_segments: int,
min_char_segment: int = 100,
max_char_segment: int = 3000,
is_unique_txt: bool = True,
):
"""An fts expression is auto-generated by `q.txt` files found in the `path`. This
expression is used to generate strings of text that match the aggregated query."""
db = Database(source_db_file)
tbl = db["opinion_segments"]
rows = tbl.search( # type: ignore
q=create_fts_expr(path), # an sqlite fts5 expression is made via q.txt files
where="category='ruling' and char_count > :min_char and char_count < :max_char ",
where_args={"min_char": min_char_segment, "max_char": max_char_segment},
limit=max_segments,
columns=["text", "id"],
)
if is_unique_txt:
rows = filter_unique_texts(rows)
return rows
Create matcher patterns
A SpanRuler component will be based on patterns.json
(with q.txt
as phrases). These patterns are aggregated via set_patterns_from_assets()
but can be used individually. See code:
def set_patterns_from_assets(path: Path):
axioms = axiom.collect_patterns(path.joinpath("meta"))
concepts = create_concept_patterns(path.joinpath("concepts"))
ents = extract_ents(path.joinpath("ents"))
return axioms + concepts + ents
Enabling textcat_multilabel
The create_concept_patterns()
can be mapped to their ids which is their location in corpus-assets. This makes it possible to create a textcat-multilabel component using the span.id, e.g.:
textcat_options = [concept["id"].split("/")[0] for concept in concept_patterns]
@Language.factory(name="add_cats_from_spans")
class AddTextCatComponent:
def __init__(self, nlp: Language, name: str, options: list[str]):
self.nlp = nlp
self.options = options
def __call__(self, doc) -> Doc:
doc.cats = {op: 0.0 for op in self.options}
for span in doc.spans["sc"]:
if span.id: # some spans won't have an id
value = self.nlp.vocab.strings[span.id]
if "/" in value: # e.g. political/bill_of_rights
main_topic = value.split("/")[0] # just political
if main_topic in self.options:
if doc.cats[main_topic] == 0.0:
doc.cats[main_topic] = 1.0
return doc
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for corpus_preprocess-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac3380a72899f43057e03dd204a5f88663e4f7515968fbdba992b71dd6657940 |
|
MD5 | cf60495e404755d128d36d807f46d79a |
|
BLAKE2b-256 | 37cd3d7c0a1b100c01f29e28a58c4960babbec638bad6136c7762417910680d1 |