Skip to main content

Python package to convert spaCy and Stanza documents to NLP Annotation Format (NAF)

Project description

nafigator

https://img.shields.io/pypi/v/nafigator.svg https://img.shields.io/travis/wjwillemse/nafigator.svg Documentation Status License: MIT Code style: black

DISCLAIMER - BETA PHASE

This package is currently in a beta phase.

to nafigate [ naf-i-geyt ]

v.intr, nafigated, nafigating

  1. To process one of more text documents through a NLP pipeline and output results in the NLP Annotation Format.

Features

The Nafigator package allows you to store (intermediate) results and processing steps from custom made spaCy and stanza pipelines in one format.

  • Convert text files to .naf-files that satisfy the NLP Annotation Format (NAF)

    • Supported input media types: application/pdf (.pdf), text/plain (.txt), text/html (.html)

    • Supported output format: .naf (xml)

    • Supported NLP processors: spaCy, stanza

    • Supported NAF layers: raw, text, terms, entities, deps, multiwords

  • Read .naf documents and access data as Python lists and dicts

In addition to NAF a ‘formats’ layer is added with text format data (font and size) to allow text classification like header detection.

When reading .naf-files Nafigator stores data in memory as lxml ElementTrees. The lxml package provides a Pythonic binding for C libaries so it should be very fast.

The NAF format

Key features:

  • Multilayered extensible annotations;

  • Reproducible NLP pipelines;

  • NLP processor agnostic;

  • Compatible with RDF

References:

Installation

To install the package

pip install nafigator

To install the package from Github

pip install -e git+https://github.com/wjwillemse/nafigator.git#egg=nafigator

How to run

Command line interface

To parse an pdf or a txt file run in the root of the project:

python -m nafigator.parse

Function calls

Example:

from nafigator.parse import generate_naf

doc = generate_naf(input = "../data/example.pdf",
                   engine = "stanza",
                   language = "en",
                   naf_version = "v3.1",
                   dtd_validation = False,
                   params = {'fileDesc': {'author': 'anonymous'}},
                   nlp = None)
  • input: text document to convert to naf document

  • engine: pipeline processor, i.e. ‘spacy’ or ‘stanza’

  • language: ‘en’ or ‘nl’

  • naf_version: ‘v3’ or ‘v3.1’

  • dtd_validation: True or False (default = False)

  • params: dictionary with parameters (default = {})

  • nlp: custom made pipeline object from spacy or stanza (default = None)

Get the document and processors metadata via:

doc.header

Output of doc.header of processed data/example.pdf:

{
  'fileDesc': {
    'author': 'anonymous',
    'creationtime': '2021-04-25T11:28:58UTC',
    'filename': 'data/example.pdf',
    'filetype': 'application/pdf',
    'pages': '2'},
  'public': {
    '{http://purl.org/dc/elements/1.1/}uri': 'data/example.pdf',
    '{http://purl.org/dc/elements/1.1/}format': 'application/pdf'},
...

Get the raw layer output via:

doc.raw

Output of doc.raw of processed data/example.pdf:

The Nafigator package allows you to store NLP output from custom made spaCy and stanza  pipelines with (intermediate) results and all processing steps in one format.  Multiwords like in 'we have set that out below' are recognized (depending on your NLP  processor).

Get the text layer output via:

doc.text

Output of doc.text of processed data/example.pdf:

[
  {'text': 'The', 'page': '1', 'sent': '1', 'id': 'w1', 'length': '3', 'offset': '0'},
  {'text': 'Nafigator', 'page': '1', 'sent': '1', 'id': 'w2', 'length': '9', 'offset': '4'},
  {'text': 'package', 'page': '1', 'sent': '1', 'id': 'w3', 'length': '7', 'offset': '14'},
  {'text': 'allows', 'page': '1', 'sent': '1', 'id': 'w4', 'length': '6', 'offset': '22'},
...

Get the terms layer output via:

doc.terms

Output of doc.terms of processed data/example.pdf:

[
  {'id': 't1', 'lemma': 'the', 'pos': 'DET', 'type': 'open', 'morphofeat': 'Definite=Def|PronType=Art', 'targets': [{'id': 'w1'}]},
  {'id': 't2', 'lemma': 'Nafigator', 'pos': 'PROPN', 'type': 'open', 'morphofeat': 'Number=Sing', 'targets': [{'id': 'w2'}]},
  {'id': 't3', 'lemma': 'package', 'pos': 'NOUN', 'type': 'open', 'morphofeat': 'Number=Sing', 'targets': [{'id': 'w3'}]},
  {'id': 't4', 'lemma': 'allow', 'pos': 'VERB', 'type': 'open', 'morphofeat': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin',
...

Get the entities layer output via:

doc.entities

Output of doc.entities of processed data/example.pdf:

[
  {'id': 'e1', 'type': 'PRODUCT', 'text': 'Nafigator', 'targets': [{'id': 't2'}]},
  {'id': 'e2', 'type': 'CARDINAL', 'text': 'one', 'targets': [{'id': 't28'}]}]
]

Get the entities layer output via:

doc.deps

Output of doc.deps of processed data/example.pdf:

[
  {'from_term': 't3', 'to_term': 't1', 'from_orth': 'package', 'to_orth': 'The', 'rfunc': 'det'},
  {'from_term': 't4', 'to_term': 't3', 'from_orth': 'allows', 'to_orth': 'package', 'rfunc': 'nsubj'},
  {'from_term': 't3', 'to_term': 't2', 'from_orth': 'package', 'to_orth': 'Nafigator', 'rfunc': 'compound'},
  {'from_term': 't4', 'to_term': 't5', 'from_orth': 'allows', 'to_orth': 'you', 'rfunc': 'obj'},
...

Get the multiwords layer output via:

doc.multiwords

Output of doc.multiwords:

[
  {'id': 'mw1', 'lemma': 'set_out', 'pos': 'VERB', 'type': 'phrasal', 'components': [
    {'id': 'mw1.c1', 'targets': [{'id': 't37'}]},
    {'id': 'mw1.c2', 'targets': [{'id': 't39'}]}]}
]

Get the formats layer output via:

doc.formats

Output of doc.formats:

[
  {'length': '268', 'offset': '0', 'textboxes': [
    {'textlines': [
      {'texts': [
        {'font': 'CIDFont+F1', 'size': '12.000', 'length': '87', 'offset': '0', 'text': 'The Nafigator package allows you to store NLP output from custom made spaCy and stanza '
        }]
      },
      {'texts': [
        {'font': 'CIDFont+F1', 'size': '12.000', 'length': '77', 'offset': '88', 'text': 'pipelines with (intermediate) results and all processing steps in one format.'
...

Adding new annotation layers

To add a new annotation layer with elements, start with registering the processor of the new annotations:

lp = ProcessorElement(name="processorname", version="1.0", timestamp=None, beginTimestamp=None,   endTimestamp=None, hostname=None)

naf.add_processor_element("recommendations", lp)

Then get the layer and add subelements:

layer = naf.layer("recommendations")

data_recommendation = {'id': "recommendation1", 'subjectivity': 0.5, 'polarity': 0.25, 'span': [{'id': 't37'}, {'id': 't39'}]}

element = self.subelement(element=layer, tag="recommendation", data=data_recommendation)

naf.add_span_element(element=element, data=data_recommendation)

Retrieve the recommendations with:

naf.recommendations

History

0.1.0 (2021-03-13)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nafigator-0.1.21-py2.py3-none-any.whl (41.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file nafigator-0.1.21-py2.py3-none-any.whl.

File metadata

  • Download URL: nafigator-0.1.21-py2.py3-none-any.whl
  • Upload date:
  • Size: 41.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.10

File hashes

Hashes for nafigator-0.1.21-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4ab82951349469458abeb1dd87bee09dce31b627a710e4fcba12ebe2657e078c
MD5 3a4793fc20a37b9d8199d8b3e8e96d67
BLAKE2b-256 3bbbf2e46770d4a8a787591554111b83104265c0436a87e4831472a32f500a4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page