Asyncio support for Stanford CoreNLP
Project description
aiocorenlp
High-fidelity asyncio capable Stanford CoreNLP library.
Heavily based on ner and nltk.
Rationale and differences from nltk
For every tag operation (in other words, every call to StanfordTagger.tag*), nltk runs a Stanford JAR (stanford-ner.jar/stanford-postagger.jar) in a newly spawned Java subprocess.
In order to pass the input text to these JARs, nltk first writes it to a tempfile and includes its path in the Java command line using the -textFile flag.
This method works well in sequential applications, however once scaled up by concurrency and stress problems begin to arise:
- Python's
tempfile.mkstempdoesn't work very well on Windows to begin with and starts to break down under stress.- Calls to
tempfile.mkstempstart to fail which in turn results in Stanford code failing (no input file to read). - Temporary files get leaked resulting in negative impact on disk usage.
- Calls to
- Repeated calls to
subprocessmean:- Multiple Java processes run in parallel causing negative impact on CPU and memory usage.
- OS-level subprocess and Java startup code has to be run every time causing additional negative impact on CPU usage.
All this causes unnecessary slowdown and bad reliability to user-written code.
Patching nltk's code to use tempfile.TemporaryDirectory instead of tempfile.mkstemp seemed to resolve issue 1 but issue 2 would require more work.
This library runs the Stanford code in a server mode and sends input text over TCP, meaning:
- Filesystem operations and temporary files/directories are avoided entirely.
- There's no need to run a Java subprocess more than once.
- The only synchronization bottleneck is offloaded to Java's
SocketServerclass which is used in the Stanford code. - CPU, memory and disk usage is greatly reduced.
Differences from ner
asynciosupport.- Method name mangling is inexplicably enabled in the
ner.client.NERclass, making subclassing not practical. - The ner library appears to be abandoned.
Differences from stanza
asynciosupport.- Stanza aims to provide a wider range of uses.
Basic Usage
>>> from aiocorenlp import ner_tag
>>> await ner_tag("I complained to Microsoft about Bill Gates.")
[('O', 'I'), ('O', 'complained'), ('O', 'to'), ('ORGANIZATION', 'Microsoft'), ('O', 'about'), ('PERSON', 'Bill'), ('PERSON', 'Gates.')]
This usage doesn't require interfacing with the server and socket directly and is suitable for low frequency/one-time tagging.
Advanced Usage
To fully take advantage of this library's benefits the AsyncNerServer and AsyncPosServer classes should be used:
from aiocorenlp.async_ner_server import AsyncNerServer
from aiocorenlp.async_corenlp_socket import AsyncCorenlpSocket
server = AsyncNerServer()
port = server.start()
print(f"Server started on port {port}")
socket: AsyncCorenlpSocket = server.get_socket()
while True:
text = input("> ")
if text == "exit":
break
print(await socket.tag(text))
server.stop()
Context manager is supported as well:
from aiocorenlp.async_ner_server import AsyncNerServer
server: AsyncNerServer
async with AsyncNerServer() as server:
socket = server.get_socket()
while True:
text = input("> ")
if text == "exit":
break
print(await socket.tag(text))
Configuration
As seen above, all classes and functions this library exposes may be used without arguments (default values).
Optionally, the following arguments may be passed to AsyncNerServer (and by extension ner_tag/pos_tag):
port: Server bind port. LeaveNonefor random port.model_path: Path to language model. LeaveNoneto letnltkfind the model (supportsSTANFORD_MODELSenvironment variable).jar_path: Path tostanford-*.jar. LeaveNoneto letnltkfind the jar (supportsSTANFORD_POSTAGGERenvironment variable, for NER as well).output_format: Output format. SeeOutputFormatenum for values. Default isslashTags.encoding: Output encoding.java_options: Additional JVM options.
It is not possible to configure the server bind interface. This is a limitation imposed by the Stanford code.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aiocorenlp-1.0.2.tar.gz.
File metadata
- Download URL: aiocorenlp-1.0.2.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ecafeb9a0320562bbe1ff91b2dd5f6050a5756191aa6b1b997029e7caa74a544
|
|
| MD5 |
7af6d799eb63e1639951661963b34232
|
|
| BLAKE2b-256 |
290adb0cba09f29d0f87f5ddc40f176a997dfe3b6308c7efcc08ddd09b99fb6e
|
File details
Details for the file aiocorenlp-1.0.2-py3-none-any.whl.
File metadata
- Download URL: aiocorenlp-1.0.2-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32937b053fd3baeb7c09876fa280485c1ffe8596656e44447263513b00e8dbec
|
|
| MD5 |
9965bbd3f23077afe24f620a84744834
|
|
| BLAKE2b-256 |
e55fbe15be76fc602ed7ae9bcedb83b03fb72d8a39e4e1d09e26230ea6405e6a
|