A toolkit for Scientific Document Processing
Project description
**CHANGE LOG
Source Code
- Rename
SingleSummarization
toSummarization
. - Change the format of output files from
.txt
to.json
.
Documentation
- Move the definition of
Pipeline
class fromUsage
toContribution Guide
. - Add catalog for Contribution Guide.
- Add examples for choosing devices in
Usage
.
SciAssist
About •
Installation •
Usage •
Contribution
About
This is the repository of SciAssist, which is a toolkit to assist scientists' research. SciAssist currently supports reference string parsing, more functions are under active development by WING@NUS, Singapore. The project was built upon an open-sourced template by ashleve.
Installation
pip install SciAssist
Setup Grobid for pdf processing
After you install the package, you can simply setup grobid with the CLI:
setup_grobid
This will setup Grobid. And after installation, starts the Grobid server with:
run_grobid
Usage
Here are some example usages.
Reference string parsing:
from SciAssist import ReferenceStringParsing
# Set device="cpu" if you want to use only CPU. The default device is "gpu".
# ref_parser = ReferenceStringParsing(device="cpu")
ref_parser = ReferenceStringParsing(device="gpu")
# For string
res = ref_parser.predict(
"""Calzolari, N. (1982) Towards the organization of lexical definitions on a
database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles
University, Prague, pp.61-64.""", type="str")
# For text
res = ref_parser.predict("test.txt", type="txt")
# For pdf
res = ref_parser.predict("test.pdf")
Summarizarion for single document:
from SciAssist import Summarization
# Set device="cpu" if you want to use only CPU. The default device is "gpu".
# pipleine = Summarization(device="cpu")
summerizer = Summarization(device="gpu")
text = """1 INTRODUCTION . Statistical learning theory studies the learning properties of machine learning algorithms , and more fundamentally , the conditions under which learning from finite data is possible .
In this context , classical learning theory focuses on the size of the hypothesis space in terms of different complexity measures , such as combinatorial dimensions , covering numbers and Rademacher/Gaussian complexities ( Shalev-Shwartz & Ben-David , 2014 ; Boucheron et al. , 2005 ) .
Another more recent approach is based on defining suitable notions of stability with respect to perturbation of the data ( Bousquet & Elisseeff , 2001 ; Kutin & Niyogi , 2002 ) .
In this view , the continuity of the process that maps data to estimators is crucial , rather than the complexity of the hypothesis space .
Different notions of stability can be considered , depending on the data perturbation and metric considered ( Kutin & Niyogi , 2002 ) .
Interestingly , the stability and complexity approaches to characterizing the learnability of problems are not at odds with each other , and can be shown to be equivalent as shown in Poggio et al . ( 2004 ) and Shalev-Shwartz et al . ( 2010 ) .
In modern machine learning overparameterized models , with a larger number of parameters than the size of the training data , have become common .
The ability of these models to generalize is well explained by classical statistical learning theory as long as some form of regularization is used in the training process ( Bühlmann & Van De Geer , 2011 ; Steinwart & Christmann , 2008 ) .
However , it was recently shown - first for deep networks ( Zhang et al. , 2017 ) , and more recently for kernel methods ( Belkin et al. , 2019 ) - that learning is possible in the absence of regularization , i.e. , when perfectly fitting/interpolating the data .
Much recent work in statistical learning theory has tried to find theoretical ground for this empirical finding .
Since learning using models that interpolate is not exclusive to deep neural networks , we study generalization in the presence of interpolation in the case of kernel methods .
We study both linear and kernel least squares problems in this paper . """
# For string
res = summerizer.predict(text, type="str")
# For text
res = summerizer.predict("bodytext.txt", type="txt")
# For pdf
res = summerizer.predict("raw.pdf")
Contribution
Here's a simple introduction about how to incorporate a new task into SciAssist. Generally, to add a new task, you will need to:
1. Git clone this repo and prepare the virtual environment.
2. Install Grobid Server.
3. Create a LightningModule and a DataLightningModule.
4. Train a model.
5. Provide a pipeline for users.
We provide a step-by-step contribution guide, see SciAssist’s documentation.
LICENSE
This toolkit is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International
.
Read LICENSE for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for SciAssist-0.0.35-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 87a4b9f49603f71e470a490086070be5aa18107f67767a05b0e616df239355b1 |
|
MD5 | dcf8c45826cd433bce20c79acf506e82 |
|
BLAKE2b-256 | f8ee362ae1c4adfd5a5e26febe32c6937ad0ae8b553ec3e9e8c3d86062ea9aac |