Skip to main content

A full SpaCy pipeline and models for scientific/biomedical documents.

Project description

This repository contains custom pipes and models related to using spaCy for scientific documents.

In particular, there is a custom tokenizer that adds tokenization rules on top of spaCy's rule-based tokenizer, a POS tagger and syntactic parser trained on biomedical data and an entity span detection model. Separately, there are also NER models for more specific tasks.

Installation

Installing scispacy requires two steps: installing the library and intalling the models. To install the library, run:

pip install scispacy

to install a model (see our full selection of available models below), run a command like the following:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.0/en_core_sci_sm-0.2.0.tar.gz

Note: We strongly recommend that you use an isolated Python environment (such as virtualenv or conda) to install scispacy. Take a look below in the "Setting up a virtual environment" section if you need some help with this. Additionally, scispacy uses modern features of Python and as such is only available for Python 3.6 or greater.

Setting up a virtual environment

Conda can be used set up a virtual environment with the version of Python required for scispaCy. If you already have a Python 3.6 or 3.7 environment you want to use, you can skip to the 'installing via pip' section.

  1. Download and install Conda.

  2. Create a Conda environment called "scispacy" with Python 3.6:

    conda create -n scispacy python=3.6
    
  3. Activate the Conda environment. You will need to activate the Conda environment in each terminal in which you want to use scispaCy.

    source activate scispacy
    

Now you can install scispacy and one of the models using the steps above.

Once you have completed the above steps and downloaded one of the models below, you can load a scispaCy model as you would any other spaCy model. For example:

import spacy
nlp = spacy.load("en_core_sci_sm")
doc = nlp("Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in some animals.")

Available Models

To install a model, click on the link below to download the model, and then run

pip install </path/to/download>

Alternatively, you can install directly from the URL by right-clicking on the link, selecting "Copy Link Address" and running

pip install CMD-V(to paste the copied URL)
Model Description Install URL
en_core_sci_sm A full spaCy pipeline for biomedical data. Download
en_core_sci_md A full spaCy pipeline for biomedical data with a larger vocabulary and word vectors. Download
en_ner_craft_md A spaCy NER model trained on the CRAFT corpus. Download
en_ner_jnlpba_md A spaCy NER model trained on the JNLPBA corpus. Download
en_ner_bc5cdr_md A spaCy NER model trained on the BC5CDR corpus. Download
en_ner_bionlp13cg_md A spaCy NER model trained on the BIONLP13CG corpus. Download

Citing

If you use ScispaCy in your research, please cite ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing.

@inproceedings{Neumann2019ScispaCyFA,
  title={ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing},
  author={Mark Neumann and Daniel King and Iz Beltagy and Waleed Ammar},
  year={2019},
  Eprint={arXiv:1902.07669}
}

ScispaCy is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scispacy-0.2.0.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scispacy-0.2.0-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file scispacy-0.2.0.tar.gz.

File metadata

  • Download URL: scispacy-0.2.0.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for scispacy-0.2.0.tar.gz
Algorithm Hash digest
SHA256 99d9271485b5353b9ba30c54b0d91ac48d3282ba239eed3c24173bfd50df323c
MD5 556bd5e69e2efcd6a7f645177817defc
BLAKE2b-256 59da857d98f05053095ab53bd14a547e17e1380a3241d971e8ead6b83c16efeb

See more details on using hashes here.

File details

Details for the file scispacy-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: scispacy-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for scispacy-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 27e474bbbae8f708c9e38969b268b18924ff04d2e1be658cb7242de52e2410a9
MD5 c5dd24ab24f580499efb86a9bd608457
BLAKE2b-256 c16bd283efee8bd39a154db75c802e44c933bf1c327dad0641a60e33a701849a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page