Skip to main content

Concept annotation tool for Electronic Health Records

Project description

Medical oncept Annotation Tool

MedCAT can be used to extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS. Preprint arXiv.

UPDATE

MedCAT is upgraded to v1, unforunately this introduces breaking changes with older models (MedCAT v0.4), as well as potential problems with all code that used the MedCAT package.

MedCAT v0.4 is available on the legacy branch and will still be supported until 1. July 2021 (with respect to potential bug fixes), after it will still be available but not updated anymore.

Demo

A demo application is available at MedCAT. Please note that this was trained on MedMentions and contains a small portion of UMLS.

Tutorial

A guide on how to use MedCAT is available in the tutorial folder. Read more about MedCAT on Towards Data Science.

Papers that use MedCAT

Related Projects

  • MedCATtrainer - an interface for building, improving and customising a given Named Entity Recognition and Linking (NER+L) model (MedCAT) for biomedical domain text.
  • MedCATservice - implements the MedCAT NLP application as a service behind a REST API.
  • iCAT - A docker container for CogStack/MedCAT/HuggingFace development in isolated environments.

Install using PIP (Requires Python 3.6.1+)

  1. Install MedCAT
  • For macOS: pip install --upgrade medcat
  • For Windows/Linux (see PyTorch documentation): pip install --upgrade medcat -f https://download.pytorch.org/whl/torch_stable.html
  1. Get the scispacy models:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_md-0.3.0.tar.gz

  1. Downlad the Vocabulary and CDB from the Models section bellow

  2. Quickstart:

from medcat.vocab import Vocab
from medcat.cdb import CDB
from medcat.cat import CAT

# Load the vocab model you downloaded
vocab = Vocab.load(vocab_path)
# Load the cdb model you downloaded
cdb = CDB.load('<path to the cdb file>') 

# Create cat - each cdb comes with a config that was used
#to train it. You can change that config in any way you want, before or after creating cat.
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab)

# Test it
text = "My simple document with kidney failure"
doc_spacy = cat(text)
# Print detected entities
print(doc_spacy.ents)

# Or to get an array of entities, this will return much more information
#and usually easier to use unless you know a lot about spaCy
doc = cat.get_entities(text)
print(doc)


# To train on one example
_ = cat(text, do_train=True)

# To train on a iterator over documents
data_iterator = <your iterator>
cat.train(data_iterator)

#Once done, save the new CDB
cat.cdb.save(<save path>)

Models

A basic trained model is made public for the vocabulary and CDB. It is trained for the ~ 35K concepts available in MedMentions.

Vocabulary Download - Built from MedMentions

CDB Download - Built from MedMentions

(Note: This is was compiled from MedMentions and does not have any data from NLM as that data is not publicaly available.)

SNOMED-CT and UMLS

If you have access to UMLS or SNOMED-CT and can provide some proof (a screenshot of the UMLS profile page is perfect, feel free to redact all information you do not want to share), contact us - we are happy to share the pre-built CDB and Vocab for those databases.

TODO

  • Switch to spaCy version 3+
  • Enable automatic download of pre-built UMLS/SNOMED databases
  • Enable spaCy serialization of documents (problem with doc._.ents)
  • Update webapp to v1 and enable UMLS and SNOMED

Acknowledgement

Entity extraction was trained on MedMentions In total it has ~ 35K entites from UMLS

The vocabulary was compiled from Wiktionary In total ~ 800K unique words

Powered By

A big thank you goes to spaCy and Hugging Face - who made life a million times easier.

Citation

@misc{kraljevic2020multidomain,
      title={Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit}, 
      author={Zeljko Kraljevic and Thomas Searle and Anthony Shek and Lukasz Roguski and Kawsar Noor and Daniel Bean and Aurelie Mascio and Leilei Zhu and Amos A Folarin and Angus Roberts and Rebecca Bendayan and Mark P Richardson and Robert Stewart and Anoop D Shah and Wai Keong Wong and Zina Ibrahim and James T Teo and Richard JB Dobson},
      year={2020},
      eprint={2010.01165},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medcat-1.0.22.tar.gz (85.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

medcat-1.0.22-py3-none-any.whl (124.0 kB view details)

Uploaded Python 3

File details

Details for the file medcat-1.0.22.tar.gz.

File metadata

  • Download URL: medcat-1.0.22.tar.gz
  • Upload date:
  • Size: 85.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.0

File hashes

Hashes for medcat-1.0.22.tar.gz
Algorithm Hash digest
SHA256 cf9a57f2061f5f5bbc52d0a3c5dc8cb6c7d06872e9564c88bd568430c5538a2e
MD5 491bcc98e72f984fccff6006d967ca79
BLAKE2b-256 4452c047df2b6bc46ff44006f5958f10943a7d4a0472d61307187e94450cd812

See more details on using hashes here.

File details

Details for the file medcat-1.0.22-py3-none-any.whl.

File metadata

  • Download URL: medcat-1.0.22-py3-none-any.whl
  • Upload date:
  • Size: 124.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.0

File hashes

Hashes for medcat-1.0.22-py3-none-any.whl
Algorithm Hash digest
SHA256 30391b0b9c53671541e5f6419290060f8f9c8c4389b4111c9b973d6a2498909c
MD5 cfc914fbae37cd1ef8ba4f429d657b09
BLAKE2b-256 454b519820c8ebbe22e1e764485fb1d279f1d82e70d6b1abab058e5fcb4f2f29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page