Remove identifiers from data using BERT

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

bert-deid

Code to fine-tune BERT on a medical note de-identification task.

Install

(Recommended) Create an environment called deid
- conda env create -f environment.yml

pip install locally
- pip install bert-deid

Download

To download the model, we have provided a helper script in bert-deid:

# note: MODEL_DIR environment variable used by download
export MODEL_DIR="~/bert_deid_model/"
bert_deid download

Usage (Shell)

From the command line, we can call bert_deid to apply it to any given text:

export MODEL_DIR="~/bert_deid_model/"
bert_deid apply --text "hello dr. somayah"

Text can also be piped to bert_deid. Alternatively, the --text_dir argument allows running the package on all files in a folder:

mkdir tmp
echo "hello dr. somayah" > tmp/example1.txt
echo "No pneumothorax since 2019-01-01." > tmp/example2.txt
bert_deid apply --text_dir tmp

Deidentified files are output with the .deid extension, e.g. tmp/example1.txt would become tmp/example1.txt.deid.

Usage (Python)

The model can also be imported and used directly within Python.

from bert_deid.model import Transformer

# load in a trained model
model_type = 'bert'
model_path = '/data/models/bert-i2b2-2014'
deid_model = Transformer(model_type, model_path)

text = 'Dr. Somayah says I have had a pneumothorax since 2019-01-01.'
print(deid_model.apply(text, repl='___'))

# we can also get the original predictions
preds, lengths, offsets = deid_model.predict(text)

# print out the identified entities
for p in range(preds.shape[0]):
    start, stop = offsets[p], offsets[p] + lengths[p]

    # most likely prediction
    idxMax = preds[p].argmax()
    label = deid_model.label_set.id_to_label[idxMax]
    print(f'{text[start:stop]:15s} {label}')

Training and evaluating a transformer model

First, you'll need a suitable dataset. Right now this can be: i2b2_2014, i2b2_2006, PhysioNet, or Dernoncourt-Lee. A dataset is considered suitable if it is saved in the right format. Dataset formats are as follows:

a root folder dedicated to the dataset
train/test subfolders
each train/test subfolder has ann/txt subfolders
the txt subfolder has files with the .txt extension containing the text to be deidentified
the ann subfolder has files with the .gs extension containing a CSV of gold standard de-id annotations

Here's an example:

i2b2_2014
├── train
│   ├── ann
│   │   ├── 100-01.gs
│   │   ├── 100-02.gs
│   │   └── 100-03.gs
│   └── txt
│       ├── 100-01.txt
│       ├── 100-02.txt
│       └── 100-03.txt
└── test
    ├── ann
    │   ├── 110-01.gs
    │   ├── 110-02.gs
    │   └── 110-03.gs
    └── txt
        ├── 110-01.txt
        ├── 110-02.txt
        └── 110-03.txt

With the dataset available, create the environment:

conda create env -f environment.yml

Activate the environment:

conda activate deid

Train a model (e.g. BERT):

python scripts/train_transformer.py --data_dir /data/deid-gs/i2b2_2014 --data_type i2b2_2014 --model_type bert --model_name_or_path bert-base-uncased --do_lower_case --output_dir /data/models/bert-model-i2b2-2014 --do_train --overwrite_output_dir

Note this will only use data from the train subfolder of the --data_dir arg. Once the model is trained it can be used as above.

The binary_evaluation.py script can be used to assess performance on a test set. First, we'll need to generate the predictions:

export TEST_SET_PATH='/enc_data/deid-gs/i2b2_2014/test'
export MODEL_PATH='/enc_data/models/bert-i2b2-2014'
export PRED_PATH='out/'

python scripts/output_preds.py --data_dir ${TEST_SET_PATH} --model_dir ${MODEL_PATH} --output_folder ${PRED_PATH}

This outputs the predictions to the out folder. If we look at one of the files, we can see each prediction is a CSV of stand-off annotations. Here are the top few lines from the 110-01.pred file:

document_id,annotation_id,start,stop,entity,entity_type,comment
110-01,4,16,20,2069,DATE,
110-01,5,20,21,-,DATE,
110-01,6,21,23,04,DATE,
110-01,7,23,24,-,DATE,
110-01,8,24,26,07,DATE,

We can now evaluate the predictions using the ground truth:

python scripts/binary_evaluation.py --pred_path ${PRED_PATH} --text_path ${TEST_SET_PATH}/txt --ref_path ${TEST_SET_PATH}/ann

For our trained model, this returned:

Macro Se: 0.9818
Macro P+: 0.9885
Macro F1: 0.9840
Micro Se: 0.9816
Micro P+: 0.9892
Micro F1: 0.9854

We can also look at individual predictions for a given file:

export FN=110-02
python scripts/print_annotation.py -p ${PRED_PATH}/${FN}.pred -t ${TEST_SET_PATH}/txt/${FN}.txt -r ${TEST_SET_PATH}/ann/${FN}.gs

If we would like a multi-class evaluation, we need to know about any label transformations done by the model, so we call a different script:

python scripts/eval.py --model_dir ${MODEL_PATH} --pred_path ${PRED_PATH} --text_path ${TEST_SET_PATH}/txt --ref_path ${TEST_SET_PATH}/ann

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.3

Mar 18, 2021

This version

0.2.2

Jul 24, 2020

0.2.1

Jul 24, 2020

0.2.0

Jul 24, 2020

0.1.2

Mar 29, 2019

0.1.1

Mar 29, 2019

0.1

Mar 29, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bert_deid-0.2.2.tar.gz (40.9 kB view hashes)

Uploaded Jul 24, 2020 Source

Built Distribution

bert_deid-0.2.2-py3-none-any.whl (44.4 kB view hashes)

Uploaded Jul 24, 2020 Python 3

Hashes for bert_deid-0.2.2.tar.gz

Hashes for bert_deid-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`492732293bcf3c289bc3222d82e240de43d622c44e2a4a2f70221b5ba694da94`
MD5	`703ccfb946c6bdb0d402263a6aee55ea`
BLAKE2b-256	`6f81f67124992af7f67c18a66c8cb770d9108d18169948e3e29214ded12b1bec`

Hashes for bert_deid-0.2.2-py3-none-any.whl

Hashes for bert_deid-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77fc88690c1b90b55fe2352e7dbb084de8084c375390f869db2be4f731833385`
MD5	`112b28156855a716e45789bdba40ac68`
BLAKE2b-256	`b8d38970a53c76df2872a3a7bb97076aa811135e551c6f5f7465a067a90239b4`