Bloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents.

These details have not been verified by PyPI

Project links

Homepage

Project description

Bloatectomy

Bloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents. Takes in a list of notes or a single file (.docx, .txt, .rtf, etc) or single string to be marked for duplicates. Marked output and tokens are output.

Requirements

Python>=3.7.x (in order for the regular expressions to work correctly)
re
sys
pandas (optional, only necessary if using MIMIC III data)
docx (optional, only necessary if input or output is a word/docx file)

Installation

using anaconda or miniconda

conda install -c summerkrankin bloatectomy

using pip via PyPI
make sure to install it to python3 if your default is python2

python3 -m pip install bloatectomy

using pip via github

python3 -m pip install git+git://github.com/MIT-LCP/mimic-code TBA

manual install by cloning the repository

git clone git://github.com/MIT-LCP/mimic-code TBA
cd bloatectomy
python3 setup.py install

Examples

To run bloatectomy on a sample string with the following options:

highlighting duplicates
display raw results
output file as html
output file of numbered tokens:

from bloatectomy import bloatectomy

text = '''Assessment and Plan
61 yo male Hep C cirrhosis
Abd pain:
-other labs: PT / PTT / INR:16.6//    1.5, CK / CKMB /
ICU Care
-other labs: PT / PTT / INR:16.6//  1.5, CK / CKMB /
Assessment and Plan
'''

bloatectomy(text, style='highlight', display=True, filename='sample_txt_highlight_output', output='html', output_numbered_tokens=True)

To use with example text or load ipynb examples, download the repository or just the bloatectomy_examples folder

cd bloatectomy_examples
from bloatectomy import bloatectomy

bloatectomy('./input/sample_text.txt',
            style='highlight', display=False,
            filename='./output/sample_txt_highlight_output',
            output='html',
            output_numbered_tokens=True,
            output_original_tokens=True)

Documentation

The paper is located at TBA

class bloatectomy(input_text,
                  path = '',
                  filename='bloatectomized_file',
                  display=False,
                  style='highlight',
                  output='html',
                  output_numbered_tokens=False,
                  output_original_tokens=False,
                  regex1=r"(.+?\.[\s\n]+)",
                  regex2=r"(?=\n\s*[A-Z1-9#-]+.*)",
                  postgres_engine=None,
                  postgres_table=None)

Parameters

input_text: file, str, list
An input document (.txt, .rtf, .docx), a string of text, or list of hadm_ids for postgres mimiciii database or the raw text.

style: str, optional, default=highlight
Method for denoting a duplicate. The following are allowed: highlight, bold, remov.

filename: str, optional, default=bloatectomized_file A string to name output file of the bloat-ectomized document.

path: str, optional, default=' '
The directory for output files.

output_numbered_tokens: bool, optional, default=False
If set to True, a .txt file with each token enumerated and marked for duplication, is output as [filename]_token_numbers.txt. This is useful when diagnosing your own regular expression for tokenization or testing the remov option for style.

output_original_tokens: bool, optional, default=False
If set to True, a .txt file with each original (non-marked) token enumerated but not marked for duplication, is output as [filename]_original_token_numbers.txt.

display: bool, optional, default=False
If set to True, the bloatectomized text will display in the console on completion.

regex1: str, optional, default=r"(.+?\.[\s\n]+)"
The regular expression for the first tokenization. Split on a period (.) followed by one or more white space characters (space, tab, line breaks) or a line feed character (\n). This can be replaced with any valid regular expression to change the way tokens are created.

regex2: str, optional, default=r"(?=\n\s*[A-Z1-9#-]+.*)"
The regular expression for the second tokenization. Split on any newline character (\n) followed by an uppercase letter, a number, or a dash. This can be replaced with any valid regular expression to change how sub-tokens are created.

postgres_engine: str, optional The postgres connection. Only relevant for use with the MIMIC III dataset. When using this option, do not invoke a filename and it will name each file with the hadm_id. See the jupyter notebook mimic_bloatectomy_example for the example code.

postgres_table: str, optional The name of the postgres table containing the concatenated notes. Only relevant for use with the MIMIC III dataset. When using this option, do not invoke a filename and it will name each file with the hadm_id. See the jupyter notebook mimic_bloatectomy_example for the example code.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.12

Jun 18, 2020

0.0.11

Jun 15, 2020

This version

0.0.10

Jun 15, 2020

0.0.9

Jun 15, 2020

0.0.8

Jun 15, 2020

0.0.7

Jun 12, 2020

0.0.6

Jun 12, 2020

0.0.5

Jun 12, 2020

0.0.4

Jun 12, 2020

0.0.3

Jun 11, 2020

0.0.2

Jun 11, 2020

0.0.1

Jun 11, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bloatectomy-0.0.10.tar.gz (16.4 kB view details)

Uploaded Jun 15, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bloatectomy-0.0.10-py3-none-any.whl (17.1 kB view details)

Uploaded Jun 15, 2020 Python 3

File details

Details for the file bloatectomy-0.0.10.tar.gz.

File metadata

Download URL: bloatectomy-0.0.10.tar.gz
Upload date: Jun 15, 2020
Size: 16.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for bloatectomy-0.0.10.tar.gz
Algorithm	Hash digest
SHA256	`81b07697d5584b4a7cc0d41af444a918596e5dfe3eec89e88ddf0990cc22645b`
MD5	`f9f4e9bd02a704012147ced868cd9bd0`
BLAKE2b-256	`8d1e4db0cf002a9e6782c742dbc162b66d3c4ab72d531a89fb56f8749c1ae396`

See more details on using hashes here.

File details

Details for the file bloatectomy-0.0.10-py3-none-any.whl.

File metadata

Download URL: bloatectomy-0.0.10-py3-none-any.whl
Upload date: Jun 15, 2020
Size: 17.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for bloatectomy-0.0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2f06396b8ef94a9bfdd4f2be67579b20986d2002554ac7665d1d18d4149b094`
MD5	`e52ebbf5a32612aef13e41787bc3dd2a`
BLAKE2b-256	`90755641cd577dc539b9f07aa398bd75f4d7d7be15284600debeeb662cb19a20`

See more details on using hashes here.

bloatectomy 0.0.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bloatectomy

Requirements

Installation

Examples

Documentation

Parameters

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes