Skip to main content

Bloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents.

Project description

Bloatectomy

Bloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents. Takes in a list of notes or a single file (.docx, .txt, .rtf, etc) or single string to be marked for duplicates. Marked output and tokens are output.

Requirements

  • Python>=3.7.x (in order for the regular expressions to work correctly)
  • re
  • sys
  • pandas (optional, only necessary if using MIMIC III data)
  • docx (optional, only necessary if input or output is a word/docx file)

Installation

using pip via PyPI

pip install bloatectomy

using pip via github

pip install git+git://github.com/MIT-LCP/mimic-code

manual install by cloning the repository

git clone git://github.com/MIT-LCP/mimic-code
cd bloatectomy
python3 setup.py install

Example

To run bloatectomy on the sample text provided in the input folder:

from bloatectomy import bloatectomy

bloatectomy('./input/sample_text.txt', style='highlight', display=False, filename='./output/sample_txt_highlight_output', output='html', output_numbered_tokens=True, output_original_tokens=True);

Documentation

class bloatectomy(input_text,
                  path = '',
                  filename='bloatectomized_file',
                  display=False,
                  style='highlight',
                  output='html',
                  output_numbered_tokens=False,
                  output_original_tokens=False,
                  regex1=r"(.+?\.[\s\n]+)",
                  regex2=r"(?=\n\s*[A-Z1-9#-]+.*)",
                  postgres_engine=None,
                  postgres_table=None)

Parameters

input_text: file, str, list
An input document (.txt, .rtf, .docx), a string of text, or list of hadm_ids for postgres mimiciii database or the raw text.

style: str, optional, default=highlight
Method for denoting a duplicate. The following are allowed: highlight, bold, remov.

filename: str, optional, default=bloatectomized_file A string to name output file of the bloat-ectomized document.

path: str, optional, default=' '
The directory for output files.

output_numbered_tokens: bool, optional, default=False
If set to True, a .txt file with each token enumerated and marked for duplication, is output as [filename]_token_numbers.txt. This is useful when diagnosing your own regular expression for tokenization or testing the remov option for style.

output_original_tokens: bool, optional, default=False
If set to True, a .txt file with each original (non-marked) token enumerated but not marked for duplication, is output as [filename]_original_token_numbers.txt.

display: bool, optional, default=False
If set to True, the bloatectomized text will display in the console on completion.

regex1: str, optional, default=r"(.+?\.[\s\n]+)"
The regular expression for the first tokenization. Split on a period (.) followed by one or more white space characters (space, tab, line breaks) or a line feed character (\n). This can be replaced with any valid regular expression to change the way tokens are created.

regex2: str, optional, default=r"(?=\n\s*[A-Z1-9#-]+.*)"
The regular expression for the second tokenization. Split on any newline character (\n) followed by an uppercase letter, a number, or a dash. This can be replaced with any valid regular expression to change how sub-tokens are created.

postgres_engine: str, optional The postgres connection. Only relevant for use with the MIMIC III dataset. See the jupyter notebook mimic_bloatectomy_example for the example code.

postgres_table: str, optional The name of the postgres table containing the concatenated notes. Only relevant for use with the MIMIC III dataset. See the jupyter notebook mimic_bloatectomy_example for the example code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bloatectomy-0.0.8.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bloatectomy-0.0.8-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file bloatectomy-0.0.8.tar.gz.

File metadata

  • Download URL: bloatectomy-0.0.8.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for bloatectomy-0.0.8.tar.gz
Algorithm Hash digest
SHA256 6f0c950968b074cefe1632acdb53cc0ab8ccd705294356d3659d5e9958913ceb
MD5 9b9fa10df76c110bc6cedf6346972b52
BLAKE2b-256 de23e0afef0f8ca1f07109cf6f4b4181520709af207d82336168abaca72d0a5c

See more details on using hashes here.

File details

Details for the file bloatectomy-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: bloatectomy-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 16.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for bloatectomy-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 24cf1933f055dcca4e5a25e8d9361393b97bf6771baec0d4ffe9cd086aa44efe
MD5 9cb94ba5fba22efbde73edf00f889d44
BLAKE2b-256 77530bf4207f310057f314a3268d2b0662bbf17aef01b6e00699b8de6e564894

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page