TakeSentenceTokenizer is a tool for tokenizing and pre processing messages
Project description
TakeSentenceTokenizer
TakeSentenceTokenizer is a tool for pre processing and tokenizing sentences. The package is used to: - convert the first word of the sentence to lowercase - convert from uppercase to lowercase - convert word to lowercase after punctuation - replace words for placeholders: laugh, date, time, ddd, measures (10kg, 20m, 5gb, etc), code, phone number, cnpj, cpf, email, money, url, number (ordinal and cardinal) - replace abbreviations - replace common typos - split punctuations - remove emoji - remove characters that are not letters or punctuation - add missing accentuation - tokenize the sentence
Installation
Use the package manager pip to install TakeSentenceTokenizer
pip install TakeSentenceTokenizer
Usage
Example 1: full processing not keeping registry of removed punctuation
Code:
from SentenceTokenizer import SentenceTokenizer
sentence = 'P/ saber disso eh c/ vc ou consigo ver pelo site www.dúvidas.com.br/minha-dúvida ??'
tokenizer = SentenceTokenizer()
processed_sentence = tokenizer.process_message(sentence)
print(processed_sentence)
Output:
'para saber disso é com você ou consigo ver pelo site URL ? ?'
Example 2: full processing keeping registry of removed punctuation
from SentenceTokenizer import SentenceTokenizer
sentence = 'como assim $@???'
tokenizer = SentenceTokenizer(keep_registry_punctuation = True)
processed_sentence = tokenizer.process_message(sentence)
print(processed_sentence)
print(tokenizer.removal_registry_lst)
Output:
como assim ? ? ?
[['como assim $@ ? ? ?', {'punctuation': '$', 'position': 11}, {'punctuation': '@', 'position': 12}, {'punctuation': ' ', 'position': 13}]]
Author
Take Data&Analytics Research
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for TakeSentenceTokenizer-1.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f1713b37d15de79246aae9c20f8f65986fd07111655c532865cdd7b399946a4 |
|
MD5 | b216d95683ff4887aa13b611b387093f |
|
BLAKE2b-256 | 1827cb7bce94e3c770519436374c426df946b9bd39c3c1d03a39f6514bd7ef0a |
Hashes for TakeSentenceTokenizer-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 27539df5752be6e9a1460970eb74f68ab2d9663ae9967932dde64b4077c39c48 |
|
MD5 | 90e31e202f1d9df93e5b6bccabff5115 |
|
BLAKE2b-256 | 5ec6db7ba692035842bc0d2340e0ba9a07febb00b79115aed29f1e80a3d82da7 |