Skip to main content

Transformers CRF: CRF Token Classification for Transformers

Project description

Tranformers-CRF

For training BERT-CRF models with the huggingface Transformers libary

Instalation

    git clone https://bitbucket.org/avisourgente/transformers_crf.git
    cd transformers_crf
    pip install -e .

Train example

Train script is in examples/run_ner.py It follows the api of https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py New args:

    --learning_rate_ner LEARNING_RATE_NER
                Custom initial learning rate for the CRF and Linear layers on AdamW. (default: None)
    --weight_decay_ner WEIGHT_DECAY_NER
                Custom weight decay for the CRF and Linear layers on AdamW. (default: None)
    --use_crf  
                Will enable to use CRF layer. (default: False)
    --no_constrain_crf
                Set to not to constrain crf outputs to labeling scheme (default: False)
    --break_docs_to_max_length [BREAK_DOCS_TO_MAX_LENGTH]
                Whether to chunck docs into sentences with the max seq length of tokenizer. (default: False)
    --convert_to_iobes [CONVERT_TO_IOBES]
                Convert a IOB2 input to IOBES. (default: False)

Example:

python run_ner.py \
  --model_name_or_path neuralmind/bert-base-portuguese-cased \
  --dataset_name eduagarcia/portuguese_benchmark \
  --dataset_config_name harem-default \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --num_train_epochs 15 \
  --learning_rate 5e-5 \
  --do_train \
  --do_eval \
  --do_predict \
  --evaluation_strategy steps \
  --eval_steps 500 \
  --output_dir /workspace/models/test-transformers-crf \
  --max_seq_length 128 \
  --break_docs_to_max_length \
  --overwrite_output_dir \
  --learning_rate_ner 5e-3 \
  --convert_to_iobes \
  --use_crf

Usage example

    import torch
    from transformers_crf import CRFTokenizer, AutoModelForEmbedderCRFTokenClassification

    model_path = "./model"

    device = torch.device('cpu') 
    tokenizer = CRFTokenizer.from_pretrained(model_path)
    model = AutoModelForEmbedderCRFTokenClassification.from_pretrained(model_path).to(device)

    tokens = [["Esse", "é", "um", "exemplo"], ["Esse", "é", "um", "segundo", "exemplo"]]
    batch = tokenizer(tokens, max_length=512).to(device)
    output = model(**batch, reorder=True)
    predicts_id = output.predicts.detach().cpu().numpy()
    preds = [[model.config.id2label[p] for p in pred_seq][:len(token_seq)] for (pred_seq, token_seq) in zip(predicts_id, tokens)]

Pip upload package

python setup.py bdist_wheel

python -m twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

transformers_crf-0.3.2-py3-none-any.whl (43.5 kB view details)

Uploaded Python 3

File details

Details for the file transformers_crf-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for transformers_crf-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4afd321bdfad008d5d45fe359d6de944fdf1f1a2dcf0e429e21afa1c203f6780
MD5 7baa1e5adccb68ab5d89478c7312f4ae
BLAKE2b-256 ec8b2e5fb81c146175ac6609754ccb22379a9c36badfe9ad6395af94ebd1bce4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page