unbabel-comet

High-quality Machine Translation Evaluation

These details have not been verified by PyPI

Project links

Project description

Version 1.1.2 is out 🥳! whats new?

Update: comet-compare to support multiple system comparison. Thanks to @SamuelLarkin

Bugfix: Broken link for wmt21-comet-qe-da (#78)

Bugfix: protobuf dependency (#82)

New models added from Cometinho -- Best paper award at EAMT 22 paper 🥳!

Quick Installation

Simple installation from PyPI

pip install --upgrade pip  # ensures that pip is current 
pip install unbabel-comet

pip install unbabel-comet==1.1.2 --use-feature=2020-resolver

To develop locally install Poetry and run the following commands:

git clone https://github.com/Unbabel/COMET
cd COMET
poetry install

Alternately, for development, you can run the CLI tools directly, e.g.,

PYTHONPATH=. ./comet/cli/score.py

Scoring MT outputs:

CLI Usage:

Test examples:

echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp1.en
echo -e "The fire could have been stopped\nSchools and pre-school were open" >> hyp2.en
echo -e "They were able to control the fire.\nSchools and kindergartens opened" >> ref.en

Basic scoring command:

comet-score -s src.de -t hyp1.en -r ref.en

you can set --gpus 0 to test on CPU.

Scoring multiple systems:

comet-score -s src.de -t hyp1.en hyp2.en -r ref.en

WMT test sets via SacreBLEU:

comet-score -d wmt20:en-de -t PATH/TO/TRANSLATIONS

The default setting of comet-score prints the score for each segment individually. If you are only interested in the score for the whole dataset (computed as the average of the segment scores), you can use the --quiet flag.

comet-score -s src.de -t hyp1.en -r ref.en --quiet

You can select another model/metric with the --model flag and for reference-free (QE-as-a-metric) models you don't need to pass a reference.

comet-score -s src.de -t hyp1.en --model wmt21-comet-qe-mqm

Following the work on Uncertainty-Aware MT Evaluation you can use the --mc_dropout flag to get a variance/uncertainty value for each segment score. If this value is high, it means that the metric is less confident in that prediction.

comet-score -s src.de -t hyp1.en -r ref.en --mc_dropout 30

When comparing multiple MT systems we encourage you to run the comet-compare command to get statistical significance with Paired T-Test and bootstrap resampling (Koehn, et al 2004).

comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en

New: Minimum Bayes Risk Decoding:

Inspired by Amrhein et al, 2022 work, we have developed a command to perform Minimum Bayes Risk decoding. This command receives a text file with source sentences and a text file containing all the MT samples and writes to an output file the best sample according to COMET.

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt --num_sample [X] -o [OUTPUT_FILE].txt

Multi-GPU Inference:

COMET is optimized to be used in a single GPU by taking advantage of length batching and embedding caching. When using Multi-GPU since data e spread across GPUs we will typically get fewer cache hits and the length batching samples is replaced by a DistributedSampler. Because of that, according to our experiments, using 1 GPU is faster than using 2 GPUs specially when scoring multiple systems for the same source and reference.

Nonetheless, if your data does not have repetitions and you have more than 1 GPU available, you can run multi-GPU inference with the following command:

comet-score -s src.de -t hyp1.en -r ref.en --gpus 2 --quiet

Warning: Segment-level scores using multigpu will be out of order. This is only useful for system scoring.

Changing Embedding Cache Size:

You can change the cache size of COMET using the following env variable:

export COMET_EMBEDDINGS_CACHE="2048"

by default the COMET cache size is 1024.

Scoring within Python:

from comet import download_model, load_from_checkpoint

model_path = download_model("wmt20-comet-da")
model = load_from_checkpoint(model_path)
data = [
    {
        "src": "Dem Feuer konnte Einhalt geboten werden",
        "mt": "The fire could be stopped",
        "ref": "They were able to control the fire."
    },
    {
        "src": "Schulen und Kindergärten wurden eröffnet.",
        "mt": "Schools and kindergartens were open",
        "ref": "Schools and kindergartens opened"
    }
]
seg_scores, sys_score = model.predict(data, batch_size=8, gpus=1)

Languages Covered:

All the above mentioned models are build on top of XLM-R which cover the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Thus, results for language pairs containing uncovered languages are unreliable!

COMET Models:

We recommend the two following models to evaluate your translations:

wmt20-comet-da: DEFAULT Reference-based Regression model build on top of XLM-R (large) and trained of Direct Assessments from WMT17 to WMT19. Same as wmt-large-da-estimator-1719 from previous versions.
wmt21-comet-qe-mqm: Reference-FREE Regression model build on top of XLM-R (large), trained on Direct Assessments and fine-tuned on MQM.
eamt22-cometinho-da: Lightweight Reference-based Regression model that was distilled from an ensemble of COMET models similar to wmt20-comet-da.

The default model was developed to participate in the WMT20 Metrics shared task (Mathur et al. 2020) and were among the best metrics that year. Also, in a large-scale study performed by Microsoft Research this metrics ranked 1st in terms of system-level decision accuracy (Kocmi et al. 2020).

Our recommended QE system was developed for the WMT21 Metrics shared task and was the best performing QE as a Metric that year (Freitag et al. 2021).

Note: The range of scores between different models can be totally different. To better understand COMET scores please take a look at our FAQs

For more information about the available COMET models read our metrics descriptions here

Train your own Metric:

Instead of using pretrained models your can train your own model with the following command:

comet-train --cfg configs/models/{your_model_config}.yaml

You can then use your own metric to score:

comet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT

Note: Please contact ricardo.rei@unbabel.com if you wish to host your own metric within COMET available metrics!

unittest:

In order to run the toolkit tests you must run the following command:

coverage run --source=comet -m unittest discover
coverage report -m

Publications

If you use COMET please cite our work! Also, don't forget to say which model you used to evaluate your systems.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.2.7

Sep 1, 2025

2.2.6

Apr 7, 2025

2.2.5

Mar 26, 2025

2.2.4

Dec 5, 2024

2.2.3

Nov 27, 2024

2.2.2

Mar 13, 2024

2.2.1

Jan 8, 2024

2.2.0

Oct 23, 2023

2.1.1

Oct 13, 2023

2.1.0

Sep 21, 2023

2.0.2

Aug 3, 2023

2.0.1

Apr 5, 2023

2.0.0

Mar 13, 2023

1.1.3

Oct 4, 2022

This version

1.1.2

Jun 6, 2022

1.1.1

Jun 1, 2022

1.1.0

Apr 2, 2022

1.0.1

Nov 19, 2021

1.0.0

Nov 19, 2021

1.0.0rc9 pre-release

Oct 21, 2021

1.0.0rc8 pre-release

Oct 18, 2021

1.0.0rc7 pre-release

Oct 18, 2021

1.0.0rc6 pre-release

Sep 28, 2021

1.0.0rc5 pre-release

Sep 4, 2021

1.0.0rc4 pre-release

Aug 16, 2021

1.0.0rc3 pre-release

Aug 15, 2021

1.0.0rc2 pre-release

Aug 10, 2021

1.0.0rc1 pre-release

Jul 27, 2021

0.1.0

Mar 11, 2021

0.0.7

Feb 9, 2021

0.0.6.post2

Nov 25, 2020

0.0.6.post1

Nov 24, 2020

0.0.6

Nov 21, 2020

0.0.4

Oct 8, 2020

0.0.3

Sep 22, 2020

0.0.2

Sep 22, 2020

0.0.1 yanked

Sep 22, 2020

Reason this release was yanked:

missing MANIFEST with reqs

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unbabel-comet-1.1.2.tar.gz (42.8 kB view details)

Uploaded Jun 6, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

unbabel_comet-1.1.2-py3-none-any.whl (64.1 kB view details)

Uploaded Jun 6, 2022 Python 3

File details

Details for the file unbabel-comet-1.1.2.tar.gz.

File metadata

Download URL: unbabel-comet-1.1.2.tar.gz
Upload date: Jun 6, 2022
Size: 42.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for unbabel-comet-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`7bd8be888f8d2b96c4667389743bcd024c91650853af734bbd71a5a398089f59`
MD5	`75a754614bbc05cadc726eee6903261b`
BLAKE2b-256	`a18f50c5d0a9eac9233ce2b9dd21e54aaaa884006a5b08d1430c3928b3d8be19`

See more details on using hashes here.

File details

Details for the file unbabel_comet-1.1.2-py3-none-any.whl.

File metadata

Download URL: unbabel_comet-1.1.2-py3-none-any.whl
Upload date: Jun 6, 2022
Size: 64.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for unbabel_comet-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`72d455b9a3c8c58630638bff6df64b089cbdf08c7803210374fdd58857a9cf70`
MD5	`8cf3e776965433d80ac50a2eea377d14`
BLAKE2b-256	`b79853435e1d2ee01f3c519031dd093c402c82239f399ea44ca977e5b3180d4b`

See more details on using hashes here.

unbabel-comet 1.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quick Installation

Scoring MT outputs:

CLI Usage:

Multi-GPU Inference:

Changing Embedding Cache Size:

Scoring within Python:

Languages Covered:

COMET Models:

Train your own Metric:

unittest:

Publications

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes