Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.

These details have not been verified by PyPI

Project links

Project description

Audio Captioning metrics (aac-metrics)

Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.

Why using this package?

Easy installation and download
Same results than caption-evaluation-tools and fense repositories
Provides the following metrics:
- BLEU [1]
- ROUGE-L [2]
- METEOR [3]
- CIDEr-D [4]
- SPICE [5]
- SPIDEr [6]
- SPIDEr-max [7]
- SBERT [8]
- FluencyError [8]
- FENSE [8]
- SPIDErErr

Installation

Install the pip package:

pip install aac-metrics

Download the external code and models needed for METEOR, SPICE, PTBTokenizer and FENSE:

aac-metrics-download

Notes:

The external code for SPICE, METEOR and PTBTokenizer is stored in $HOME/.cache/aac-metrics.
The weights of the FENSE fluency error detector and the the SBERT model are respectively stored by default in $HOME/.cache/torch/hub/fense_data and $HOME/.cache/torch/sentence_transformers.

Usage

Evaluate default AAC metrics

The full evaluation process to compute AAC metrics can be done with aac_metrics.aac_evaluate function.

from aac_metrics import aac_evaluate

candidates: list[str] = ["a man is speaking"]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]

corpus_scores, _ = aac_evaluate(candidates, mult_references)
print(corpus_scores)
# dict containing the score of each aac metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
# {"bleu_1": tensor(0.7), "bleu_2": ..., ...}

Evaluate a specific metric

Evaluate a specific metric can be done using the aac_metrics.functional.<metric_name>.<metric_name> function or the aac_metrics.classes.<metric_name>.<metric_name> class. Unlike aac_evaluate, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with preprocess_mono_sents and preprocess_mult_sents functions.

from aac_metrics.functional import cider_d
from aac_metrics.utils.tokenization import preprocess_mono_sents, preprocess_mult_sents

candidates: list[str] = ["a man is speaking"]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]

candidates = preprocess_mono_sents(candidates)
mult_references = preprocess_mult_sents(mult_references)

corpus_scores, sents_scores = cider_d(candidates, mult_references)
print(corpus_scores)
# {"cider_d": tensor(0.1)}
print(sents_scores)
# {"cider_d": tensor([0.9, ...])}

Each metrics also exists as a python class version, like aac_metrics.classes.cider_d.CIDErD.

Metrics

Default AAC metrics

Metric	Python Class	Origin	Range	Short description
BLEU [1]	`BLEU`	machine translation	[0, 1]	Precision of n-grams
ROUGE-L [2]	`ROUGEL`	machine translation	[0, 1]	FScore of the longest common subsequence
METEOR [3]	`METEOR`	machine translation	[0, 1]	Cosine-similarity of frequencies with synonyms matching
CIDEr-D [4]	`CIDErD`	image captioning	[0, 10]	Cosine-similarity of TF-IDF computed on n-grams
SPICE [5]	`SPICE`	image captioning	[0, 1]	FScore of semantic graph
SPIDEr [6]	`SPIDEr`	image captioning	[0, 5.5]	Mean of CIDEr-D and SPICE

Other metrics

Metric name	Python Class	Origin	Range	Short description
SPIDEr-max [7]	`SPIDErMax`	audio captioning	[0, 5.5]	Max of SPIDEr scores for multiples candidates
SBERT [7]	`SBERT`	audio captioning	[-1, 1]	Cosine-similarity of Sentence-BERT embeddings
FluencyError [7]	`FluencyError`	audio captioning	[0, 1]	Use pretrained model to detect fluency errors in sentences
FENSE [8]	`FENSE`	audio captioning	[-1, 1]	Combines `SBERT` and `FluencyError`
SPIDErErr	`SPIDErErr`	audio captioning	[0, 5.5]	Combines `SPIDEr` and `FluencyError`

SPIDEr-max metric

SPIDEr-max [7] is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model.

SPIDEr-max: why ?

The SPIDEr metric used in audio captioning is highly sensitive to the frequencies of the words used.

Here is 2 examples with the 5 candidates generated by the beam search algorithm, their corresponding SPIDEr scores and the associated references:

Beam search candidates	SPIDEr
heavy rain is falling on a roof	0.562
heavy rain is falling on a tin roof	0.930
a heavy rain is falling on a roof	0.594
a heavy rain is falling on the ground	0.335
a heavy rain is falling on the roof	0.594

References
heavy rain falls loudly onto a structure with a thin roof
heavy rainfall falling onto a thin structure with a thin roof
it is raining hard and the rain hits a tin roof
rain that is pouring down very hard outside
the hard rain is noisy as it hits a tin roof

(Candidates and references for the Clotho development-testing file named "rain.wav")

Beam search candidates	SPIDEr
a woman speaks and a sheep bleats	0.190
a woman speaks and a goat bleats	1.259
a man speaks and a sheep bleats	0.344
an adult male speaks and a sheep bleats	0.231
an adult male is speaking and a sheep bleats	0.189

References
a man speaking and laughing followed by a goat bleat
a man is speaking in high tone while a goat is bleating one time
a man speaks followed by a goat bleat
a person speaks and a goat bleats
a man is talking and snickering followed by a goat bleating

(Candidates and references for an AudioCaps testing file with the id "jid4t-FzUn0")

Even with very similar candidates, the SPIDEr scores varies drastically. To adress this issue, we proposed a SPIDEr-max metric which take the maximum value of several candidates for the same audio. SPIDEr-max demonstrate that SPIDEr can exceed state-of-the-art scores on AudioCaps and Clotho and even human scores on AudioCaps [7].

SPIDEr-max: usage

This usage is very similar to other captioning metrics, with the main difference of take a multiple candidates list as input.

from aac_metrics.functional import spider_max
from aac_metrics.utils.tokenization import preprocess_mult_sents

mult_candidates: list[list[str]] = [["a man is speaking", "maybe someone speaking"]]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]

mult_candidates = preprocess_mult_sents(mult_candidates)
mult_references = preprocess_mult_sents(mult_references)

corpus_scores, sents_scores = spider_max(mult_candidates, mult_references)
print(corpus_scores)
# {"spider": tensor(0.1), ...}
print(sents_scores)
# {"spider": tensor([0.9, ...]), ...}

Requirements

Python packages

The pip requirements are automatically installed when using pip install on this repository.

torch >= 1.10.1
numpy >= 1.21.2
pyyaml >= 6.0
tqdm >= 4.64.0
sentence-transformers>=2.2.2

External requirements

java >= 1.8 is required to compute METEOR, SPICE and use the PTBTokenizer. Most of these functions can specify a java executable path with java_path argument.
unzip command to extract SPICE zipped files.

Additional notes

CIDEr or CIDEr-D ?

The CIDEr metric differs from CIDEr-D because it applies a stemmer to each word before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr in caption-evaluation-tools, but some papers called it "CIDEr".

Does metrics work on multi-GPU ?

No. Most of these metrics use numpy or external java programs to run, which prevents multi-GPU testing for now.

Is torchmetrics needed for this package ?

No. But if torchmetrics is installed, all metrics classes will inherit from the base class torchmetrics.Metric. It is because most of the metrics does not use PyTorch tensors to compute scores and numpy and strings cannot be added to states of torchmetrics.Metric.

References

BLEU

[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceed- ings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association for Computational Linguistics, 2001, p. 311. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1073083.1073135

ROUGE-L

[2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013

METEOR

[3] M. Denkowski and A. Lavie, “Meteor Universal: Language Specific Translation Evaluation for Any Target Language,” in Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics, 2014, pp. 376–380. [Online]. Available: http://aclweb.org/anthology/W14-3348

CIDEr

[4] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based Image Description Evaluation,” arXiv:1411.5726 [cs], Jun. 2015, arXiv: 1411.5726. [Online]. Available: http://arxiv.org/abs/1411.5726

SPICE

[5] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic Propositional Image Caption Evaluation,” arXiv:1607.08822 [cs], Jul. 2016, arXiv: 1607.08822. [Online]. Available: http://arxiv.org/abs/1607.08822

SPIDEr

[6] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE Inter- national Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017, arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370

SPIDEr-max

[7] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396

FENSE

[8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684

Citation

If you use SPIDEr-max, you can cite the following paper using BibTex:

@inproceedings{labbe:hal-03810396,
    TITLE = {{Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates}},
    AUTHOR = {Labb{\'e}, Etienne and Pellegrini, Thomas and Pinquier, Julien},
    URL = {https://hal.archives-ouvertes.fr/hal-03810396},
    BOOKTITLE = {{Workshop DCASE}},
    ADDRESS = {Nancy, France},
    YEAR = {2022},
    MONTH = Nov,
    KEYWORDS = {audio captioning ; evaluation metric ; beam search ; multiple candidates},
    PDF = {https://hal.archives-ouvertes.fr/hal-03810396/file/Labbe_DCASE2022.pdf},
    HAL_ID = {hal-03810396},
    HAL_VERSION = {v1},
}

Contact

Maintainer:

Etienne Labbé "Labbeti": labbeti.pub@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Jun 29, 2025

0.5.5

Jan 20, 2025

0.5.4

Mar 5, 2024

0.5.3

Jan 9, 2024

0.5.2

Jan 5, 2024

0.5.1

Dec 20, 2023

0.5.0

Dec 8, 2023

0.4.6

Oct 10, 2023

0.4.5

Sep 12, 2023

0.4.4

Aug 14, 2023

0.4.3

Jul 25, 2023

0.4.2

Apr 19, 2023

0.4.1

Apr 13, 2023

0.4.0

Apr 13, 2023

This version

0.3.0

Feb 27, 2023

0.2.0

Dec 14, 2022

0.1.2

Oct 31, 2022

0.1.1

Sep 30, 2022

0.1.0

Sep 28, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aac-metrics-0.3.0.tar.gz (162.3 kB view details)

Uploaded Feb 27, 2023 Source

File details

Details for the file aac-metrics-0.3.0.tar.gz.

File metadata

Download URL: aac-metrics-0.3.0.tar.gz
Upload date: Feb 27, 2023
Size: 162.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.4

File hashes

Hashes for aac-metrics-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c716f09f792b7ee96352d41cf96dd5677e9e237618cc88f6607de4d432c459c8`
MD5	`0950cf0fb3bcf09446d9db4c47e42fe1`
BLAKE2b-256	`34789e1cd5886d43249581781ee580fbc789153c946ec0b59fe0e51f5396a2d1`

See more details on using hashes here.

aac-metrics 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Audio Captioning metrics (aac-metrics)

Why using this package?

Installation

Usage

Evaluate default AAC metrics

Evaluate a specific metric

Metrics

Default AAC metrics

Other metrics

SPIDEr-max metric

SPIDEr-max: why ?

SPIDEr-max: usage

Requirements

Python packages

External requirements

Additional notes

CIDEr or CIDEr-D ?

Does metrics work on multi-GPU ?

Is torchmetrics needed for this package ?

References

BLEU

ROUGE-L

METEOR

CIDEr

SPICE

SPIDEr

SPIDEr-max

FENSE

Citation

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes