Skip to main content

Sentence embedding evaluation for German

Project description

PyPI version Total alerts Language grade: Python

sentence-embedding-evaluation-german

Sentence embedding evaluation for German.

This library is inspired by SentEval but focuses on German language downstream tasks.

Downstream tasks

task type properties #train #test target info
TOXIC ๐Ÿ‘ฟ toxic comments facebook comments 3244 944 binary {0,1} GermEval 2021, comments subtask 1, ๐Ÿ“ ๐Ÿ“–
ENGAGE ๐Ÿค— engaging comments facebook comments 3244 944 binary {0,1} GermEval 2021, comments subtask 2, ๐Ÿ“ ๐Ÿ“–
FCLAIM โ˜๏ธ fact-claiming comments facebook comments 3244 944 binary {0,1} GermEval 2021, comments subtask 3, ๐Ÿ“ ๐Ÿ“–
VMWE verbal idioms newspaper 6652 1447 binary (figuratively, literally) GermEval 2021, verbal idioms, ๐Ÿ“ ๐Ÿ“–
OL19-A ๐Ÿ‘ฟ offensive language tweets 3980 3031 binary {0,1} GermEval 2018, ๐Ÿ“ ๐Ÿ“–
OL19-B ๐Ÿ‘ฟ offensive language, fine-grained tweets 3980 3031 4 catg. (profanity, insult, abuse, oth.) GermEval 2018, ๐Ÿ“ ๐Ÿ“–
OL19-C ๐Ÿ‘ฟ explicit vs. implicit offense tweets 1921 930 binary (explicit, implicit) GermEval 2018, ๐Ÿ“ ๐Ÿ“–
OL18-A ๐Ÿ‘ฟ offensive language tweets 5009 3398 binary {0,1} GermEval 2018, ๐Ÿ“
OL18-B ๐Ÿ‘ฟ offensive language, fine-grained tweets 5009 3398 4 catg. (profanity, insult, abuse, oth.) GermEval 2018, ๐Ÿ“
ABSD-1 ๐Ÿ›ค๏ธ relevance classification 'Deutsche Bahn' customer feedback, lang:de-DE 19432 2555 binary GermEval 2017, ๐Ÿ“
ABSD-2 ๐Ÿ›ค๏ธ Sentiment analysis 'Deutsche Bahn' customer feedback, lang:de-DE 19432 2555 3 catg. (pos., neg., neutral) GermEval 2017, ๐Ÿ“
ABSD-3 ๐Ÿ›ค๏ธ aspect categories 'Deutsche Bahn' customer feedback, lang:de-DE 19432 2555 20 catg. GermEval 2017, ๐Ÿ“
MIO-S Sentiment analysis 'Der Standard' newspaper article web comments, lang:de-AT 1799 1800 3 catg. One Million Posts Corpus, ๐Ÿ“
MIO-O off-topic comments 'Der Standard' newspaper article web comments, lang:de-AT 1799 1800 binary One Million Posts Corpus, ๐Ÿ“
MIO-I inappropriate comments 'Der Standard' newspaper article web comments, lang:de-AT 1799 1800 binary One Million Posts Corpus, ๐Ÿ“
MIO-D discriminating comments 'Der Standard' newspaper article web comments, lang:de-AT 1799 1800 binary One Million Posts Corpus, ๐Ÿ“
MIO-F feedback comments 'Der Standard' newspaper article web comments, lang:de-AT 3019 3019 binary One Million Posts Corpus, ๐Ÿ“
MIO-P personal story comments 'Der Standard' newspaper article web comments, lang:de-AT 4668 4668 binary One Million Posts Corpus, ๐Ÿ“
MIO-A argumentative comments 'Der Standard' newspaper article web comments, lang:de-AT 1799 1800 binary One Million Posts Corpus, ๐Ÿ“
SBCH-L Swiss German detection 'chatmania' app comments, lang:gsw 748 748 binary SB-CH Corpus, ๐Ÿ“
SBCH-S Sentiment analysis 'chatmania' app comments, only comments labelled as Swiss German are included, lang:gsw 394 394 3 catg. SB-CH Corpus, ๐Ÿ“
ARCHI Swiss German Dialect Classification lang:gsw 18809 4743 4 catg. ArchiMob, ๐Ÿ“ ๐Ÿ“–
LSDC Lower Saxon Dialect Classification lang:nds 74140 8602 14 catg. LSDC, ๐Ÿ“ ๐Ÿ“–

Download datasets

bash download-datasets.sh

Usage example

from typing import List
import sentence_embedding_evaluation_german as seeg
import torch

# (1) Instantiate your Embedding model
emb_dim = 512
vocab_sz = 128
emb = torch.randn((vocab_sz, emb_dim), requires_grad=False)
emb = torch.nn.Embedding.from_pretrained(emb)
assert emb.weight.requires_grad == False

# (2) Specify the preprocessing
def preprocesser(batch: List[str], params: dict=None) -> List[List[float]]:
    """ Specify your embedding or pretrained encoder here
    Paramters:
    ----------
    params : dict
        The params dictionary
    batch : List[str]
        A list of sentence as string
    Returns:
    --------
    List[List[float]]
        A list of embedding vectors
    """
    features = []
    for sent in batch:
        try:
            ids = torch.tensor([ord(c) % 128 for c in sent])
        except:
            print(sent)
        h = emb(ids)
        features.append(h.mean(axis=0))
    features = torch.stack(features, dim=0)
    return features

# (3) Training settings
params = {
    'datafolder': '../datasets',
    'batch_size': 128, 
    'num_epochs': 20,
    # 'early_stopping': True,
    # 'split_ratio': 0.2,  # if early_stopping=True
    # 'patience': 5,  # if early_stopping=True
}

# (4) Specify downstream tasks
downstream_tasks = ['FCLAIM', 'VMWE', 'OL19-C', 'ABSD-2', 'MIO-P', 'ARCHI', 'LSDC']

# (5) Run experiments
results = seeg.evaluate(downstream_tasks, preprocesser, **params)

Appendix

Installation

The sentence-embedding-evaluation-german git repo is available as PyPi package

pip install sentence-embedding-evaluation-german
pip install git+ssh://git@github.com/ulf1/sentence-embedding-evaluation-german.git

Install a virtual environment

python3.7 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir

(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv. Use an absolute path without whitespaces.)

Python commands

  • Jupyter for the examples: jupyter lab
  • Check syntax: flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')

Publish

pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist 
twine upload -r pypi dist/*

Clean up

find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv

Support

Please open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page