Skip to main content

Extracts data from German Wiktionary dump files.

Project description

wiktionary-de-parser

A Python module to extract data from German Wiktionary XML files (for Python 3.11+).

Features

  • Extracts flexion tables, IPA transcriptions, language, genus, lemma, part of speech information (basic) and syllables of a word.
  • Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)

Installation

pip install wiktionary-de-parser

Or with Poetry:

poetry add wiktionary-de-parser

Usage

from bz2 import BZ2File
from wiktionary_de_parser import Parser

bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)

for record in Parser(bz_file):
    if record.lang_code != 'de':
      continue
    # do stuff with 'record'

Note: In this example we load a compressed Wiktionary dump file that was obtained from here.

Output

Example output for the page "Abend":

Record(lemma='Abend',
       inflected=False,
       syllables=['Abend'],
       ipa=['ˈaːbn̩t', 'ˈaːbm̩t'],
       rhymes=['aːbn̩t'],
       pos={'Substantiv': []},
       lang='Deutsch',
       lang_code='de',
       flexion={'Akkusativ Plural': 'Abende',
                'Akkusativ Singular': 'Abend',
                'Dativ Plural': 'Abenden',
                'Dativ Singular': 'Abend',
                'Genitiv Plural': 'Abende',
                'Genitiv Singular': 'Abends',
                'Genus': 'm',
                'Nominativ Plural': 'Abende',
                'Nominativ Singular': 'Abend'},
       page_id=5719,
       index=0,
       title='Abend',
       wikitext=None)

Record(lemma='Abend',
       inflected=False,
       syllables=['Abend'],
       ipa=['ˈaːbn̩t'],
       rhymes=['aːbn̩t'],
       pos={'Substantiv': ['Nachname']},
       lang='Deutsch',
       lang_code='de',
       flexion=None,
       page_id=5719,
       index=1,
       title='Abend',
       wikitext=None)

Record(lemma='Abend',
       inflected=False,
       syllables=['Abend'],
       ipa=['ˈaːbn̩t', 'ˈaːbm̩t'],
       rhymes=['aːbn̩t'],
       pos={'Substantiv': ['Toponym']},
       lang='Deutsch',
       lang_code='de',
       flexion=None,
       page_id=5719,
       index=2,
       title='Abend',
       wikitext=None)

Development

This project uses Poetry.

  1. Install Poetry.
  2. Clone this repository
  3. Run poetry install inside of the project folder to install dependencies.
  4. Change wiktionary_de_parser/run.py to your needs.
  5. Run poetry run python wiktionary_de_parser/run.py to run the parser. Or poetry run pytest to run tests.

License

MIT © Gregor Weichbrodt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiktionary_de_parser-0.10.1.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wiktionary_de_parser-0.10.1-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file wiktionary_de_parser-0.10.1.tar.gz.

File metadata

  • Download URL: wiktionary_de_parser-0.10.1.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.3 Darwin/23.2.0

File hashes

Hashes for wiktionary_de_parser-0.10.1.tar.gz
Algorithm Hash digest
SHA256 ed4c9ac9e147680889f05c764177abfaafa7a8855629e8e0d211d33b6b5cac5f
MD5 1a89ed1886f83b608f00d5999cc61838
BLAKE2b-256 012f10c3b752311472b95e83fb97612b3925ec9e8430afb999776647035e89d6

See more details on using hashes here.

File details

Details for the file wiktionary_de_parser-0.10.1-py3-none-any.whl.

File metadata

File hashes

Hashes for wiktionary_de_parser-0.10.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0fbd7bb78fee46262442ab0fe8efbe5fb02f6360ba807b11d1a93aedefc5103e
MD5 53a506bf5041a00d9b701f1f534f55af
BLAKE2b-256 e8f9656d1ac5c0ac899eb876077e88395059d5906de97f02335ce0bd32bed70a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page