Extracts data from German Wiktionary dump files.
Project description
wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
Features
- Extracts flexion tables, IPA transcriptions, language, genus, lemma, part of speech information (basic) and syllables of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
Installation
pip install wiktionary-de-parser
Or with Poetry:
poetry add wiktionary-de-parser
Usage
from bz2 import BZ2File
from wiktionary_de_parser import Parser
bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)
for record in Parser(bz_file):
if record.lang_code != 'de':
continue
# do stuff with 'record'
Note: In this example we load a compressed Wiktionary dump file that was obtained from here.
Output
Example output for the page "Abend":
Record(lemma='Abend',
inflected=False,
syllables=['Abend'],
ipa=['ˈaːbn̩t', 'ˈaːbm̩t'],
rhymes=['aːbn̩t'],
pos={'Substantiv': []},
lang='Deutsch',
lang_code='de',
flexion={'Akkusativ Plural': 'Abende',
'Akkusativ Singular': 'Abend',
'Dativ Plural': 'Abenden',
'Dativ Singular': 'Abend',
'Genitiv Plural': 'Abende',
'Genitiv Singular': 'Abends',
'Genus': 'm',
'Nominativ Plural': 'Abende',
'Nominativ Singular': 'Abend'},
page_id=5719,
index=0,
title='Abend',
wikitext=None)
Record(lemma='Abend',
inflected=False,
syllables=['Abend'],
ipa=['ˈaːbn̩t'],
rhymes=['aːbn̩t'],
pos={'Substantiv': ['Nachname']},
lang='Deutsch',
lang_code='de',
flexion=None,
page_id=5719,
index=1,
title='Abend',
wikitext=None)
Record(lemma='Abend',
inflected=False,
syllables=['Abend'],
ipa=['ˈaːbn̩t', 'ˈaːbm̩t'],
rhymes=['aːbn̩t'],
pos={'Substantiv': ['Toponym']},
lang='Deutsch',
lang_code='de',
flexion=None,
page_id=5719,
index=2,
title='Abend',
wikitext=None)
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry installinside of the project folder to install dependencies. - Change
wiktionary_de_parser/run.pyto your needs. - Run
poetry run python wiktionary_de_parser/run.pyto run the parser. Orpoetry run pytestto run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wiktionary_de_parser-0.10.1.tar.gz.
File metadata
- Download URL: wiktionary_de_parser-0.10.1.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.3 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed4c9ac9e147680889f05c764177abfaafa7a8855629e8e0d211d33b6b5cac5f
|
|
| MD5 |
1a89ed1886f83b608f00d5999cc61838
|
|
| BLAKE2b-256 |
012f10c3b752311472b95e83fb97612b3925ec9e8430afb999776647035e89d6
|
File details
Details for the file wiktionary_de_parser-0.10.1-py3-none-any.whl.
File metadata
- Download URL: wiktionary_de_parser-0.10.1-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.3 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fbd7bb78fee46262442ab0fe8efbe5fb02f6360ba807b11d1a93aedefc5103e
|
|
| MD5 |
53a506bf5041a00d9b701f1f534f55af
|
|
| BLAKE2b-256 |
e8f9656d1ac5c0ac899eb876077e88395059d5906de97f02335ce0bd32bed70a
|