Extracts data from German Wiktionary dump files.
Project description
wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
Features
- Extracts flexion tables, IPA transcriptions, language, genus, lemma, part of speech information (basic) and syllables of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
Installation
pip install wiktionary-de-parser
Or with Poetry:
poetry add wiktionary-de-parser
Usage
The following example will download the latest Wiktionary dump file (from here) and parse all German entries.
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump
# Specify the directory where the dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")
# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()
# Alternatively you can also specify a different dump file to download.
dump = WiktionaryDump(
dump_dir_path="directory-of-dump-file",
dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()
# If you already have the dump file, you can also specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
# Next, we can parse the dump file.
parser = WiktionaryParser()
for page in dump.pages():
# Skip redirects
if page.redirect_to:
continue
for entry in parser.entries_from_page(page):
parsed = parser.parse_entry(entry)
# Ignore non-German entries
if parsed.language.lang_code != "de":
continue
# do something with "parsed
...
Output
All entries for "Abend":
ParsedWiktionaryPageEntry(
name="Abend",
flexion={
"Genus": "m",
"Nominativ Singular": "Abend",
"Nominativ Plural": "Abende",
"Genitiv Singular": "Abends",
"Genitiv Plural": "Abende",
"Dativ Singular": "Abend",
"Dativ Plural": "Abenden",
"Akkusativ Singular": "Abend",
"Akkusativ Plural": "Abende",
},
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": []},
rhymes=["aːbn̩t"],
syllables=["Abend"],
)
ParsedWiktionaryPageEntry(
name="Abend",
flexion=None,
ipa=["ˈaːbn̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Nachname"]},
rhymes=["aːbn̩t"],
syllables=["Abend"],
)
ParsedWiktionaryPageEntry(
name="Abend",
flexion=None,
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Toponym"]},
rhymes=["aːbn̩t"],
syllables=["Abend"],
)
Development
This project uses Poetry.
- Install Poetry.
- Clone this repository
- Run
poetry installinside of the project folder to install dependencies. - There is a
notebook.ipynbto test the parser. - Run
poetry run pytestto run tests.
License
MIT © Gregor Weichbrodt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wiktionary_de_parser-0.11.1.tar.gz.
File metadata
- Download URL: wiktionary_de_parser-0.11.1.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.3 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6737e019a22ae9333dca23ca6ec3e23cdbf0116179e46828cdf0b38fd6a2ee87
|
|
| MD5 |
a9e2277c0ec7b3ef942f9b4c05972796
|
|
| BLAKE2b-256 |
ed89dbef5cb9a1d8867ff84bf122ca4358fb0bb3ed61f19140ad1487adb5d591
|
File details
Details for the file wiktionary_de_parser-0.11.1-py3-none-any.whl.
File metadata
- Download URL: wiktionary_de_parser-0.11.1-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.3 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26eb4fe865d3345ac8e0c5bea8686af743ceba7d66501aae36a6d7796c7c6c1e
|
|
| MD5 |
b5fb5f22fe73d969cedff699fff075cd
|
|
| BLAKE2b-256 |
190a349f97f9715430e576ce167dcde592009d679e91cede7cd6d58d92e50272
|