Skip to main content

Extracts data from German Wiktionary dump files.

Project description

wiktionary-de-parser

A Python module to extract data from German Wiktionary XML files (for Python 3.11+).

Features

  • Extracts flexion tables, IPA transcriptions, language, genus, lemma, part of speech information (basic) and syllables of a word.
  • Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)

Installation

pip install wiktionary-de-parser

Or with Poetry:

poetry add wiktionary-de-parser

Usage

The following example will download the latest Wiktionary dump file (from here) and parse all German entries.

from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump

# Specify the directory where the dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")

# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()

# Alternatively you can also specify a different dump file to download.
dump = WiktionaryDump(
    dump_dir_path="directory-of-dump-file",
    dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()

# If you already have the dump file, you can also specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()

# Next, we can parse the dump file.
parser = WiktionaryParser()
for page in dump.pages():
    # Skip redirects
    if page.redirect_to:
        continue

    for entry in parser.entries_from_page(page):
        parsed = parser.parse_entry(entry)

       #  Ignore non-German entries
       if parsed.language.lang_code != "de":
            continue

        # do something with "parsed
        ...

Output

All entries for "Abend":

ParsedWiktionaryPageEntry(
    name="Abend",
    flexion={
        "Genus": "m",
        "Nominativ Singular": "Abend",
        "Nominativ Plural": "Abende",
        "Genitiv Singular": "Abends",
        "Genitiv Plural": "Abende",
        "Dativ Singular": "Abend",
        "Dativ Plural": "Abenden",
        "Akkusativ Singular": "Abend",
        "Akkusativ Plural": "Abende",
    },
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": []},
    rhymes=["aːbn̩t"],
    syllables=["Abend"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    flexion=None,
    ipa=["ˈaːbn̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Nachname"]},
    rhymes=["aːbn̩t"],
    syllables=["Abend"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    flexion=None,
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Toponym"]},
    rhymes=["aːbn̩t"],
    syllables=["Abend"],
)

Development

This project uses Poetry.

  1. Install Poetry.
  2. Clone this repository
  3. Run poetry install inside of the project folder to install dependencies.
  4. There is a notebook.ipynb to test the parser.
  5. Run poetry run pytest to run tests.

License

MIT © Gregor Weichbrodt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiktionary_de_parser-0.11.1.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wiktionary_de_parser-0.11.1-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file wiktionary_de_parser-0.11.1.tar.gz.

File metadata

  • Download URL: wiktionary_de_parser-0.11.1.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.3 Darwin/23.2.0

File hashes

Hashes for wiktionary_de_parser-0.11.1.tar.gz
Algorithm Hash digest
SHA256 6737e019a22ae9333dca23ca6ec3e23cdbf0116179e46828cdf0b38fd6a2ee87
MD5 a9e2277c0ec7b3ef942f9b4c05972796
BLAKE2b-256 ed89dbef5cb9a1d8867ff84bf122ca4358fb0bb3ed61f19140ad1487adb5d591

See more details on using hashes here.

File details

Details for the file wiktionary_de_parser-0.11.1-py3-none-any.whl.

File metadata

File hashes

Hashes for wiktionary_de_parser-0.11.1-py3-none-any.whl
Algorithm Hash digest
SHA256 26eb4fe865d3345ac8e0c5bea8686af743ceba7d66501aae36a6d7796c7c6c1e
MD5 b5fb5f22fe73d969cedff699fff075cd
BLAKE2b-256 190a349f97f9715430e576ce167dcde592009d679e91cede7cd6d58d92e50272

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page