Skip to main content

Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 🚀

Project description

wiktionary-de-parser

This is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods.

Installation

pip install wiktionary-de-parser

Features

  • Extracts flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
  • Allows you to add your own extraction methods (pass them as argument)
  • Yields per section, not per page (a word can have multiple meanings --> multiple sections of a Wiktionary pages)

Usage

from bz2 import BZ2File
from wiktionary_de_parser import Parser

bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz_file = BZ2File(bzfile_path)

for record in Parser(bz_file):
    if 'lang_code' not in record or record['lang_code'] != 'de':
      continue
    # do stuff with 'record'

Note: In this example we load a compressed Wiktionary dump file that was obtained from here.

Adding new extraction methods

An extraction method takes the following arguments:

  • title (string): The title of the current Wiktionary page
  • text (string): The Wikitext of the current word entry/section
  • current_record (Dict): A dictionary with all values of the current iteration (e. g. current_record['lang_code'])

It must return a Dict with the results or False if the record was processed unsuccesfully.

# Create a new extraction method
def my_method(title, text, current_record):
  # do stuff
  return {'my_field': my_data} if my_data else False

# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz_file, custom_methods=[my_method]):
    print(record['my_field'])

Output

Example output for the word "Abend":

{'flexion': {'Akkusativ Plural': 'Abende',
             'Akkusativ Singular': 'Abend',
             'Dativ Plural': 'Abenden',
             'Dativ Singular': 'Abend',
             'Genitiv Plural': 'Abende',
             'Genitiv Singular': 'Abends',
             'Genus': 'm',
             'Nominativ Plural': 'Abende',
             'Nominativ Singular': 'Abend'},
 'inflected': False,
 'ipa': ['ˈaːbn̩t', 'ˈaːbm̩t'],
 'lang': 'Deutsch',
 'lang_code': 'de',
 'lemma': 'Abend',
 'pos': {'Substantiv': []},
 'rhymes': ['aːbn̩t'],
 'syllables': ['Abend'],
 'title': 'Abend'}

Development

This project uses Poetry.

  1. Install Poetry.
  2. Clone this repository
  3. Run poetry install inside of the project folder to install dependencies.
  4. Change wiktionary_de_parser/run.py to your needs.
  5. Run poetry run python wiktionary_de_parser/run.py to run the parser. Or poetry run pytest to run tests.

License

MIT © Gregor Weichbrodt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiktionary-de-parser-0.9.5.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wiktionary_de_parser-0.9.5-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file wiktionary-de-parser-0.9.5.tar.gz.

File metadata

  • Download URL: wiktionary-de-parser-0.9.5.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.9.10 Darwin/21.6.0

File hashes

Hashes for wiktionary-de-parser-0.9.5.tar.gz
Algorithm Hash digest
SHA256 aaae7daaeea75cfacd6cc92eaff09424022584859d59429da908ccbc6dcb7334
MD5 cab9a30d254e65ef861ca91ef2a08a93
BLAKE2b-256 d6e6d91d18aff8de3b01402413043ea9a53c83ea83d36bed6ba6c47f37be6ab8

See more details on using hashes here.

File details

Details for the file wiktionary_de_parser-0.9.5-py3-none-any.whl.

File metadata

File hashes

Hashes for wiktionary_de_parser-0.9.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b59a8bed19ebdeaac206e99944f2a2540df22520dc325b36a899ec9e73953e8b
MD5 75d33e8709289a9173f9a8d649d8e0d0
BLAKE2b-256 cf13c66118999f751771183385b5cf132f0356d4393d74b6fb5d0be785d152f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page