Skip to main content

Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 🚀

Project description

wiktionary_de_parser

wiktionary_de_parser is a Python module to extract data from German Wiktionary XML files (for Python 3.7+). It allows you to add your own extraction methods.

Installation

pip3 install wiktionary_de_parser

Features

  • comes with preset extraction methods for:
    • flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext
  • allows you to add your own extraction methods (pass them as argument)
  • data values are normalized and cleaned from obsolete Wikitext markup
  • yields per section, not per page (a word can have multiple meanings, which is why some Wiktionary pages have multiple 'sections')

Usage

Import wiktionary_de_parser like this:

from bz2 import BZ2File
from wiktionary_de_parser import Parser

bzfile_path = '/tmp/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz = BZ2File(bzfile_path)

for record in Parser(bz):
    if 'lang_code' not in record or record['lang_code'] != 'de':
      continue
    # do stuff with 'record'

Note: in this example we use BZ2File to read a compressed Wiktionary dump file. The Wiktionary dump file is obtained from here.

Adding new extraction methods

All extraction methods must return a Dict() and accept the following arguments:

  • title (string): The title of the current Wiktionary page
  • text (string): The Wikitext of the current word entry/section
  • current_record (Dict): A dictionary with all values of the current iteration (e. g. current_record['lang_code'])
# Create a new extraction method
def my_method(title, text, current_record):
  # do stuff
  return {'my_field': my_data}

# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz, custom_methods=[my_method]):
    print(record['my_field'])

Sample data:

{'flexion': {'Akkusativ Plural': 'Trittbrettfahrer',
             'Akkusativ Singular': 'Trittbrettfahrer',
             'Dativ Plural': 'Trittbrettfahrern',
             'Dativ Singular': 'Trittbrettfahrer',
             'Genitiv Plural': 'Trittbrettfahrer',
             'Genitiv Singular': 'Trittbrettfahrers',
             'Genus': 'm',
             'Nominativ Plural': 'Trittbrettfahrer',
             'Nominativ Singular': 'Trittbrettfahrer'},
 'inflected': False,
 'ipa': ['ˈtʁɪtbʁɛtˌfaːʁɐ'],
 'lang': 'Deutsch',
 'lang_code': 'de',
 'lemma': 'Trittbrettfahrer',
 'pos': {'Substantiv': []},
 'syllables': ['Tritt', 'brett', 'fah', 'rer'],
 'title': 'Trittbrettfahrer',
 'wikitext': '=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===\n'
             '\n'
             '{{Deutsch Substantiv Übersicht\n'
             '|Genus=m\n'
             '|Nominativ Singular=Trittbrettfahrer\n'
             '|Nominativ Plural=Trittbrettfahrer\n'
             '|Genitiv Singular=Trittbrettfahrers\n'
             '|Genitiv Plural=Trittbrettfahrer\n'
             '|Dativ Singular=Trittbrettfahrer\n'
             '|Dativ Plural=Trittbrettfahrern\n'
             '|Akkusativ Singular=Trittbrettfahrer\n'
             '|Akkusativ Plural=Trittbrettfahrer\n'
             '}}\n'
             '\n'
             '{{Worttrennung}}\n'
             ':Tritt·brett·fah·rer, {{Pl.}} Tritt·brett·fah·rer\n'
             '\n'
             '{{Aussprache}}\n'
             ':{{IPA}} {{Lautschrift|ˈtʁɪtbʁɛtˌfaːʁɐ}}\n'
             ':{{Hörbeispiele}} {{Audio|}}\n'
             '\n'
             '{{Bedeutungen}}\n'
             ':[1] Person, die ohne [[Anstrengung]] an Vorteilen teilhaben '
             'will\n'
             '\n'
             '{{Herkunft}}\n'
             ':[[Determinativkompositum]] aus den Substantiven '
             "''[[Trittbrett]]'' und ''[[Fahrer]]''\n"
             '\n'
             '{{Weibliche Wortformen}}\n'
             ':[1] [[Trittbrettfahrerin]]\n'
             '\n'
             '{{Beispiele}}\n'
             ':[1] „Bleibt schließlich noch das Problem der '
             "''Trittbrettfahrer,'' die sich ohne Versicherung aus "
             'Nachlässigkeit in das soziale Netz abgleiten '
             'lassen.“<ref>{{Internetquelle|url=http://books.google.se/books?id=VjLq84xNpfMC&pg=PA446&dq=trittbrettfahrer&hl=de&sa=X&ei=8AztU4aVJYq_ygOd1oKIDA&ved=0CEEQ6AEwBjgK#v=onepage&q=trittbrettfahrer&f=false|titel=Öffentliche '
             'Finanzen in der Demokratie: Eine Einführung, Charles B. '
             'Blankart|zugriff=2014-08-14}}</ref>\n'
             '\n'
             '{{Wortbildungen}}\n'
             ':[1] [[Trittbrettfahrer-Problem]]\n'
             '\n'
             '==== {{Übersetzungen}} ====\n'
             '{{Ü-Tabelle|Ü-links=\n'
             '*{{en}}: [1] {{Ü|en|free rider}}\n'
             '*{{fi}}: [1] {{Ü|fi|siipeilijä}}, {{Ü|fi|vapaamatkustaja}}\n'
             '*{{fr}}: [1] {{Ü|fr|profiteur}}\n'
             '|Ü-rechts=\n'
             '*{{it}}: [1] {{Ü|it|scroccone}} {{m}}\n'
             '*{{es}}: [1] {{Ü|es|}}\n'
             '}}\n'
             '\n'
             '{{Referenzen}}\n'
             ':[1] {{Wikipedia|Trittbrettfahrer}}\n'
             ':[*] {{Ref-DWDS|Trittbrettfahrer}}\n'
             ':[*] {{Ref-Canoo|Trittbrettfahrer}}\n'
             ':[1] {{Ref-UniLeipzig|Trittbrettfahrer}}\n'
             ':[1] {{Ref-FreeDictionary|Trittbrettfahrer}}\n'
             '\n'
             '{{Quellen}}'}

License

MIT © Gregor Weichbrodt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiktionary_de_parser-0.8.6.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wiktionary_de_parser-0.8.6-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file wiktionary_de_parser-0.8.6.tar.gz.

File metadata

  • Download URL: wiktionary_de_parser-0.8.6.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.8.6 Darwin/20.6.0

File hashes

Hashes for wiktionary_de_parser-0.8.6.tar.gz
Algorithm Hash digest
SHA256 3ecd48e1c8973f280d7bae5e60a5eb099a4fb0b3abc3051e5b631134e9d94f20
MD5 d1294ea42b1d2793bcbb3a23cf5a30d4
BLAKE2b-256 8513a781c1af99fcaaa0b02497c390fa26a7ff225cd889f6956e92d970ce99b2

See more details on using hashes here.

File details

Details for the file wiktionary_de_parser-0.8.6-py3-none-any.whl.

File metadata

File hashes

Hashes for wiktionary_de_parser-0.8.6-py3-none-any.whl
Algorithm Hash digest
SHA256 4d9a72d463e8dd102763c442cc5c4a64f625aad67fd6f864164285a1c64739c4
MD5 9afda24757e32ca3d5cb9ba66284874d
BLAKE2b-256 2044b54f89ed1f85b195854a596ffb4501d297566b55ca13e664a428623cac85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page