Skip to main content

Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods 🚀

Project description

wiktionary_de_parser

wiktionary_de_parser is a Python module to extract data from German Wiktionary XML files. It allows you to add your own extraction methods.

Requirements

  • Python 3.7 (might work with other 3.+ versions, but not tested)

Features

  • comes with preset extraction methods for:
    • flexion tables, genus, IPA, language, lemma, part of speech, syllables, raw Wikitext
  • allows you to add your own extraction methods (pass them as argument)
  • data values are normalized and cleaned from obsolete Wikitext markup
  • yields per section, not per page (a word can have multiple meanings, which is why some Wiktionary pages have multiple 'sections')

Usage

  1. Install via pip3 install wiktionary_de_parser.
  2. Import wiktionary_de_parser like this:
from bz2file import BZ2File
from wiktionary_de_parser import Parser

bzfile_path = 'C:/Users/Gregor/Downloads/dewiktionary-latest-pages-articles-multistream.xml.bz2'
bz = BZ2File(bzfile_path)

for record in Parser(bz):
    if 'language' not in record or record['language'] != 'Deutsch':
      continue
    # do stuff with 'record'

Note: in this example we use BZ2File to read a compressed Wiktionary dump file. The Wiktionary dump file is obtained from here.

Adding new extraction methods

All extraction methods must return a Dict() and accept the following arguments:

  • title (string): The title of the current Wiktionary page
  • text (string): The Wikitext of the current word entry/section
  • current_record (Dict): A dictionary with all values of the current iteration (e. g. current_record['language'])
# Create a new extraction method
def my_method(title, text, current_record):
  # do stuff
  return {'my_field': my_data}

# Pass a list with all extraction methods to the class constructor:
for record in Parser(bz, custom_methods=[my_method]):
    print(record['my_field'])

Sample data:

{'flexion': {'Akkusativ Plural': 'Trittbrettfahrer',
             'Akkusativ Singular': 'Trittbrettfahrer',
             'Dativ Plural': 'Trittbrettfahrern',
             'Dativ Singular': 'Trittbrettfahrer',
             'Genitiv Plural': 'Trittbrettfahrer',
             'Genitiv Singular': 'Trittbrettfahrers',
             'Genus': 'm',
             'Nominativ Plural': 'Trittbrettfahrer',
             'Nominativ Singular': 'Trittbrettfahrer'},
 'inflected': False,
 'ipa': 'ˈtʁɪtbʁɛtˌfaːʁɐ',
 'language': 'Deutsch',
 'lemma': 'Trittbrettfahrer',
 'pos': {'Substantiv': []},
 'syllables': ['Tritt', 'brett', 'fah', 'rer'],
 'title': 'Trittbrettfahrer',
 'wikitext': '=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===\n'
             '\n'
             '{{Deutsch Substantiv Übersicht\n'
             '|Genus=m\n'
             '|Nominativ Singular=Trittbrettfahrer\n'
             '|Nominativ Plural=Trittbrettfahrer\n'
             '|Genitiv Singular=Trittbrettfahrers\n'
             '|Genitiv Plural=Trittbrettfahrer\n'
             '|Dativ Singular=Trittbrettfahrer\n'
             '|Dativ Plural=Trittbrettfahrern\n'
             '|Akkusativ Singular=Trittbrettfahrer\n'
             '|Akkusativ Plural=Trittbrettfahrer\n'
             '}}\n'
             '\n'
             '{{Worttrennung}}\n'
             ':Tritt·brett·fah·rer, {{Pl.}} Tritt·brett·fah·rer\n'
             '\n'
             '{{Aussprache}}\n'
             ':{{IPA}} {{Lautschrift|ˈtʁɪtbʁɛtˌfaːʁɐ}}\n'
             ':{{Hörbeispiele}} {{Audio|}}\n'
             '\n'
             '{{Bedeutungen}}\n'
             ':[1] Person, die ohne [[Anstrengung]] an Vorteilen teilhaben '
             'will\n'
             '\n'
             '{{Herkunft}}\n'
             ':[[Determinativkompositum]] aus den Substantiven '
             "''[[Trittbrett]]'' und ''[[Fahrer]]''\n"
             '\n'
             '{{Weibliche Wortformen}}\n'
             ':[1] [[Trittbrettfahrerin]]\n'
             '\n'
             '{{Beispiele}}\n'
             ':[1] „Bleibt schließlich noch das Problem der '
             "''Trittbrettfahrer,'' die sich ohne Versicherung aus "
             'Nachlässigkeit in das soziale Netz abgleiten '
             'lassen.“<ref>{{Internetquelle|url=http://books.google.se/books?id=VjLq84xNpfMC&pg=PA446&dq=trittbrettfahrer&hl=de&sa=X&ei=8AztU4aVJYq_ygOd1oKIDA&ved=0CEEQ6AEwBjgK#v=onepage&q=trittbrettfahrer&f=false|titel=Öffentliche '
             'Finanzen in der Demokratie: Eine Einführung, Charles B. '
             'Blankart|zugriff=2014-08-14}}</ref>\n'
             '\n'
             '{{Wortbildungen}}\n'
             ':[1] [[Trittbrettfahrer-Problem]]\n'
             '\n'
             '==== {{Übersetzungen}} ====\n'
             '{{Ü-Tabelle|Ü-links=\n'
             '*{{en}}: [1] {{Ü|en|free rider}}\n'
             '*{{fi}}: [1] {{Ü|fi|siipeilijä}}, {{Ü|fi|vapaamatkustaja}}\n'
             '*{{fr}}: [1] {{Ü|fr|profiteur}}\n'
             '|Ü-rechts=\n'
             '*{{it}}: [1] {{Ü|it|scroccone}} {{m}}\n'
             '*{{es}}: [1] {{Ü|es|}}\n'
             '}}\n'
             '\n'
             '{{Referenzen}}\n'
             ':[1] {{Wikipedia|Trittbrettfahrer}}\n'
             ':[*] {{Ref-DWDS|Trittbrettfahrer}}\n'
             ':[*] {{Ref-Canoo|Trittbrettfahrer}}\n'
             ':[1] {{Ref-UniLeipzig|Trittbrettfahrer}}\n'
             ':[1] {{Ref-FreeDictionary|Trittbrettfahrer}}\n'
             '\n'
             '{{Quellen}}'}

Vendor packages

License

MIT © Gregor Weichbrodt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiktionary_de_parser-0.7.3.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wiktionary_de_parser-0.7.3-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file wiktionary_de_parser-0.7.3.tar.gz.

File metadata

  • Download URL: wiktionary_de_parser-0.7.3.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for wiktionary_de_parser-0.7.3.tar.gz
Algorithm Hash digest
SHA256 a170c3c9afe7f5b7cdfab897524906701f596517865e40b4ff0ac7d7db685fb8
MD5 b09241294e790bc0fff38bffa81f95e9
BLAKE2b-256 ebecb9924285cf1b41a57a97c96bfe0c86d3b521b50bbff6c8e5347be0bfa42c

See more details on using hashes here.

File details

Details for the file wiktionary_de_parser-0.7.3-py3-none-any.whl.

File metadata

  • Download URL: wiktionary_de_parser-0.7.3-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for wiktionary_de_parser-0.7.3-py3-none-any.whl
Algorithm Hash digest
SHA256 78b6ac6e57da94f4748f220d0732e36675fdf80412f28f36c82587da64f39ed1
MD5 9ca301715acd81e7519a6a961237c18e
BLAKE2b-256 1faa75b4db5487213457acb5032da4678c1f4ac2fa69b7ee58a9bdfb138c8388

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page