tatoebatools

A library for downloading and reading data from Tatoeba

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

By allowing you to easily download and parse monolingual data files, tatoebatools helps you to integrate Tatoeba into your codebase more quickly.

Installation

This library requires Python 3.6

pip3 install tatoebatools

Basic Usage

Use the high-level ParallelCorpus class to automatically download and iterate over all sentence/translation pairs from a source language to a target language.

>>> from tatoebatools import ParallelCorpus
>>> for sentence, translation in ParallelCorpus("cmn", "eng"):
        print((sentence.text, translation.text))
...
('那里有八块小圆石。', 'There were eight pebbles there.')
('这个椅子坐着不舒服。', 'This chair is uncomfortable.')
('我会在这里等着到他回来的。', 'Until he comes back, I will wait here.')

Advanced Usage

The data files are handled by the tatoeba object.

>>> from tatoebatools import tatoeba

Use the all_tables attribute to list the tables you can have access to:

>>> tatoeba.all_tables
['jpn_indices', 'links', ... , 'user_languages', 'user_lists']

Each table has its own set of attributes:

Table	Attributes
sentences_detailed	sentence_id, lang, text, username, date_added, date_last_modified
sentences_base	sentence_id, base_of_the_sentence
sentences_CC0	sentence_id, lang, text, date_last_modified
links	sentence_id, translation_id
tags	sentence_id, tag_name
sentences_in_lists	list_id, sentence_id
jpn_indices	sentence_id, meaning_id, text
sentences_with_audio	sentence_id, username, license, attribution_url
user_languages	lang, skill_level, username, details
transcriptions	sentence_id, lang, script_name, username, transcription
user_lists	list_id, username, date_created, date_last_modified, list_name, editable_by

Find out more about the Tatoeba data files and their fields here.

You can call all_languages to list the languages supported by Tatoeba:

>>> tatoeba.all_languages
['abk', 'acm', 'ady', ... , 'zsm', 'zul', 'zza']

Iterating over a table

To read a table, just call its iterator. The downloading of data files will be automatically handled in the background.

Set the scope argument to ‘added’ to only read rows that did not exist in the previous version of an updated file. Set it to ‘removed’ to iterate over the rows that don’t exist anymore.

Examples

List all sentences in English:

>>> english_texts = [s.text for s in tatoeba.sentences_detailed("eng")]

List all German sentences that were added by the latest update:

>>> new_german_texts = [s.text for s in tatoeba.sentences_detailed("deu", scope="added")]

List all links between French and Italian sentences:

>>>  links = [(lk.sentence_id, lk.translation_id) for lk in tatoeba.links("fra", "ita")]

List all French native speakers:

>>> native_french = [x.username for x in tatoeba.user_languages("fra") if x.skill_level == 5]

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.3

Dec 10, 2023

0.2.2

Jun 22, 2023

0.2.1

Jun 25, 2022

0.2.0

Sep 17, 2021

0.1.1

Mar 25, 2021

0.1.0

Dec 1, 2020

0.0.9

Nov 7, 2020

0.0.8

Oct 29, 2020

This version

0.0.7

Oct 10, 2020

0.0.6

Oct 4, 2020

0.0.5

Sep 29, 2020

0.0.4

Sep 22, 2020

0.0.2

Aug 25, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tatoebatools-0.0.7.tar.gz (26.7 kB view hashes)

Uploaded Oct 10, 2020 Source

Built Distribution

tatoebatools-0.0.7-py3-none-any.whl (40.2 kB view hashes)

Uploaded Oct 10, 2020 Python 3

Hashes for tatoebatools-0.0.7.tar.gz

Hashes for tatoebatools-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`79966f5f850d6dbf923ff7bc12c2fd4eadca6c205c03f0b42ec5c625bb883330`
MD5	`e0850e19e349ff3aac36b43f1126df79`
BLAKE2b-256	`f7c9b1ee63c1eed35d22bd20a57e7a074326fefc6db2feb541847f053895e271`

Hashes for tatoebatools-0.0.7-py3-none-any.whl

Hashes for tatoebatools-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c1e279f8a93b9fcc59f572bc66edaa1fd4c69597669d1df170123bd064ded92e`
MD5	`4ce06379a4a2e6a4023c3156922a56b6`
BLAKE2b-256	`2dd91e5423bbf6659cd4134d9a17b0e3e780d82bf16775e052e9cfdc8a069847`