Skip to main content

Transliterations to/from Indian languages

Project description

Indicate: Transliterate Indic Languages with TensorFlow and LLMs

Notary Badge PyPI Version Downloads Tests Documentation

Indicate provides high-quality transliteration between Indic languages and English using both traditional TensorFlow models and state-of-the-art LLMs (Large Language Models).

🚀 Features

  • 🧠 Dual Backend Support: Choose between TensorFlow models or LLM-based transliteration
  • 🌍 Multi-Language: 12+ Indic languages (Hindi, Tamil, Telugu, Bengali, etc.)
  • 🔄 Bidirectional: Supports both Indic→English and English→Indic transliteration
  • 🛡️ Production Ready: Safe file handling, atomic writes, backup support
  • 📊 Structured Output: Rich JSON format with metadata and error handling
  • ⚡ Batch Processing: Efficient processing of large files with progress tracking

🎯 Supported Languages

Hindi • Tamil • Telugu • Bengali • Gujarati • Kannada • Malayalam • Punjabi • Marathi • Odia • Urdu • Sanskrit ↔ English

Install

We strongly recommend installing indicate inside a Python virtual environment (see venv documentation)

Requirements: Python 3.11 or 3.12 (TensorFlow does not yet support Python 3.13)

pip install indicate

🔧 Quick Setup

For LLM-based transliteration (recommended):

pip install indicate

# Set your API key (choose one):
export OPENAI_API_KEY=your-key
export ANTHROPIC_API_KEY=your-key  
export GOOGLE_API_KEY=your-key

For TensorFlow-only usage:

pip install indicate
# No API key needed - uses pre-trained models

🎯 Usage

🧠 LLM-Based Transliteration (New!)

The LLM backend provides higher accuracy and supports all Indic languages:

# Simple transliteration (auto-detects Hindi)
indicate llm "राजशेखर चिंतालपति"
# Output: Rajashekar Chintalapati

# Specify languages explicitly  
indicate llm "முருகன்" --source tamil --target english
# Output: Murugan

# Between Indic languages
indicate llm "नमस्ते" --source hindi --target tamil  
# Output: நமஸ்தே

# Safe batch processing with structured JSON output
indicate llm --input names.txt --output results.json --format json --batch --backup

# Dry run to preview changes
indicate llm --input large_file.txt --dry-run

Python API:

from indicate import IndicLLMTransliterator

# Initialize for any language pair
transliterator = IndicLLMTransliterator('hindi', 'english')
result = transliterator.transliterate('राजशेखर चिंतालपति')
print(result)  # Output: Rajashekar Chintalapati

# Batch processing
texts = ["राजेश", "गौरव", "प्रिया"]
results = transliterator.transliterate_batch(texts)
print(results)  # ['Rajesh', 'Gaurav', 'Priya']

🤖 TensorFlow Backend (Traditional)

# Hindi to English using TensorFlow model
indicate hindi2english "राजशेखर चिंतालपति"
# Output: rajashekar chintalapati

# From file
indicate hindi2english --input hindi.txt --output english.txt

# Batch processing
indicate hindi2english --input large_file.txt --batch

Python API:

from indicate import hindi2english
result = hindi2english("हिंदी")
print(result)  # Output: hindi

📊 JSON Output Format

The LLM backend provides rich, structured output perfect for data processing:

{
  "metadata": {
    "source_language": "hindi",
    "target_language": "english", 
    "timestamp": "2024-12-09T12:00:00Z",
    "total_lines": 3,
    "successful_lines": 3,
    "failed_lines": 0,
    "encoding": "utf-8"
  },
  "results": [
    {
      "line_number": 1,
      "input_text": "राजेश कुमार",
      "output_text": "Rajesh Kumar", 
      "source_lang": "hindi",
      "target_lang": "english",
      "confidence": "high",
      "processing_time": 1.2,
      "timestamp": "2024-12-09T12:00:01Z"
    }
  ]
}

🛡️ Safety Features

  • 🔒 Input/Output Validation: Prevents accidental file overwrites
  • ⚛️ Atomic Writing: Safe file operations using temporary files
  • 💾 Automatic Backups: Optional timestamped backups of existing files
  • 🔄 Resume Support: Resume interrupted batch operations
  • 👁️ Dry Run Mode: Preview operations before execution

🎛️ Advanced Usage

# Show few-shot examples being used
indicate llm --show-examples --source bengali --target english

# Resume interrupted batch job
indicate llm --input large_file.txt --output results.txt --resume

# Use specific LLM provider/model
indicate llm "text" --provider anthropic --model claude-3-opus

# Process JSON from previous results
indicate llm --input results.json --source english --target hindi

🔄 Backend Comparison

Feature TensorFlow Backend LLM Backend
Languages Hindi ↔ English only 12+ Indic languages ↔ English + Inter-Indic
Setup No API key needed Requires LLM API key
Speed Very fast (local) Moderate (API calls)
Accuracy Good for common words Excellent for all types
Cost Free Pay per API call
Offline ✅ Works offline ❌ Requires internet
Batch Processing ✅ with safety features

🧪 Testing Locally

  1. Clone and install:

    git clone https://github.com/in-rolls/indicate.git
    cd indicate
    uv sync  # or pip install -e .
    
  2. Run tests:

    # All tests
    python -m pytest
    
    # Specific tests
    python -m pytest tests/test_llm_indic.py
    python -m pytest tests/test_file_safety.py
    
  3. Test both backends:

    # TensorFlow backend
    indicate hindi2english "हिंदी"
    
    # LLM backend (set API key first)
    export OPENAI_API_KEY=your-key
    indicate llm "हिंदी"
    

Data

The datasets used to train the model:

Evaluation

Model was evaluated on test dataset of Google Dakshina dataset, Model predicted 73.64% exact matches. Indic-trans predicted 63.12% exact matches on Google Dakshina dataset.

Below is the edit distance metrics on test dataset (0.0 mean exact match, the farther away from 0.0, the difference is more between predicted text and actual text):

Edit distance metrics of model on Google Dakshina test dataset

Authors

Rajashekar Chintalapati and Gaurav Sood

Contributor Code of Conduct

The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct.

License

The package is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indicate-0.5.1.tar.gz (56.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

indicate-0.5.1-py3-none-any.whl (56.5 MB view details)

Uploaded Python 3

File details

Details for the file indicate-0.5.1.tar.gz.

File metadata

  • Download URL: indicate-0.5.1.tar.gz
  • Upload date:
  • Size: 56.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for indicate-0.5.1.tar.gz
Algorithm Hash digest
SHA256 0985bc68f698045140bd0f36eb2ff88e80165c3f64aae72885a4c5510def78b1
MD5 e94235045cfd33e849f65068b872a852
BLAKE2b-256 db55ead0b76c125cdb505992f962fa7ea3584397d9f0c056e0b49cd1e0516d8a

See more details on using hashes here.

Provenance

The following attestation bundles were made for indicate-0.5.1.tar.gz:

Publisher: python-publish.yml on in-rolls/indicate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file indicate-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: indicate-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 56.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for indicate-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 078387688ffee0c28696c2b583ad92e3d4229d72bd42608f897b416acf8238b6
MD5 cf48b8ee770387bb934d1a11c94b2468
BLAKE2b-256 bec203128dab52056cf4b2b0c4da7ab3fbb032e28fe5d0adc917b651e8929674

See more details on using hashes here.

Provenance

The following attestation bundles were made for indicate-0.5.1-py3-none-any.whl:

Publisher: python-publish.yml on in-rolls/indicate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page