Skip to main content

Yet Another Tokenizer for Thai

Project description

AttaCut

Build Status


TLDR: 3-Layer dilated CNN on character and syllable features

Installation

# only for beta version
$ pip install attacut

Usage

Command-Line Interface

$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Tokenizer for Thai

Usage:
  attacut-cli <src> [--dest=<dest>] [--model=<model>]
  attacut-cli (-h | --help)

Options:
  -h --help         Show this screen.
  --model=<model>   Model to be used [default: attacut-sc].
  --dest=<dest>     If not specified, it'll be <src>-tokenized-by-<model>.txt

Higher-Level Inferface

aka. module importing

from attacut import Tokenizer

atta = Tokenizer(model="attacut-sc")
atta.tokenizer(txt)

Development

Please refer to DEVELOPMENT.md

Related Resources

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

attacut-0.0.4.dev0.tar.gz (1.3 MB view hashes)

Uploaded Source

Built Distribution

attacut-0.0.4.dev0-py3-none-any.whl (1.3 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page