Skip to main content

Natural-Language-Toolkit for bahasa Malaysia, powered by Deep Learning Tensorflow.

Project description

Malaya is a Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Documentation

Proper documentation is available at https://malaya.readthedocs.io/

Installing from the PyPI

CPU version

$ pip install malaya

GPU version

$ pip install malaya-gpu

Only Python 3.6.0 and above and Tensorflow 1.15.0 and above are supported.

We recommend to use virtualenv for development. All examples tested on Tensorflow version 1.15.4 and 2.4.1.

Features

  • Augmentation, augment any text using dictionary of synonym, Wordvector or Transformer-Bahasa.

  • Constituency Parsing, breaking a text into sub-phrases using finetuned Transformer-Bahasa.

  • Dependency Parsing, extracting a dependency parse of a sentence using finetuned Transformer-Bahasa.

  • Emotion Analysis, detect and recognize 6 different emotions of texts using finetuned Transformer-Bahasa.

  • Entities Recognition, seeks to locate and classify named entities mentioned in text using finetuned Transformer-Bahasa.

  • Generator, generate any texts given a context using T5-Bahasa, GPT2-Bahasa or Transformer-Bahasa.

  • Keyword Extraction, provide RAKE, TextRank and Attention Mechanism hybrid with Transformer-Bahasa.

  • Language Detection, using Fast-text and Sparse Deep learning Model to classify Malay (formal and social media), Indonesia (formal and social media), Rojak language and Manglish.

  • Normalizer, using local Malaysia NLP researches hybrid with Transformer-Bahasa to normalize any bahasa texts.

  • Num2Word, convert from numbers to cardinal or ordinal representation.

  • Paraphrase, provide Abstractive Paraphrase using T5-Bahasa and Transformer-Bahasa.

  • Part-of-Speech Recognition, grammatical tagging is the process of marking up a word in a text using finetuned Transformer-Bahasa.

  • Relevancy Analysis, detect and recognize relevancy of texts using finetuned Transformer-Bahasa.

  • Sentiment Analysis, detect and recognize polarity of texts using finetuned Transformer-Bahasa.

  • Text Similarity, provide interface for lexical similarity deep semantic similarity using finetuned Transformer-Bahasa.

  • Spell Correction, using local Malaysia NLP researches hybrid with Transformer-Bahasa to auto-correct any bahasa words.

  • Stemmer, using BPE LSTM Seq2Seq with attention state-of-art to do Bahasa stemming.

  • Subjectivity Analysis, detect and recognize self-opinion polarity of texts using finetuned Transformer-Bahasa.

  • Kesalahan Tatabahasa, Fix kesalahan tatabahasa using TransformerTag-Bahasa.

  • Summarization, provide Abstractive T5-Bahasa also Extractive interface using Transformer-Bahasa, skip-thought and Doc2Vec.

  • Topic Modelling, provide Transformer-Bahasa, LDA2Vec, LDA, NMF and LSA interface for easy topic modelling with topics visualization.

  • Toxicity Analysis, detect and recognize 27 different toxicity patterns of texts using finetuned Transformer-Bahasa.

  • Transformer, provide easy interface to load Pretrained Language models Malaya.

  • Translation, provide Neural Machine Translation using Transformer for EN to MS and MS to EN.

  • Word2Num, convert from cardinal or ordinal representation to numbers.

  • Word2Vec, provide pretrained bahasa wikipedia and bahasa news Word2Vec, with easy interface and visualization.

  • Zero-shot classification, provide Zero-shot classification interface using Transformer-Bahasa to recognize texts without any labeled training data.

  • Hybrid 8-bit Quantization, provide hybrid 8-bit quantization for all models to reduce inference time up to 2x and model size up to 4x.

  • Longer Sequences Transformer, provide BigBird + Pegasus for longer Abstractive Summarization, Neural Machine Translation and Relevancy Analysis sequences.

  • Distilled Transformer, provide distilled transformer models for Abstractive Summarization.

Pretrained Models

Malaya also released Bahasa pretrained models, simply check at Malaya/pretrained-model

Or can try use huggingface 🤗 Transformers library, https://huggingface.co/models?filter=ms

References

If you use our software for research, please cite:

@misc{Malaya, Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow,
  author = {Husein, Zolkepli},
  title = {Malaya},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huseinzol05/malaya}}
}

Acknowledgement

Thanks to KeyReply for sponsoring private cloud to train Malaya models, without it, this library will collapse entirely.

Also, thanks to Tensorflow Research Cloud for free TPUs access.

Contributing

Thank you for contributing this library, really helps a lot. Feel free to contact me to suggest me anything or want to contribute other kind of forms, we accept everything, not just code!

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

malaya-4.2.4-py3-none-any.whl (2.2 MB view details)

Uploaded Python 3

File details

Details for the file malaya-4.2.4-py3-none-any.whl.

File metadata

  • Download URL: malaya-4.2.4-py3-none-any.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for malaya-4.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2930878e35d3c6de26714b360f3cd1676376afd04ccfe4d03cb3ed505563d384
MD5 f6085a32912df311c7c24cc081c30549
BLAKE2b-256 d4c98ca4d2b03a3e6982e050b8f4076c7b893d857978da5358cfd8cf8b152c02

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page