End to End Language Model Pipeline built for training speed

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language
- Python :: 3.6
- Python :: 3.7

Project description

lm: The Language Model Pipeline

There are few frameworks out there that focus on sequence to sequence neural network models. Most notables are the ones built by Google and Facebook. This repository focuses on seq2seq and language model (next token prediction) using an opinionated end to end setup. The project objective is to create a production pipeline that runs end to end and contains all the professional steps required to achieve state of the art language models.

It leverages:

mesh tensorflow to train on 8, 32, 256, 512 TPUs
jsonnet configuration files
docker/kubeflow for orchestrating the various experiments
absl for process management, flags, unittest

It uses and supports ONLY:

Tensorflow (1.15.2+)
Tensorflow Mesh
TPUs (maybe GPUs cluster in the future, maybe)
Docker / Kubeflow setup

Useful Commands

TL;DR

# install the library and create gpt2 encoded binaries
pip3 install lm
export INPUT=/tmp/some/path/
export OUTPUT=/tmp/some/path/
lm hashsort ${INPUT} /tmp/lm.index.uniq.txt
lm encode /tpm/lm.index.uniq.txt  ${OUTPUT}

lm encode

Turns text files into gpt2 encoded .tfrecords.

!mkdir -p /tmp/datasets/tfrecords/

ENCODE_INPUT=/tmp/datasets/txt
ENCODE_OUTPUT=/tmp/datasets/tfrecords/
NAME=my_dataset

# short
lm encode ${ENCODE_INPUT} ${ENCODE_OUTPUT} 

# expanded 
lm encode \
    --name $NAME \
    --encoder gpt2 \
    --size 200MiB \
    --nproc 0 \
    --compress zlib \
    ${ENCODE_INPUT} \
    ${ENCODE_OUTPUT}

Add --size 300 to add 300MiB (300 * 2 * 20) of uncompressed input text into each tfrecord file.

Set --nproc 1 to disable multiprocessing (useful for debugging).

License

Free software: Apache Software License 2.0

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Sponsor the Project

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language
- Python :: 3.6
- Python :: 3.7

Release history Release notifications | RSS feed

0.2.2a0 pre-release

Sep 6, 2020

0.2.1a0 pre-release

Sep 6, 2020

This version

0.2.0a0 pre-release

Sep 6, 2020

0.1.0

Sep 2, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lm-0.2.0a0.tar.gz (48.1 kB view hashes)

Uploaded Sep 6, 2020 Source

Built Distribution

lm-0.2.0a0-py2.py3-none-any.whl (56.3 kB view hashes)

Uploaded Sep 6, 2020 Python 2 Python 3

Hashes for lm-0.2.0a0.tar.gz

Hashes for lm-0.2.0a0.tar.gz
Algorithm	Hash digest
SHA256	`1618d1c2848292efb3db94c19fb249f11df211d5d2412c0f2ae0f8aacc868dd3`
MD5	`3864c12de54d1aea980f3f63e97dc3f8`
BLAKE2b-256	`56bb2739d4368f2665f5b1e324348714902735a446998231769d0bc7865178fa`

Hashes for lm-0.2.0a0-py2.py3-none-any.whl

Hashes for lm-0.2.0a0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`b8077a07e232b242ac741e2671f4e01386fe682a366856bcab3cdbffe8b17467`
MD5	`cfa68552055278b75c0e91bccff80b35`
BLAKE2b-256	`9bcb13475f2b5bfc5baa7dc48fe3214fdde542084f72114cc7b36a27f96132fb`