End to End Language Model Pipeline built for training speed
Project description
lm: The Language Model Pipeline
There are few frameworks out there that focus on sequence to sequence neural network models. Most notables are the ones built by Google and Facebook. This repository focuses on seq2seq and language model (next token prediction) using an opinionated end to end setup. The project objective is to create a production pipeline that runs end to end and contains all the professional steps required to achieve state of the art language models.
It leverages:
- mesh tensorflow to train on 8, 32, 256, 512 TPUs
- jsonnet configuration files
- docker/kubeflow for orchestrating the various experiments
- absl for process management, flags, unittest
It uses and supports ONLY:
- Tensorflow (1.15.2+)
- Tensorflow Mesh
- TPUs (maybe GPUs cluster in the future, maybe)
- Docker / Kubeflow setup
Useful Commands
TL;DR
# install the library and create gpt2 encoded binaries
pip3 install lm
export INPUT=/tmp/some/path/
export OUTPUT=/tmp/some/path/
lm hashsort ${INPUT} /tmp/lm.index.uniq.txt
lm encode /tpm/lm.index.uniq.txt ${OUTPUT}
lm encode
Turns text files into gpt2 encoded .tfrecords
.
!mkdir -p /tmp/datasets/tfrecords/
ENCODE_INPUT=/tmp/datasets/txt
ENCODE_OUTPUT=/tmp/datasets/tfrecords/
NAME=my_dataset
# short
lm encode ${ENCODE_INPUT} ${ENCODE_OUTPUT}
# expanded
lm encode \
--name $NAME \
--encoder gpt2 \
--size 200MiB \
--nproc 0 \
--compress zlib \
${ENCODE_INPUT} \
${ENCODE_OUTPUT}
Add
--size 300
to add 300MiB (300 * 2 * 20) of uncompressed input text into each tfrecord file.
Set --nproc 1
to disable multiprocessing (useful for debugging).
License
- Free software: Apache Software License 2.0
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
Sponsor the Project
<style>.bmc-button img{height: 34px !important;width: 35px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{padding: 7px 15px 7px 10px !important;line-height: 35px !important;height:51px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#000000 !important;border-radius: 8px !important;border: 1px solid transparent !important;font-size: 18px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Arial', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#ffffff !important;}</style>☕Buy me a triple espressoProject details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.