Skip to main content

Setup for training Tensorflow models on SLURM clusters.

Project description

scoach

A setup for training Tensorflow models on SLURM clusters

How does it work?

  • Inputs needed (see examples directory):
    • A .json file with parameters for training
    • A .json file with the model definition
    • A .py file with the training code.
    • There's a CLI app for interacting with scoach
    • Run scoach init for setting up your configuration file, such as in config_example.yaml
    • On the login machine at the SLURM cluster, run scoach start. This will start a daemon that will then launch jobs as requested.
    • On any machine, you can do scoach run submit to submit jobs.
    • This will upload the Python script to MinIO and submit the configurations to the database.
    • The new runs are consumed by the daemon process, which then uses Jinja2 to render the training script and submit it to the cluster.
    • The training script is then run on the cluster, using Dask workers, that will grow as needed.

To do

  • Add option --local on scoach start for launching runs locally
  • Add support for uploading/managing datasets
  • No Python script duplicates

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scoach-0.1.7a0.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scoach-0.1.7a0-py3-none-any.whl (41.4 kB view details)

Uploaded Python 3

File details

Details for the file scoach-0.1.7a0.tar.gz.

File metadata

  • Download URL: scoach-0.1.7a0.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.5 Linux/5.10.0-8-amd64

File hashes

Hashes for scoach-0.1.7a0.tar.gz
Algorithm Hash digest
SHA256 176ddc627de08c55515185b6f5667b61bd5cee05a022aed3e133e22b89a4ddab
MD5 979219c45eec109df7e2e8e047f9931e
BLAKE2b-256 e8de7951bd7592bc18d138edf08eb0d369e0815d850f2c3c4f38718a6f2a1e14

See more details on using hashes here.

File details

Details for the file scoach-0.1.7a0-py3-none-any.whl.

File metadata

  • Download URL: scoach-0.1.7a0-py3-none-any.whl
  • Upload date:
  • Size: 41.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.5 Linux/5.10.0-8-amd64

File hashes

Hashes for scoach-0.1.7a0-py3-none-any.whl
Algorithm Hash digest
SHA256 ff96a8f0895d56a862966493799d3b82152ec0c55173ad4e4af7713989ea5c34
MD5 3bfa0bc020ecaf128165d22542bd51c6
BLAKE2b-256 92d79a3cc7ca3dc8ec9dfd14cf52c5b9a8ae9e360181cbb21fbfb6d9ea8f9cc1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page