Skip to main content

A python template

Project description

gpu_tester

pypi

Gpu tester finds all your bad gpus.

Works on slurm.

Features:

  • does a forward on each gpu
  • check for gpu returning incorrect results
  • check for gpu failing due to ECC errors

Roadmap:

  • sanity check forward speed
  • sanity check broadcast speed

Install

Create a venv:

python3 -m venv .env
source .env/bin/activate
pip install -U pip

Then:

pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
pip install gpu_tester

Python examples

Checkout these examples to call this as a lib:

Output

Output looks like this:

job succeeded
0 have incorrect results, 1 have gpu errors and 319 succeeded
incorrect results:
[]
gpu errors:
[['gpu_error', 'compute-od-gpu-st-p4d-24xlarge-156', '3']]

Recommended testing strategy

Pair based strategy

The easiest way to quickly spot broken node is to do the pair-based strategy. It will run many jobs in parallel and find which node can talk together Here is one example

gpu_tester --nodes 2 --parallel-tests 50 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 45 --exclude 'gpu-st-p4d-24xlarge-[66]'

All at once strategy

Once you validated this works, you may want to try the DDP strategy over all nodes, eg:

gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 300 --exclude 'gpu-st-p4d-24xlarge-[66]'

Simple forward

If you want to only validate the forward functionality of gpus and not the communication, you may use:

gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "simple_forward" --job_timeout 50 --exclude 'gpu-st-p4d-24xlarge-[66]'

API

This module exposes a single function gpu_tester which takes the same arguments as the command line tool:

  • cluster the cluster. (default slurm)
  • job_name slurm job name. (default gpu_tester)
  • partition slurm partition. (default compute-od-gpu)
  • gpu_per_node numbe of gpu per node. (default 8)
  • nodes number of gpu nodes. (default 1)
  • output_folder the output folder. (default None which means current folder / results)
  • job_timeout job timeout (default 150 seconds)
  • job_comment optional comment arg given to slurm (default None)
  • job_account optional account arg given to slurm (default None)
  • test_kind simple_forward or ddp. simple_forward is quick forward test. DDP uses pytorch ddp to check gpu interconnect (default simple_forward)
  • parallel_tests number of tests to run in parallel. Recommended to use that with nodes == 2 to test pair by pair (default 1)
  • nodelist node whitelist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)
  • exclude node blacklist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu_tester-1.2.0.tar.gz (8.2 kB view hashes)

Uploaded Source

Built Distribution

gpu_tester-1.2.0-py3-none-any.whl (10.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page