A python template
Project description
gpu_tester
Gpu tester finds all your bad gpus.
Works on slurm.
Features:
- does a forward on each gpu
- check for gpu returning incorrect results
- check for gpu failing due to ECC errors
Roadmap:
- sanity check forward speed
- sanity check broadcast speed
Install
Create a venv:
python3 -m venv .env
source .env/bin/activate
pip install -U pip
Then:
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
pip install gpu_tester
Python examples
Checkout these examples to call this as a lib:
Output
Output looks like this:
job succeeded
0 have incorrect results, 1 have gpu errors and 319 succeeded
incorrect results:
[]
gpu errors:
[['gpu_error', 'compute-od-gpu-st-p4d-24xlarge-156', '3']]
Recommended testing strategy
Pair based strategy
The easiest way to quickly spot broken node is to do the pair-based strategy. It will run many jobs in parallel and find which node can talk together Here is one example
gpu_tester --nodes 2 --parallel-tests 50 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 45 --exclude 'gpu-st-p4d-24xlarge-[66]'
All at once strategy
Once you validated this works, you may want to try the DDP strategy over all nodes, eg:
gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 300 --exclude 'gpu-st-p4d-24xlarge-[66]'
Simple forward
If you want to only validate the forward functionality of gpus and not the communication, you may use:
gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "simple_forward" --job_timeout 50 --exclude 'gpu-st-p4d-24xlarge-[66]'
API
This module exposes a single function gpu_tester
which takes the same arguments as the command line tool:
- cluster the cluster. (default slurm)
- job_name slurm job name. (default gpu_tester)
- partition slurm partition. (default compute-od-gpu)
- gpu_per_node numbe of gpu per node. (default 8)
- nodes number of gpu nodes. (default 1)
- output_folder the output folder. (default None which means current folder / results)
- job_timeout job timeout (default 150 seconds)
- job_comment optional comment arg given to slurm (default None)
- job_account optional account arg given to slurm (default None)
- test_kind simple_forward or ddp. simple_forward is quick forward test. DDP uses pytorch ddp to check gpu interconnect (default simple_forward)
- parallel_tests number of tests to run in parallel. Recommended to use that with nodes == 2 to test pair by pair (default 1)
- nodelist node whitelist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)
- exclude node blacklist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)
For development
Either locally, or in gitpod (do export PIP_USER=false
there)
Setup a virtualenv:
python3 -m venv .env
source .env/bin/activate
pip install -e .
to run tests:
pip install -r requirements-test.txt
then
make lint
make test
You can use make black
to reformat the code
python -m pytest -x -s -v tests -k "dummy"
to run a specific test
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gpu_tester-1.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 424b6186e22c745ca0894cf5fdeb9910e776a1a9a7810e2f6c428d746b3627f2 |
|
MD5 | 40fd9743b801583a58e091750f9e22cb |
|
BLAKE2b-256 | 4cf906d6281f49a9dc821275ee9e212241aa2e1deafb27c2df57ab74bde6c254 |