Skip to main content

Bagua is a deep learning training acceleration framework for PyTorch. It provides a one-stop training acceleration solution, including faster distributed training compared to PyTorch DDP, faster dataloader, kernel fusion, and more.

Project description


tutorials Documentation Status Downloads Docker Pulls Docker Cloud Build Status GitHub license

Bagua is a deep learning training acceleration framework for PyTorch developed by AI platform@Kuaishou Technology and DS3 Lab@ETH Zürich. Bagua currently supports:

  • Advanced Distributed Training Algorithms: Users can extend the training on a single GPU to multi-GPUs (may across multiple machines) by simply adding a few lines of code (optionally in elastic mode). One prominent feature of Bagua is to provide a flexible system abstraction that supports state-of-the-art system relaxation techniques of distributed training. So far, Bagua has integrated communication primitives including
  • Cached Dataset: When data loading is slow or data preprocessing is tedious, they could become a major bottleneck of the whole training process. Bagua provides cached dataset to speedup this process by caching data samples in memory, so that reading these samples after the first time becomes much faster.
  • TCP Communication Acceleration (Bagua-Net): Bagua-Net is a low level communication acceleration feature provided by Bagua. It can greatly improve the throughput of AllReduce on TCP network. You can enable Bagua-Net optimization on any distributed training job that uses NCCL to do GPU communication (this includes PyTorch-DDP, Horovod, DeepSpeed, and more).
  • Performance Autotuning: Bagua can automatically tune system parameters to achieve the highest throughput.
  • Generic Fused Optimizer: Bagua provides generic fused optimizer which improve the performance of optimizers by fusing the optimizer .step() operation on multiple layers. It can be applied to arbitrary PyTorch optimizer, in contrast to NVIDIA Apex's approach, where only some specific optimizers are implemented.
  • Load Balanced Data Loader: When the computation complexity of samples in training data are different, for example in NLP and speech tasks, where each sample have different lengths, distributed training throughput can be greatly improved by using Bagua's load balanced data loader, which distributes samples in a way that each worker's workload are similar.

Its effectiveness has been evaluated in various scenarios, including VGG and ResNet on ImageNet, BERT Large and many industrial applications at Kuaishou.

Links

Performance

The performance of different systems and algorithms on VGG16 with 128 GPUs under different network bandwidth.



Epoch time of BERT-Large Finetune under different network conditions for different systems.

For more comprehensive and up to date results, refer to Bagua benchmark page.

Installation

Wheels (precompiled binary packages) are available for Linux (x86_64). Package names are different depending on your CUDA Toolkit version (CUDA Toolkit version is shown in nvcc --version).

CUDA Toolkit version Installation command
>= v10.2 pip install bagua-cuda102
>= v11.1 pip install bagua-cuda111
>= v11.3 pip install bagua-cuda113

Add --pre to pip install commands to install pre-release (development) versions. See Bagua tutorials for quick start guide and more installation options.

Quick Start on AWS

Thanks to the Amazon Machine Images (AMI), we can provide users an easy way to deploy and run Bagua on AWS EC2 clusters with flexible size of machines and a wide range of GPU types. Users can find our pre-installed Bagua image on EC2 by a unique AMI-ID that we publish here.

Bagua version AMI ID Region
0.6.3 ami-0e719d0e3e42b397e us-east-1

To manage the EC2 cluster more efficiently, we use Starcluster as a toolkit to manipulate the cluster. In the config file of Starcluster, there are a few configurations that need to be set up by users, including AWS credentials, cluster settings, etc. More information regarding the Starcluster configuration can be found in this tutorial. Note that AMI is a regional resource, so you need to specify the AMI ID and its corresponding EC2 region at the same time.

For example, we create a EC2 cluster with 4 machines (p3.16xlarge), each of which has 8 V100 GPUs. The cluster is based on the Bagua AMI we pre-installed in us-east-1 region. Then the config file of Starcluster would be:

# region of EC2 instances, here we choose us_east_1
AWS_REGION_NAME = us-east-1
AWS_REGION_HOST = ec2.us-east-1.amazonaws.com
# AMI ID of Bagua
NODE_IMAGE_ID = ami-0e719d0e3e42b397e
# number of instances
CLUSTER_SIZE = 4
# instance type
NODE_INSTANCE_TYPE = p3.16xlarge

With above setup, we created two identical clusters to benchmark a synthesized image classification task over Bagua and Horovod, respectively. Here is the screen recording video of this experiment.

Cite Bagua

% System Overview
@misc{gan2021bagua,
  title={BAGUA: Scaling up Distributed Learning with System Relaxations}, 
  author={Shaoduo Gan and Xiangru Lian and Rui Wang and Jianbin Chang and Chengjun Liu and Hongmei Shi and Shengzhuo Zhang and Xianghong Li and Tengxu Sun and Jiawei Jiang and Binhang Yuan and Sen Yang and Ji Liu and Ce Zhang},
  year={2021},
  eprint={2107.01499},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

% Theory on System Relaxation Techniques
@book{liu2020distributed,
  title={Distributed Learning Systems with First-Order Methods: An Introduction},
  author={Liu, J. and Zhang, C.},
  isbn={9781680837018},
  series={Foundations and trends in databases},
  url={https://books.google.com/books?id=vzQmzgEACAAJ},
  year={2020},
  publisher={now publishers}
}

Discussion

Feel free to join our Telegram Group for discussion!

You can also scan the following QR code to join our WeChat group :)

Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bagua_cuda113-0.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

bagua_cuda113-0.9.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

bagua_cuda113-0.9.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

File details

Details for the file bagua_cuda113-0.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bagua_cuda113-0.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a3b5c8fc660b14daf1093bd2c97a66f051e3aedcdf02aa9c782a85b72f875d9d
MD5 e35e923153228334296f36745323486b
BLAKE2b-256 705794357fcace00b54e1b8b808115d31364435826189078f6e6ee4fd7eef381

See more details on using hashes here.

File details

Details for the file bagua_cuda113-0.9.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bagua_cuda113-0.9.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2196238ad2860acf749733a439c110c2d0d80f6265cfc2f8ccbf5bb75e73ab73
MD5 a0b35985893e0780decd0f2ca87b4e16
BLAKE2b-256 133907bb555975074fab486e78b5e600edf50ecf11e708da3802e40e110ffaef

See more details on using hashes here.

File details

Details for the file bagua_cuda113-0.9.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bagua_cuda113-0.9.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c818e2065444c26658b92d279dd922584111e167df17986d975a5f4883ffbe8d
MD5 5348705c31b48d4f794da434e2faabed
BLAKE2b-256 06e7389de5306539f51cfa66ec04620fe4d4891b755155c1b958299ff0397882

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page