Skip to main content

triton inference service of llama

Project description

 
OpenMMLab website HOT      OpenMMLab platform TRY IT OUT
 

docs codecov license issue resolution open issues

English | 简体中文

Introduction

Installation

Below are quick steps for installation:

conda create -n open-mmlab python=3.8
conda activate open-mmlab
git clone https://github.com/open-mmlab/llmdeploy.git
cd llmdeploy
pip install -e .

Quick Start

Build

Pull docker image openmmlab/llmdeploy:base and build llmdeploy libs in its launched container

mkdir build && cd build
../generate.sh
make -j$(nproc) && make install

Serving LLaMA

Weights for the LLaMA models can be obtained from by filling out this form

Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:

7B
python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
    --tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
13B
python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
33B
python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
65B
python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer

Serving Vicuna

7B
python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
  --base-model-path /path/to/llama-7b \
  --target-model-path /path/to/vicuna-7b \
  --delta-path lmsys/vicuna-7b-delta-v1.1

python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
13B
python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
  --base-model-path /path/to/llama-13b \
  --target-model-path /path/to/vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1

python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer

Inference with Command Line Interface

python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1

Inference with Web UI

python3 llmdeploy/webui/app.py {server_ip_addresss}:33337 model_name

User Guide

Quantization

In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users. First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by deploy.py. Then adjust config.ini

  • use_context_fmha changed to 0, means off
  • quant_policy is set to 4. This parameter defaults to 0, which means it is not enabled

Contributing

We appreciate all contributions to LLMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

llmdeploy-0.0.1-py3-none-any.whl (24.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page