llmdeploy

triton inference service of llama

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

OpenMMLab website ^HOT OpenMMLab platform ^{TRY IT OUT}

English | 简体中文

Introduction

Installation

Below are quick steps for installation:

conda create -n open-mmlab python=3.8
conda activate open-mmlab
git clone https://github.com/open-mmlab/llmdeploy.git
cd llmdeploy
pip install -e .

Quick Start

Build

Pull docker image openmmlab/llmdeploy:base and build llmdeploy libs in its launched container

mkdir build && cd build
../generate.sh
make -j$(nproc) && make install

Serving LLaMA

Weights for the LLaMA models can be obtained from by filling out this form

Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:

python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
    --tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer

13B

python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer

33B

python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer

65B

python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer

Serving Vicuna

python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
  --base-model-path /path/to/llama-7b \
  --target-model-path /path/to/vicuna-7b \
  --delta-path lmsys/vicuna-7b-delta-v1.1

python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer

13B

python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
  --base-model-path /path/to/llama-13b \
  --target-model-path /path/to/vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1

python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer

Inference with Command Line Interface

python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1

Inference with Web UI

python3 llmdeploy/webui/app.py {server_ip_addresss}:33337 model_name

User Guide

Quantization

In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users. First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by deploy.py. Then adjust config.ini

use_context_fmha changed to 0, means off
quant_policy is set to 4. This parameter defaults to 0, which means it is not enabled

Contributing

We appreciate all contributions to LLMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

FasterTransformer

License

This project is released under the Apache 2.0 license.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.1

Jun 29, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

llmdeploy-0.0.1-py3-none-any.whl (24.4 kB view hashes)

Uploaded Jun 29, 2023 Python 3

Hashes for llmdeploy-0.0.1-py3-none-any.whl

Hashes for llmdeploy-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7c5cac4723f5fa6516bbfbb3202917daa00a082306620d069757dbcbd21ac239`
MD5	`85ce0624cf3f345b2ebd01b4eaf9a408`
BLAKE2b-256	`063c330d0f4edaf108bb8c8e3d33cf200e38afeb0b1c1acb3ffaea98bbe4c731`