Skip to main content

Unified flow interface for data augmentation for LLM training.

Project description

🌊 uniflow

License Apache-2.0 pypi_status Commit activity Slack

uniflow is a unified interface to solve data augmentation problem for LLM training.

Built by CambioML.

Quick Install

pip3 install uniflow

See more details at the full installation.

Overview

For all the flows, you must first import the Client interface and the specific uniflow configuration that you want to use.

from uniflow.client import Client
from uniflow.config import UNIFLOW_CONFIG

Then you can create a Client object to run the a particular flow.

client = Client(UNIFLOW_CONFIG)

Here is a table of the different pre-defined configurations you can use and their corresponding Flows, Model Configs, LLMs:

Config Flow Model Config LLM
Config ModelFLow ModelConfig gpt-3.5-turbo-1106
OpenAIConfig ModelFLow OpenAIModelConfig gpt-3.5-turbo-1106
OpenAIJsonConfig OpenAIJsonModelFlow ModelConfig gpt-3.5-turbo-1106
OpenAIFewShotConfig FewShotModelFlow OpenAIModelConfig gpt-3.5-turbo-1106
HuggingfaceConfig HuggingFaceModelFlow HuggingfaceModelConfig mistralai/Mistral-7B-Instruct-v0.1
LMQGConfig LMQGModelFlow LMQGModelConfig lmqg/t5-base-squad-qg-ae

You can also run your own custom flows by passing in a custom configurations and/or parameters to the Client object. See the Custom Configuration section for more details.

Every flow takes a list of dictionaries as input. Each dictionary in the input list should have data to prompt in a format similar one of the following:

  1. A simple list of context dicts, which can be used with a few-shot model:

    data = [
    {"context": """It was a sunny day and the sky color is blue.""",},
    {"context": """My name is bobby and I am a talent software engineer working on AI/ML.""",}
    ]
    

    This will be appended to the few_shot_template in the configuration. See the few_shot_model.ipynb notebook for a working example.

  2. A more complex list of dictionaries with a single instruction and an examples list:

    data = [{
      "instruction": """Generate one question and its corresponding answer based on the context. Following the format of the examples below to include context, question, and answer in the response.""",
      "examples": [
          {
              "context": """The quick brown fox jumps over the lazy dog.""",
              "question": """What is the color of the fox?""",
              "answer": """brown."""
          },
          {
              "context": """The quick brown fox jumps over the lazy black dog.""",
              "question": """What is the color of the dog?""",
              "answer": """black."""
          },
          {
              "context": """It was a sunny day and the sky color is blue.""",
              "question": """""",
              "answer": """""",
          }
        ],
    },
    ...]
    

Next, use the client object to run the flow on the input data.

output = client.run(data)

For example if you want to run the FewShotModelFlow flow with the OpenAIFewShotConfig configuration, you can do the following:

from uniflow.client import Client
from uniflow.config import OpenAIFewShotConfig

data = [
    {"context": """It was a sunny day and the sky color is blue.""",},
    {"context": """My name is bobby and I am a talent software engineer working on AI/ML.""",}
    ]

config = OpenAIFewShotConfig()
client = Client(config)

output = client.run(data)

output[0]['output'][0]

You will see output similar to the following:

{'response': [{'context': 'It was a sunny day and the sky color is blue.',
   'question': 'what color is the sky on a sunny day?',
   'answer': 'blue.'}],
 'error': 'Failed to deserialize 0 examples'}

Flows

The Flow determines which uniflow Model is run. All the different flows are built on the Flow class, which converts the input data into nodes. The Model determines how the input data is serialized and deserialized.

Here is a table of the different Flows and their corresponding Models:

Flow Model
ModelFlow JsonModel
FewShotModelFlow FewShotModel
OpenAIJsonModelFlow OpenAIJsonModel
HuggingFaceModelFlow Model
LMQGModelFlow Model

Model

Serialization

This model serializes the input data into a single string, with the format:

KEY0: VALUE0\n
KEY1: VALUE1\n
...

Deserialization

It returns the raw data in the following object:

{
    RESPONSE: data,
    ERROR: "Failed to deserialize 0 examples",
}

JsonModel

Serialization

This model first appends the data to the few_shot_template in the configuration. It then serializes the data using json.dumps.

Deserialization

It iterates through the data and parses the output data into a output_list. Then returns the output_list in the following object:

{
    RESPONSE: output_list,
    ERROR: "Failed to deserialize 0 examples",
}

FewShotModel

Serialization

The FewShotModel serializes the input data into a few more steps. First, it adds the input data to the examples list within the few_shot_template. Then it serializes the data into a single string, with the format:

KEY0: VALUE0\n
KEY1: VALUE1\n
...

Deserialization

It first runs through a data filtering function to parse the output data into a output_list. Then returns the output_list in the following object:

{
    RESPONSE: output_list,
    ERROR: "Failed to deserialize 0 examples",
}

OpenAIJsonModel

Serialization

Same as the Model.

Deserialization

Same as from the Model.

Model Config

The Model Config is a configuration that is passed to the base Config object and determines which LLM model is used. It also has parameters that are specific to the LLM model.

ModelConfig

The base config is called ModelConfig and has the following parameters:

Parameter Type Default Description
model_name str gpt-3.5-turbo-1106 name of the LLM model used

OpenAIModelConfig

The OpenAIModelConfig inherits from the ModelConfig and has the following additional parameters:

Parameter Type Default Description
num_calls int 1 The number of calls to make to the OpenAI API.
temperature float 1.5 The temperature to use for the OpenAI API.
response_format Dict[str, str] {"type": "text"} The response format to use for the OpenAI API.

OpenAIJsonModelConfig

The OpenAIJsonModelConfig inherits from the ModelConfig and has the following additional parameters:

Parameter Type Default Description
num_calls int 1 The number of calls to make to the OpenAI API.
temperature float 1.5 The temperature to use for the OpenAI API.
response_format Dict[str, str] {"type": "json_object"} The response format to use for the OpenAI API.

HuggingfaceModelConfig

The HuggingfaceModelConfig inherits from the ModelConfig, but overrides the model_name parameter to use the mistralai/Mistral-7B-Instruct-v0.1 model by default.

Parameter Type Default Description
model_name str mistralai/Mistral-7B-Instruct-v0.1 name of the LLM model used

LMQGModelConfig

The LMQGModelConfig inherits from the ModelConfig, but overrides the model_name parameter to use the lmqg/t5-base-squad-qg-ae model by default.

Parameter Type Default Description
model_name str lmqg/t5-base-squad-qg-ae name of the LLM model used

Few Shot Template

The OpenAIFewShotConfig is the only configuration that has a default feww_shot_template. The default template is the following:

 {
            "instruction": """Generate one question and its corresponding answer based on the context.
    Following the format of the examples below. The response must include original context, question, and answer.""",
            "examples": [
                {
                    "context": """The quick brown fox jumps over the lazy dog.""",
                    "question": """What is the color of the fox?""",
                    "answer": """brown.""",
                },
                {
                    "context": """The quick brown fox jumps over the lazy black dog.""",
                    "question": """What is the color of the dog?""",
                    "answer": """black.""",
                },
            ],
        }

Custom Configuration

You can also configure the flows by passing custom configurations or arguments to the Client object.

Every configuration inherits from the Config configuration, which has the following parameters:

Parameter Type Description
flow_name str The name of the flow to run.
few_shot_template Dict[str, str] The template to use for the FewShotModelFlow.
num_threads int The number of threads to use for the flow.
model_config ModelConfig The configuration to pass to the model.

You can further configure the model_config by passing in one of the Model Configs with custom parameters.

Here is an example of how to pass in a custom configuration to the Client object:

from uniflow.client import Client
from uniflow.config import OpenAIFewShotConfig
from uniflow.model.config import OpenAIModelConfig

data = [
    {"context": """It was a sunny day and the sky color is blue.""",},
    {"context": """My name is bobby and I am a talent software engineer working on AI/ML.""",}
    ]

config = OpenAIFewShotConfig(
  num_threads=2,
  model_config=OpenAIModelConfig(
    model_name="gpt-4",
    num_calls=2,
    temperature=0.5,
  ),
)
client = Client(config)
output = client.run(data)

As you can see, we are passing in a custom parameters to the OpenAIModelConfig to the OpenAIFewShotConfig configurations according to our needs.

Examples

For more examples, see the example folder.

Installation

To get started with uniflow, you can install it using pip in a conda environment.

First, create a conda environment on your terminal using:

conda create -n uniflow python=3.10 -y
conda activate uniflow  # some OS requires `source activate uniflow`

Next, install the compatible pytorch based on your OS.

  • If you are on a GPU, install pytorch based on your cuda version. You can find your CUDA version via nvcc -V.
    pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121  # cu121 means cuda 12.1
    
  • If you are on a CPU instance,
    pip3 install torch
    

Then, install uniflow:

pip3 install uniflow

Finally, if you are running the HuggingfaceModelFlow, you will also need to install the transformers, accelerate, bitsandbytes, scipy libraries:

pip3 install transformers accelerate bitsandbytes scipy

Congrats you have finished the installation!

Dev Setup

If you are interested in contributing to us, here are the preliminary development setups.

API keys

If you are running one of the following OpenAI flows, you will have to set up your OpenAI API key.

To do so, create a .env file in your root uniflow folder. Then add the following line to the .env file:

OPENAI_API_KEY=YOUR_API_KEY

Backend Dev Setup

conda create -n uniflow python=3.10
conda activate uniflow
cd uniflow
pip3 install poetry
poetry install --no-root

EC2 Dev Setup

If you are on EC2, you can launch a GPU instance with the following config:

  • EC2 g4dn.xlarge (if you want to run a pretrained LLM with 7B parameters)
  • Deep Learning AMI PyTorch GPU 2.0.1 (Ubuntu 20.04) Alt text
  • EBS: at least 100G Alt text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniflow-0.0.8.tar.gz (21.9 kB view hashes)

Uploaded Source

Built Distribution

uniflow-0.0.8-py3-none-any.whl (24.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page