Skip to main content

Unified flow interface for synthetic data generation and many more

Project description

🌊 uniflow

License Apache-2.0 pypi_status Commit activity Slack

uniflow is a unified interface for synthetic data generation. You can generate and augment synthetic data from raw text or other unstructured data using one of or multiple uniflow flows, including flows to augment structured data, generate structured data from unstructured text, and generate structured data from unstructured text (self instructed).

Built by CambioML.

Quick Install

pip3 install uniflow

See more details at the full installation.

Overview

For all the flows, you must first import the Client interface and the uniflow constants.

from uniflow.client import Client
from uniflow.flow.constants import (OUTPUT_NAME, QAPAIR_DF_KEY, ...)

Then you can create a Client object to run the a particular flow.

client = Client(YOUR_FLOW_KEY)

Here is a table of the different flows and their corresponding keys, and input file types.

Flow Key Input File Type
Augment Structured Data flow_data_gen .csv
Generate Structured Data from Unstructured Text flow_data_gen_text .txt, .html
Generate and Augment Structured Data from Unstructured Text flow_text_plus_data_gen .txt, .html
Generate Structured Data from Unstructured Text (Self Instructed) flow_self_instructed_gen .html

Every flow takes a list of input dictionaries. Each dictionary has its own input file, with the INPUT_FILE key as shown below:

from uniflow.flow.constants import INPUT_FILE
input_dict = {INPUT_FILE: input_file}

The input_file is the full path to the input data file.

You can have multiple dictionaries in the input list, each with a different structured data file.

input_list = [input_dict1, input_dict2,...]

Next, you can use the client object to run the flow on the input list.

output_list = client.run(input_list)

The output list will have the same number of dictionaries as the input list, with each dictionary containing the corresponding generated QA pairs.

All of the flows will have the output dictionary with the following listed at the OUTPUT_NAME key. Within the output dictionary, you have the following keys and corresponding values

Key Description
QAPAIR_DF_KEY The output QA dataframe
OUTPUT_FILE The output file path
ERROR_LIST (optional) List of any errors

Here's an example of how to access the output QA dataframe from the first output dictionary in the output list.

output_dict1 = output_list[0]
output_dict1[OUTPUT_NAME][0][QAPAIR_DF_KEY] #this will print the output QA dataframe

Examples

For more examples, check out the QA Generation and Self-Instructed folders.

Flows

uniflow lets you easily generate synthetic data from raw text (including .txt, .html, .pdf, etc.). Here are the flows for common applications:

Augment Structured Data

Given existing structured data (e.g. sample Question-Answer (QA) pairs), augment more QA pairs using the Client("flow-data-gen") interface.

Example

Check out this example and this notebook to get started.

Generate Structured Data from Unstructured Text

Generate structured data (e.g. Question-Answer pairs) from unstructured text using the Client("flow-data-gen-text") interface.

Example

Check out this example and this notebook to get started.

Generate and Augment Structured Data from Unstructured Text

Using the Client("flow_text_plus_data_gen") interface, you can run the previous two flows in sequence to generate structured data from unstructured text, and then augment more data from the structured data.

Example

Check out this example and this notebook to get started.

Generate Structured Data from Unstructured Text (Self Instructed)

Generate data from unstructured text using the Client("flow_self_instructed_gen") interface. This flow generates question answer pairs from unstructured .html files.

Example

Check out this example and this notebook to get started.

Installation

To get started with uniflow, you can install it using pip in a conda environment.

First, create a conda environment on your terminal using:

conda create -n uniflow python=3.10 -y
conda activate uniflow  # some OS requires `source activate uniflow`

Then install flow and the compatible pytorch based on your OS:

pip3 install uniflow
pip3 install torch

Finally, if you are on a GPU, install pytorch based on your cuda version. You can find your CUDA version via nvcc -V.

pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121  # cu121 means cuda 12.1

Congrats you have finished the installation!

Dev Setup

If you are interested in contributing to us, here are the preliminary development setups.

API keys

If you are running one of the following flows, you will have to set up your OpenAI API key.

  • flow_data_gen
  • flow_text_plus_data_gen

To do so, create a .env file in your root uniflow folder. Then add the following line to the .env file:

OPENAI_API_KEY=YOUR_API_KEY

Backend Dev Setup

conda create -n uniflow python=3.10
conda activate uniflow
cd uniflow
pip3 install poetry --no-root

EC2 Dev Setup

If you are on EC2, you can launch a GPU instance with the following config:

  • EC2 g4dn.xlarge (if you want to run a pretrained LLM with 7B parameters)
  • Deep Learning AMI PyTorch GPU 2.0.1 (Ubuntu 20.04) Alt text
  • EBS: at least 100G Alt text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniflow-0.0.5.tar.gz (21.5 kB view hashes)

Uploaded Source

Built Distribution

uniflow-0.0.5-py3-none-any.whl (29.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page