Unified flow interface for synthetic data generation and many more

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

🌊 uniflow

uniflow is a unified interface for synthetic data generation. You can generate and augment synthetic data from raw text or other unstructured data using one of or multiple uniflow flows, including flows to augment structured data, generate structured data from unstructured text, and generate structured data from unstructured text (self instructed).

Built by CambioML.

Quick Install

pip3 install uniflow

See more details at the full installation.

Overview

For all the flows, you must first import the Client interface and the uniflow constants.

from uniflow.client import Client
from uniflow.flow.constants import (OUTPUT_NAME, QAPAIR_DF_KEY, ...)

Then you can create a Client object to run the a particular flow.

client = Client(YOUR_FLOW_KEY)

Here is a table of the different flows and their corresponding keys, and input file types.

Flow	Key	Input File Type
Augment Structured Data	flow_data_gen	.csv
Generate Structured Data from Unstructured Text	flow_data_gen_text	.txt, .html
Generate and Augment Structured Data from Unstructured Text	flow_text_plus_data_gen	.txt, .html
Generate Structured Data from Unstructured Text (Self Instructed)	flow_self_instructed_gen	.html

Every flow takes a list of input dictionaries. Each dictionary has its own input file, with the INPUT_FILE key as shown below:

from uniflow.flow.constants import INPUT_FILE
input_dict = {INPUT_FILE: input_file}

The input_file is the full path to the input data file.

You can have multiple dictionaries in the input list, each with a different structured data file.

input_list = [input_dict1, input_dict2,...]

Next, you can use the client object to run the flow on the input list.

output_list = client.run(input_list)

The output list will have the same number of dictionaries as the input list, with each dictionary containing the corresponding generated QA pairs.

All of the flows will have the output dictionary with the following listed at the OUTPUT_NAME key. Within the output dictionary, you have the following keys and corresponding values

Key	Description
`QAPAIR_DF_KEY`	The output QA dataframe
`OUTPUT_FILE`	The output file path
`ERROR_LIST` (optional)	List of any errors

Here's an example of how to access the output QA dataframe from the first output dictionary in the output list.

output_dict1 = output_list[0]
output_dict1[OUTPUT_NAME][0][QAPAIR_DF_KEY] #this will print the output QA dataframe

Examples

For more examples, check out the QA Generation and Self-Instructed folders.

Flows

uniflow lets you easily generate synthetic data from raw text (including .txt, .html, .pdf, etc.). Here are the flows for common applications:

Augment Structured Data

Given existing structured data (e.g. sample Question-Answer (QA) pairs), augment more QA pairs using the Client("flow-data-gen") interface.

Example

Check out this example and this notebook to get started.

Generate Structured Data from Unstructured Text

Generate structured data (e.g. Question-Answer pairs) from unstructured text using the Client("flow-data-gen-text") interface.

Example

Check out this example and this notebook to get started.

Generate and Augment Structured Data from Unstructured Text

Using the Client("flow_text_plus_data_gen") interface, you can run the previous two flows in sequence to generate structured data from unstructured text, and then augment more data from the structured data.

Example

Check out this example and this notebook to get started.

Generate Structured Data from Unstructured Text (Self Instructed)

Generate data from unstructured text using the Client("flow_self_instructed_gen") interface. This flow generates question answer pairs from unstructured .html files.

Example

Check out this example and this notebook to get started.

Installation

To get started with uniflow, you can install it using pip in a conda environment.

First, create a conda environment on your terminal using:

conda create -n uniflow python=3.10 -y
conda activate uniflow  # some OS requires `source activate uniflow`

Then install flow and the compatible pytorch based on your OS:

pip3 install uniflow
pip3 install torch

Finally, if you are on a GPU, install pytorch based on your cuda version. You can find your CUDA version via nvcc -V.

pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121  # cu121 means cuda 12.1

Congrats you have finished the installation!

Dev Setup

If you are interested in contributing to us, here are the preliminary development setups.

API keys

If you are running one of the following flows, you will have to set up your OpenAI API key.

flow_data_gen
flow_text_plus_data_gen

To do so, create a .env file in your root uniflow folder. Then add the following line to the .env file:

OPENAI_API_KEY=YOUR_API_KEY

Backend Dev Setup

conda create -n uniflow python=3.10
conda activate uniflow
cd uniflow
pip3 install poetry --no-root

EC2 Dev Setup

If you are on EC2, you can launch a GPU instance with the following config:

EC2 g4dn.xlarge (if you want to run a pretrained LLM with 7B parameters)
Deep Learning AMI PyTorch GPU 2.0.1 (Ubuntu 20.04)
EBS: at least 100G

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.31

Mar 29, 2024

0.0.30

Mar 18, 2024

0.0.29

Mar 11, 2024

0.0.28

Mar 11, 2024

0.0.27

Mar 7, 2024

0.0.26

Feb 27, 2024

0.0.25

Feb 24, 2024

0.0.24

Feb 20, 2024

0.0.23

Feb 15, 2024

0.0.22

Feb 4, 2024

0.0.21

Jan 26, 2024

0.0.20

Jan 25, 2024

0.0.19

Jan 25, 2024

0.0.18

Jan 22, 2024

0.0.17

Jan 19, 2024

0.0.16

Jan 17, 2024

0.0.15

Jan 15, 2024

0.0.14

Jan 9, 2024

0.0.13

Jan 8, 2024

0.0.12

Jan 6, 2024

0.0.11

Dec 31, 2023

0.0.10

Dec 27, 2023

0.0.9

Dec 23, 2023

0.0.8

Dec 11, 2023

0.0.7

Nov 7, 2023

0.0.6

Nov 6, 2023

This version

0.0.5

Nov 3, 2023

0.0.4

Oct 20, 2023

0.0.3

Oct 20, 2023

0.0.2

Oct 20, 2023

0.0.1

Oct 13, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniflow-0.0.5.tar.gz (21.5 kB view hashes)

Uploaded Nov 3, 2023 Source

Built Distribution

uniflow-0.0.5-py3-none-any.whl (29.8 kB view hashes)

Uploaded Nov 3, 2023 Python 3

Hashes for uniflow-0.0.5.tar.gz

Hashes for uniflow-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`b9950038e4d8e2e61e778ee1ae4e770c9ffb7cba0af254ab6d1bdc7a74bdb634`
MD5	`6f016652530881659698ee3081dfe57c`
BLAKE2b-256	`f5a43fe463e0441d8331650451aab79b9e05dd7ab4df309c555ad18fd46139fa`

Hashes for uniflow-0.0.5-py3-none-any.whl

Hashes for uniflow-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`407c764ad60aa72b3a19fcfc0d737aeae219e48d34b969f11bea286b0ae770a3`
MD5	`e4e74b0fd0b38b5bae1b7806e5aafc48`
BLAKE2b-256	`069401083bb46d663e83fb715063cf4ac5c52485e038e799268f927db73a5874`