JAI - Trust your data
Project description
jai-sdk
JAI SDK - Trust your data
Installation
The source code is currently hosted on GitHub at: (https://github.com/jquant/jai-sdk)[https://github.com/jquant/jai-sdk]
Installing jai-sdk using pip
:
pip install jai-sdk
Examples
Instantiate base class
from jai import Jai
j = Jai(AUTH_KEY)
Setting up your databases
All data should be in pandas.DataFrame or pandas.Series format
Aplication using the NLP FastText model
### fasttext implementation
# save this if you want to work in the same database later
name = 'text_data'
### Insert data and train the FastText model
# data can be a list of texts, pandas Series or DataFrame.
# if data is a list, then the ids will be set with range(len(data_list))
# if data is a pandas type, then the ids will be the index values.
# heads-up: index values must not contain duplicates.
j.setup(name, data, db_type='FastText')
# wait for the training to finish
j.wait_setup(name, 10)
Aplication using the NLP BERT model
### BERT implementation
# generate a random name for identification of the base; it can be a user input
name = j.generate_name(20, prefix='sdk_', suffix='_text')
# this time we choose db_type="Text", applying the pre-trained BERT model
j.setup(name, data, db_type='Text', batch_size=1024)
j.wait_setup(name, 10)
Checking database
Here are some methods to check your databases.
The name of your database should appear in:
>>> j.names
['jai_database', 'jai_unsupervised', 'jai_supervised']
or you can check if a given database name is valid:
>>> j.is_valid(name)
True
You can also check the types for each of your databases with:
>>> j.info
db_name db_type
0 jai_database Text
1 jai_unsupervised Unsupervised
2 jai_supervised Supervised
If you want to check which ids are in your database:
>>> j.ids(name)
['1000 items from 0 to 999']
Similarity
After you're done setting up your database, you perform similarity searches:
- Using the indexes of the input data
# Find the 5 most similar values for ids 0 and 1
results = j.similar(name, [0, 1], top_k=5)
# Find the 20 most similar values for every id from [0, 99]
ids = list(range(100))
results = j.similar(name, ids, top_k=20)
# Find the 100 most similar values for every input value
results = j.similar(name, data.index, top_k=100, batch_size=1024)
- Using new data to be processed All data should be in pandas.DataFrame or pandas.Series format
# Find the 100 most similar values for every new_data
results = j.similar(name, new_data, top_k=100, batch_size=1024)
The output will be a list of dictionaries with ("query_id") being the id of the value you want to find similars and ("results") a list with top_k
dictionaries with the "id" and the "distance" between "query_id" and "id".
[
{
'query_id': 0,
'results':
[
{'id': 0, 'distance': 0.0},
{'id': 3836, 'distance': 2.298321008682251},
{'id': 9193, 'distance': 2.545339584350586},
{'id': 832, 'distance': 2.5819168090820312},
{'id': 6162, 'distance': 2.638622283935547},
...
]
},
...,
{
'query_id': 9,
'results':
[
{'id': 9, 'distance': 0.0},
{'id': 54, 'distance': 5.262974262237549},
{'id': 101, 'distance': 5.634262561798096},
...
]
},
...
]
Removing data
After you're done with the model setup, you can delete your raw data
# Delete the raw data inputed as it won't be needed anymore
j.delete_raw_data(name)
If you no longer need the model or anything else related to your database:
j.delete_database(name)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.