Skip to main content

Open-source tool for exploring, labeling, and monitoring data for NLP projects.

Project description

Argilla
Argilla

CI Codecov CI

Open-source platform for data-centric NLP

Data Labeling for MLOps & Feedback Loops

https://user-images.githubusercontent.com/1107111/223220683-fbfa63da-367c-4cfa-bda5-66f47413b6b0.mp4


🆕 🔥 Train custom transformers models with no-code: Argilla + AutoTrain

🆕 🔥 Deploy Argilla on Spaces

🆕 🔥 Since 1.2.0 Argilla supports vector search for finding the most similar records to a given one. This feature uses vector or semantic search combined with more traditional search (keyword and filter based). Learn more on this deep-dive guide


Documentation | Key Features | Quickstart | Principles | Migration from Rubrix | FAQ

Key Features

Advanced NLP labeling

Monitoring

Team workspaces

  • Bring different users and roles into the NLP data and model lifecycles
  • Organize data collection, review and monitoring into different workspaces
  • Manage workspace access for different users

Quickstart

👋 Welcome! If you have just discovered Argilla this is the best place to get started. Argilla is composed of:

  • Argilla Client: a powerful Python library for reading and writing data into Argilla, using all the libraries you love (transformers, spaCy, datasets, and any other).

  • Argilla Server and UI: the API and UI for data annotation and curation.

To get started you need to:

  1. Launch the Argilla Server and UI.

  2. Pick a tutorial and start rocking with Argilla using Jupyter Notebooks, or Google Colab.

To get started follow the steps on the Quickstart docs page.

🚒 If you find issues, get direct support from the team and other community members on the Slack Community

Principles

  • Open: Argilla is free, open-source, and 100% compatible with major NLP libraries (Hugging Face transformers, spaCy, Stanford Stanza, Flair, etc.). In fact, you can use and combine your preferred libraries without implementing any specific interface.

  • End-to-end: Most annotation tools treat data collection as a one-off activity at the beginning of each project. In real-world projects, data collection is a key activity of the iterative process of ML model development. Once a model goes into production, you want to monitor and analyze its predictions, and collect more data to improve your model over time. Argilla is designed to close this gap, enabling you to iterate as much as you need.

  • User and Developer Experience: The key to sustainable NLP solutions is to make it easier for everyone to contribute to projects. Domain experts should feel comfortable interpreting and annotating data. Data scientists should feel free to experiment and iterate. Engineers should feel in control of data pipelines. Argilla optimizes the experience for these core users to make your teams more productive.

  • Beyond hand-labeling: Classical hand labeling workflows are costly and inefficient, but having humans-in-the-loop is essential. Easily combine hand-labeling with active learning, bulk-labeling, zero-shot models, and weak-supervision in novel data annotation workflows.

Contribute

We love contributors and have launched a collaboration with JustDiggit to hand out our very own bunds, to help the re-greening of sub-Saharan Africa. To help our community with the creation of contributions, we have created our developer and contributor docs. Additionally, you can always schedule a meeting with our Developer Advocacy team so they can get you up to speed.

FAQ

What is Argilla?

Argilla is an open-source MLOps tool for building and managing data for Natural Language Processing.

What can I use Argilla for?

Argilla is useful if you want to:

  • create a dataset for training a model.

  • evaluate and improve an existing model.

  • monitor an existing model to improve it over time and gather more training data.

What do I need to start using Argilla?

You need to have a running instance of Elasticsearch and install the Argilla Python library. The library is used to read and write data into Argilla.

How can I "upload" data into Argilla?

Currently, the only way to upload data into Argilla is by using the Python library.

This is based on the assumption that there's rarely a perfectly prepared dataset in the format expected by the data annotation tool.

Argilla is designed to enable fast iteration for users that are closer to data and models, namely data scientists and NLP/ML/Data engineers.

If you are familiar with libraries like Weights & Biases or MLFlow, you'll find Argilla log and load methods intuitive.

That said, Argilla gives you different shortcuts and utils to make loading data into Argilla a breeze, such as the ability to read datasets directly from the Hugging Face Hub.

In summary, the recommended process for uploading data into Argilla would be following:

  1. Install Argilla Python library,

  2. Open a Jupyter Notebook,

  3. Make sure you have a Argilla server instance up and running,

  4. Read your source dataset using Pandas, Hugging Face datasets, or any other library,

  5. Do any data preparation, pre-processing, or pre-annotation with a pretrained model, and

  6. Transform your dataset rows/records into Argilla records and log them into a dataset using rb.log. If your dataset is already loaded as a Hugging Face dataset, check the read_datasets method to make this process even simpler.

How can I train a model

The training datasets created with Argilla are model agnostic.

You can choose one of many amazing frameworks to train your model, like transformers, spaCy, flair or sklearn.

Check out our deep dives and our tutorials on how Argilla integrates with these frameworks.

If you want to train a Hugging Face transformer or spaCy NER model, we provide a neat shortcut to prepare your dataset for training.

Can Argilla share the Elasticsearch Instance/cluster?

Yes, you can use the same Elasticsearch instance/cluster for Argilla and other applications. You only need to perform some configuration, check the Advanced installation guide in the docs.

How to solve an exceeded flood-stage watermark in Elasticsearch?

By default, Elasticsearch is quite conservative regarding the disk space it is allowed to use.

If less than 5% of your disk is free, Elasticsearch can enforce a read-only block on every index, and as a consequence, Argilla stops working.

To solve this, you can simply increase the watermark by executing the following command in your terminal:

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'{"persistent": {"cluster.routing.allocation.disk.watermark.flood_stage":"99%"}}'

Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

argilla-1.4.1.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

argilla-1.4.1-py3-none-any.whl (2.0 MB view details)

Uploaded Python 3

File details

Details for the file argilla-1.4.1.tar.gz.

File metadata

  • Download URL: argilla-1.4.1.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for argilla-1.4.1.tar.gz
Algorithm Hash digest
SHA256 12cf76f50d6eb7911dfc42e9722394846922df6244561ea65c61b259c1d9e9ae
MD5 6260d125e000a6328320e77e6029b694
BLAKE2b-256 863dd430ab392f9fb308e67be0f62b55419fafa8ebb9c18b397110f1b33ac69a

See more details on using hashes here.

File details

Details for the file argilla-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: argilla-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for argilla-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d9491faad4593ebc74b2a0f0285937be41cab1d9c0a74f059f10acfbcda6adde
MD5 c2dff466eae59a181ce44792a2f86ea8
BLAKE2b-256 a4444b8815d1972ad6cdf8f9b9b67db673c9f0c5364cc252182772c39b0de395

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page