Skip to main content

Optimus is the missing framework for cleaning and pre-processing data in a distributed fashion.

Project description

Optimus

Logo Optimus

Tests Docker image updated PyPI Latest Release GitHub release CalVer

Downloads Downloads Downloads Mentioned in Awesome Data Science Slack

Get started 🏃

Try Optimus

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges:

Binder Colab

Installation (pip):

In your terminal just type pip install pyoptimus

Requirements

  • Python 3.7 or 3.8

Examples

You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.

Also you can go to Examples and found specific notebooks about data cleaning, data munging, profiling, data enrichment and how to create ML and DL models.

Besides check the Cheat Sheet

Start Optimus

Start Optimus using "pandas", "dask", "cudf" or "dask_cudf".

from optimus import Optimus
op = Optimus("pandas")

Loading data

Now Optimus can load data in csv, json, parquet, avro, excel from a local file or URL.

#csv
df = op.load.csv("../examples/data/foo.csv")

#json
df = op.load.json("../examples/data/foo.json")

# using a url
df = op.load.json("https://raw.githubusercontent.com/hi-primus/optimus/develop-21.9/examples/data/foo.json")

# parquet
df = op.load.parquet("../examples/data/foo.parquet")

# ...or anything else
df = op.load.file("../examples/data/titanic3.xls")

Also, you can load data from oracle, redshift, mysql and postgres.

Saving Data

#csv
df.save.csv("data/foo.csv")

# json
df.save.json("data/foo.json")

# parquet
df.save.parquet("data/foo.parquet")

You can also save data to oracle, redshift, mysql and postgres.

Create dataframes

Also, you can create a dataframe from scratch

df = op.create.dataframe({
    'A': ['a', 'b', 'c', 'd'],
    'B': [1, 3, 5, 7],
    'C': [2, 4, 6, None],
    'D': ['1980/04/10', '1980/04/10', '1980/04/10', '1980/04/10']
})

Using display you have a beautiful way to show your data with extra information like column number, column data type and marked white spaces.

display(df)

Cleaning and Processing

Optimus was created to make data cleaning a breeze. The API was designed to be super easy to newcomers and very familiar for people that comes from Pandas. Optimus expands the standard DataFrame functionality adding .rows and .cols accessors.

For example you can load data from a url, transform and apply some predefined cleaning functions:

new_df = df\
    .rows.sort("rank", "desc")\
    .cols.lower(["names", "function"])\
    .cols.date_format("date arrival", "yyyy/MM/dd", "dd-MM-YYYY")\
    .cols.years_between("date arrival", "dd-MM-YYYY", output_cols="from arrival")\
    .cols.normalize_chars("names")\
    .cols.remove_special_chars("names")\
    .rows.drop(df["rank"]>8)\
    .cols.rename("*", str.lower)\
    .cols.trim("*")\
    .cols.unnest("japanese name", output_cols="other names")\
    .cols.unnest("last position seen", separator=",", output_cols="pos")\
    .cols.drop(["last position seen", "japanese name", "date arrival", "cybertronian", "nulltype"])

Need help? 🛠️

Feedback

Feedback is what drive Optimus future, so please take a couple of minutes to help shape the Optimus' Roadmap: http://bit.ly/optimus_survey

Also if you want to a suggestion or feature request use https://github.com/hi-primus/optimus/issues

Troubleshooting

If you have issues, see our Troubleshooting Guide

Contributing to Optimus 💡

Contributions go far beyond pull requests and commits. We are very happy to receive any kind of contributions
including:

  • Documentation updates, enhancements, designs, or bugfixes.
  • Spelling or grammar fixes.
  • README.md corrections or redesigns.
  • Adding unit, or functional tests
  • Triaging GitHub issues -- especially determining whether an issue still persists or is reproducible.
  • Blogging, speaking about, or creating tutorials about Optimus and its many features.
  • Helping others on our official chats

Backers and Sponsors

Become a backer or a sponsor and get your image on our README on Github with a link to your site.

OpenCollective OpenCollective

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyoptimus-21.9.0b1.tar.gz (228.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyoptimus-21.9.0b1-py3-none-any.whl (287.2 kB view details)

Uploaded Python 3

File details

Details for the file pyoptimus-21.9.0b1.tar.gz.

File metadata

  • Download URL: pyoptimus-21.9.0b1.tar.gz
  • Upload date:
  • Size: 228.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for pyoptimus-21.9.0b1.tar.gz
Algorithm Hash digest
SHA256 6dc06d8aa084ab8f035665864fe969eeddabf594d99f81152d87c54cadbaf234
MD5 07dcbdf9b664ceccfa4c92b44823cc7e
BLAKE2b-256 37bd4ba0cf7275e5d1523a8aa68573a3d82386ed41137b75d82f2f58ade4257a

See more details on using hashes here.

File details

Details for the file pyoptimus-21.9.0b1-py3-none-any.whl.

File metadata

  • Download URL: pyoptimus-21.9.0b1-py3-none-any.whl
  • Upload date:
  • Size: 287.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7

File hashes

Hashes for pyoptimus-21.9.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 9bf1ff6de97e1b8e440a9cbbe1ca16328ae78197ee8023541952053918d4d9ab
MD5 a3042d6078a91d4b7f35bb4c58a000b3
BLAKE2b-256 a148afc09dbbb883a0b174c9e3d2f07ba54087a93e1f3afecd9c7881af868ec7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page