Skip to main content

A python library for building different types of copulas and using them for sampling.

Project description

“sdv-dev” An open source project from Data to AI Lab at MIT.

Development Status PyPi Shield Travis CI Shield Coverage Status Downloads

Copulas

Overview

Copulas is a Python library for modeling multivariate distributions and sampling from them using copula functions. Given a table containing numerical data, we can use Copulas to learn the distribution and later on generate new synthetic rows following the same statistical properties.

Some of the features provided by this library include:

  • A variety of distributions for modeling univariate data.
  • Multiple Archimedean copulas for modeling bivariate data.
  • Gaussian and Vine copulas for modeling multivariate data.
  • Automatic selection of univariate distributions and bivariate copulas.

Supported Distributions

Univariate

  • Gaussian
  • Student T
  • Beta
  • Gamma
  • Gaussian KDE
  • Truncated Gaussian

Archimedean Copulas (Bivariate)

  • Clayton
  • Frank
  • Gumbel

Multivariate

  • Gaussian
  • D-Vine
  • C-Vine
  • R-Vine

Install

Requirements

Copulas has been developed and tested on Python 3.5, 3.6 and 3.7

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where Copulas is run.

Install with pip

The easiest and recommended way to install Copulas is using pip:

pip install copulas

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Quickstart

In this short quickstart, we show how to model a multivariate dataset and then generate synthetic data that resembles it.

import warnings
warnings.filterwarnings('ignore')

from copulas.datasets import sample_trivariate_xyz
from copulas.multivariate import GaussianMultivariate
from copulas.visualization import compare_3d

# Load a dataset with 3 columns that are not independent
real_data = sample_trivariate_xyz()

# Fit a gaussian copula to the data
copula = GaussianMultivariate()
copula.fit(real_data)

# Sample synthetic data
synthetic_data = copula.sample(len(real_data))

# Plot the real and the synthetic data to compare
compare_3d(real_data, synthetic_data)

The output will be a figure with two plots, showing what both the real and the synthetic data that you just generated look like:

Quickstart

What's next?

For more details about Copulas and all its possibilities and features, please check the documentation site.

There you can learn more about how to contribute to Copulas in order to help us developing new features or cool ideas.

Credits

Copulas is an open source project from the Data to AI Lab at MIT which has been built and maintained over the years by the following team:

Related Projects

SDV

SDV, for Synthetic Data Vault, is the end-user library for synthesizing data in development under the HDI Project. SDV allows you to easily model and sample relational datasets using Copulas thought a simple API. Other features include anonymization of Personal Identifiable Information (PII) and preserving relational integrity on sampled records.

CTGAN

CTGAN is a GAN based model for synthesizing tabular data. It's also developed by the MIT's Data to AI Lab and is under active development.

History

0.3.3 (2020-09-18)

General Improvements

Use corr instead of cov in the GaussianMultivariate - Issue #195 by @rollervan Add arguments to GaussianKDE - Issue #181 by @rollervan

New Features

Log Laplace Distribution - Issue #188 by @rollervan

0.3.2 (2020-08-08)

General Improvements

  • Support Python 3.8 - Issue #185 by @csala
  • Support scipy <1.3 - Issue #180 by @csala

New Features

  • Add Uniform Univariate - Issue #179 by @rollervan

0.3.1 (2020-07-09)

General Improvements

  • Raise numpy version upper bound to 2 - Issue #178 by @csala

New Features

  • Add Student T Univariate - Issue #172 by @gbonomib

Bug Fixes

  • Error in Quickstarts : Unknown projection '3d' - Issue #174 by @csala

0.3.0 (2020-03-27)

Important revamp of the internal implementation of the project, the testing infrastructure and the documentation by Kevin Alex Zhang @k15z, Carles Sala @csala and Kalyan Veeramachaneni @kveerama

Enhancements

  • Reimplementation of the existing Univariate distributions.
  • Addition of new Beta and Gamma Univariates.
  • New Univariate API with automatic selection of the optimal distribution.
  • Several improvements and fixes on the Bivariate and Multivariate Copulas implementation.
  • New visualization module with simple plotting patterns to visualize probability distributions.
  • New datasets module with toy datasets sampling functions.
  • New testing infrastructure with end-to-end, numerical and large scale testing.
  • Improved tutorials and documentation.

0.2.5 (2020-01-17)

General Improvements

  • Convert import_object to get_instance - Issue #114 by @JDTheRipperPC

0.2.4 (2019-12-23)

New Features

  • Allow creating copula classes directly - Issue #117 by @csala

General Improvements

  • Remove select_copula from Bivariate - Issue #118 by @csala

  • Rename TruncNorm to TruncGaussian and make it non standard - Issue #102 by @csala @JDTheRipperPC

Bugs fixed

  • Error on Frank and Gumble sampling - Issue #112 by @csala

0.2.3 (2019-09-17)

New Features

  • Add support to Python 3.7 - Issue #53 by @JDTheRipperPC

General Improvements

  • Document RELEASE workflow - Issue #105 by @JDTheRipperPC

  • Improve serialization of univariate distributions - Issue #99 by @ManuelAlvarezC and @JDTheRipperPC

Bugs fixed

  • The method 'select_copula' of Bivariate return wrong CopulaType - Issue #101 by @JDTheRipperPC

0.2.2 (2019-07-31)

New Features

  • truncnorm distribution and a generic wrapper for scipy.rv_continous distributions - Issue #27 by @amontanez, @csala and @ManuelAlvarezC
  • Independence bivariate copulas - Issue #46 by @aliciasun, @csala and @ManuelAlvarezC
  • Option to select seed on random number generator - Issue #63 by @echo66 and @ManuelAlvarezC
  • Option on Vine copulas to select number of rows to sample - Issue #77 by @ManuelAlvarezC
  • Make copulas accept both scalars and arrays as arguments - Issues #85 and #90 by @ManuelAlvarezC

General Improvements

  • Ability to properly handle constant data - Issues #57 and #82 by @csala and @ManuelAlvarezC
  • Tests for analytics properties of copulas - Issue #61 by @ManuelAlvarezC
  • Improved documentation - Issue #96 by @ManuelAlvarezC

Bugs fixed

  • Fix bug on Vine copulas, that made it crash during the bivariate copula selection - Issue #64 by @echo66 and @ManuelAlvarezC

0.2.1 - Vine serialization

  • Add serialization to Vine copulas.
  • Add distribution as argument for the Gaussian Copula.
  • Improve Bivariate Copulas code structure to remove code duplication.
  • Fix bug in Vine Copulas sampling: 'Edge' object has no attribute 'index'
  • Improve code documentation.
  • Improve code style and linting tools configuration.

0.2.0 - Unified API

  • New API for stats methods.
  • Standarize input and output to numpy.ndarray.
  • Increase unittest coverage to 90%.
  • Add methods to load/save copulas.
  • Improve Gaussian copula sampling accuracy.

0.1.1 - Minor Improvements

  • Different Copula types separated in subclasses
  • Extensive Unit Testing
  • More pythonic names in the public API.
  • Stop using third party elements that will be deprected soon.
  • Add methods to sample new data on bivariate copulas.
  • New KDE Univariate copula
  • Improved examples with additional demo data.

0.1.0 - First Release

  • First release on PyPI.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copulas-0.3.3.tar.gz (260.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

copulas-0.3.3-py2.py3-none-any.whl (47.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file copulas-0.3.3.tar.gz.

File metadata

  • Download URL: copulas-0.3.3.tar.gz
  • Upload date:
  • Size: 260.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.12

File hashes

Hashes for copulas-0.3.3.tar.gz
Algorithm Hash digest
SHA256 3ae82c201f950f72e6438b4de55ef00ead247c190a0b193f063ed43714fe6db5
MD5 c4b8ed7df128e2b8b9d78f06784cc47d
BLAKE2b-256 ec9927cc85a20c9c859bff7aff55f7da7c6263e7b641b1e37e10801a569e43da

See more details on using hashes here.

File details

Details for the file copulas-0.3.3-py2.py3-none-any.whl.

File metadata

  • Download URL: copulas-0.3.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 47.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.12

File hashes

Hashes for copulas-0.3.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b37e3124a2735a0d92dbe1b724268059382a2e167b6f78436ba15a90058b4cf0
MD5 0af9e168cb47d2db685ed8760afa9411
BLAKE2b-256 1e06285dfabc2c93e63b279bd8010c1f15eca5604a505dfafa331f14fd67a076

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page