Skip to main content

A domain-general, Bayesian method for analyzing high-dimensional data tables

Project description

Crosscat

CrossCat is a domain-general, Bayesian method for analyzing high-dimensional data tables. CrossCat estimates the full joint distribution over the variables in the table from the data, via approximate inference in a hierarchical, nonparametric Bayesian model, and provides efficient samplers for every conditional distribution. CrossCat combines strengths of nonparametric mixture modeling and Bayesian network structure learning: it can model any joint distribution given enough data by positing latent variables, but also discovers independencies between the observable variables.

A range of exploratory analysis and predictive modeling tasks can be addressed via CrossCat, including detecting predictive relationships between variables, finding multiple overlapping clusterings, imputing missing values, and simultaneously selecting features and classifying rows. Research on CrossCat has shown that it is suitable for analysis of real-world tables of up to 10 million cells, including hospital cost and quality measures, voting records, handwritten digits, and state-level unemployment time series.

Installation

Local (Ubuntu)

You can install CrossCat using pip (no need to clone from git):

$ pip install crosscat

If you’d like to install from source, CrossCat can be successfully installed locally on bare Ubuntu server 14.04 systems with:

$ sudo apt-get install build-essential cython libboost-all-dev python
$ sudo apt-get install python-setuptools python-numpy
$ git clone https://github.com/probcomp/crosscat.git

$ cd crosscat
$ python setup.py build
$ python setup.py install  # or python setup.py develop

CrossCat can also be installed in a local Python virtual environment:

$ cd crosscat
$ virtualenv --system-site-packages /path/to/venv
$ . /path/to/venv/bin/activate
$ python setup.py build
$ python setup.py install  # or python setup.py develop

A similar process has been found to work on OSX.

Tests

To run the automatic tests:

$ ./check.sh

Documentation

Note: The VM is only meant to provide an out-of-the-box usable system setup. Its resources are limited and large jobs will fail due to memory errors. To run larger jobs, increase the VM resources or install directly to your system.

Python Client

C++ backend

Example

dha_example.py (github) is a basic example of analysis using CrossCat. For a first test, run the following from above the top level crosscat dir

python crosscat/examples/dha_example.py crosscat/www/data/dha.csv --num_chains 2 --num_transitions 2

Note: the default argument values take a considerable amount of time to run and are best suited to a cluster.

License

Apache License, Version 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crosscat-0.1.48.tar.gz (309.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crosscat-0.1.48-cp27-none-macosx_10_6_intel.whl (1.6 MB view details)

Uploaded CPython 2.7macOS 10.6+ Intel (x86-64, i386)

File details

Details for the file crosscat-0.1.48.tar.gz.

File metadata

  • Download URL: crosscat-0.1.48.tar.gz
  • Upload date:
  • Size: 309.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for crosscat-0.1.48.tar.gz
Algorithm Hash digest
SHA256 7d3fba9436091ffe4c2ecf2647c292dbfac61eeffd4049a9883a1ef61e5182d4
MD5 5edacc80fde13c5c8eb02bcb670502af
BLAKE2b-256 850d1dfc97b24963b33e9d034ba7ab7a090050d8ad4a09cd42cc72a7d73c8915

See more details on using hashes here.

File details

Details for the file crosscat-0.1.48-cp27-none-macosx_10_6_intel.whl.

File metadata

File hashes

Hashes for crosscat-0.1.48-cp27-none-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 321b70bddd653ab8faa786ed5e73ccd92b9579273d94a6406aa74604a99efcdd
MD5 b37c61fbdaeb53b7370a25e90f01d0d7
BLAKE2b-256 804d830c802047f825d8906228c1287083399291317c4fd833b72f110a19e80a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page