Skip to main content

Clustering based on density with variable density clusters

Project description

PyPI Version Conda-forge Version Conda-forge downloads License Travis Build Status Test Coverage Docs JOSS article Launch example notebooks in Binder

HDBSCAN

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning – and the primary parameter, minimum cluster size, is intuitive and easy to select.

HDBSCAN is ideal for exploratory data analysis; it’s a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).

Based on the papers:

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf]

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

Documentation, including tutorials, are available on ReadTheDocs at http://hdbscan.readthedocs.io/en/latest/ .

Notebooks comparing HDBSCAN to other clustering algorithms, explaining how HDBSCAN works and comparing performance with other python clustering implementations are available.

How to use HDBSCAN

The hdbscan package inherits from sklearn classes, and thus drops in neatly next to other sklearn clusterers with an identical calling API. Similarly it supports input in a variety of formats: an array (or pandas dataframe, or sparse matrix) of shape (num_samples x num_features); an array (or sparse matrix) giving a distance matrix between samples.

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

Performance

Significant effort has been put into making the hdbscan implementation as fast as possible. It is orders of magnitude faster than the reference implementation in Java, and is currently faster than highly optimized single linkage implementations in C and C++. version 0.7 performance can be seen in this notebook . In particular performance on low dimensional data is better than sklearn’s DBSCAN , and via support for caching with joblib, re-clustering with different parameters can be almost free.

Additional functionality

The hdbscan package comes equipped with visualization tools to help you understand your clustering results. After fitting data the clusterer object has attributes for:

  • The condensed cluster hierarchy

  • The robust single linkage cluster hierarchy

  • The reachability distance minimal spanning tree

All of which come equipped with methods for plotting and converting to Pandas or NetworkX for further analysis. See the notebook on how HDBSCAN works for examples and further details.

The clusterer objects also have an attribute providing cluster membership strengths, resulting in optional soft clustering (and no further compute expense). Finally each cluster also receives a persistence score giving the stability of the cluster over the range of distance scales present in the data. This provides a measure of the relative strength of clusters.

Outlier Detection

The HDBSCAN clusterer objects also support the GLOSH outlier detection algorithm. After fitting the clusterer to data the outlier scores can be accessed via the outlier_scores_ attribute. The result is a vector of score values, one for each data point that was fit. Higher scores represent more outlier like objects. Selecting outliers via upper quantiles is often a good approach.

Based on the paper:

R.J.G.B. Campello, D. Moulavi, A. Zimek and J. Sander Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection, ACM Trans. on Knowledge Discovery from Data, Vol 10, 1 (July 2015), 1-51.

Robust single linkage

The hdbscan package also provides support for the robust single linkage clustering algorithm of Chaudhuri and Dasgupta. As with the HDBSCAN implementation this is a high performance version of the algorithm outperforming scipy’s standard single linkage implementation. The robust single linkage hierarchy is available as an attribute of the robust single linkage clusterer, again with the ability to plot or export the hierarchy, and to extract flat clusterings at a given cut level and gamma value.

Example usage:

import hdbscan
from sklearn.datasets import make_blobs

data = make_blobs(1000)

clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster_labels = clusterer.fit_predict(data)
hierarchy = clusterer.cluster_hierarchy_
alt_labels = hierarchy.get_clusters(0.100, 5)
hierarchy.plot()
Based on the paper:

K. Chaudhuri and S. Dasgupta. “Rates of convergence for the cluster tree.” In Advances in Neural Information Processing Systems, 2010.

Installing

Easiest install, if you have Anaconda (thanks to conda-forge which is awesome!):

conda install -c conda-forge hdbscan

PyPI install, presuming you have sklearn and all its requirements (numpy and scipy) installed:

pip install hdbscan

Binary wheels for a number of platforms are available thanks to the work of Ryan Helinski <rlhelinski@gmail.com>.

If pip is having difficulties pulling the dependencies then we’d suggest installing the dependencies manually using anaconda followed by pulling hdbscan from pip:

conda install cython
conda install numpy scipy
conda install scikit-learn
pip install hdbscan

For a manual install get this package:

wget https://github.com/scikit-learn-contrib/hdbscan/archive/master.zip
unzip master.zip
rm master.zip
cd hdbscan-master

Install the requirements

sudo pip install -r requirements.txt

or

conda install scikit-learn cython

Install the package

python setup.py install

Running the Tests

The package tests can be run after installation using the command:

nosetests -s hdbscan

or, if nose is installed but nosetests is not in your PATH variable:

python -m nose -s hdbscan

If one or more of the tests fail, please report a bug at https://github.com/scikit-learn-contrib/hdbscan/issues/new

Python Version

The hdbscan library supports both Python 2 and Python 3. However we recommend Python 3 as the better option if it is available to you.

Help and Support

For simple issues you can consult the FAQ in the documentation. If your issue is not suitably resolved there, please check the issues on github. Finally, if no solution is available there feel free to open an issue ; the authors will attempt to respond in a reasonably timely fashion.

Contributing

We welcome contributions in any form! Assistance with documentation, particularly expanding tutorials, is always welcome. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Citing

If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Open Source Software article.

L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering In: Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017

To refernece the high performance algorithm developed in this library please cite our paper in ICDMW 2017 proceedings.

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017

Licensing

The hdbscan package is 3-clause BSD licensed. Enjoy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdbscan-0.8.13.tar.gz (5.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hdbscan-0.8.13-cp36-cp36m-win_amd64.whl (535.4 kB view details)

Uploaded CPython 3.6mWindows x86-64

hdbscan-0.8.13-cp36-cp36m-manylinux1_x86_64.whl (693.2 kB view details)

Uploaded CPython 3.6m

hdbscan-0.8.13-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6mmacOS 10.10+ Intel (x86-64, i386)macOS 10.10+ x86-64macOS 10.6+ Intel (x86-64, i386)macOS 10.9+ Intel (x86-64, i386)macOS 10.9+ x86-64

hdbscan-0.8.13-cp35-cp35m-win_amd64.whl (535.0 kB view details)

Uploaded CPython 3.5mWindows x86-64

hdbscan-0.8.13-cp35-cp35m-manylinux1_x86_64.whl (693.1 kB view details)

Uploaded CPython 3.5m

hdbscan-0.8.13-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5mmacOS 10.10+ Intel (x86-64, i386)macOS 10.10+ x86-64macOS 10.6+ Intel (x86-64, i386)macOS 10.9+ Intel (x86-64, i386)macOS 10.9+ x86-64

hdbscan-0.8.13-cp34-cp34m-manylinux1_x86_64.whl (696.1 kB view details)

Uploaded CPython 3.4m

hdbscan-0.8.13-cp34-cp34m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.4mmacOS 10.10+ Intel (x86-64, i386)macOS 10.10+ x86-64macOS 10.6+ Intel (x86-64, i386)macOS 10.9+ Intel (x86-64, i386)macOS 10.9+ x86-64

hdbscan-0.8.13-cp27-cp27mu-manylinux1_x86_64.whl (721.8 kB view details)

Uploaded CPython 2.7mu

hdbscan-0.8.13-cp27-cp27m-win_amd64.whl (574.9 kB view details)

Uploaded CPython 2.7mWindows x86-64

hdbscan-0.8.13-cp27-cp27m-manylinux1_x86_64.whl (721.9 kB view details)

Uploaded CPython 2.7m

hdbscan-0.8.13-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (1.3 MB view details)

Uploaded CPython 2.7mmacOS 10.10+ Intel (x86-64, i386)macOS 10.10+ x86-64macOS 10.6+ Intel (x86-64, i386)macOS 10.9+ Intel (x86-64, i386)macOS 10.9+ x86-64

File details

Details for the file hdbscan-0.8.13.tar.gz.

File metadata

  • Download URL: hdbscan-0.8.13.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for hdbscan-0.8.13.tar.gz
Algorithm Hash digest
SHA256 31874db29375816688b5541287a051c9bd768f2499ccf1f6a4d88d266530e2a6
MD5 2544a1b0e48c900600579556a65f44ec
BLAKE2b-256 c76a92ecddb0d8c28266d8d4f9ab6f58ee543059aaade98e73b35de44a1c99f9

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp36-cp36m-win_amd64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 dbd0401cf73752a64d6333439789e97ac5a3e4a3d5e099f6b5e5202993d7310a
MD5 99ad2a9d2a298d975f6112bf975bd09c
BLAKE2b-256 d0a642449ff6a6a1d13be1084c8da9b8464530747e8bac1c967d9384ba54dccc

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7c35439f5e87ed4bf9a3e955821b6e47123562e2d0db2860aba65bfd1393dd96
MD5 cd8ebdadca4edf5e559aad2602312c78
BLAKE2b-256 c682033d23724c26d133e106c5509f23ca40cd59e9a0652dd7d316f28904c276

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 9a38c12c0c3baeb07428c96c8516dbcb59b4fa91fda639f7ea12972dd8c53796
MD5 b377df10c680bfea75a8a6531d1a2cfe
BLAKE2b-256 84ba4461933c237cb98e634273deb43c7b0887dc38bb4dceb6648fe8d97460db

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp35-cp35m-win_amd64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 a06504fae32e0787b07309fa3cdfd54d69f09abd22b6ef93d75f76692cd0bd57
MD5 8b8b23c349c914a9f8eddf4efbce108d
BLAKE2b-256 25218fdfd0742b8cfdd3a07d52c7da9bb4789b3353ec98527331eddb6ab8a364

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f5ffe71297e2829d976baf393b298e8ec7019435e38e24856dd6166c0376f1e7
MD5 b392692e0320f9c9ce84602f1810b634
BLAKE2b-256 45b3bbf643cb13e2a9ab52d59784593b54039f226c1b2962496f7fa129cd3122

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 8050feab742d867e4d31ea19ad06cd02d50a905e81659120abad2f6bb2cbbd07
MD5 60b1f77dd4730465d207fd14947888cc
BLAKE2b-256 01a90c0bea6c26510d12b6e9c700c71e34a04f44e96a73ee3b12a710e14a6025

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e73d76bc5bff6f9681d6457bd5293b6e3e4305a1bf1ec73a21c93ccf8be8c2a2
MD5 43a6a60229675e2d6bea6dcd37cdc2e3
BLAKE2b-256 31eab7f9eb40d30647d5bd67b94f7ca2de0be739805a50954ba7dfbc73e6670e

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp34-cp34m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp34-cp34m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 733c7a56b0334beb2920ffdeee27caa1f0efdcbd8affa3a5a633d0e787eb4579
MD5 a883ececa0abf9c4fe947599dece5a00
BLAKE2b-256 b25f1fc1e47c3480f9ac93c31a5bf72eeaa652fe2e10274aee68d3bad30ec056

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 be70ba8fa957a949b1cfbfc776de9ee9a4f9777d9cfb3619d4fa3df47434f54f
MD5 a07c930e9adbcf1025ba9d946692b5cc
BLAKE2b-256 4cfc8fd3783d1d6a0976c80f917278039dc88446283a82ab80ddb49794e0d9d6

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp27-cp27m-win_amd64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp27-cp27m-win_amd64.whl
Algorithm Hash digest
SHA256 fae879b6cd7a2b66186c1ccac4741ce2e954bc4436d6c91c2d6592815ebdba0d
MD5 0212fd21ec16f420398927999ea010a9
BLAKE2b-256 1c7aef368814757252b4eb17bca30cfbd3b27367d674162897bf7f38df9a19b2

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 47d158fc3c874e482ba435375d2fa812cd0f411d246ee9ecc2eb753092c7403f
MD5 121776a508dff1cda1c62a996de650ce
BLAKE2b-256 4fb46e3d1ddce7fab804ee34f07be27fc31e7524ba4c52c08c7a6f7d8f78ea61

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.13-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.13-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Algorithm Hash digest
SHA256 2aa455802783e55c7f43621da1c57be708495c8adc2e3afa7238c58d6900a79e
MD5 019b5f0c63763a2347cf731734bbf5b3
BLAKE2b-256 2c58d726d97406c396e9afc0545cb29b107f19b1c2bc2b152ff32dc55e2ccd78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page