Single-Cell Analysis in Python.

Project description

Getting started | Features | Installation | References

Scanpy – Single-Cell Analysis in Python

Efficient tools for analyzing and simulating large-scale single-cell data that aim at an understanding of dynamic biological processes from snapshots of transcriptome or proteome. The draft `Wolf, Angerer & Theis (2017) <>`__ explains conceptual ideas of the package. Any comments are appreciated!

Getting started

Download or clone the repository – green button on top of the page – and ``cd`` into its root directory. With Python 3.5 or 3.6 (preferably Miniconda_) installed, type::

pip install -e .

Aside from enabling ``import scanpy as sc`` anywhere on your system, you can also work with the top-level command ``scanpy`` on the command-line (more info `here <Installation_>`__).

Then go through the use cases compiled in scanpy_usage_, in particular, the recent additions

We reproduce most of the `Guided Clustering tutorial`_ of Seurat_ [Macosco15]_.
Analyzing 68 000 cells from [Zheng17]_, we find that Scanpy is about a factor 5 to 16 faster and more memory efficient than the `Cell Ranger`_ R kit for secondary analysis.
We reproduce the results of the Diffusion Pseudotime (DPT) paper of [Haghverdi16]_. Note that DPT has recently been very `favorably discussed`_ by the authors of Monocle_.

Let us give an Overview_ of the toplevel user functions, followed by a few words on Scanpy's `Basic Features`_ and more `details <Visualization_>`__.


Scanpy user functions are grouped into the following modules

Machine Learning and statistics tools. Abbreviation ````.
Preprocessing. Abbreviation ``sc.pp``.
Plotting. Abbreviation ````.

`pp.* <sc.preprocessing_>`__
Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization.


`tl.pca <pca_>`__
PCA [Pedregosa11]_.
`tl.diffmap <diffmap_>`__
Diffusion Maps [Coifman05]_ [Haghverdi15]_ [Wolf17]_.
`tl.tsne <tsne_>`__
t-SNE [Maaten08]_ [Amir13]_ [Pedregosa11]_.
`tl.draw_graph <draw_graph_>`__
Force-directed graph drawing [Csardi06]_ [Weinreb17]_.

Branching trajectories and pseudotime, clustering, differential expression

`tl.dpt <dpt_>`__
Infer progression of cells, identify *branching* subgroups [Haghverdi16]_ [Wolf17]_.
`tl.louvain <louvain_>`__
Cluster cells into subgroups [Blondel08]_ [Traag17]_.
`tl.rank_genes_groups <rank_genes_groups_>`__
Rank genes according to differential expression [Wolf17]_.


`tl.sim <sim_>`__
Simulate dynamic gene expression data [Wittmann09]_ [Wolf17]_.

Basic Features

The typical workflow consists of subsequent calls of data analysis tools
of the form::, **params)

where ``adata`` is an ``AnnData`` object and ``params`` is a dictionary that stores optional parameters. Each of these calls adds annotation to an expression matrix *X*, which stores *n* *d*-dimensional gene expression measurements. By default, Scanpy tools operate *inplace* and return ``None``. If you want to copy the ``AnnData`` object, pass the ``copy`` argument::

adata_copy =, copy=True, **params)

Reading and writing data files and AnnData objects

One usually calls::

adata =

to initialize an AnnData object, possibly adds further annotation, e.g. by::

annotation = np.genfromtxt(filename_annotation)
adata.smp['cell_groups'] = annotation[:, 2] # categorical annotation of type str
adata.smp['time'] = annotation[:, 3] # numerical annotation of type float

and uses::

sc.write(filename, adata)

to save the ``adata`` to a file. Reading foresees filenames with extensions *h5*, *xlsx*, *mtx*, *txt*, *csv* and others. Writing foresees writing *h5*, *csv* and *txt*. Instead of providing a filename, you can provide a *filekey*, i.e., any string that does *not* end on a valid file extension.

AnnData objects

An ``AnnData`` instance stores an array-like data matrix as ``adata.X``, dict-like sample annotation as ``adata.smp``, dict-like variable annotation as ``adata.var`` and additional unstructured dict-like annotation as ``adata.add``. While ``adata.add`` is a conventional dictionary, ``adata.smp`` and ``adata.var`` are instances of a low-level Pandas dataframe-like class.

Values can be retrieved and appended via ``adata.smp[key]`` and ``adata.var[key]``. Sample and variable names can be accessed via ``adata.smp_names`` and ``adata.var_names``, respectively. AnnData objects can be sliced like Pandas dataframes, for example, ``adata = adata[:, list_of_gene_names]``. The AnnData class is similar to R's ExpressionSet [Huber15]_ the latter though is not implemented for sparse data.


For each tool, there is an associated plotting function::

that retrieves and plots the elements of ``adata`` that were previously written by ````. Scanpy's plotting module can be viewed similar to Seaborn_: an extension of matplotlib_ that allows visualizing operations on AnnData objects with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy's plotting functions accept and return a ``Matplotlib.Axes`` object.

`[source] <scanpy/tools/>`__ Computes the PCA representation ``X_pca`` of data, principal components and variance decomposition. Uses the implementation of the ``scikit-learn`` package [Pedregosa11]_.


`[source] <scanpy/tools/>`__ Computes the tSNE representation ``X_tsne`` of data.

The algorithm has been introduced by [Maaten08]_ and proposed for single-cell data by [Amir13]_. By default, Scanpy uses the implementation of the ``scikit-learn`` package [Pedregosa11]_. You can achieve a huge speedup if you install the Multicore-tSNE package by [Ulyanov16]_, which will be automatically detected by Scanpy.


`[source] <scanpy/tools/>`__ Computes the diffusion maps representation ``X_diffmap`` of data.

Diffusion maps [Coifman05]_ has been proposed for visualizing single-cell data by [Haghverdi15]_. The tool uses the adapted Gaussian kernel suggested by [Haghverdi16]_. The Scanpy implementation is due to [Wolf17]_.


`[source] <scanpy/tools/>`__ Force-directed graph drawing is a long-established algorithm for visualizing graphs, see `Force-directed graph drawing`_. It has been suggested for visualizing single-cell data by [Weinreb17]_.

Here, the Fruchterman & Reingold [Fruchterman91]_ algorithm is used by default, but many other layouts are available. We use the igraph implementation [Csardi06]_.

Discrete clustering of subgroups, continuous progression through subgroups, differential expression


`[source] <scanpy/tools/>`__ Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by [Haghverdi16]_ and implemented for Scanpy by [Wolf17]_.

The functionality of diffmap and dpt compare to the R package destiny_ of [Angerer16]_, but run faster and scale to much higher cell numbers.

*Examples:* See this `use case <>`__.

`[source] <scanpy/tools/>`__ Cluster cells using the Louvain algorithm [Blondel08]_ in the implementation of [Traag17]_.

The Louvain algorithm has been proposed for single-cell analysis by [Levine15]_.

*Examples:* See this `use case <>`__.


`[source] <scanpy/tools/>`__ Rank genes by differential expression.

*Examples:* See this `use case <>`__.



`[source] <scanpy/tools/>`__ Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by [Wittmann09]_. The Scanpy implementation is due to [Wolf17]_.

The tool compares to the Matlab tool *Odefy* of [Krumsiek10]_.

*Examples:* See this `use case <>`__.


If you use Windows or Mac OS X and do not have a current Python distribution (Python 3.5 or 3.6), download and install Miniconda_ (see below). If you use Linux, use your package manager to obtain a current python distribution.

Then, download or clone the repository – green button on top of the page – and ``cd`` into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with ``git pull``) call::

pip install -e .

and work with the top-level command ``scanpy`` or::

import scanpy.api as sc

in any directory.

Installing Miniconda

After downloading Miniconda_, in a unix shell (Linux, Mac), run

chmod +x

and accept all suggestions. Either reopen a new terminal or ``source ~/.bashrc`` on Linux/ ``source ~/.bash_profile`` on Mac. The whole process takes just a couple of minutes.

The package is registered_ in the `Python Packaging Index`_, but
versioning has not started yet. In the future, installation will also be
possible without reference to GitHub via ``pip install scanpy``.

