Skip to main content

cellcutter is a Python module for creating thumbnail images of cells given a multi-channel TIFF image and segmentation mask

Project description

License: MIT

Create cell thumbnails

Module for creating thumbnail images of cells given a multi-channel TIFF image and segmentation mask.

Installation

pip install cellcutter

Usage

usage: cut_cells [-h] [-p P] [-z] [-f] [--window-size WINDOW_SIZE] [--mask-cells]
                 [--channels [CHANNELS ...]] [--cache-size CACHE_SIZE]
                 [--chunk-size CHUNK_SIZE | --cells-per-chunk CELLS_PER_CHUNK]
                 IMAGE SEGMENTATION_MASK CELL_DATA DESTINATION.zarr

Cut out thumbnail images of all cells. Thumbnails will be stored as Zarr array
(https://zarr.readthedocs.io/en/stable/index.html) with dimensions [#channels, #cells,
window_size, window_size]. The chunking shape greatly influences performance
https://zarr.readthedocs.io/en/stable/tutorial.html#chunk-optimizations.

positional arguments:
  IMAGE                 Path to image in TIFF format, potentially with multiple channels.
                        Thumbnails will be created from each channel.
  SEGMENTATION_MASK     Path to segmentation mask image in TIFF format. Used to automatically
                        chose window size and find cell outlines.
  CELL_DATA             Path to CSV file with a row for each cell. Must contain columns CellID
                        (must correspond to the cell IDs in the segmentation mask), Y_centroid,
                        and X_centroid (the coordinates of cell centroids). Only cells represented
                        in the given CSV file will be used, even if additional cells are present
                        in the segmentation mask. Cells are written to the Zarr array in the same
                        order as they appear in the CSV file.
  DESTINATION.zarr      Path to a new directory where cell thumbnails will be stored in Zarr
                        format. Must end in '.zarr' and must not already exist. These restrictions
                        can be lifted by using the '-f/--force' flag.

optional arguments:
  -h, --help            show this help message and exit
  -p P                  Number of processes run in parallel.
  -z                    Store thumbnails in a single zip file instead of a directory.
  -f, --force           Overwrite existing destination directory and don't enforce '.zarr'
                        extension.
  --window-size WINDOW_SIZE
                        Size of the cell thumbnail in pixels. Defaults to size of largest cell.
  --mask-cells          Fill every pixel not occupied by the target cell with zeros.
  --channels [CHANNELS ...]
                        Indices of channels (1-based) to include in the output e.g., --channels 1
                        3 5. Channels are included in the file in the given order. If not
                        specified, by default all channels are included. This option must be
                        *after* all positional arguments.
  --cache-size CACHE_SIZE
                        Cache size for reading image tiles in MB. For best performance the cache
                        size should be larger than the size of the image. (Default: 10240 MB = 10
                        GB)
  --chunk-size CHUNK_SIZE
                        Desired uncompressed chunk size in MB. (See
                        https://zarr.readthedocs.io/en/stable/tutorial.html#chunk-optimizations)
                        Since the other chunk dimensions are fixed as [#channels, #cells,
                        window_size, window_size], this argument determines the number of cells
                        per chunk. (Default: 32 MB)
  --cells-per-chunk CELLS_PER_CHUNK
                        Desired number of cells stored per Zarr array chunk. By default this is
                        determined automatically using the chunk size parameter. Setting this
                        option overrides the chunk size parameter.

Example

Example data are available for download at mcmicro.org.

cut_cells exemplar-001/registration/exemplar-001.ome.tif \
  exemplar-001/segmentation/unmicst-exemplar-001/cellMask.tif \
  exemplar-001/quantification/unmicst-exemplar-001_cellMask.csv \
  cellMaskThumbnails.zarr

Reading the zarr array output

import zarr
from matplotlib import pyplot as plt
x = zarr.open("cellMaskThumbnails.zarr", mode = "r")
<zarr.core.Array (12, 9522, 46, 46) uint16 read-only>
plt.imshow(x[0, 0, ...])

png

plt.figure(figsize=(10, 10))
for i in range(64):
    ax = plt.subplot(8, 8, i + 1)
    ax.axis("off")
    ax.imshow(x[0, i, ...])
plt.tight_layout()

png

Performance

Zarr arrays are chunked, meaning that they are split up into small pieces of equal size, and each chunk is stored in a separate file. Choice of the chunk size affects performance significantly.

Performance will also vary quite a bit depending on the access pattern. Slicing the array so that only data from a single chunk needs to be read from disk will be fast while array slices that cross many chunks will be slow.

An overview of some chunking performance considerations are available here.

By default, cellcutter creates Zarr arrays with chunks of the size [channels in TIFF, x cells, thumbnail width, thumbnail height], meaning for a given cell, all channels and the entire thumbnail image are stored in the same chunk. The number of cells x per chunk is calculated internally such that each chunk has a total uncompressed size of about 32 MB.

The default chunk size works well for access patterns that request all channels and the entire thumbnail for a given range of cells. Ideally, the cells should be contiguous along the second dimension of the array.

import zarr
from numpy.random import default_rng
z = zarr.open("cellMaskThumbnails.zarr", mode="r")
z.shape
(12, 9522, 46, 46)
z.chunks
(12, 660, 46, 46)

The chunks property gives the size of each chunk in the array. In this example, all 12 channels, 660 cells, and the complete thumbnail are stored in a single chunk.

The number of cells per chunk is determined automatically by default and can be set directly using the --cells-per-chunk argument to cellcutter or alternatively indirectly using --chunk-size.

Also here the number of cells per chunk should ideally be more or less in line with how many cells are requested in a typical array access operation.

Access patterns

100 Random cells

rng = default_rng()
def rand_slice(n=100):
    return rng.choice(z.shape[1], size=n, replace=False)
%%timeit
_ = z.get_orthogonal_selection((slice(None), rand_slice()))
101 ms ± 2.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

100 Contiguous cells

%%timeit
_ = z[:, 1000:1100, ...]
5.81 ms ± 45.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Accessing 100 random cells from the Zarr array takes around 100 ms whereas accessing 100 contiguous cells (cell 1000 to 1100) only takes around 6 ms — an almost 17-fold speed difference. This is because random cells are likely to be distributed across many separate chunks. All these chunks need to be read into memory in full even if only a single cell is requested for a given chunk. Given that this particular array happens to be split up into 15 chunks total the speed difference suggests that every request of 100 random cells results in all chunks being read from disk

In contrast, contiguous cells are stored together in one or several neighboring chunks minimizing the amount of data that has to be read from disk.

Fast access to random cells

If access to random cells is required, for example for training a machine learning model, there is a workaround avoiding the performance penalty of requesting random cells. Instead of requesting a random slices of the array we can instead randomize cell order before the Zarr array is created. Because cell order is random we can then simply access a contiguous slice of cells during training. The simplest way to randomize cell order is to shuffle the order of rows in the CSV file that is passed to cellcutter, for example by using pandas df.sample(frac=1).

A training loop using cell thumbnails created with this method could look something like this:

import pandas as pd
from timeit import default_timer
csv = pd.read_csv("exemplar-001/quantification/unmicst-exemplar-001_cellMask.csv")
P = 0.2

# batch sizes
batch_size_train = 500
batch_size_valid = round(batch_size_train * P)

# training loop
for i, s in enumerate(
    range(0, len(csv) - batch_size_train - batch_size_valid, batch_size_train + batch_size_valid)
):
    # construct training and validation slices
    train_slice = (s, s + batch_size_train)
    valid_slice = (train_slice[1], train_slice[1] + batch_size_valid)
    # get training and validation thumbnails
    start_time = default_timer()
    x_train = z[:, train_slice[0]:train_slice[1], ...]
    x_valid = z[:, valid_slice[0]:valid_slice[1], ...]
    end_time = default_timer()
    print(
        f"Iteration {i} training using cells {train_slice}",
        f"validating using cells {valid_slice}",
        f"loading images took {round(end_time - start_time, 3)}s"
    )
    # Do training
Iteration 0 training using cells (0, 500) validating using cells (500, 600) loading images took 0.018s
Iteration 1 training using cells (600, 1100) validating using cells (1100, 1200) loading images took 0.03s
Iteration 2 training using cells (1200, 1700) validating using cells (1700, 1800) loading images took 0.024s
Iteration 3 training using cells (1800, 2300) validating using cells (2300, 2400) loading images took 0.023s
Iteration 4 training using cells (2400, 2900) validating using cells (2900, 3000) loading images took 0.023s
Iteration 5 training using cells (3000, 3500) validating using cells (3500, 3600) loading images took 0.023s
Iteration 6 training using cells (3600, 4100) validating using cells (4100, 4200) loading images took 0.023s
Iteration 7 training using cells (4200, 4700) validating using cells (4700, 4800) loading images took 0.023s
Iteration 8 training using cells (4800, 5300) validating using cells (5300, 5400) loading images took 0.023s
Iteration 9 training using cells (5400, 5900) validating using cells (5900, 6000) loading images took 0.027s
Iteration 10 training using cells (6000, 6500) validating using cells (6500, 6600) loading images took 0.02s
Iteration 11 training using cells (6600, 7100) validating using cells (7100, 7200) loading images took 0.019s
Iteration 12 training using cells (7200, 7700) validating using cells (7700, 7800) loading images took 0.023s
Iteration 13 training using cells (7800, 8300) validating using cells (8300, 8400) loading images took 0.023s
Iteration 14 training using cells (8400, 8900) validating using cells (8900, 9000) loading images took 0.025s

Funding

This work is supported by the following:

  • NCI grants U54-CA22508U2C-CA233262 and U2C-CA233280
  • NIH grant 1U54CA225088: Systems Pharmacology of Therapeutic and Adverse Responses to Immune Checkpoint and Small Molecule Drugs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cellcutter-0.2.5.tar.gz (18.6 kB view hashes)

Uploaded Source

Built Distribution

cellcutter-0.2.5-py3-none-any.whl (17.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page