Skip to main content

Minimal task scheduling abstraction

Project description

Build Status Coverage status Documentation Status

Dask provides multi-core execution on larger-than-memory datasets using blocked algorithms and task scheduling. It maps high-level NumPy and list operations on large datasets on to graphs of many operations on small in-memory datasets. It then executes these graphs in parallel on a single machine. Dask lets us use traditional NumPy and list programming while operating on inconveniently large data in a small amount of space.

  • dask is a specification to describe task dependency graphs.

  • dask.array is a drop-in NumPy replacement (for a subset of NumPy) that encodes blocked algorithms in dask dependency graphs.

  • dask.bag encodes blocked algorithms on Python lists of arbitrary Python objects.

  • dask.async is a shared-memory asynchronous scheduler efficiently execute dask dependency graphs on multiple cores.

Dask does not currently have a distributed memory scheduler.

See full documentation at http://dask.pydata.org or read developer-focused blogposts about dask’s development.

Install

Dask is easily installable through your favorite Python package manager:

conda install dask

or

pip install dask[array]
or
pip install dask[bag]
or
pip install dask[complete]

Dask Graphs

Consider the following simple program:

def inc(i):
    return i + 1

def add(a, b):
    return a + b

x = 1
y = inc(x)
z = add(y, 10)

We encode this as a dictionary in the following way:

d = {'x': 1,
     'y': (inc, 'x'),
     'z': (add, 'y', 10)}

While less aesthetically pleasing this dictionary may now be analyzed, optimized, and computed on by other Python code, not just the Python interpreter.

A simple dask dictionary

Dask Arrays

The dask.array module creates these graphs from NumPy-like operations

>>> import dask.array as da
>>> x = da.random.random((4, 4), blockshape=(2, 2))
>>> x.T[0, 3].dask
{('x', 0, 0): (np.random.random, (2, 2)),
 ('x', 0, 1): (np.random.random, (2, 2)),
 ('x', 1, 0): (np.random.random, (2, 2)),
 ('x', 1, 1): (np.random.random, (2, 2)),
 ('y', 0, 0): (np.transpose, ('x', 0, 0)),
 ('y', 0, 1): (np.transpose, ('x', 1, 0)),
 ('y', 1, 0): (np.transpose, ('x', 0, 1)),
 ('y', 1, 1): (np.transpose, ('x', 1, 1)),
 ('z',): (getitem, ('y', 0, 1), (0, 1))}

Finally, a scheduler executes these graphs to achieve the intended result. The dask.async module contains a shared memory scheduler that efficiently leverages multiple cores.

Dependencies

dask.core supports Python 2.6+ and Python 3.3+ with a common codebase. It is pure Python and requires no dependencies beyond the standard library. It is a light weight dependency.

dask.array depends on numpy.

dask.bag depends on toolz and dill.

LICENSE

New BSD. See License File.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask-0.4.0.tar.gz (102.4 kB view details)

Uploaded Source

File details

Details for the file dask-0.4.0.tar.gz.

File metadata

  • Download URL: dask-0.4.0.tar.gz
  • Upload date:
  • Size: 102.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for dask-0.4.0.tar.gz
Algorithm Hash digest
SHA256 a869c81fd967d0c5ca4a3e30079b99cf9860830eca2a35a683a7f779d3600b95
MD5 40569863d28316305d8abc1f8d97af1b
BLAKE2b-256 fe829588164f7de3b494ec9205119aae939edb65d70e96a9231655c8d41e88e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page