Deploy Dask Clusters

Dask works well at many scales ranging from a single machine to clusters of many machines. This section describes the many ways to deploy and run Dask, including the following:

_images/dask-cluster-manager.svg

An overview of cluster management with Dask distributed.

Single Machine

If you import Dask, set up a computation, and call compute, then you will use the the local threaded scheduler by default.

import dask.dataframe as dd
df = dd.read_csv(...)
df.x.sum().compute()  # This uses threads in your local process by default

Alternatively, you can set up a fully-featured Dask cluster on your local machine. This gives you access to multi-process computation and diagnostic dashboards.

from dask.distributed import LocalCluster
cluster = LocalCluster()
client = cluster.get_client()

# Dask works as normal and leverages the infrastructure defined above
df.x.sum().compute()

The LocalCluster cluster manager defined above is easy to use and works well on a single machine. It follows the same interface as all other Dask cluster managers, and so it’s easy to swap out when you’re ready to scale up.

# You can swap out LocalCluster for other cluster types

from dask.distributed import LocalCluster
from dask_kubernetes import KubeCluster

# cluster = LocalCluster()
cluster = KubeCluster()  # example, you can swap out for Kubernetes

client = cluster.get_client()

The following resources explain how to set up Dask on a variety of local and distributed hardware.

Single Machine

Dask runs perfectly well on a single machine with or without a distributed scheduler. But once you start using Dask in anger you’ll find a lot of benefit both in terms of scaling and debugging by using the distributed scheduler.

  • Default Scheduler

    The no-setup default. Uses local threads or processes for larger-than-memory processing

  • dask.distributed

    The sophistication of the newer system on a single machine. This provides more advanced features while still requiring almost no setup.

High Performance Computing

See High Performance Computers for more details.

  • Dask-Jobqueue

    Provides cluster managers for PBS, SLURM, LSF, SGE and other resource managers.

  • Dask-MPI

    Deploy Dask from within an existing MPI environment.

  • Dask Gateway for Jobqueue

    Multi-tenant, secure clusters. Once configured, users can launch clusters without direct access to the underlying HPC backend.

Kubernetes

See Kubernetes for more details.

Cloud

See Cloud for more details.

  • Dask-Yarn

    Deploy Dask on YARN clusters, such as are found in traditional Hadoop installations.

  • Dask Cloud Provider

    Constructing and managing ephemeral Dask clusters on AWS, DigitalOcean, Google Cloud, Azure, and Hetzner

  • Coiled

    Commercial Dask deployment option, which handles the creation and management of Dask clusters on cloud computing environments (AWS and GCP).

Managed Solutions

  • Coiled manages the creation and management of Dask clusters on cloud computing environments (AWS and GCP).

  • Domino Data Lab lets users create Dask clusters in a hosted platform.

  • Saturn Cloud lets users create Dask clusters in a hosted platform or within their own AWS accounts.

Advanced Understanding

There are additional concepts to understand if you want to improve your deployment.