Distributed computing¶
divvy¶
Certain functions in the ngs_toolkit
toolkit can make use of distributed
computing. To achieve this for a variety of computing configurations
it uses the divvy library.
Divvy provides an abstract way of submitting a job to various job managers by shipping job templates for each configuration.
When divvy
starts, a configuration is chosen (the compute_configuration
attribute) and that template gets filled with the attributes of the job -
the code to be executed, the resouce requirements and others
(e.g. “cores”, “mem”, “time” attributes).
To see all supported compute configurations run:
divvy list
For more information on how to configure divvy
, see its documentation:
http://divvy.databio.org/
To let ngs_toolkit
know which divvy
configuration to use by default,
modify the following section in the ngs_toolkit
configuration file:
preferences:
# The next item is the default computing configuration to use from divvy.
# Run "divvy list" to see all options.
# See more here: http://code.databio.org/divvy/
computing_configuration: 'slurm'
This will make ngs_toolkit
send jobs to a slurm cluster if wanted.
All functions that allow running a task in a distributed manner have a
distributed
keyword argument.
In addition, they also accept additional keyword arguments (kwargs in the
function signature) where additional options can be passed.
These options must match fields available to format of the currently selected
compute_configuration
.
Sending jobs and collecting output¶
Performing a taks in a distributed manner can therefore be as simple as calling
the desired function with distributed=True
. Jobs will be sent to the
job manager of the chosen computing configuration.
However, since the jobs are often run individually for a sample/group of samples,
functions called with distributed=True
may not return the same output as
distributed=False
.
For that reason, for all such functions, there is a reciprocal function of
identical name as the first prefixed with collect
.
from ngs_toolkit.demo import generate_project
an = generate_project(sample_input_files=True)
an.measure_coverage(distributed=True)
coverage = collect_coverage()
Implementing automatic collection of job outputs in part of future plans.
Example¶
The ngs_toolkit.atacseq.ATACSeqAnalysis.measure_coverage()
function has
distributed
and kwargs
options.
This provides code portability and allows customization of various aspects of the jobs:
from ngs_toolkit.demo import generate_project
an = generate_project(sample_input_files=True)
# in serial
cov1 = an.measure_coverage()
# as slurm jobs (because the config computing_configuration is set to 'slurm')
an.measure_coverage(distributed=True)
cov2 = collect_coverage()
# confirm the output is the same
assert (cov2 == cov1).all().all()
# as slurm jobs to a particular queue and more memory
an.measure_coverage(distributed=True, partition="longq", mem=24000)
# here 'partition' and 'mem' are attributes of the slurm divvy template
# and not magic attributes