Recipes¶
ngs_toolkit provides scripts to perform routine tasks on NGS data - they are called recipes.
Recipes are distributed with ngs_toolkit and can be seen in the github repository.
To make it convenient to run the scripts on data from a project, recipes can be run with the command projectmanager recipe <recipe_name> <project_config.yaml>
.
ngs_analysis¶
This recipe will perform general NGS analysis on 3 data types: ATAC-seq, ChIP-seq and RNA-seq.
For ATAC and ChIP-seq, quantification and annotation of genomic regions will be performed.
Standard analysis appropriate for each data type will proceed with cross-sample normalization, unsupervised analysis and supervised analysis if a comparison_table
is provided.
This recipe uses variables provided in the project configuration file project_name
, sample_attributes
and group_attributes
.
Here are the command-line arguments to use it in a stand-alone script:
usage: python -m ngs_toolkit.recipe.ngs_analysis [-h] [-n NAME] [-o RESULTS_DIR]
[-t {ATAC-seq,RNA-seq,ChIP-seq}] [-q] [-a ALPHA]
[-f ABS_FOLD_CHANGE]
config_file
positional arguments:
config_file YAML project configuration file.
optional arguments:
-h, --help show this help message and exit
-n NAME, --analysis-name NAME
Name of analysis. Will be the prefix of output_files.
By default it will be the name of the Project given in
the YAML configuration.
-o RESULTS_DIR, --results-output RESULTS_DIR
Directory for analysis output files. Default is
'results' under the project roort directory.
-t {ATAC-seq,RNA-seq,ChIP-seq}, --data-type {ATAC-seq,RNA-seq,ChIP-seq}
Data type to restrict analysis to. Default is to run
separate analysis for each data type.
-q, --pass-qc Whether only samples with a 'pass_qc' value of '1' in
the annotation sheet should be used.
-a ALPHA, --alpha ALPHA
Alpha value of confidence for supervised analysis.
-f ABS_FOLD_CHANGE, --fold-change ABS_FOLD_CHANGE
Absolute log2 fold change value for supervised
analysis.
call_peaks¶
This recipe will call peaks for samples in a fashion described in a comparison table.
It is capable of parallelizing work in jobs if a SLURM cluster is available.
Here are the command-line arguments to use it in a stand-alone script:
usage: python -m ngs_toolkit.recipe.call_peaks [-h] [-c COMPARISON_TABLE] [-t] [-j]
[-o RESULTS_DIR]
config_file
Call peaks recipe.
positional arguments:
config_file YAML project configuration file.
optional arguments:
-h, --help show this help message and exit
-c COMPARISON_TABLE, --comparison-table COMPARISON_TABLE
Comparison table to use for peak calling. If not
provided will use a filenamed `comparison_table.csv`
in the same directory of the given YAML Project
configuration file.
-t, --only-toggle Whether only comparisons with 'toggle' value of '1' in
the should be performed.
-j, --as-jobs Whether jobs should be created for each sample, or it
should run in serial mode.
-o RESULTS_DIR, --results-output RESULTS_DIR
Directory for analysis output files. Default is
'results' under the project roort directory.
region_set_frip¶
This recipe will perform fraction of reads in peaks (FRiP) for ATAC-seq or ChIP-seq samples based on a set of regions discovered across all samples in a given project or in an external gold region set.
If the external region set is not given, a region set derived from all samples already exists (e.g. from running the ngs_analysis recipe) the same one will be used, otherwise it will be produced.
Here are the command-line arguments to use it in a stand-alone script:
usage: python -m ngs_toolkit.recipe.region_set_frip [-h] [-n NAME] [-r REGION_SET] [-q] [-j]
[-o RESULTS_DIR]
config_file
Region set FRiP recipe.
positional arguments:
config_file YAML project configuration file.
optional arguments:
-h, --help show this help message and exit
-n NAME, --analysis-name NAME
Name of analysis. Will be the prefix of output_files.
By default it will be the name of the Project given in
the YAML configuration.
-r REGION_SET, --region-set REGION_SET
BED file with region set derived from several samples
or Oracle region set. If unset, will try to get the
`sites` attribute of an existing analysis object if
existing, otherwise will create a region set from the
peaks of all samples.
-q, --pass-qc Whether only samples with a 'pass_qc' value of '1' in
the annotation sheet should be used.
-j, --as-jobs Whether jobs should be created for each sample, or it
should run in serial mode.
-o RESULTS_DIR, --results-output RESULTS_DIR
Directory for analysis output files. Default is
'results' under the project roort directory.
merge_signal¶
This recipe will merge signal from various ATAC-seq or ChIP-seq samples given a set of attributes to group samples by.
It produces merged BAM and bigWig files for all signal in the samples but is also capable of producing this for nucleosomal/nucleosomal free signal based on fragment length distribution if data is paired-end sequenced. This signal may optionally be normalized for each group. It is also capable of parallelizing work in jobs if a SLURM cluster is available.
Here are the command-line arguments to use it in a stand-alone script:
usage: python -m ngs_toolkit.recipe.merge_signal [-h] [-a ATTRIBUTES] [-q] [-j] [-n] [--nucleosome]
[--overwrite] [-o OUTPUT_DIR]
config_file
Merge signal recipe.
positional arguments:
config_file YAML project configuration file.
optional arguments:
-h, --help show this help message and exit
-a ATTRIBUTES, --attributes ATTRIBUTES
Attributes to merge samples by. By default will use
values in the project config `sample_attributes`.
-q, --pass-qc Whether only samples with a 'pass_qc' value of '1' in
the annotation sheet should be used.
-j, --as-jobs Whether jobs should be created for each sample, or it
should run in serial mode.
-n, --normalize Whether tracks should be normalized to total sequenced
depth.
--nucleosome Whether to produce nucleosome/nucleosome-free signal
files.
--overwrite Whether to overwrite existing files.
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Directory for output files. Default is 'merged' under
the project roort directory.
deseq2¶
Run differential testing of sample groups using DESeq2
usage: python -m ngs_toolkit.recipe.deseq2 [-h] [--output_prefix OUTPUT_PREFIX] [--formula FORMULA]
[--alpha ALPHA] [-d] [--overwrite]
work_dir
Perform differential expression using DESeq2 by comparing sample groups using
a formula.
positional arguments:
work_dir Working directory. Should contain required files for
DESeq2.
optional arguments:
-h, --help show this help message and exit
--output_prefix OUTPUT_PREFIX
Prefix for output files.
--formula FORMULA R-style formula for differential expression. Default =
'~ sample_group'.
--alpha ALPHA Significance level to call differential expression.
All results will be output anyway.
-d, --dry-run Don't actually do anything.
--overwrite Don't overwrite any existing directory or file.
enrichr¶
Get enrichment of gene sets using the Enrichr API
usage: python -m ngs_toolkit.recipe.enrichr [-h] [-a MAX_ATTEMPTS] [--no-overwrite]
input_file output_file
A helper script to run enrichment analysis using the Enrichr API on a gene
set.
positional arguments:
input_file Input file with gene names.
output_file Output CSV file with results.
optional arguments:
-h, --help show this help message and exit
-a MAX_ATTEMPTS, --max-attempts MAX_ATTEMPTS
Maximum attempts to retry the API before giving up.
--no-overwrite Whether results should not be overwritten if existing.
lola¶
Get enrichment of region sets in public region databases using LOLA
usage: python -m ngs_toolkit.recipe.lola [-h] [--no-overwrite] [-c CPUS]
bed_file universe_file output_folder genome
A helper script to run Location Overlap Analysis (LOLA) of a single region set
in various sets of region-based annotations.
positional arguments:
bed_file BED file with query set regions.
universe_file BED file with universe where the query set came from.
output_folder Output directory for produced files.
genome Genome assembly of the region set.
optional arguments:
-h, --help show this help message and exit
--no-overwrite Don't overwrite existing output files.
-c CPUS, --cpus CPUS Number of CPUS/threads to use for analysis.