Recipes

ngs_toolkit provides scripts to perform routine tasks on NGS data - they are called recipes.

Recipes are distributed with ngs_toolkit and can be seen in the github repository.

To make it convenient to run the scripts on data from a project, recipes can be run with the command projectmanager recipe <recipe_name> <project_config.yaml>.

ngs_analysis

This recipe will perform general NGS analysis on 3 data types: ATAC-seq, ChIP-seq and RNA-seq. For ATAC and ChIP-seq, quantification and annotation of genomic regions will be performed. Standard analysis appropriate for each data type will proceed with cross-sample normalization, unsupervised analysis and supervised analysis if a comparison_table is provided.

This recipe uses variables provided in the project configuration file project_name, sample_attributes and group_attributes.

Here are the command-line arguments to use it in a stand-alone script:

usage: python ngs_analysis_recipe [-h] [-n NAME] [-o RESULTS_DIR]
                           [-t {ATAC-seq,RNA-seq,ChIP-seq}] [-q] [-a ALPHA]
                           [-f ABS_FOLD_CHANGE]
                           config_file

positional arguments:
  config_file           YAML project configuration file.

optional arguments:
  -h, --help            show this help message and exit
  -n NAME, --analysis-name NAME
                        Name of analysis. Will be the prefix of output_files.
                        By default it will be the name of the Project given in
                        the YAML configuration.
  -o RESULTS_DIR, --results-output RESULTS_DIR
                        Directory for analysis output files. Default is
                        'results' under the project roort directory.
  -t {ATAC-seq,RNA-seq,ChIP-seq}, --data-type {ATAC-seq,RNA-seq,ChIP-seq}
                        Data type to restrict analysis to. Default is to run
                        separate analysis for each data type.
  -q, --pass-qc         Whether only samples with a 'pass_qc' value of '1' in
                        the annotation sheet should be used.
  -a ALPHA, --alpha ALPHA
                        Alpha value of confidence for supervised analysis.
  -f ABS_FOLD_CHANGE, --fold-change ABS_FOLD_CHANGE
                        Absolute log2 fold change value for supervised
                        analysis.

call_peaks

This recipe will call peaks for samples in a fashion described in a comparison table.

It is capable of parallelizing work in jobs if a SLURM cluster is available.

Here are the command-line arguments to use it in a stand-alone script:

usage: call_peaks [-h] [-c COMPARISON_TABLE] [-t] [-j]
                           [-o RESULTS_DIR]
                           config_file

Call peaks recipe.

positional arguments:
  config_file           YAML project configuration file.

optional arguments:
  -h, --help            show this help message and exit
  -c COMPARISON_TABLE, --comparison-table COMPARISON_TABLE
                        Comparison table to use for peak calling. If not
                        provided will use a filenamed `comparison_table.csv`
                        in the same directory of the given YAML Project
                        configuration file.
  -t, --only-toggle     Whether only comparisons with 'toggle' value of '1' in
                        the should be performed.
  -j, --as-jobs         Whether jobs should be created for each sample, or it
                        should run in serial mode.
  -o RESULTS_DIR, --results-output RESULTS_DIR
                        Directory for analysis output files. Default is
                        'results' under the project roort directory.

region_set_frip

This recipe will perform fraction of reads in peaks (FRiP) for ATAC-seq or ChIP-seq samples based on a set of regions discovered across all samples in a given project or in an external gold region set.

If the external region set is not given, a region set derived from all samples already exists (e.g. from running the ngs_analysis recipe) the same one will be used, otherwise it will be produced.

Here are the command-line arguments to use it in a stand-alone script:

usage: region_set_frip [-h] [-n NAME] [-r REGION_SET] [-q] [-j]
                           [-o RESULTS_DIR]
                           config_file

Region set FRiP recipe.

positional arguments:
  config_file           YAML project configuration file.

optional arguments:
  -h, --help            show this help message and exit
  -n NAME, --analysis-name NAME
                        Name of analysis. Will be the prefix of output_files.
                        By default it will be the name of the Project given in
                        the YAML configuration.
  -r REGION_SET, --region-set REGION_SET
                        BED file with region set derived from several samples
                        or Oracle region set. If unset, will try to get the
                        `sites` attribute of an existing analysis object if
                        existing, otherwise will create a region set from the
                        peaks of all samples.
  -q, --pass-qc         Whether only samples with a 'pass_qc' value of '1' in
                        the annotation sheet should be used.
  -j, --as-jobs         Whether jobs should be created for each sample, or it
                        should run in serial mode.
  -o RESULTS_DIR, --results-output RESULTS_DIR
                        Directory for analysis output files. Default is
                        'results' under the project roort directory.

merge_signal

This recipe will merge signal from various ATAC-seq or ChIP-seq samples given a set of attributes to group samples by.

It produces merged BAM and bigWig files for all signal in the samples but is also capable of producing this for nucleosomal/nucleosomal free signal based on fragment length distribution if data is paired-end sequenced. This signal may optionally be normalized for each group. It is also capable of parallelizing work in jobs if a SLURM cluster is available.

Here are the command-line arguments to use it in a stand-alone script:

usage: merge_signal [-h] [-a ATTRIBUTES] [-q] [-j] [-n] [--nucleosome]
                    [--overwrite] [-o OUTPUT_DIR]
                    config_file

Merge signal recipe.

positional arguments:
  config_file           YAML project configuration file.

optional arguments:
  -h, --help            show this help message and exit
  -a ATTRIBUTES, --attributes ATTRIBUTES
                        Attributes to merge samples by. By default will use
                        values in the project config `sample_attributes`.
  -q, --pass-qc         Whether only samples with a 'pass_qc' value of '1' in
                        the annotation sheet should be used.
  -j, --as-jobs         Whether jobs should be created for each sample, or it
                        should run in serial mode.
  -n, --normalize       Whether tracks should be normalized to total sequenced
                        depth.
  --nucleosome          Whether to produce nucleosome/nucleosome-free signal
                        files.
  --overwrite           Whether to overwrite existing files.
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Directory for output files. Default is 'merged' under
                        the project roort directory.