Recipes

ngs_toolkit provides scripts to perform routine tasks on NGS data - they are called recipes.

Recipes are distributed with ngs_toolkit and can be seen in the github repository.

To make it convenient to run the scripts on data from a project, recipes can also be run with the command projectmanager recipe <recipe_name> <project_config.yaml>.

ngs_toolkit.recipes.ngs_analysis

Perform full end-to-end analysis of ATAC-seq, ChIP-seq or RNA-seq data.

Produces quantification matrices, normalizes them, performes unsupervised and supervised analysis as well as enrichment analyisis of differential features, all accompaigned with powerful visualizations.

Supervised analysis will only be performed if PEP configuration file contains a comparison table field.

In addition, this recipe uses variables provided in the project configuration file project_name, sample_attributes and group_attributes.

usage: python -m ngs_toolkit.recipes.ngs_analysis [-h] [-n NAME]
                                                  [-o RESULTS_DIR]
                                                  [-t {ATAC-seq,RNA-seq,ChIP-seq}]
                                                  [-q] [-a ALPHA]
                                                  [-f ABS_FOLD_CHANGE]
                                                  config_file

Positional Arguments

config_file

YAML project configuration file.

Named Arguments

-n, --analysis-name

Name of analysis. Will be the prefix of output_files. By default it will be the name of the Project given in the YAML configuration.

-o, --results-output

Directory for analysis output files. Default is ‘results’ under the project roort directory.

Default: “results”

-t, --data-type

Possible choices: ATAC-seq, RNA-seq, ChIP-seq

Data type to restrict analysis to. Default is to run separate analysis for each data type.

-q, --pass-qc

Whether only samples with a ‘pass_qc’ value of ‘1’ in the annotation sheet should be used.

Default: False

-a, --alpha

Alpha value of confidence for supervised analysis.

Default: 0.05

-f, --fold-change

Absolute log2 fold change value for supervised analysis.

Default: 0

ngs_toolkit.recipes.call_peaks

Call peaks for ChIP-seq samples given a comparison table mapping foreground-background relationships between samples.

usage: python -m ngs_toolkit.recipes.call_peaks [-h] [-c COMPARISON_TABLE]
                                                [-t] [-qc] [-j]
                                                [-o RESULTS_DIR]
                                                config_file

Positional Arguments

config_file

YAML project configuration file.

Named Arguments

-c, --comparison-table

Comparison table to use for peak calling. If not provided will use a filenamed comparison_table.csv in the same directory of the given YAML Project configuration file.

-t, --only-toggle

Whether only comparisons with ‘toggle’ value of ‘1’ or ‘True’ should be performed.

Default: False

-qc, --pass-qc

Whether only samples with a ‘pass_qc’ attribute should be included. Default is False.

Default: False

-j, --as-jobs

Whether jobs should be created for each sample, or it should run in serial mode.

Default: False

-o, --results-output

Directory for analysis output files. Default is ‘results’ under the project root directory.

Default: “results”

ngs_toolkit.recipes.coverage

A helper script to calculate the read coverage of a BAM file in regions from a BED file. Ensures the same order and number of lines as input BED file.

usage: python -m ngs_toolkit.recipes.coverage [-h] [--no-overwrite]
                                              bed_file bam_file output_bed

Positional Arguments

bed_file

Input BED file with regions to quantify.

bam_file

Input BAM file with reads.

output_bed

Output BED file with counts for each region.

Named Arguments

--no-overwrite

Whether results should not be overwritten if existing.

Default: True

ngs_toolkit.recipes.deseq2

Perform differential expression using DESeq2 by comparing sample groups using a formula.

usage: python -m ngs_toolkit.recipes.deseq2 [-h]
                                            [--output_prefix OUTPUT_PREFIX]
                                            [--formula FORMULA]
                                            [--alpha ALPHA] [-d] [--overwrite]
                                            [--no-save-inputs]
                                            work_dir

Positional Arguments

work_dir

Working directory. Should contain required files for DESeq2.

Named Arguments

--output_prefix

Prefix for output files.

Default: “differential_analysis”

--formula

R-style formula for differential expression. Default = ‘~ sample_group’.

Default: “~ sample_group”

--alpha

Significance level to call differential expression. All results will be output anyway.

Default: 0.05

-d, --dry-run

Don’t actually do anything.

Default: False

--overwrite

Don’t overwrite any existing directory or file.

Default: False

--no-save-inputs

Don’t write inputs to disk.

Default: True

ngs_toolkit.recipes.enrichr

A helper script to run enrichment analysis using the Enrichr API on a gene set.

usage: python -m ngs_toolkit.recipes.enrichr [-h] [-a MAX_ATTEMPTS]
                                             [--no-overwrite]
                                             input_file output_file

Positional Arguments

input_file

Input file with gene names.

output_file

Output CSV file with results.

Named Arguments

-a, --max-attempts

Maximum attempts to retry the API before giving up.

Default: 5

--no-overwrite

Whether results should not be overwritten if existing.

Default: True

ngs_toolkit.recipes.generate_project

A helper script to generate synthetic data for a project in PEP format.

usage: python -m ngs_toolkit.recipes.generate_project [-h]
                                                      [--output-dir OUTPUT_DIR]
                                                      [--project-name PROJECT_NAME]
                                                      [--organism ORGANISM]
                                                      [--genome-assembly GENOME_ASSEMBLY]
                                                      [--data-type DATA_TYPE]
                                                      [--n-factors N_FACTORS]
                                                      [--only-metadata ONLY_METADATA]
                                                      [--sample-input-files SAMPLE_INPUT_FILES]
                                                      [--debug]

Named Arguments

--output-dir
--project-name

Default: “test_project”

--organism

Default: “human”

--genome-assembly

Default: “hg38”

--data-type

Default: “ATAC-seq”

--n-factors

Default: 1

--only-metadata

Default: False

--sample-input-files

Default: False

--debug

Default: False

ngs_toolkit.recipes.lola

A helper script to run Location Overlap Analysis (LOLA) of a single region set in various sets of region-based annotations.

usage: python -m ngs_toolkit.recipes.lola [-h] [--no-overwrite] [-c CPUS]
                                          bed_file universe_file output_folder
                                          genome

Positional Arguments

bed_file

BED file with query set regions.

universe_file

BED file with universe where the query set came from.

output_folder

Output directory for produced files.

genome

Genome assembly of the region set.

Named Arguments

--no-overwrite

Don’t overwrite existing output files.

Default: True

-c, --cpus

Number of CPUS/threads to use for analysis.

ngs_toolkit.recipes.merge_signal

Merge signal from various ATAC-seq or ChIP-seq samples given a set of attributes to group samples by.

It produces merged BAM and bigWig files for all signal in the samples but is also capable of producing this for nucleosomal/nucleosomal free signal based on fragment length distribution if data is paired-end sequenced. This signal may optionally be normalized for each group. It is also capable of parallelizing work in jobs.

usage: python -m ngs_toolkit.recipes.merge_signal [-h] [-a ATTRIBUTES] [-q]
                                                  [-j] [--cpus CPUS]
                                                  [--normalize] [--nucleosome]
                                                  [--overwrite]
                                                  [-o OUTPUT_DIR] [-d]
                                                  config_file

Positional Arguments

config_file

YAML project configuration file.

Named Arguments

-a, --attributes

Attributes to merge samples by. A comma-delimited string with no spaces. By default will use values in the project config group_attributes.

-q, --pass-qc

Whether only samples with a ‘pass_qc’ value of ‘1’ in the annotation sheet should be used.

Default: False

-j, --as-jobs

Whether jobs should be created for each sample, or it should run in serial mode.

Default: False

--cpus

CPUs/Threads to use per job if –as-jobs is on.

Default: 8

--normalize

Whether tracks should be normalized to total sequenced depth.

Default: False

--nucleosome

Whether to produce nucleosome/nucleosome-free signal files.

Default: False

--overwrite

Whether to overwrite existing files.

Default: False

-o, --output-dir

Directory for output files. Default is ‘merged’ under the project root directory.

Default: “merged”

-d, --dry-run

Whether to do everything except running commands.

Default: False

ngs_toolkit.recipes.region_enrichment

A helper script to run enrichment analysis of a single region set in region-based set of annotations.

usage: python -m ngs_toolkit.recipes.region_enrichment [-h]
                                                       [--output-file OUTPUT_FILE]
                                                       [--overwrite]
                                                       bed_file pep

Positional Arguments

bed_file

BED file with regions.

pep

The analysis’ PEP config file.

Named Arguments

--output-file

Output file.

Default: “region_type_enrichment.csv”

--overwrite

Don’t overwrite any existing directory or file.

Default: False

ngs_toolkit.recipes.region_set_frip

Compute fraction of reads in peaks (FRiP) based on a consensus set of regions derived from several samples.

usage: python -m ngs_toolkit.recipes.region_set_frip [-h] [-d DATA_TYPE]
                                                     [-n NAME] [-r REGION_SET]
                                                     [-q] [-j] [-o OUTPUT_DIR]
                                                     [-s]
                                                     config_file

Positional Arguments

config_file

YAML project configuration file.

Named Arguments

-d, --data-type

Data types to perform analysis on. Will be done separately for each.

-n, --analysis-name

Name of analysis. Will be the prefix of output_files. By default it will be the name of the Project given in the YAML configuration.

-r, --region-set

BED file with region set derived from several samples or Oracle region set. If unset, will try to get the sites attribute of an existing analysis object if existing, otherwise will create a region set from the peaks of all samples.

-q, --pass-qc

Whether only samples with a ‘pass_qc’ value of ‘1’ in the annotation sheet should be used.

Default: False

-j, --as-jobs

Whether jobs should be created for each sample, or it should run in serial mode.

Default: False

-o, --results-output

Directory for analysis output files. Default is ‘results’ under the project roort directory.

Default: “results”

-s, --strict

Whether to throw an error in case files cannot be created or not.

Default: False