Concepts¶
A few notes on the way some of the library and its objects were designed to be used.
Analysis objects¶
The Analysis
object and its data-type specific dependents are central to the usage of ngs-toolkit
. These objects hold attributes and functions relevant to the analysis, such as Sample
objects (and their attributes), Dataframes with numerical values, and others.
Leveraging on the PEP format¶
One easy and recommended way to instantiate Analysis
objects is with a PEP Project
file.
This has several advantages:
Usage of the language-agnostic PEP format to store a project description and interoperability with other tools (see https://github.com/pepkit for other tools);
Initialization of project-specific variables into the
Analysis
object that are derived from the PEP. Examples: analysis samples, genome(s), sample and sample group attributes, sample comparison table.
The example below shows how this works:
>>> from ngs_toolkit import Analysis
>>> an = Analysis(from_pep="my_project/metadata/project_config.yaml")
[INFO] > Setting project's 'sample_attributes' as the analysis 'sample_attributes'.
[INFO] > Setting project's 'group_attributes' as the analysis 'group_attributes'.
[INFO] > Setting project's 'comparison_table' as the analysis 'comparison_table'.
[INFO] > Setting analysis organism as 'mouse'.
[INFO] > Setting analysis genome as 'mm10'.
>>> print(an)
Analysis 'my_project' with 12 samples of organism 'mouse' (mm10).
Note
The verbosity of ngs-toolkit
can be controlled
See the section on logging to control the verbosity of ngs-toolkit
.
Reasonable defaults with full customization¶
Functions in the Analysis
object are aware of these attributes and will
use them by default, making calling the functions very simple (other overiding
arguments can be passed though).
In the example below, we will generate a consensus peak set for ATAC-seq
analyss using the get_consensus_sites
function. This will demonstrate
several things that “come for free”:
>>> from ngs_toolkit import ATACSeqAnalysis
>>> an = ATACSeqAnalysis(from_pep="my_project/metadata/project_config.yaml")
[INFO] > Setting project's 'sample_attributes' as the analysis 'sample_attributes'.
[INFO] > Setting project's 'group_attributes' as the analysis 'group_attributes'.
[INFO] > Setting project's 'comparison_table' as the analysis 'comparison_table'.
[INFO] > Subsetting samples for samples of type 'ATAC-seq'.
[INFO] > Subsetting comparison_table for comparisons of type 'ATAC-seq'.
[INFO] > Setting analysis organism as 'mouse'.
[INFO] > Setting analysis genome as 'mm10'.
>>> an.get_consensus_sites()
even though the PEP project includes samples from several data types (ATAC-, ChIP- and RNA-seq), the current analysis will only consider ATAC-seq samples.
the necessary files with peak calls for each sample are not specified -
ngs-toolkit
knows where to find them;a BED file with ENCODE blacklisted regions will not be given, but these regions will be filtered out -
ngs-toolkit
will download this and use it. No static files are distributed with the package.related to the above, the correct blacklist file is downloaded because the genome assembly for the project is infered from the samples - even though it is not directly specified.
Workflow¶
Most functions of the Analysis
object will take some input (usually a
dataframe), apply some transformation and assign the result to a
variable of the same Analysis object.
To see what variable has been assigned within a given function check the relevant function in the API, specifically the Variables value. Some functions will assign attributes that are used almost ubiquitily. See the common attributes section for some examples.
High-level functions will also often assign their outputs to the object
itself. To see which attribute holds it, note the Attributes
section of
the respective function documentation.
Assignment allows the exchange of information between analysis steps without
the user always providing all required inputs, which would make using such a
toolkit quite verbose.
The example below illustrates this:
>>> from ngs_toolkit import ATACSeqAnalysis
>>> an = ATACSeqAnalysis(from_pep="my_project/metadata/project_config.yaml")
>>> print(an)
'ATAC-seq' analysis 'test-project_ATAC-seq_mm10_1_100_1' with 2 samples of organism 'mouse' (mm10).
>>> an.get_consensus_sites()
>>> an.measure_coverage()
>>> print(an.matrix_raw.head())
S1_a1 S2_a2
region
chr1:42447241-42447627 955 2211
chr1:44445678-44446750 1939 2122
chr1:44743959-44744926 1264 1443
chr1:90513210-90513978 1262 1354
chr1:93565764-93567191 911 892
>>> an.normalize()
>>> print(an.matrix_norm.head())
region S1_a1 S2_a2
chr1:42447241-42447627 12.681954 13.822151
chr1:44445678-44446750 13.703582 13.762881
chr1:44743959-44744926 13.086324 13.206576
chr1:90513210-90513978 13.084040 13.114743
chr1:93565764-93567191 12.613915 12.512715
All three get_consensus_sites
, measure_coverage
and normalize
build
on the output of each other, but the user doesn’t have to specify the input to
any. Changing either the name of the attribute that stores either output or the
location of files outputed is nonetheless easy.
Many functions also have a save
argument which will save the result as a
CSV
file.
Common attributes¶
To allow a uniform usage across different data types and analysis types,
a few but important attributes of the Analysis
object and its derivatives
have naming conventions:
data_type
: The type of data of the analysis. Matches the object type.matrix_raw
: A dataframe of raw, unnormalized values of shape (features, samples)matrix_norm
: A dataframe of normalized values of shape (features, samples)quantity
: The name of the units of the values measured. E.g. “expression” for RNA-seq or “accessibility” for ATAC-seqvar_unit_name
: The name of the variables measured. E.g. “gene” for RNA-seq or “region” for ATAC-seq or ChIP-seqnorm_method
: The method used to normalize thematrix_norm
dataframethresholds
: A dictionary with keys “log_fold_change” and “p_value” storing thresholds used in the analysis
Comparison table¶
ngs-toolkit
has functions to perform supervised differntial comparisons
between groups of samples. The sample groupings are specified in a CSV file called comparison_table
.
An example of a typical “case vs control” comparison table is given below:
comparison_name |
comparison_side |
sample_name |
sample_group |
---|---|---|---|
KOA_vs_WT |
1 |
ATAC-seq_KOA_r1 |
KO_A |
KOA_vs_WT |
1 |
ATAC-seq_KOA_r2 |
KO_A |
KOA_vs_WT |
0 |
ATAC-seq_WT_r1 |
WT |
KOA_vs_WT |
0 |
ATAC-seq_WT_r2 |
WT |
KOB_vs_WT |
1 |
ATAC-seq_KOB_r1 |
KO_B |
KOB_vs_WT |
1 |
ATAC-seq_KOB_r2 |
KO_B |
KOB_vs_WT |
0 |
ATAC-seq_WT_r1 |
WT |
KOB_vs_WT |
0 |
ATAC-seq_WT_r2 |
WT |
Each row is reserved for a given sample. Samples of the same group (typically replicates) should have the same value of “sample_group” and same “comparison_side”. The group of interest (comparison foreground) should have a value of 1 as “comparison_side” and the background a value of 0. Finally, the comparison will be labeled with the value of “comparison_name”, which should be constant for all samples in both foreground and background groups.
For an all-vs-all group comparison, I recommend labeling all background sample groups as a new group in the following manner:
comparison_name |
comparison_side |
sample_name |
sample_group |
---|---|---|---|
celltypeA |
1 |
ATAC-seq_celltypeA_r1 |
ct_A |
celltypeA |
1 |
ATAC-seq_celltypeA_r2 |
ct_A |
celltypeA |
0 |
ATAC-seq_celltypeB_r1 |
ct_A_background |
celltypeA |
0 |
ATAC-seq_celltypeB_r2 |
ct_A_background |
celltypeA |
0 |
ATAC-seq_celltypeC_r1 |
ct_A_background |
celltypeA |
0 |
ATAC-seq_celltypeC_r2 |
ct_A_background |
celltypeB |
1 |
ATAC-seq_celltypeB_r1 |
ct_B |
celltypeB |
1 |
ATAC-seq_celltypeB_r2 |
ct_B |
celltypeB |
0 |
ATAC-seq_celltypeA_r1 |
ct_B_background |
celltypeB |
0 |
ATAC-seq_celltypeA_r2 |
ct_B_background |
celltypeB |
0 |
ATAC-seq_celltypeC_r1 |
ct_B_background |
celltypeB |
0 |
ATAC-seq_celltypeC_r2 |
ct_B_background |
Additional useful columns are data_type (to subset comparisons based on type of NGS data), comparison_type to specify the type of comparison to perform (e.g. one of ‘differential’ or ‘peaks’) and toggle for subsetting comparisons to perform.
Note
Hyphens and other symbols in comparison_table
Since differential comparisons are perfomed using DESeq2, R is used (throught the Python-R interface library rpy2). ngs_toolkit will create the required tables by DESeq2 which includes names of samples and comparisons as dataframe columns. Unfortunately due to the way R handles column names, these get changed.
In the future this will be accounted for but for now avoid using hyphens and any other symbols as values for sample names or groups.
Low-level functions - utils
¶
Functions from Analysis objects are generally pretty high level functions, often performing several tasks by calling other more general-purpose functions. However, one of the concepts I really wanted to have is that the user retains as much control as they wish.
They may choose to use the high level functions which generally provide sensible defaults, or retain more control and build their analysis pipeline from the lower level helper functions.
One example: calling ATACSeqAnalysis.normalize()
will by default run 3-4
other functions to return a quantile normalized, GC-corrected, log-transformed
output - a fairly complex normalization procedure but made simple by providing
sensible defaults.
A user may easily change the procedure by choosing one of the ~4 types of normalization using keyword arguments or implement an alternative method which can be plugged in to the next step of the analysis.
In the future the low level functions will be moved to ngs_toolkit.utils and the data type-specific modules will have only classes and functions specific to those data which are usually more high-level.