API¶
The great flexibility of ngs_toolkit
comes from the ability to compose workflows using the API.
It provides a rich but abstract Analysis
object and implements various modules building on it depending on the data type.
In addition, the general
module contains several analysis-independent methods and the utils
module provides low-level functions of general use.
ngs_toolkit.analysis¶
-
class
ngs_toolkit.analysis.
Analysis
(name=None, from_pep=False, from_pickle=False, root_dir=None, data_dir='data', results_dir='results', prj=None, samples=None, subset_to_data_type=True, **kwargs)[source]¶ Generic class holding functions and data from a typical NGS analysis.
Other modules implement classes inheriting from this that in general contain data type-specific functions (e.g.
ATACSeqAnalysis
has aget_consensus_sites()
function to generate a peak consensus map).Objects of this type can be used to store data (e.g. dataframes), variables (e.g. paths to files or configurations) and can easily be filled with existing data using
load_data()
for cross-environment portability, or serialized (saved to a file as a pickle object) for rapid loading in the same environment. See theto_pickle()
,from_pickle()
andupdate()
functions for this.- Parameters
name (
str
, optional) – Name of the analysis.Defaults to “analysis”.
from_pep (
str
, optional) – PEP configuration file to initialize analysis from. The analysis will adopt as much attributes from the PEP as possible but keyword arguments passed at initialization will still have priority.Defaults to
None
(no PEP used).from_pickle (
str
, optional) – Pickle file of an existing serialized analysis object from which the analysis should be loaded.Defaults to
None
(will not load from pickle).root_dir (
str
, optional) – Base directory for the project.Defaults to current directory or to what is specified in PEP if
from_pep
.data_dir (
str
, optional) – Directory containing processed data (e.g. by looper) that will be input to the analysis. This is in principle not required.Defaults to “data”.
results_dir (
str
, optional) – Directory to contain outputs produced by the analysis.Defaults to “results”.
prj (
peppy.Project
, optional) – Apeppy.Project
object that this analysis is tied to.Defaults to
None
.samples (
list
, optional) – List ofpeppy.Sample
objects that this analysis is tied to.Defaults to
None
.subset_to_data_type (
bool
, optional) – Whether to keep only samples that match the data type of the analysis.Defaults to
True
.kwargs (
dict
, optional) – Additional keyword arguments will simply be stored as object attributes.
-
from_pep
(pep_config, **kwargs)[source]¶ Create a peppy.Project from a PEP configuration file and associate is with the analysis.
- Parameters
pep_config (
str
) – PEP configuration file.- Variables
prj (
peppy.Project
) – Project object from given PEP configuration file.
-
update
(pickle_file=None)[source]¶ Update all of the object”s attributes with the attributes from a serialized object (ie object stored in a file) object.
- Parameters
pickle_file (
str
, optional) – Pickle file to load.Defaults to the analysis’
pickle_file
.
-
set_organism_genome
()[source]¶ Attempt to derive the analysis’ organism and genome assembly by inspecting the same attributes of its samples.
-
set_project_attributes
(overwrite=True, subset_to_data_type=True)[source]¶ Set Analysis object attributes
samples
,sample_attributes
andgroup_atrributes
to the values in the associated Project object if existing.- Parameters
- Variables
samples (
list
) – List of peppy.Samples if contained in the PEP configuration.sample_attributes (
list
) – Sample attributes if specified in the PEP configuration.group_attributes (
list
) – Groups attributes if specified in the PEP configuration.comparison_table (
pandas.DataFrame
) – Comparison table if specified in the PEP configuration.
-
set_samples_input_files
(overwrite=True)[source]¶ Add input file values to sample objects dependent on data type. These are specified in the
ngs_toolkit
configuration file undersample_input_files:<data type>:<attribute>
.
-
from_pickle
(pickle_file=None)[source]¶ Load object from pickle file.
- Parameters
pickle_file (
str
, optional) – Pickle file to load.Default is the object’s attribute
pickle_file
.- Returns
The analysis serialized in the pickle file.
- Return type
Analysis
-
get_sample_annotation
(attributes=None, samples=None)[source]¶ Get dataframe annotation of sample attributes.
- Variables
- Returns
Dataframe with requested attributes (columns) for each sample (rows).
- Return type
-
load_data
(output_map=None, only_these_keys=None, prefix='{results_dir}/{name}', permissive=True)[source]¶ Load the output files of the major functions of the Analysis.
- Parameters
output_map (
dict
) – Dictionary with {attribute_name: (file_path, kwargs)} to load the files. The kwargs in the tuple will be passed topandas.read_csv()
.Default is the required to read the keys in
only_these_keys
.only_these_keys (
list
, optional) – Iterable of analysis attributes to load up. Possible attributes:“matrix_raw”
“matrix_norm”
“matrix_features”
“differential_results”
“differential_enrichment”
Default is all of the above.
prefix (
str
, optional) – String prefix of files to load. Variables in curly braces will be formated with attributes of analysis.Default is “{results_dir}/{name}”.
permissive (
bool
, optional) – Whether an error should be ignored if reading a file causes IOError.Default is
True
.
- Variables
<various> (
pandas.DataFrame
) – Dataframes holding the respective data, available as attributes described in theonly_these_keys
parameter.- Raises
IOError – If not permissive and a file is not found.
-
record_output_file
(file_name, name='analysis', dump_yaml=True, output_yaml='{root_dir}/{name}.analysis_record.yaml')[source]¶ Record an analysis output.
Will also write all records to a YAML file and call Analysis.generate_report if specified in general configuration.
- Parameters
file_name (
str
) – Filename of output to report.name (
str
, optional) – Name of the output to report.Defaults to “analysis”.
dump_yaml (
bool
, optional) – Whether to dump records to yaml file.Defaults to
True
.output_yaml (
str
, optional) – YAML file to dump records to. Will be formated with Analysis variables.Defaults to “{root_dir}/{name}.analysis_record.yaml”.
- Variables
output_files (
list
) – Appends a tuple of (name
,file_name
) tooutput_files
.
-
generate_report
(output_html='{root_dir}/{name}.analysis_report.html', template=None, pip_versions=True)[source]¶ Record an analysis output.
- Parameters
output_html (
str
) – Filename of output to report.Defaults to “{root_dir}/{name}.analysis_report.html”.
template (
None
, optional) – Name of the output to report.Default is the HTML template distributed with ngs-toolkit.
pip_versions (
bool
, optional) – Whether the versions of Python packages should be included in the report by using pip freeze.Default is
True
.
-
set_matrix
(matrix_name, csv_file, prefix='{results_dir}/{name}', **kwargs)[source]¶ Set an existing CSV file as the value of the analysis’ matrix.
- Parameters
matrix_name (
str
) – The attribute name of the matrix.Options are “matrix_raw” and “matrix_norm”.
csv_file (
str
) – Path to valid CSV file to be used as matrix. Assumes header and index column. Customize additional overwriding options to read CSV by passing kwargs.prefix (
str
, optional) – String prefix of paths to save files. Variables in curly braces will be formated with attributes of analysis.Defaults to “{results_dir}/{name}”.
**kwargs (
dict
) – Additional keyword arguments will be passed topandas.read_csv
.
- Variables
matrix_name (
pandas.DataFrame
) – An attribute named matrix_name holding the respecive matrix.
-
get_resources
(steps=['blacklist', 'tss', 'genomic_context'], organism=None, genome_assembly=None, output_dir=None, overwrite=False)[source]¶ Get genome-centric resources used by several
ngs_toolkit
analysis functions.- Parameters
steps (
list
, optional) –What kind of annotations to get. Options are:
“genome”: Genome sequence (2bit format)
“blacklist”: Locations of blacklisted regions for genome
“tss”: Locations of gene”s TSSs
“genomic_context”: Genomic context of genome
“chromosome_sizes”: Sizes of chromosomes
Defaults to [“blacklist”, “tss”, “genomic_context”].
organism (
str
, optional) – Organism to get for. Currently supported are “human” and “mouse”.Defaults to analysis’ own organism.
genome_assembly (
str
, optional) – Genome assembly to get resources for. Currently supported are “hg19”, “hg38” and “mm10”.Defaults to the genome assembly of the analysis.
output_dir (
str
, optional) – Directory to save results to.Defaults to the value of
preferences:root_reference_dir
in the configuration, if that is not set, to a directory called “reference” in the analysis root directory.overwrite (
bool
, optional) – Whether existing files should be overwritten by new ones. Otherwise they will be kept and no action is made.Defaults to
False
.
- Returns
Dictionary with keys same as the options as steps, containing paths to the requested files.
The values of the ‘genome’ step are also a dictionary with keys “2bit” and “fasta” for each file type respectively.
- Return type
-
normalize_rpm
(matrix='matrix_raw', samples=None, mult_factor=1000000.0, log_transform=True, pseudocount=1, save=True, assign=True)[source]¶ Normalization of matrix of (n_features, n_samples) by total in each sample.
- Parameters
matrix (
str
, optional) – Attribute name of matrix to normalize.Defaults to “matrix_raw”.
samples (
list
, optional) – Iterable ofpeppy.Sample
objects to restrict matrix to.Defaults to all samples in matrix.
mult_factor (
float
, optional) – A constant to multiply values for.Defaults to 1e6.
log_transform (
bool
, optional) – Whether to log transform values or not.Defaults to
True
.pseudocount ({
int
,float
}, optional) – A constant to add to values.Defaults to 1.
save (
bool
, optional) – Whether to write normalized DataFrame to disk.Defaults to
True
.assign (
bool
, optional) – Whether to assign the normalized DataFrame tomatrix_norm
.Defaults to
True
.
- Variables
matrix_norm (
pandas.DataFrame
) – Ifassign
, apandas.DataFrame
normalized with respective method.norm_method (
str
) – Ifassign
, it is the name of method used to normalize: “rpm”.
- Returns
Normalized dataframe.
- Return type
-
normalize_quantiles
(matrix='matrix_raw', samples=None, implementation='Python', log_transform=True, pseudocount=1, save=True, assign=True)[source]¶ Quantile normalization of matrix of (n_features, n_samples).
- Parameters
matrix (
str
) – Attribute name of matrix to normalize.Defaults to “matrix_raw”.
samples (
list
, optional) – Iterable ofpeppy.Sample
objects to restrict matrix to.Defaults to all in matrix.
implementation (
str
, optional) – One ofPython
orR
. Dictates which implementation is to be used. The R implementation comes from the preprocessCore package, and the Python one is from https://github.com/ShawnLYU/Quantile_Normalize. They give very similar results.Default is “Python”.
log_transform (
bool
, optional) – Whether to log transform values or not.Default is
True
.pseudocount (float, optional) – A constant to add before log transformation.
Default is 1.
save (
bool
, optional) – Whether to write normalized DataFrame to disk.Default is
True
.assign (
bool
, optional) – Whether to assign the normalized DataFrame to an attribute matrix_norm.Default is
True
.
- Variables
matrix_norm (
pandas.DataFrame
) – Ifassign
, a pandas DataFrame normalized with respective method.norm_method (
str
) – Ifassign
, it is the name of method used to normalize: “quantile”.
- Returns
Normalized dataframe.
- Return type
-
normalize_median
(matrix='matrix_raw', samples=None, function=<function nanmedian>, fillna=True, save=True, assign=True)[source]¶ Normalization of matrices of (n_features, n_samples) by subtracting the median from each sample/feature. Most appopriate for CNV data.
- Parameters
matrix (
str
, optional) – Attribute name of dictionary of matrices to normalize.Defaults to “matrix_raw”.
samples (
list
, optional) – Samples to restrict analysis to.Defaults to all samples in Analysis object.
function (function, optional) – An alternative function to calculate across samples. Data will be subtracted by this.
Defaults to
numpy.nanmedian
.fillna (
bool
, optional) – Whether to fill NaN with zero.Defaults to
True
.save (
bool
, optional) – Whether results should be saved to disc.Defaults to
True
.assign (
bool
, optional) – Whether to assign the normalized DataFrame to an attribute matrix_norm.Default is
True
.
- Variables
matrix_norm (
pandas.DataFrame
) – Ifassign
, a pandas DataFrame normalized with respective method.norm_method (
str
) – Ifassign
, it is the name of method used to normalize: “median”.
- Returns
Normalized dataframe.
- Return type
-
normalize_pca
(pc, matrix='matrix_raw', samples=None, save=True, assign=True, **kwargs)[source]¶ Normalization of a matrix by subtracting the contribution of Principal Component pc from each sample/feature.
- Parameters
pc (
int
) – Principal Component to remove. 1-based.matrix (
str
, optional) – Attribute name of dictionary of matrices to normalize.Defaults to “matrix_raw”.
samples (
list
, optional) – Samples to restrict analysis to.Defaults to all samples.
save (
bool
, optional) – Whether results should be saved to disc.Defaults to
True
.assign (
bool
, optional) – Whether to assign the normalized DataFrame to an attribute matrix_norm.Default is
True
.**kwargs (
dict
, optional) – Additional keyword arguments will be passed tongs_toolkit.general.subtract_principal_component
.
- Variables
matrix_norm (
pandas.DataFrame
) – Ifassign
, a pandas DataFrame normalized with respective method.norm_method (
str
) – Ifassign
, it is the name of method used to normalize: “pca”.
- Returns
Normalized dataframe.
- Return type
-
normalize_vst
(matrix='matrix_raw', samples=None, save=True, assign=True, **kwargs)[source]¶ Normalization of a matrix using Variance Stabilization Transformation (VST) method from DESeq2.
- Parameters
matrix (
str
, optional) – Attribute name of dictionary of matrices to normalize.Defaults to “matrix_raw”.
samples (
list
, optional) – Samples to restrict analysis to.Defaults to all samples.
save (
bool
, optional) – Whether results should be saved to disc.Defaults to
True
.assign (
bool
, optional) – Whether to assign the normalized DataFrame to an attribute matrix_norm.Default is
True
.**kwargs (
dict
) – Additional keywork arguments will be passed to DESeq2::varianceStabilizingTransformation.
- Variables
matrix_norm (
pandas.DataFrame
) – Ifassign
, a DataFrame normalized with VST method.norm_method (
str
) – Ifassign
, it is the name of method used to normalize: “vst”.
- Returns
Normalized dataframe.
- Return type
-
normalize
(method='quantile', matrix='matrix_raw', samples=None, save=True, assign=True, **kwargs)[source]¶ Normalization of matrix of (n_features, n_samples).
- Parameters
method (
str
, optional) –- Normalization method to apply. One of:
rpm
: Reads per million normalization (RPM).vst
: Variance stabilization transformation (usesDESeq2
R package).quantile
: Quantile normalization and log2 transformation.cqn
: Conditional quantile normalization (usescqn
R package).Only available for ATAC-seq.
median
: Substraction of median per feature.Only useful for CNV.
pca
: Subtraction of Principal Component from matrix.Requires which PC to subtract.
pc
must be passed as kwarg.
Defaults to “quantile”.
matrix (
str
, optional) – Attribute name of matrix to normalize.Defaults to “matrix_raw”.
samples (
list
, optional) – Iterable ofpeppy.Sample
objects to restrict matrix to.Default is all samples in matrix.
save (
bool
, optional) – Whether to write normalized DataFrame to disk.Defaults to
True
.assign (
bool
, optional) – Whether to assign the normalized DataFrame to an attribute matrix_norm.Default is
True
.**kwargs (
dict
) – Additional keyword arguments will be passed to the respective normalization function.
- Variables
matrix_norm (
pandas.DataFrame
) – Ifassign
, a pandas DataFrame normalized with respective method.norm_method (
str
) – Ifassign
, it is themethod
used to normalize.
- Returns
Normalized dataframe.
- Return type
-
remove_factor_from_matrix
(factor, method='combat', covariates=None, matrix='matrix_norm', samples=None, save=True, assign=True, make_positive=True)[source]¶ Remove an annotated factor from a matrix using Combat.
Requires the Python port of “Combat” to be installed. Install for example the following fork:
pip install git+https://github.com/afrendeiro/combat.git
- Parameters
factor (
str
) – The name of the factor to remove from matrix.method (
str
) – The method to use to remove the factor.Default is “combat”.
covariates (
list
) – Covariates to consider when removing factor. These will be kept in the data.matrix ({str, pandas.DataFrame}) – The name of the attribute with the matrix or a DataFrame.
Defaults to “matrix_norm”.
samples (
list
) – Iterable ofpeppy.Sample
objects to restrict matrix to.Default is not to subset matrix.
save (
bool
, optional) – Whether to write normalized DataFrame to disk.Defaults to
True
.assign (
bool
) – Whether to assign the result to “matrix_norm”.Defaults to
True
.make_positive (
bool
) – Whether to make resulting matrix non-negative. Not implemented yet.Defaults to
True
.
- Returns
Requested matrix (dataframe).
- Return type
-
get_matrix
(matrix, samples=None)[source]¶ Get a matrix that is an attribute of self subsetted for the requested samples.
- Parameters
matrix ({str, pandas.DataFrame}) – The name of the attribute with the matrix or a DataFrame already.
samples (
list
) – Iterable ofpeppy.Sample
objects to restrict matrix to.Default is not to subset matrix.
- Returns
Requested matrix (dataframe).
- Return type
-
get_matrix_stats
(matrix='matrix_raw', samples=None, save=True, output_prefix='stats_per_feature', assign=True)[source]¶ Gets a matrix of feature-wise (ie for every gene or region) statistics such across samples such as mean, variance, deviation, dispersion and amplitude.
- Parameters
matrix (
str
) – Attribute name of matrix to normalize.Defaults to “matrix_raw”.
samples (
list
[peppy.Sample
]) – Subset of samples to use.Defaults to all in analysis.
save (
bool
, optional) – Whether to write the annotated DataFrame to disk.Default is
True
.output_prefix (
str
, optional) – Prefix to add to output file when save is True.Default is “matrix_features”.
assign (
bool
, optional) – Whether to assign the annoatated DataFrame to “matrix_features”.Default is
True
.
- Returns
Statistics for each feature.
- Return type
- Variables
stats (
pandas.DataFrame
) – A DataFrame with statistics for each feature.
-
annotate_features
(samples=None, matrix='matrix_norm', feature_tables=None, permissive=True, save=True, assign=True, output_prefix='matrix_features')[source]¶ Annotates analysis features (regions/genes) by aggregating annotations per feature (genomic context, chromatin state, gene annotations and statistics) if present and relevant depending on the data type of the Analysis.
The numeric matrix to be used is specified in matrix. If any two two annotation dataframes have equally named columns (e.g. chrom, start, end), the value of the first is kept.
- Parameters
samples (
list
) – Iterable ofpeppy.Sample
objects to restrict matrix to. Calculated metrics will be restricted to these samples.Defaults to all in analysis (the matrix will not be subsetted).
matrix (
str
) – Attribute name of matrix to annotate.Defaults to “matrix_norm”.
feature_tables (
list
) – Attribute names of dataframes used to annotate the numeric dataframe.Default is [“gene_annotation”, “region_annotation”, “chrom_state_annotation”, “support”, “stats”] for ATAC-seq and ChIP-seq and [“stats”] for all others.
permissive (
bool
) – Whether DataFrames that do not exist should be simply skipped or an error will be thrown.Defaults to
True
.save (
bool
, optional) – Whether to write the annotated DataFrame to disk.Default is
True
.assign (
bool
, optional) – Whether to assign the annoatated DataFrame to “matrix_features”.Default is
True
.output_prefix (
str
, optional) – Prefix to add to output file whensave
isTrue
.Default is “matrix_features”.
- Raises
AttributeError – If not permissive a required DataFrame does not exist as an object attribute.
- Variables
matrix_features (
pandas.DataFrame
) – A pandas DataFrame containing annotations of the region features.
-
annotate_samples
(matrix='matrix_norm', attributes=None, numerical_attributes=None, save=False, assign=False)[source]¶ Annotate matrix
(n_features, n_samples)
with sample metadata (creates MultiIndex on columns). Numerical attributes can be pass as a iterable tonumerical_attributes
to be converted.- Parameters
matrix (
str
, optional) – Attribute name of matrix to annotate.Defaults to “matrix_norm”.
attributes (
list
, optional) – Desired attributes to be annotated.Defaults to all attributes in the original sample annotation sheet of the analysis’ Project.
numerical_attributes (
list
, optional) – Attributes which are numeric even though they might be so in the samples” attributes. Will attempt to convert values to numeric.save (
bool
, optional) – Whether to write normalized DataFrame to disk.Default is
True
.assign (
bool
, optional) – Whether to assign the normalized DataFrame to “matrix_norm”.Default is
True
.
- Returns
Annotated dataframe with requested sample attributes.
- Return type
- Variables
matrix_norm (
pandas.DataFrame
) – A pandas DataFrame with MultiIndex column index containing the sample’s attributes specified.
-
annotate_matrix
(**kwargs)[source]¶ Convinience function to create dataframes annotated with feature and samples attributes.
Simply calls
Analysis.annotate_features()
andAnalysis.annotate_samples()
.- Parameters
kwargs (
dict
) – Additional keyword arguments are passed to the above mentioned functions.
-
get_level_colors
(index=None, matrix='matrix_norm', levels=None, pallete='tab20', uniform_cmap='plasma', diverging_cmap='RdYlBu_r', nan_color=(0.662745, 0.662745, 0.662745, 1.0), as_dataframe=False)[source]¶ Get tuples of floats representing a colour for a sample in a given variable in a dataframe”s index (particularly useful with MultiIndex dataframes).
If given, will use the provieded
index
argument, otherwise, the columns and its levels of an attribute of self namedmatrix
.levels
can be passed to subset the levels of the index.Will try to guess if each variable is categorical or numerical and return either colours from a colour
pallete
or acmap
, respectively with null values set tonan_color
(a 4-value tuple of floats).- Parameters
index (pandas.Index, optional) – Pandas Index to use.
Default is to use the column Index of the provided
matrix
.matrix (
str
, optional) – Name of analysis attribute containing a dataframe with pandas.MultiIndex columns to use.Default is to use the provided
index
.levels (
list
, optional) – Levels of multiindex to restrict to.Defaults to all in index under use.
pallete (
str
, optional) – Name of matplotlib color palete to use with categorical levels. See matplotlib.org/examples/color/colormaps_reference.html.Defaults to “tab20”.
{uniform_cmap, diverging_cmap} (
str
, optional) – Name of matplotlib color paletes to use with numerical levels. Uniform will be used if values in level are non-negative, while diverging if including negative. See matplotlib.org/examples/color/colormaps_reference.html.Defaults to “plasma” and “RdYlBu_r”, respectively.
nan_color (tuple, optional) – Color for missing (i.e. NA) values.
Defaults to
(0.662745, 0.662745, 0.662745, 0.5)
==grey
.as_dataframe (
bool
, optional) – Whether a dataframe should be returned.Defaults to
False
.
- Returns
Matrix of shape (level, sample) with rgb values of each of the variable. If as_dataframe, this will be a pandas.DataFrame otherwise, list of lists.
- Return type
{list, pandas.DataFrame}
-
unsupervised_analysis
(steps=['correlation', 'manifold', 'pca', 'pca_association'], matrix='matrix_norm', samples=None, attributes_to_plot=None, output_dir='{results_dir}/unsupervised_analysis_{data_type}', output_prefix='all_{var_unit_name}s', standardize_matrix=True, manifold_algorithms=['MDS', 'Isomap', 'LocallyLinearEmbedding', 'SpectralEmbedding', 'TSNE'], maniford_kwargs={}, display_corr_values=False, plot_max_pcs=4, save_additional=False, prettier_sample_names=True, rasterized=False, dpi=300, **kwargs)[source]¶ General unsupervised analysis of a matrix.
Apply unsupervised clustering, manifold learning and dimensionality reduction methods on numeric matrix. Colours and labels samples by their attributes as given in attributes_to_plot.
- This analysis has 4 possible steps:
- “correlation”:
Pairwise sample correlation with 2 distance metrics plotted as heatmap.
- “manifold”:
Manifold learning of latent spaces for projection of samples. See here available algorithms: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold
- “pca”:
For PCA analysis, if test_pc_association is True, will compute association of PCs with sample attributes given in attributes_to_plot. For numeric attributes, the Pearson correlation will be computed and for categoriacal, a pairwise Kruskal-Wallis H-test (ANOVA).
- “pca_association”:
For PCA analysis, if test_pc_association is True, will compute association of PCs with sample attributes given in attributes_to_plot. For numeric attributes, the Pearson correlation will be computed and for categoriacal, a pairwise Kruskal-Wallis H-test (ANOVA).
- Parameters
steps (
list
, optional) – List of step keywords to be performed as described above.Defaults to all available.
matrix (
str
, optional) – Name of analysis attribute contatining the numeric dataframe to perform analysis on. Must have a pandas.MultiIndex as column index.Defaults to “matrix_norm”.
samples (
list
, optional) – List of sample objects to restrict analysis to.Defaults to all in analysis.
attributes_to_plot (
list
, optional) – List of attributes shared between sample groups should be plotted.Defaults to attributes in analysis.group_attributes.
output_dir (
str
, optional) – Directory for generated files and plots.Defaults to “{results_dir}/unsupervised_analysis_{data_type}”.
output_prefix (
str
, optional) – Prefix for output files.Defaults to “all_regions” if data_type is ATAC-seq and “all_genes” if data_type is RNA-seq.
standardize_matrix (
bool
, optional) – Whether to standardize variables in matrix by removing the mean and scaling to unit variance. It is not applied to the “correlation” step.Default is
True
.manifold_algorithms (
list
, optional) – List of manifold algorithms to use. See available algorithms here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifoldDefaults to [‘MDS’, ‘Isomap’, ‘LocallyLinearEmbedding’, ‘SpectralEmbedding’, ‘TSNE’],
maniford_kwargs (
dict
, optional) – Dictionary of keyword arguments to pass to the algorithms inmanifold_algorithms
. Should be of the form {“algorithm_name”: {“key”: value}}display_corr_values (
bool
, optional) – Whether values in heatmap of sample correlations should be displayed overlaid on top of colours.Defaults to
False
.save_additional (
bool
, optional) – Whether additional results such as PCA projection, loadings should be saved.Defaults to
False
.prettier_sample_names (
bool
, optional) – Whether it should attempt to prettify sample names by removing the data type from plots.Defaults to
True
.rasterized (
bool
, optional) – Whether elements with many objects should be rasterized.Defaults to
False
.dpi (
int
, optional) – Definition of rasterized image in dots per inch (dpi).Defaults to 300.
**kwargs (optional) – kwargs are passed to
get_level_colors()
andplot_projection()
.
-
differential_analysis
(comparison_table=None, samples=None, covariates=None, filter_support=False, output_dir='{results_dir}/differential_analysis_{data_type}', output_prefix='differential_analysis', overwrite=True, distributed=False, deseq_kwargs=None, **kwargs)[source]¶ Perform differential regions/genes across samples that are associated with a certain trait. Currently the only implementation is with DESeq2. This implies the rpy2 library and the respective R library are installed.
Requires the R package “DESeq2” to be installed:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DESeq2")
For other implementations of differential analysis see ngs_toolkit.general.least_squares_fit and ngs_toolkit.general.differential_from_bivariate_fit.
- Parameters
comparison_table (
pandas.DataFrame
) – A dataframe with “comparison_name”, “comparison_side” and “sample_name”, “sample_group” columns.Defaults to the analysis’ own “comparison_table” attribute.
samples (
list
, optional) – Samples to limit analysis to.Defaults to all samples in analysis object.
covariates (
list
, optional) – Additional variables to take into account in the model fitting.Defaults to None.
filter_support (
bool
, optional) – Whether features not supported in a given comparison should be removed (i.e. regions with no peaks in any sample in a comparison are not tested). Applies only to ATAC-/ChIP-seq data.Default is
True
.output_dir (
str
, optional) – Output directory for analysis. Variables in curly braces will be formated with attributes from analysis.Defaults to “{results_dir}/differential_analysis_{data_type}”.
output_prefix (
str
, optional) – Prefix for output files.Defaults to “differential_analysis”.
overwrite (
bool
, optional) – Whether results should be overwritten in case they already exist.Defaults to
True
.distributed (
bool
, optional) – Whether analysis should be distributed in a computing cluster for each comparison. Additional configuration can be passed inkwargs
.Defaults to
False
.deseq_kwargs (
dict
, optional) – Additional keyword arguments to be passed to the DESeq function of DESeq2.kwargs (
dict
, optional) – Additional keyword arguments are passed tosubmit_job()
and then to the chosen divvy submission template according to computing_configuration. Pass for example cores=4, mem=8000, partition=”longq”, time=”08:00:00”.
- Returns
Results for all comparisons. Will be
None
if distributed is True.- Return type
- Variables
differential_results (
pandas.DataFrame
) – Pandas dataframe with results.
-
collect_differential_analysis
(comparison_table=None, input_dir='{results_dir}/differential_analysis_{data_type}', input_prefix='differential_analysis', output_dir='{results_dir}/differential_analysis_{data_type}', output_prefix='differential_analysis', permissive=True, save=True, assign=True, overwrite=False)[source]¶ Collect results from DESeq2 differential analysis. Particularly useful when running
differential_analysis
in distributed mode.- Parameters
comparison_table (
pandas.DataFrame
) – A dataframe with “comparison_name”, “comparison_side” and “sample_name”, “sample_group” columns.Defaults to the analysis’s own “comparison_table” attribute.
input_dir, output_dir (
str
, optional) – In-/Output directory of files. Values within curly brackets “{data_type}”, will be formated with attributes from analysis.Defaults to “{results_dir}/differential_analysis_{data_type}”.
input_prefix, output_prefix (
str
, optional) – Prefix of the in-/output files.Defaults for both is “differential_analysis”.
permissive (
bool
, optional) – Whether non-existing files should be skipped or an error be thrown.Defaults to
True
.save (
bool
, optional) – Whether to save results to disk.Defaults to
True
.assign (
bool
, optional) – Whether to add results to a differential_results attribute.Defaults to
True
.overwrite (
bool
, optional) – Whether results should be overwritten in case they already exist.Defaults to
False
.
- Returns
Results for all comparisons. Will be
None
ifoverwrite
isFalse
and a results file already exists.- Return type
- Variables
differential_results (
pandas.DataFrame
) – Pandas dataframe with results.
-
plot_differential
(steps=['distributions', 'counts', 'scatter', 'volcano', 'ma', 'stats_heatmap', 'correlation', 'heatmap'], results=None, comparison_table=None, samples=None, matrix='matrix_norm', only_comparison_samples=False, alpha=0.05, corrected_p_value=True, fold_change=None, diff_based_on_rank=False, max_rank=1000, ranking_variable='pvalue', respect_stat_thresholds=True, output_dir='{results_dir}/differential_analysis_{data_type}', output_prefix='differential_analysis', plot_each_comparison=True, mean_column='baseMean', log_fold_change_column='log2FoldChange', p_value_column='pvalue', adjusted_p_value_column='padj', comparison_column='comparison_name', rasterized=True, robust=False, feature_labels=False, group_colours=True, group_attributes=None, **kwargs)[source]¶ Plot differential features (eg chromatin region, genes) discovered with supervised group comparisons by
ngs_toolkit.general.differential_analysis
. This will plot number and direction of discovered features, scatter, MA and volcano plots for each comparison and joint heatmaps of log fold changes, normalized values or Z-scores of individual samples or groups in the differential features.- Parameters
steps (
list
, optional) –- Types of plots to make:
“distributions”: Distribution of p-values and fold-changes
“counts” - Count of differential features per comparison given certain thresholds.
“scatter” - Scatter plots (group 1 vs group 2).
“volcano” - Volcano plots (log fold change vs -log p-value)
“ma” - MA plots (log mean vs log fold change)
“stats_heatmap” - Heatmap of p-values and fold-changes for comparisons.
“correlation” - Correlation of samples or sample groups in differential features.
“heatmap” - Heatmaps of samples or sample groups in differential features.
Defaults to all of the above.
results (
pandas.DataFrame
, optional) – Data frame with differential analysis results. Seengs_toolkit.general.differential_analysis
for more information.comparison_table (
pandas.DataFrame
, optional) – Comparison table. If provided, group-wise plots will be produced.Defaults to the analysis’ “comparison_table” attribute.
samples (
list
, optional) – List of sample objects to restrict analysis to.Defaults to all samples in analysis.
matrix (
str
, optional) – Matrix of quantification to use for plotting feature values across samples/groups.Defaults to “matrix_norm”.
only_comparison_samples (
bool
, optional) – Whether to use only samples present in the comparison_table and results table.Defaults to
False
.alpha (float, optional) – Significance level to consider a feature differential.
Defaults to 0.05.
corrected_p_value (
bool
, optional) – Whether to use a corrected p-valueto consider a feature differential.Defaults to
True
.fold_change (float, optional) – Effect size (log2 fold change) to consider a feature differential. Considers absolute values.
Default is no log2 fold change threshold.
diff_based_on_rank (
bool
, optional) – Whether a feature should be considered differential based on its rank. Use in combination with max_rank, ranking_variable and respect_stat_thresholds.Defaults to
False
.max_rank (
int
, optional) – Rank to use when using diff_based_on_rank.Defaults to 1000.
ranking_variable (
str
, optional) – Which variable to use for ranking when using diff_based_on_rank.Defaults to “pvalue”.
respect_stat_thresholds (
bool
, optional) – Whether the statistical thresholds from alpha and fold_change should still be respected when using diff_based_on_rank.Defaults to
True
.output_dir (
str
, optional) – Directory to create output files.Defaults to “{results_dir}/differential_analysis_{data_type}”
output_prefix (
str
, optional) – Prefix to use when creating output files.Defaults to “differential_analysis”.
plot_each_comparison (
bool
, optional) – Whether each comparison should be plotted in scatter, MA and volcano plots. Useful to turn off with many comparisons.Defaults to
True
.mean_column (
str
, optional) – Column in results data frame containing values for mean values across samples.Defaults to “baseMean”.
log_fold_change_column (
str
, optional) – Column in results data frame containing values for log2FoldChange values across samples.Defaults to “log2FoldChange”.
p_value_column (
str
, optional) – Column in results data frame containing values for p-values across samples.Defaults to “pvalue”.
adjusted_p_value_column (
str
, optional) – Column in results data frame containing values for adjusted p-values across samples.Defaults to “padj”.
comparison_column (
str
, optional) – Column in results data frame containing the name of the comparison.Defaults to “comparison_name”.
rasterized (
bool
, optional) – Whether plots with many objects should be rasterized.Defaults to
True
.robust (
bool
, optional) – Whether heatmap color scale ranges should be robust (using quantiles) rather than extreme values. Useful for noisy/extreme data.Defaults to
False
.feature_labels (
bool
, optional) – Whether features (regions/genes) should be labeled in heatmaps.Defaults to
False
.group_colours (
bool
, optional) – Whether groups of samples should be coloured in heatmaps.Defaults to
True
.group_attributes (
list
, optional) – Which variables to colour if group_colours ifTrue
.Defaults to all of analysis.group_attributes.
**kwargs (
dict
, optional) – Additional keyword arguments will be passed to Analysis.get_level_colors.
-
differential_overlap
(differential=None, output_dir='{results_dir}/differential_analysis_{data_type}', output_prefix='differential_analysis')[source]¶ Visualize intersection of sets of differential regions/genes.
- Parameters
differential (
pandas.DataFrame
, optional) – DataFrame containing result of comparisons filtered for features considered as differential.Defaults to the
differential_results
attribute, subset by the object’sthresholds
.output_dir (
str
, optional) – Directory to create output files.Defaults to “{results_dir}/differential_analysis_{data_type}”.
output_prefix (
str
, optional) – Prefix to use when creating output files.Defaults to “differential_analysis”.
-
differential_enrichment
(differential=None, output_dir='{results_dir}/differential_analysis_{data_type}/enrichments', output_prefix='differential_analysis', genome=None, steps=['region', 'lola', 'meme', 'homer', 'enrichr'], directional=True, max_diff=1000, sort_var='pvalue', distributed=False, overwrite=False)[source]¶ Perform various types of enrichment analysis given a dataframe of the results from differential analysis. Performs enrichment of gene sets (RNA-seq and ATAC-seq), genomic regions, chromatin states Location Overlap Analysis (LOLA) and TF motif enrichment (over-representation and de-novo search) (ATAC-seq only).
- Parameters
differential (
pandas.DataFrame
) – Data frame with differential results as produced bydifferential_analysis
, but filtered by some threshold for the relevant (significant regions). Must contain a “comparison_name” column.Defaults to
analysis.differential_results
.output_dir (
str
, optional) – Directory to create output files.Defaults to “{results_dir}/differential_analysis_{data_type}”.
output_prefix (
str
, optional) – Prefix to use when creating output files.Defaults to “differential_analysis”.
genome (
str
, optional) – Genome assembly of the analysis.Defaults to Analysis’s
genome
attribute.steps (
list
, optional) – Steps of the analysis to perform.Defaults to all possible: [“region”, lola”, “meme”, “homer”, “enrichr”].
directional (
bool
, optional) – Whether enrichments should be performed in a direction-dependent way (up-regulated and down-regulated features separately). This requires a column named “log2FoldChange” to exist.Defaults to
True
.max_diff (
int
, optional) – Number of maximum features to perform enrichment for ranked by variable in max_diff.Defaults to 1000.
sort_var (
str
, optional) – Variable to sort for when setting max_diff.Defaults to “pvalue”.
distributed (
bool
, optional) – Whether work should be submitted as jobs in a computing cluster.Defaults to
False
.overwrite (
bool
, optional) – Whether output files should be overwritten when distributed isTrue
.Defaults to
False
.
- Variables
enrichment_results (
dict
) – Dictionary with keys as in steps and values with pandas.DataFrame of enrichment results.
-
collect_differential_enrichment
(steps=['region', 'lola', 'motif', 'homer', 'homer_consensus', 'enrichr'], directional=True, permissive=True, output_dir='{results_dir}/differential_analysis_{data_type}/enrichments', input_prefix='differential_analysis', output_prefix='differential_analysis', differential=None)[source]¶ Collect the results of enrichment analysis ran after a differential analysis.
- Parameters
steps (
list
, optional) – Steps of the enrichment analysis to collect results for.Defaults to [“region”, “lola”, “meme”, “homer”, “enrichr”].
directional (
bool
, optional) – Whether enrichments were made in a direction-dependent way (up-regulated and down-regulated features separately). This implies a column named “direction” exists”.Defaults to
True
.differential (
pandas.DataFrame
, optional) – Data frame with differential results to select which comparisons to collect enrichments for. Usually produced byngs_toolkit.general.differential_analysis
.Defaults to analysis’s
differential_results
attributes.output_dir (
str
, optional) – Directory to create output files.Defaults to “{results_dir}/differential_analysis_{data_type}”.
input_prefix, output_prefix (
str
, optional) – File prefix of input/output files.Defaults to “differential_analysis”.
permissive (
bool
, optional) – Whether to skip non-existing files, giving a warning.Defaults to
True
.
- Variables
enrichment_results (
dict
) – Dictionary with keys as insteps
and values with pandas.DataFrame of enrichment results.
-
plot_differential_enrichment
(steps=['region', 'lola', 'motif', 'great', 'enrichr'], plot_types=['barplots', 'scatter', 'correlation', 'heatmap'], enrichment_type=None, enrichment_table=None, direction_dependent=True, output_dir='{results_dir}/differential_analysis_{data_type}/enrichments', comp_variable='comparison_name', output_prefix='differential_analysis', rasterized=True, clustermap_metric='correlation', top_n=5, z_score=0, cmap=None)[source]¶ Make plots illustrating enrichment of features for various comparisons.
Input can be the dictionary under analysis.enrichment_results or a single dataframe of enrichment terms across several comparisons for a given type of enrichment. In the later case both enrichment_table and enrichment_type must be given.
- Parameters
steps (
list
, optional) – Types of the enrichment analysis to plot. Options are [“region”, “lola”, “motif”, “great”, “enrichr”].Defaults to all keys present in analysis.enrichment_results.
plot_types (
list
, optional) – Types of plots to do for each enrichment type. One of [“barplot”, “scatter”, “correlation”, “heatmap”].Defaults to all of the above.
enrichment_type (
str
, optional) – Type of enrichment if run for a single type of enrichment. In this case enrichment_table must be given. One of {“region”, “lola”, “motif”, “great”, “enrichr”}.Default (
None
) is to run all keys present in analysis.enrichment_results.enrichment_table (
pandas.DataFrame
, optional) – Data frame with enrichment results as produced bydifferential_enrichment
orcollect_differential_enrichment
. If given, enrichment_type must be given too.Default (
None
) is the dataframes in all values present in analysis.enrichment_results.direction_dependent (
bool
, optional) – Whether enrichments were made in a direction-dependent way (up-regulated and down-regulated features separately). This implies a column named “direction” exists”.Defaults to
True
.output_dir (
str
, optional) – Directory to create output files.Defaults to “{results_dir}/differential_analysis_{data_type}/enrichments”.
comp_variable (
str
, optional) – Column defining which comparison enrichment terms belong to.Defaults to “comparison_name”.
output_prefix (
str
, optional) – Prefix to use when creating output files.Defaults to “differential_analysis”.
rasterized (
bool
, optional) – Whether or not to rasterize heatmaps for efficient plotting.Defaults to
True
.clustermap_metric (
str
, optional) – Distance metric to use for clustermap clustering, See https://docs.scipy.org/doc/scipy/reference/spatial.distance.html for valid values.Default to “correlation” (Pearson’s).
top_n (
int
, optional) – Top terms to use to display in plots.Defaults to 5.
z_score ({bool, int}, optional) – Which dimention/axis to perform Z-score transformation for. Pass
False
to skip plotting Z-score heatmaps. Numpy/Pandas conventions are used: 0 is row-wise (in this case across comparisons) and 1 is column-wise (across terms).Defaults to 0.
cmap (
str
, optional) – Colormap to use in heatmaps.Defaults to
None
.
ngs_toolkit.atacseq¶
-
class
ngs_toolkit.atacseq.
ATACSeqAnalysis
(name=None, from_pep=False, from_pickle=False, root_dir=None, data_dir='data', results_dir='results', prj=None, samples=None, **kwargs)[source]¶ Class to model analysis of ATAC-seq data. Inherits from the
Analysis
class.- Parameters
name (
str
, optional) – Name of the analysis.Defaults to “analysis”.
from_pep (
str
, optional) – PEP configuration file to initialize analysis from. The analysis will adopt as much attributes from the PEP as possible but keyword arguments passed at initialization will still have priority.Defaults to
None
(no PEP used).from_pickle (
str
, optional) – Pickle file of an existing serialized analysis object from which the analysis should be loaded.Defaults to
None
(will not load from pickle).root_dir (
str
, optional) – Base directory for the project.Defaults to current directory or to what is specified in PEP if
from_pep
.data_dir (
str
, optional) – Directory containing processed data (e.g. by looper) that will be input to the analysis. This is in principle not required.Defaults to “data”.
results_dir (
str
, optional) – Directory to contain outputs produced by the analysis.Defaults to “results”.
prj (
peppy.Project
, optional) – Apeppy.Project
object that this analysis is tied to.Defaults to
None
.samples (
list
, optional) – List ofpeppy.Sample
objects that this analysis is tied to.Defaults to
None
.kwargs (
dict
, optional) – Additional keyword arguments will be passed to parent classAnalysis
.
Examples
>>> from ngs_toolkit.atacseq import ATACSeqAnalysis
This is an example of the beginning of an ATAC-seq analysis:
>>> pep = "metadata/project_config.yaml" >>> a = ATACSeqAnalysis(from_pep=pep) >>> # Get consensus peak set from all samples >>> a.get_consensus_sites(a.samples) >>> # Annotate regions >>> a.get_peak_gene_annotation() >>> a.get_peak_genomic_location() >>> # Get coverage values for each peak in each sample of ATAC-seq >>> a.measure_coverage() >>> # Normalize jointly (quantile normalization + GC correction) >>> a.normalize(method="gc_content") >>> # Annotate quantified peaks with previously calculated metrics and features >>> a.annotate_features() >>> # Annotate with sample metadata >>> a.annotate_samples() >>> # Save object >>> a.to_pickle()
-
load_data
(output_map=None, only_these_keys=None, prefix='{results_dir}/{name}', permissive=True)[source]¶ Load the output files of the major functions of the Analysis.
- Parameters
output_map (
dict
) – Dictionary with “attribute name”: “path prefix” to load the files.only_these_keys (
list
, optional) – Iterable of analysis attributes to load up. Possible attributes:“matrix_raw”
“matrix_norm”
“matrix_features”
“sites”
“support”
“nuc”
“coverage_gc_corrected”
“gene_annotation”
“region_annotation”
“region_annotation_b”
“chrom_state_annotation”
“chrom_state_annotation_b”
“stats”
“differential_results”
Default is all of the above.
prefix (
str
, optional) – String prefix of files to load. Variables in curly braces will be formated with attributes of analysis.Defaults to “{results_dir}/{name}”.
bool (permissive, optional) – Whether an error should be ignored if reading a file causes IOError.
Default is
True
.
- Variables
sites (
pybedtools.bedtool.BedTool
) – Sets a sites variable.pandas.DataFrame – Dataframes holding the respective data, available as attributes described in the only_these_keys parameter.
- Raises
IOError – If not permissive and a file is not found
-
get_consensus_sites
(samples=None, region_type='summits', extension=250, blacklist_bed=None, filter_chroms=None, permissive=False, save=True, assign=True, **kwargs)[source]¶ Get consensus (union) of enriched sites (peaks) across samples. There are two modes possible, defined by the value of
region_type
:peaks: simple union of all sites;
summits: peak summits are extended by
extension
and a union is made.
- Parameters
samples (
list
) – Iterable ofpeppy.Sample
objects to restrict to. Must have apeaks
attribute set.Defaults to all samples in the analysis (
samples
attribute).region_type (
str
) – The type of region to use to create the consensus region set - one of “summits” or “peaks”. If “summits”, peak summits will be extended byextension
before union. If “peaks”, sample peaks will be used with no modification prior to union.Default is “summits”.
extension (
int
) – Amount to extend peaks summits by in both directions.Default is 250.
blacklist_bed ({
False
,str
}) – EitherFalse
or a path to a BED file with genomic positions to exclude from consensus peak set.Default is to use a blacklist file for the analysis
genome
.filter_chroms ({
list
,str
}) – A list of chromosomes to filter out or a string with a pattern to match to exclude chromosomes. Uses Pandas string methodspandas.Series.str.match
. Pass for example “.*_.*|chrM” to filter out chromosomes with a “_” character and a “chrM” chromosome.Default is not to filter anything.
permissive (
bool
) – Whether Samples that whichregion_type
attribute file does not exist should be simply skipped or an error thrown.Default is
True
.**kwargs – Not used. Provided for compatibility with
ngs_toolkit.ChIPSeqAnalysis
class.
- Raises
ValueError – If not
permissive
and either thepeaks
orsummits
file of a sample is not readable, or ifpermissive
but none of the samples has an existing file.AttributeError – If analysis does not have
organism
andgenome
attributes.
- Variables
sites (
pybedtools.bedtool.BedTool
) – Sets asites
variable with the consensus peak set.- Returns
sites – The consensus peak set.
- Return type
-
set_consensus_sites
(bed_file, overwrite=True)[source]¶ Set consensus (union) sites across samples given a BED file.
-
calculate_peak_support
(samples=None, region_type='summits', permissive=False, comparison_table=None, peak_dir=None)[source]¶ Count number of called peaks per sample in the consensus region set. In addition calculate a measure of peak support (or ubiquitouness) by observing the ratio of samples containing a peak overlapping each region.
- Parameters
samples (
list
) – Iterable ofpeppy.Sample
objects to restrict to. Must have apeaks
attribute set.Defaults to all samples in the analysis (
samples
attribute).region_type (
str
) – The type of region to use to create the consensus region set. One of “summits” or “peaks”. If summits, peak summits will be extended byextension
before union. Otherwise sample peaks will be used with no modification.Default is “summits”.
permissive (
bool
) – Whether Samples that which region_type attribute file does not exist should be simply skipped or an error thrown.comparison_table (
pandas.DataFrame
) – Not used. Provided for compatibility with ChIPSeqAnalysis class.peak_dir (
str
) – Not used. Provided for compatibility with ChIPSeqAnalysis class.
- Raises
IOError – If not
permissive
and either thepeaks
orsummits
file of a sample is not readable. Or ifpermissive
but none of the samples has an existing file.- Variables
support (
pandas.DataFrame
) – A dataframe with counts of peaks overlapping each feature of consensus set.
-
get_supported_peaks
(samples=None, **kwargs)[source]¶ Get mask of sites with 0 support in the given samples. Requires support matrix produced by ngs_toolkit.atacseq.ATACSeqAnalysis.calculate_peak_support.
- Parameters
samples (
list
) – Iterable ofpeppy.Sample
objects to restrict to.**kwargs – Not used. Provided for compatibility with ChIPSeqAnalysis class.
- Returns
Boolean Pandas Series with sites with at least one of the given samples having a peak called.
- Return type
pd.Series
-
measure_coverage
(samples=None, sites=None, save=True, assign=True, peak_set_name='peak_set', output_file='{results_dir}/{name}.matrix_raw.csv', permissive=False, distributed=False, overwrite=True, **kwargs)[source]¶ Measure read coverage (counts) of each sample in each region in consensus sites. Uses parallel computing using the
parmap
library. However, for many samples (hundreds), parallelization in a computing cluster is possible with the distributed option.- Parameters
samples (
list
) – Iterable ofpeppy.Sample
objects to restrict to. Must have aaligned_filtered_bam
attribute set.Defaults to all samples in the analysis (
samples
attribute).sites ({
pybedtools.bedtool.BedTool
,pandas.DataFrame
,str
}) – Sites in the genome to quantify, usually apybedtools.bedtool.BedTool
fromngs_toolkit.atacseq.ATACSeqAnalysis.get_consensus_sites
. If a DataFrame, will try to convert to BED format assuming first three columns are chr,start,end. If a string assumes a path to a BED file.Defaults to
sites
attribute of analysis object.save (
bool
) – Whether to save to disk the coverage matrix with filenameoutput_file
.Default is
True
.assign (
bool
) – Whether to assign the matrix to an attribute namedcoverage
.Default is
True
.peak_set_name (
bool
) – Suffix to files containing coverage ofdistributed
is True.Defaults to “peak_set”.
output_file (
str
) – A path to a CSV file with coverage output.Default is “{results_dir}/{name}.raw_coverage.csv”.
permissive (
bool
) – Whether Samples for whichregion_type
attribute file does not exist should be simply skipped or an error thrown.Default is
False
.distributed (
bool
) – Whether it should be run as jobs for each sample separately in parallel. Currently only implemented for a SLURM cluster.Default is
False
.overwrite (
bool
) – Whether to overwrite existing files ifdistributed
is True.Default is
True
.**kwargs (
dict
) – Additional keyword arguments will be passed to ngs_toolkit.utils.submit_job if distributed is True, and on to a divvy submission template. Pass for example: computing_configuration=”slurm”, jobname=”job”, cores=2, mem=8000, partition=”longq”.
- Raises
IOError – If not
permissive
and the ‘aligned_filtered_bam’ file attribute of a sample is not readable. Or ifpermissive
but none of the samples has an existing file.- Variables
matrix_raw (
pandas.DataFrame
) – The dataframe of raw coverage values (counts) of shape (n_features, m_samples).- Returns
Pandas DataFrame with read counts of shape (n_sites, m_samples).
- Return type
-
collect_coverage
(samples=None, save=True, assign=True, output_file=None, permissive=False, peak_set_name='peak_set', fast_and_unsafe=False)[source]¶ Collect read coverage (counts) of each sample in each region in consensus sites from existing files. Useful after runnning analysis.measure_coverage() in distributed mode.
- Parameters
samples (
list
) – Iterable ofpeppy.Sample
objects to restrict to. If not provided (None is passed) if will default to all samples in the analysis (samples
attribute).save (
bool
) – Whether to save to disk the coverage matrix with filenameoutput_file
.assign (
bool
) – Whether to assign the matrix to an attribute of self namedcoverage
.output_file (
str
) – A path to a CSV file with coverage output.Default is “{results_dir}/{name}.raw_coverage.csv”.
permissive (
bool
) – Whether Samples without an existing coverage file does not exist should be simply skipped or an error thrown.peak_set_name (
bool
) – Suffix to files containing coverage. Defaults to “peak_set”.fast_and_unsafe (
bool
) – Whether to use a faster but unsafer method to concatenate the data. If the order of all rows in all samples is the same then the result should be the same. The default, slower method assures that all rows are matched and is therefore slower.Defaults to
False
.
- Raises
IOError – If not
permissive
and the coverage file of a sample is not readable or is empty. Or ifpermissive
but none of the samples has an existing file or are empty.- Variables
matrix_raw (
pandas.DataFrame
) – The dataframe of raw coverage values (counts) of shape (n_features, m_samples).- Returns
Pandas DataFrame with read counts of shape (n_sites, m_samples).
- Return type
-
get_peak_gccontent_length
(bed_file=None, fasta_file=None)[source]¶ Get length and GC content of features in region set.
- bed_file
str
A BED file with regions to calculate GC content on. Must be a 3-column BED! If not provided the calculation will be for the analysis sites attribute.
- genome
str
Genome assembly.
- fasta_file
str
Fasta file of genome. Preferably indexed. If not given, will try to download.
- Variables
nuc – DataFrame with nucleotide content and length of each region.
nuc (
pandas.DataFrame
) – Dataframe with length and GC-content of each feature.
- Returns
DataFrame with nucleotide content and length of each region.
- Return type
- bed_file
-
normalize_cqn
(matrix='matrix_raw', samples=None, save=True, assign=True)[source]¶ Conditional quantile normalization (CQN) of a matrix. It uses GC content and length of regulatory elements as covariates.
Requires the R package “cqn” to be installed:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("cqn")
- Parameters
matrix (
str
) – Attribute name of matrix to normalize.Defaults to “matrix_raw”.
samples (
list
) – Iterable ofpeppy.Sample
objects to restrict matrix to.Defaults to all samples in analysis.
save (
bool
) – Whether to write normalized DataFrame to disk.Default is
True
.assign (
bool
) – Whether to assign the normalized DataFrame to an attribute matrix_norm.Default is
True
.
- Variables
matrix_norm (
pandas.DataFrame
) – If assign, the dataframe with normalized values.norm_method (
str
) – Ifassign
, it is the name of method used to normalize: “cqn”.
-
get_peak_gene_annotation
(tss_file=None, max_dist=100000, save=True, output_prefix='', assign=True)[source]¶ Annotates peaks with closest gene. The annotation reference can either be given in the tss_file parameter but if ommited, it will be fetched if analysis has genome and organism attributes. A dataframe with each feature’s distance to the nearest gene is also saved.
- Parameters
tss_file (
str
, optional) – A valid BED file where the name field (4th column) identifies the gene and the strand column (6th column). Other fields will not be used.Default is to get gene position annotations.
max_dist (
int
, optional) – Maximum absolute distance allowed to perform associations. Regions with no genes within the range will have NaN values.Default is 100000.
save (
bool
, optional) – Whether to write the annotated DataFrame to disk.Default is
True
.output_prefix (
str
, optional) – Prefix to add to output file when save is True.Default is “” (empty string).
assign (
bool
, optional) – Whether to assign the DataFrames to Attributes.Default is
True
.
- Variables
gene_annotation (
pandas.DataFrame
) – A pandas DataFrame containing the genome annotations of the region features. If a feature overlaps more than one gene, the two gene values will be concatenated with a comma.closest_tss_distances (
pandas.DataFrame
) – A pandas DataFrame containing unique region->gene associations. In contrast to gene_annotation dataframe, this contains one row per region->gene assignment.
- Returns
A dataframe with genes annotated for the peak set.
- Return type
-
get_peak_genomic_location
(genomic_context_file=None, save=True, output_prefix='', assign=True)[source]¶ Annotates a consensus peak set (
sites
attribute of analysis) with their genomic context. The genomic context is mostly gene-centric, which includes overlap with gene promoters, UTRs, exons, introns and remaining intergenic space.If no reference genomic annotation file is given (genomic_context_file kwarg), it will use the ngs_toolkit.general.get_genomic_context function to get such data. For more customization of the annotations, use that function directly and pass the output file to this function.
- Parameters
genomic_context_file (
str
) – A 4 column BED file (chrom, start, end, feature), where feature is a string with the type of region. If not provided will be get with the get_genomic_context function.save (
bool
, optional) – Whether to write the annotated DataFrame to disk.Default is
True
.output_prefix (
str
, optional) – Prefix to add to output file when save is True.Default is “” (empty string).
assign (
bool
, optional) – Whether to assign the DataFrames to Attributes.Default is
True
.
- Variables
region_annotation_b (region_annotation,) – A DataFrame with the genome annotations of the region features or genome background.
region_annotation_b_mapping (region_annotation_mapping,) – A DataFrame with one row for each chromatin state-region mapping or genome background.
- Returns
The genomic context annotation for the peak set.
- Return type
-
get_peak_chromatin_state
(chrom_state_file, frac=0.2, save=True, output_prefix='', assign=True)[source]¶ Annotates a consensus peak set (
sites
attribute of analysis) with their chromatin state context. This would be given, for example by a chromatin state segmentation file from projects such as Roadmap Epigenomics.See examples of such files for various cell types/assemblies in this website (the “dense.bed.gz” files are optimal).
- Parameters
chrom_state_file (
str
) – A 4 column BED file (chrom, start, end, feature), where feature is a string with the type of region. Additional columns are ignored.frac (float) – Minimal fraction of region to overlap with a feature.
Defaults to 0.2.
save (
bool
, optional) – Whether to write the annotated DataFrame to disk.Default is
True
.output_prefix (
str
, optional) – Prefix to add to output file when save is True.Default is “” (empty string).
assign (
bool
, optional) – Whether to assign the DataFrames to Attributes.Default is
True
.
- Returns
The chromatin state annotation for the peak set.
- Return type
- Variables
chrom_state_annotation_b (chrom_state_annotation,) – A DataFrame with the chromatin state annotations of the region features or of the genome background.
chrom_state_annotation_b_mapping (chrom_state_annotation_mapping,) – A DataFrame with one row for each chromatin state-region mapping or for the genome background.
-
get_sex_chrom_ratio
(matrix='matrix_norm', sex_chroms=['chrX', 'chrY'], output_dir='{results_dir}', output_prefix='sex_chrom_ratio', plot=True)[source]¶ Get ratio of signal between sex chromosomes. Useful to quickly assign sex to samples.
- Parameters
matrix (
pandas.DataFrame
, optional) – Matrix to use. Defaults to matrix_norm.sex_chroms (
list
, optional) – Names of the two sex chromosomes to use.output_dir (
str
, optional) – Directory to write output to.output_prefix (
str
, optional) – String to prefix output with.plot (
bool
, optional) – Whether to produce illustrative plots.
- Returns
Ratio of sex chromosomes defined as sex_chroms[1] - sex_chroms[0].
- Return type
pd.Series
-
get_gene_level_matrix
(matrix='matrix_norm', reduce_func=<function mean>, assign=True, save=True, output_file='{results_dir}/{name}.gene_coverage.csv')[source]¶ Get gene-level measurements of coverage.
Requires a ‘gene_annotation’ or ‘closest_tss_distances’ attribute to be set containing a mapping between the index of matrix and genes (produced from get_peak_gene_annotation).
- Parameters
matrix (
str
, optional) – Quantification matrix to use (e.g. ‘matrix_raw’ or ‘matrix_norm’)Default is “matrix_norm”.
reduce_func (func) – Function to apply to reduce values.
Default is mean.
assign (
bool
) – Whether to assign the matrix to an attribute of self named matrix_gene.Default is
True
.save (
bool
) – Whether to save to disk the coverage matrix with filename output_file.Default is
True
.output_file (
str
) – Path to save a CSV file with coverage output if save is True.Default is self.results_dir/self.name + “.raw_coverage.csv”.
- Returns
Coverage values reduced per gene.
- Return type
- Variables
matrix_gene (
pandas.DataFrame
) – Coverage values reduced per gene.
-
get_gene_level_changes
(differential_results=None, reduce_func=<function mean>)[source]¶ Redcuce changes in regulatory elements to gene-level by aggregating across regulatory elements. Requires a ‘gene_annotation’ attribute to be set containing a mapping between the index of matrix and genes (produced from get_peak_gene_annotation).
- Parameters
differential_results (
pandas.DataFrame
) – Matrix with differential results to use. Default is a ‘differential_results’ attribute of self.reduce_func (func) – Function to apply to reduce values. Default is mean
- Returns
Changes in chromatin accessibility (log2FoldChanges) reduced per gene.
- Return type
-
plot_peak_characteristics
(samples=None, by_attribute=None, genome_space=3000000000.0, output_dir='{results_dir}/peak_characteristics', output_prefix='{name}')[source]¶ Several diagnostic plots on the analysis’ consensus peak set and the sample’s signal on them.
Provides plots with samples grouped by_attribute if given (a string or a list of strings).
- Parameters
samples (
list
, optional) – List of samples to restrict analysis to.by_attribute ({str, list}, optional) – Attribute or list of sample attributes to groupby samples by when plotting. This is done in addition to the plots with individual values per sample.
genome_space (
int
) – Length of genome.Defaults to 3e9 basepairs (human genome).
output_dir (
str
) – Directory to output files. Will be formated with variables from Analysis.Defaults to “peak_characteristics” under the Analysis “results_dir”.
output_prefix (
str
) – Prefix to add to output files.Defaults to the Analysis’ name.
-
plot_raw_coverage
(samples=None, by_attribute=None)[source]¶ Diagnostic plots on the Sample’s signal. Provides plots with Samples grouped by_attribute if given (a string or a list of strings).
-
region_context_enrichment
(regions, steps=['genomic_region', 'chromatin_state'], background='region_set', prefix='region_type_enrichment', output_dir='{results_dir}')[source]¶ Characterize a subset of the regions (e.g. differential regions) in terms of their genomic context.
- Parameters
regions ({
list
,pandas.DataFrame
,pandas.Index
}) – Subset of regions of interest to analysis. Must be a subset of the universe (i.e.sites
attribute).steps (
list
, optional) – Steps of enrichment to perform. Defaults to all available: ‘genomic_region’ and ‘chromatin_state’.background (
str
, optional) – Which set to consider as backgroud. Options are:“region_set”: the consensus region_set of the analysis
“genome”: a randomized set of size as region_set across the genome
prefix (
str
, optional) – Prefix for saved files.Default is “region_type_enrichment”.
output_dir (
str
, optional) – Directory to write results to. Can be formatted with Analysis attributes.Default is “{results_dir}”.
- Returns
Enrichment results
- Return type
-
characterize_regions_function
(differential, output_dir, prefix, universe_file=None, run=True, genome=None, steps=['region', 'lola', 'meme', 'homer', 'enrichr'])[source]¶ Performs a range of functional enrichments of a set of regions given in
differential
(a dataframe which is typically a subset of an annotated coverage dataframe). Will extract regions, their underlying sequence, associated genes, perform enrichment of genomic regions, chromatin states against a background, motif enrichment, location overlap analysis (LOLA), and gene set enrichment (using the Enrichr API).- This requires several programs and R libraries:
MEME suite (AME)
HOMER suite (findMotifsGenome.pl)
LOLA (R library)
Additionally, some genome-specific databases are needed to run these programs.
- Parameters
differential (
pandas.DataFrame
) – Results of differential analysis for a given comparison of interest.output_dir (
str
) – Directory to output results to.prefix (
str
) – Prefix to use for output files.universe_file (
str
, optional) – Path to BED file with set of universe regions where differential were selected from.Default is
sites
attribute of Analysis.run (
bool
, optional) – Whether to run enrichment commands now or to simply prepare the input files for it. Default isTrue
.genome (
str
, optional) – Genome assembly of analysis. Default is genome genome assembly of analysis (genome
attribute).steps (
list
, optional) – Which steps of the analysis to perform. Default is all: [‘region’, ‘lola’, ‘meme’, ‘homer’, ‘enrichr’].
ngs_toolkit.chipseq¶
-
class
ngs_toolkit.chipseq.
ChIPSeqAnalysis
(name=None, from_pep=False, from_pickle=False, root_dir=None, data_dir='data', results_dir='results', prj=None, samples=None, **kwargs)[source]¶ Class to model analysis of ChIP-seq data. Inherits from the
ATACSeqAnalysis
class.- Parameters
name (
str
, optional) – Name of the analysis.Defaults to “analysis”.
from_pep (
str
, optional) – PEP configuration file to initialize analysis from. The analysis will adopt as much attributes from the PEP as possible but keyword arguments passed at initialization will still have priority.Defaults to
None
(no PEP used).from_pickle (
str
, optional) – Pickle file of an existing serialized analysis object from which the analysis should be loaded.Defaults to
None
(will not load from pickle).root_dir (
str
, optional) – Base directory for the project.Defaults to current directory or to what is specified in PEP if
from_pep
.data_dir (
str
, optional) – Directory containing processed data (e.g. by looper) that will be input to the analysis. This is in principle not required.Defaults to “data”.
results_dir (
str
, optional) – Directory to contain outputs produced by the analysis.Defaults to “results”.
prj (
peppy.Project
, optional) – Apeppy.Project
object that this analysis is tied to.Defaults to
None
.samples (
list
, optional) – List ofpeppy.Sample
objects that this analysis is tied to.Defaults to
None
.kwargs (
dict
, optional) – Additional keyword arguments will be passed to parent classATACSeqAnalysis
.
-
call_peaks_from_comparisons
(comparison_table=None, output_dir='{results_dir}/chipseq_peaks', permissive=True, overwrite=True, distributed=True)[source]¶ Call peaks for ChIP-seq samples using an annotation of which samples belong in each comparison and which samples represent signal or background.
- Parameters
comparison_table (
pandas.DataFrame
) – Comparison table with the following required columns: “comparison_name”, “sample_name”, “comparison_side”, “sample_group”.Defaults to analysis’ own comparison_table.
output_dir (
str
) – Parent directory where peaks will be created.Will be created if does not exist.
permissive (
bool
) – If incomplete/incoherent comparisons should be skipped or an error should be thrown.Default is
True
.overwrite (
bool
) – If incomplete/incoherent comparisons should be skipped or an error should be thrown.Default is
True
.distributed (
bool
) – Whether peak calling should be run in serial or in distributed mode as jobs.Default is
True
.
- Raises
ValueError – If not permissive and incomplete/incoherent comparisons are detected.
-
filter_peaks
(comparison_table=None, filter_bed=None, peaks_dir='{results_dir}/chipseq_peaks')[source]¶ Filter peak calls for various comparisons for entries that do not overlap another BED file.
- Parameters
comparison_table (
pandas.DataFrame
, optional) – Comparison table with the following required columns: “comparison_name”, “sample_name”, “comparison_side”, “sample_group”.Defaults to analysis’ own comparison_table.
filter_bed (
str
) – BED file with entries to filter out from the BED files of each comparison.Defaults to the set of Blacklisted regions from the analysis’ genome. In that case it will be fetched if not present.
peaks_dir (
str
) – Parent directory where peak calls for each comparison exist. Will be created if does not exist.Defaults to “{results_dir}/chipseq_peaks”.
- Raises
AttributeError – If filter_bed is not given and failes to be retrieved.
-
summarize_peaks_from_comparisons
(comparison_table=None, output_dir='{results_dir}/chipseq_peaks', filtered=True, permissive=True)[source]¶ Call peaks for ChIP-seq samples using an annotation of which samples belong in each comparison and which samples represent signal or background.
- Parameters
comparison_table (
pandas.DataFrame
, optional) – Comparison table with the following required columns: “comparison_name”, “sample_name”, “comparison_side”, “sample_group”.Defaults to analysis’ own comparison_table.
output_dir (
str
) – Parent directory where peaks will be created. Will be created if does not exist.permissive (
bool
) – If incomplete/incoherent comparisons should be skipped or an error should be thrown.
- Raises
ValueError – Will be raised if not permissive and incomplete/incoherent comparisons are detected.
-
get_consensus_sites
(samples=None, region_type='summits', extension=250, blacklist_bed=None, filter_chroms=True, permissive=False, save=True, assign=True, **kwargs)[source]¶ Get consensus (union) of enriched sites (peaks) across all comparisons. There are two modes possible, defined by the value of
region_type
:peaks: simple union of all sites;
summits: peak summits are extended by
extension
and a union is made.
For ChIP-seq, the
comparison_table
keyword argument or acomparison_table
attribute set is required. Peaks/summits will be aggregated for the peaks called in each sample comparison.- Parameters
samples (
list
) – Iterable ofpeppy.Sample
objects to restrict to. Must have apeaks
attribute set.Defaults to all samples in the analysis (
samples
attribute).region_type (
str
) – The type of region to use to create the consensus region set - one of “summits” or “peaks”. If “summits”, peak summits will be extended byextension
before union. If “peaks”, sample peaks will be used with no modification prior to union.Default is “summits”.
extension (
int
) – Amount to extend peaks summits by in both directions.Default is 250.
blacklist_bed ({
False
,str
}) – EitherFalse
or a path to a BED file with genomic positions to exclude from consensus peak set.Default is to use a blacklist file for the analysis
genome
.filter_chroms ({
list
,str
}) – A list of chromosomes to filter out or a string with a pattern to match to exclude chromosomes. Uses Pandas string methodspandas.Series.str.match
. Pass for example ‘.*_.*|chrM’ to filter out chromosomes with a “_” character and a “chrM” chromosome.Default is not to filter anything.
permissive (
bool
) – Whether Samples that whichregion_type
attribute file does not exist should be simply skipped or an error thrown.comparison_table (
pandas.DataFrame
, optional) – DataFrame with signal/background combinations used to call peaks. Part of kwargs.Defaults to analysis own
comparison_table
.peak_dir (
str
, optional) – Path to peaks output directory. Part of kwargs.Defaults to “{analysis.results_dir}/chipseq_peaks”.
- Variables
sites (
pybedtools.BedTool
) – Bedtool with consensus sites.
-
calculate_peak_support
(samples=None, region_type='summits', permissive=False, comparison_table=None, peak_dir='{results_dir}/chipseq_peaks')[source]¶ Calculate a measure of support for each region in peak set (i.e. ratio of samples containing a peak overlapping region in union set of peaks).
- Parameters
comparison_table (
pandas.DataFrame
, optional) – DataFrame with signal/background combinations used to call peaksDefaults to analysis’ own comparison_table.
peak_dir (
str
, optional) – Path to peaks output directory. Defaults to {analysis.results_dir}/chipseq_peakssamples (
list
) – Not used. Provided for compatibility with ATACSeqAnalysis class.region_type (
str
) – Not used. Provided for compatibility with ATACSeqAnalysis class.permissive (
bool
) – Not used. Provided for compatibility with ATACSeqAnalysis class.
- Variables
support (
pandas.DataFrame
) – DataFrame with signal/background combinations used to call peaks
-
get_supported_peaks
(samples=None, **kwargs)[source]¶ Get mask of sites with 0 support in the given samples. Requires support matrix produced by ngs_toolkit.atacseq.ATACSeqAnalysis.calculate_peak_support.
- Parameters
- Returns
Boolean Pandas Series with sites with at least one of the given samples having a peak called.
- Return type
pd.Series
-
normalize_by_background
(comparison_table=None, reduction_func=<function mean>, comparison_func=<ufunc 'subtract'>, by_group=False, matrix='matrix_norm', samples=None)[source]¶ Normalize values in matrix by background samples in a comparison-specific way as specified in comparison_table.
The background samples will be pooled by the reduction_func and their values wil be removed from the signal samples using the comparison_func.
- Parameters
comparison_table (
pandas.DataFrame
) – Table with comparisons from which peaks were called.Defaults to analysis’ comparison_table.
reduction_func (func) – Function to reduce the region to gene values to.
Defaults to
numpy.mean
.comparison_func (func) – Function to use for normalization of signal samples against background samples. You can also try for example
numpy.divide
.Defaults to
numpy.subtract
.by_group (
bool
) – Whether output should be by group (True
) or for each sample (False
).Default is
False
.matrix ({
pandas.DataFrame
,str
, optional}) – Name of attribute or pandas DataFrame to use.Defaults to “matrix_norm”.
samples (
list
, optional) – Subset of samples to consider.Defaults to all samples in analysis.
- Returns
Dataframe with values normalized by background samples.
- Return type
ngs_toolkit.cnv¶
-
class
ngs_toolkit.cnv.
CNVAnalysis
(name=None, from_pep=False, from_pickle=False, root_dir=None, data_dir='data', results_dir='results', prj=None, samples=None, **kwargs)[source]¶ Class to model analysis of CNV data. Inherits from the
Analysis
class.- Parameters
name (
str
, optional) – Name of the analysis.Defaults to “analysis”.
from_pep (
str
, optional) – PEP configuration file to initialize analysis from. The analysis will adopt as much attributes from the PEP as possible but keyword arguments passed at initialization will still have priority.Defaults to
None
(no PEP used).from_pickle (
str
, optional) – Pickle file of an existing serialized analysis object from which the analysis should be loaded.Defaults to
None
(will not load from pickle).root_dir (
str
, optional) – Base directory for the project.Defaults to current directory or to what is specified in PEP if
from_pep
.data_dir (
str
, optional) – Directory containing processed data (e.g. by looper) that will be input to the analysis. This is in principle not required.Defaults to “data”.
results_dir (
str
, optional) – Directory to contain outputs produced by the analysis.Defaults to “results”.
prj (
peppy.Project
, optional) – Apeppy.Project
object that this analysis is tied to.Defaults to
None
.samples (
list
, optional) – List ofpeppy.Sample
objects that this analysis is tied to.Defaults to
None
.kwargs (
dict
, optional) – Additional keyword arguments will be passed to parent classAnalysis
.
Examples
>>> from ngs_toolkit.cnv import CNVAnalysis
This is an example of a CNV analysis:
>>> pep = "metadata/project_config.yaml" >>> a = CNVAnalysis(from_pep=pep) >>> # Get consensus peak set from all samples >>> a.get_cnv_data() >>> # Normalize >>> a.normalize(method="median") >>> # Segmentation >>> a.segment_genome() >>> # General plots >>> a.plot_all_data() >>> a.plot_segmentation_stats() >>> # Unsupervised analysis >>> a.unsupervised_analysis() >>> # Save object >>> a.to_pickle()
-
load_data
(output_map=None, only_these_keys=None, resolutions=None, prefix='{results_dir}/{name}', permissive=True)[source]¶ Load the output files of the major functions of the Analysis.
- Parameters
output_map (
dict
) – Dictionary with {attribute_name: (file_path, kwargs)} to load the files. The kwargs in the tuple will be passed topandas.read_csv()
.Defaults to the required to read the keys in
only_these_keys
.only_these_keys (
list
, optional) – Iterable of analysis attributes to load up. Possible attributes:“matrix_raw”
“matrix_norm”
“matrix_features”
“differential_results”
Defaults to all of the above.
resolutions (
list
) – List of resolution strings to get data for.Defaults to value of
resolutions
attribute of Analysis.prefix (
str
, optional) – String prefix of files to load. Variables in curly braces will be formated with attributes of analysis.Defaults to “{results_dir}/{name}”.
permissive (
bool
, optional) – Whether an error should be ignored if reading a file causes IOError.Default is
True
.
- Variables
pandas.DataFrame – Dataframes holding the respective data, available as attributes described in the only_these_keys parameter.
- Raises
IOError – If not permissive and a file is not found
-
get_cnv_data
(resolutions=None, samples=None, save=True, assign=True, permissive=False)[source]¶ Load CNV data from ATAC-seq CNV pipeline and create CNV matrix at various resolutions.
- Parameters
resolutions (
list
, optional) – Resolutions of analysis. Defaults to resolutions in Analysis object.samples (
list
, optional) – Samples to restrict analysis to. Defaults to samples in Analysis object.save (
bool
, optional) – Whether results should be saved to disc. Defaults to Trueassign (
bool
, optional) – Whether results should be assigned to an attribute in the Analsyis object. Defaults to Truepermissive (
bool
, optional) – Whether missing files should be allowed. Defaults to False
- Returns
Dictionary with CNV matrices for each resolution.
- Return type
- Raises
IOError – If not permissive and input files can’t be read.
- Variables
matrix (
dict
) – Sets a matrix dictionary with CNV matrices for each resolution.
-
normalize
(method='median', matrix='matrix_raw', samples=None, save=True, assign=True, **kwargs)[source]¶ Normalization of dictionary of matrices with (n_features, n_samples).
- Parameters
resolutions (
list
, optional) – Resolutions of analysis. Defaults to resolutions in Analysis object.method (
str
) – Normalization method to apply.Defaults to “median”.
matrix (
str
, optional) – Attribute name of dictionary of matrices to normalize. Defaults to matrix_raw.samples (
list
) – Iterable of peppy.Sample objects to restrict matrix to. Default is all in analysis.save (
bool
, optional) – Whether results should be saved to disc. Defaults to Trueassign (
bool
, optional) – Whether results should be assigned to an attribute in the Analsyis object. Defaults to Truekwargs (
dict
, optional) – Additional kwargs are passed to the respective normalization method.
- Returns
Dictionary with normalized CNV matrices for each resolution.
- Return type
- Variables
matrix_norm (
dict
) – Sets a matrix_norm dictionary with CNV matrices for each resolution.
-
plot_all_data
(matrix='matrix_norm', resolutions=None, samples=None, output_dir=None, output_prefix='{analysis_name}.all_data', robust=True, vmin=None, vmax=None, rasterized=True, dpi=300, sample_labels=True)[source]¶ Visualize CNV data genome-wide using heatmaps. Will be done independently for each specified resolution.
- Parameters
matrix (
str
, optional) – Attribute name of dictionary of matrices to normalize. Defaults to matrix_norm.resolutions (
list
, optional) – Resolutions of analysis. Defaults to resolutions in Analysis object.samples (
list
, optional) – Samples to restrict analysis to. Defaults to samples in Analysis object.output_dir (
str
, optional) – Output directory. Defaults to Analysis results directory.output_prefix (
str
, optional) – Prefix to add to plots. Defaults to “{analysis_name}.all_data”robust (
bool
, optional) – Whether to scale the color scale robustly (to quantiles rather than extremes). Defaults to Truevmin (float, optional) – Minimum value of color scale.
vmax (float, optional) – Maximum value of color scale. Defaults to None
rasterized (
bool
, optional) – Whether to rasterize main heatmap. Defaults to Truedpi (
int
, optional) – DPI resolution of rasterized image. Defaults to 300sample_labels (
bool
, optional) – Whether to label samples with their name. Defaults to True
-
plot_stats_per_chromosome
(matrix='matrix_norm', resolutions=None, samples=None, output_dir='{results_dir}', output_prefix='{analysis_name}.all_data', robust=True, rasterized=True, dpi=300, sample_labels=True)[source]¶ Visualize mean and variation of CNV data for each chromosome using heatmaps. Will be done independently for each specified resolution. Will also be done once for all chromosomes and another time without sex chromosomes.
- Parameters
matrix (
str
, optional) – Attribute name of dictionary of matrices to normalize. Defaults to matrix_norm.resolutions (
list
, optional) – Resolutions of analysis. Defaults to resolutions in Analysis object.samples (
list
, optional) – Samples to restrict analysis to. Defaults to samples in Analysis object.output_dir (
str
, optional) – Output directory. Defaults to Analysis results directory.output_prefix (
str
, optional) – Prefix to add to plots. Defaults to “{analysis_name}.all_data”robust (
bool
, optional) – Whether to scale the color scale robustly (to quantiles rather than extremes). Defaults to Truerasterized (
bool
, optional) – Whether to rasterize main heatmap. Defaults to Truedpi (
int
, optional) – DPI resolution of rasterized image. Defaults to 300sample_labels (
bool
, optional) – Whether to label samples with their name. Defaults to True
-
segment_genome
(matrix='matrix_norm', resolutions=None, samples=None, save=True, assign=True)[source]¶ Segment CNV data to create calls of significant deviations. Will be done independently for each specified resolution.
- Requires the R package “DNAcopy” to be installed:
>>> source('http://bioconductor.org/biocLite.R') >>> biocLite('DNAcopy')
- Parameters
matrix (
str
, optional) – Attribute name of dictionary of matrices to segment. Defaults to matrix_norm.resolutions (
list
, optional) – Resolutions of analysis. Defaults to resolutions in Analysis object.samples (
list
, optional) – Samples to restrict analysis to. Defaults to samples in Analysis object.save (
bool
, optional) – Whether results should be saved to disc. Defaults to Trueassign (
bool
, optional) – Whether results should be assigned to an attribute in the Analsyis object. Defaults to True
- Returns
Dictionary with segmentation for each resolution.
- Return type
- Variables
segmentation (
dict
) – Dictionary with CNV matrices for each resolution.
-
annotate_with_chrom_bands
(segmentation=None, resolutions=None, save=True, assign=True)[source]¶ Annotate segmentation with chromosome bands and overlaping genes. Will be done independently for each specified resolution.
- Parameters
segmentation (
str
, optional) – Attribute name of dictionary of segmentation results. Defaults to segmentation.resolutions (
list
, optional) – Resolutions of analysis. Defaults to resolutions in Analysis object.samples (
list
, optional) – Samples to restrict analysis to. Defaults to samples in Analysis object.save (
bool
, optional) – Whether results should be saved to disc. Defaults to Trueassign (
bool
, optional) – Whether results should be assigned to an attribute in the Analsyis object. Defaults to True
- Returns
Dictionary with annotated segmentation for each resolution.
- Return type
- Variables
segmentation_annot (
dict
) – Dictionary with CNV matrices for each resolution.
-
plot_segmentation_stats
(segmentation=None, resolutions=None, per_sample=False, output_dir='{results_dir}/segmentation', output_prefix='{resolution}.segmentation_metrics')[source]¶ Visualize distribution of statistics of CNV data segmentation. Will be done independently for each specified resolution.
- Parameters
segmentation (
str
, optional) – Dictionary of segmentation results. Defaults to segmentation.resolutions (
list
, optional) – Resolutions of analysis. Defaults to resolutions in Analysis object.per_sample (
bool
, optional) – Whether plots should be made for each sample too. Defaults to Falseoutput_dir (
str
, optional) – Output directory.output_prefix (
str
, optional) – Prefix to add to plots. Defaults to “{resolution}.segmentation_metrics”
-
ngs_toolkit.cnv.
all_to_igv
(matrix, output_prefix, **kwargs)[source]¶ Convert dictionary of DataFrame with CNV data in several resolutions to IGV format.
- Parameters
matrix (
pandas.DataFrame
) – DataFrame with CNV data to convert.output_prefix (str) – Prefix to add to plots.
**kwargs (
dict
, optional) – Additional parameters will be passed to ngs_toolkit.cnv.to_igv
- Returns
Dictionary of CNV data in IGV format for each resolution.
- Return type
-
ngs_toolkit.cnv.
to_igv
(matrix, output_file=None, save=True, view_limits=(-2, 2))[source]¶ Convert DataFrame with CNV data to IGV format.
- Parameters
matrix (
pandas.DataFrame
) – DataFrame with CNV data to convert.output_file (str, optional) – Output file.
Required if save is True.
save (
bool
, optional) – Whether results should be saved to disc.Defaults to
True
.view_limits (tuple, optional) – Extreme values (min, max) of color scale used to visualize in IGV.
Defaults to (-2, 2).
- Returns
CNV data in IGV format.
- Return type
- Raises
ValueError: – If save is True but output_file is None.
ngs_toolkit.rnaseq¶
-
class
ngs_toolkit.rnaseq.
RNASeqAnalysis
(name=None, from_pep=False, from_pickle=False, root_dir=None, data_dir='data', results_dir='results', prj=None, samples=None, **kwargs)[source]¶ Class to model analysis of RNA-seq data. Inherits from the
Analysis
class.- Parameters
name (
str
, optional) – Name of the analysis.Defaults to “analysis”.
from_pep (
str
, optional) – PEP configuration file to initialize analysis from. The analysis will adopt as much attributes from the PEP as possible but keyword arguments passed at initialization will still have priority.Defaults to
None
(no PEP used).from_pickle (
str
, optional) – Pickle file of an existing serialized analysis object from which the analysis should be loaded.Defaults to
None
(will not load from pickle).root_dir (
str
, optional) – Base directory for the project.Defaults to current directory or to what is specified in PEP if
from_pep
.data_dir (
str
, optional) – Directory containing processed data (e.g. by looper) that will be input to the analysis. This is in principle not required.Defaults to “data”.
results_dir (
str
, optional) – Directory to contain outputs produced by the analysis.Defaults to “results”.
prj (
peppy.Project
, optional) – Apeppy.Project
object that this analysis is tied to.Defaults to
None
.samples (
list
, optional) – List ofpeppy.Sample
objects that this analysis is tied to.Defaults to
None
.kwargs (
dict
, optional) – Additional keyword arguments will be passed to parent classAnalysis
.
-
collect_bitseq_output
(samples=None, permissive=True, expression_type='counts')[source]¶ Collect gene expression (read counts, transcript-level) output from Bitseq into expression matrix for samples.
-
collect_esat_output
(samples=None, permissive=True)[source]¶ Collect gene expression (read counts, gene-level) output from ESAT into expression matrix for samples.
-
get_gene_expression
(expression_type='counts', expression_level='gene', reduction_func=<built-in function max>, quantification_prog='bitseq', samples=None, save=True, assign=True, output_file=None, permissive=False, species=None, ensembl_version=None)[source]¶ Collect gene expression (read counts per transcript or gene) for all samples.
If expression_level is “gene”, then, transcripts will be reduced per gene ID using reduction_func (defaults to max) and features will be named with gene symbols.
- Parameters
expression_type (
str
, optional) – Type of expression quantification to get. One of “counts” or “rpkm”.Defaults to “counts”.
expression_level (
str
, optional) – Type of expression quantification to get. One of “transcript” or “gene”.Defaults to “gene”.
reduction_func (func, optional) – Function to reduce gene expression between transcript and gene if expression_level is “gene”.
Defaults to max.
quantification_prog (
str
, optional) – Name of program used to produce the quantification of gene expression. One of “bitseq”, “htseq” or “esat”.Defaults to “bitseq”.
samples (list[peppy.Sample], optional) – Subset of samples to get expression for.
Defaults to all in analysis.
save (
bool
, optional) – Whether to save output as CSV.Default is
None
.assign (
bool
, optional) – Whether to assign output to matrix_raw.Default is
None
.output_file (
str
, optional) – Path of resulting file if save is True.Defaults to “{results_dir}/{name}.matrix_raw.csv”.
permissive (
bool
, optional) – Whether to skip samples with non-existing gene expression quantification.Default is False.
species (
str
, optional) – Ensembl species name (e.g. “hsapiens”, “mmusculus”)Defaults to analysis’ organism.
ensembl_version (
str
, optional) – Ensembl version of annotation to use (e.g. “grch38”, “grcm38”)Defaults to analysis’ genome.
- Variables
matrix_raw (
pandas.DataFrame
) – DataFrame with gene expression.
-
plot_expression_characteristics
(matrix_raw=None, matrix_norm=None, samples=None, output_dir='{results_dir}/quality_control', output_prefix='quality_control')[source]¶ Plot general characteristics of the gene expression distributions within and across samples.
- matrix_raw{str, pandas.DataFrame}, optional
Name of analysis attribute with raw expression values or pandas dataframe.
Defaults to analysis’ matrix_raw.
- matrix_norm{str, pandas.DataFrame}, optional
Name of analysis attribute with normalized expression values or pandas dataframe.
Defaults to analysis’ matrix_norm.
- samples
list
, optional List of samples to include.
Defaults to all samples in analysis
- output_dir
str
, optional Directory for output files.
Defaults to “{results_dir}/quality_control”
- output_prefix
str
, optional Prefix for output files.
Defaults to “quality_control”
-
ngs_toolkit.rnaseq.
knockout_plot
(analysis=None, knockout_genes=None, matrix='matrix_norm', samples=None, comparison_results=None, output_dir=None, output_prefix='knockout_expression', square=True, rasterized=True)[source]¶ Plot expression of knocked-out genes in all samples.
- Parameters
analysis (
RNASeqAnalysis
, optional) – Analysis object.Not required if matrix is given.
knockout_genes (
list
, optional) – List of perturbed genes to plot.Defaults to the set of knockout attributes in the analysis’ samples if analysis is given. Otherwise must be given.
matrix (str, optional) – Matrix with expression values to use.
Defaults to “matrix_norm”
samples ([type], optional) – [description]
Defaults to
None
.comparison_results ([type], optional) – [description]
Defaults to
None
.output_dir ([type], optional) – [description]
Defaults to
None
.output_prefix (str, optional) – Prefix for output files.
Defaults to “knockout_expression”
square (bool, optional) – Whether heatmap cells should have inforced aspect.
Defaults to
True
.rasterized (bool, optional) – Whether heatmap cells should be rasterized.
Defaults to
True
.
ngs_toolkit.demo¶
A module dedicated to the generation of Analysis, Projects and their data.
-
ngs_toolkit.demo.data_generator.
generate_count_matrix
(n_factors=1, n_replicates=4, n_features=1000, intercept_mean=4, intercept_std=2, coefficient_stds=0.4, size_factors=None, size_factors_std=0.1, dispersion_function=None)[source]¶ Generate count matrix for groups of samples by sampling from a negative binomial distribution.
-
ngs_toolkit.demo.data_generator.
generate_data
(n_factors=1, n_replicates=4, n_features=1000, coefficient_stds=0.4, data_type='ATAC-seq', genome_assembly='hg38', **kwargs)[source]¶ Creates real-looking data
- Parameters
n_factors (
int
, optional) – Number of factors influencing variance between groups. For each factor there will be two groups of samples.Defaults to 1.
n_replicates (
int
, optional) – Number of replicates per group.Defaults to 4.
n_features (
int
, optional) – Number of features (i.e. genes, regions) in matrix.Defaults to 1000.
coefficient_stds ({
int
,list
}, optional) – Standard deviation of the coefficients between groups. If a list, must match the number ofn_factors
.Defaults to 1.
data_type (
bool
, optional) – Data type of the project. Must be one of thengs_toolkit
classes.Default is “ATAC-seq”
genome_assembly (
bool
, optional) – Genome assembly of the project.Default is “hg38”
**kwargs (
dict
) – Additional keyword arguments will be passed tongs_toolkit.demo.data_generator.generate_count_matrix()
.
- Returns
A tuple of
pandas.DataFrame
objects with numeric and categorical data respectively.- Return type
-
ngs_toolkit.demo.data_generator.
generate_project
(output_dir=None, project_name='test_project', organism='human', genome_assembly='hg38', data_type='ATAC-seq', n_factors=1, only_metadata=False, sample_input_files=False, initialize=True, **kwargs)[source]¶ Creates a real-looking PEP-based project with respective input files and quantification matrix.
- Parameters
output_dir (
str
, optional) – Directory to write files to.Defaults to a temporary location in the user’s
${TMPDIR}
.project_name (
bool
, optional) – Name for the project.Default is “test_project”.
organism (
bool
, optional) – Organism of the project.Default is “human”
genome_assembly (
bool
, optional) – Genome assembly of the project.Default is “hg38”
data_type (
bool
, optional) – Data type of the project. Must be one of thengs_toolkit
classes.Default is “ATAC-seq”
only_metadata (obj:bool, optional) – Whether to only generate metadata for the project or input files in addition.
Default is
False
.sample_input_files (obj:bool, optional) – Whether the input files for the respective data type should be produced.
This would be BAM and peak files for ATAC-seq or BAM files for RNA-seq.
Default is
True
.initialize (obj:bool, optional) – Whether the project should be initialized into an Analysis object for the respective
data_type
or simply return the path to a PEP configuration file.Default is
True
.**kwargs (
dict
) – Additional keyword arguments will be passed tongs_toolkit.demo.data_generator.generate_data()
.
- Returns
The Analysis object for the project or a path to its PEP configuration file.
- Return type
-
ngs_toolkit.demo.data_generator.
generate_projects
(output_path=None, project_prefix_name='demo-project', data_types=['ATAC-seq', 'RNA-seq'], organisms=['human', 'mouse'], genome_assemblies=['hg38', 'mm10'], n_factors=[1, 2, 5], n_features=[100, 1000, 10000], n_replicates=[1, 3, 5], **kwargs)[source]¶ Create a list of Projects given ranges of parameters, which will be passed to
ngs_toolkit.demo.data_generator.generate_project()
.
-
ngs_toolkit.demo.data_generator.
generate_bam_file
(count_vector, output_bam, genome_assembly='hg38', chrom_sizes_file=None, index=True)[source]¶ Generate BAM file containing reads matching the counts in a vector of features
-
ngs_toolkit.demo.data_generator.
generate_peak_file
(peak_set, output_peak, genome_assembly='hg38', summits=False)[source]¶ Generate peak files containing regions from a fraction of a given set of features
-
ngs_toolkit.demo.data_generator.
generate_sample_input_files
(analysis, matrix)[source]¶ Generate input files (BAM, peaks) for a sample depending on its data type.
-
ngs_toolkit.demo.data_generator.
initialize_analysis_of_data_type
(data_type, pep_config, *args, **kwargs)[source]¶ Initialize an Analysis object from a PEP config with the appropriate
data_type
.
-
ngs_toolkit.demo.data_generator.
get_random_genomic_locations
(n_regions, width_mean=500, width_std=400, min_width=300, genome_assembly='hg38')[source]¶ Get n_regions` number of random genomic locations respecting the boundaries of the
genome_assembly
ngs_toolkit.general¶
-
ngs_toolkit.general.
get_genome_reference
(organism, genome_assembly=None, output_dir=None, genome_provider='UCSC', file_format='2bit', dry_run=False, overwrite=True)[source]¶ Get genome FASTA/2bit file. Saves results to disk and returns path to file.
- Parameters
organism (
str
) – Organism to get annotation for. Currently supported: “human” and “mouse”.output_dir (
str
, optional) – Directory to write output to. Defaults to current directorygenome_provider (
str
, optional) – Which genome provider to use. One of ‘UCSC’ or ‘Ensembl’.file_format (
str
, optional) – File format to get. One of ‘fasta’ or ‘2bit’.dry_run (
bool
, optional) – Whether to not download and just return path to file.overwrite (
bool
, optional) – Whether existing files should be overwritten by new ones. Otherwise they will be kept and no action is made. Defaults toTrue
.
- Returns
Path to genome FASTA/2bit file, but if dry_run tuple of URL of reference genome and path to file.
- Return type
{str, tuple}
- Raises
ValueError – If arguments are not in possible options or if desired combination is not available.
-
ngs_toolkit.general.
get_blacklist_annotations
(organism, genome_assembly=None, output_dir=None, overwrite=True)[source]¶ Get annotations of blacklisted genomic regions for a given organism/genome assembly. Saves results to disk and returns a path to a BED file.
- Parameters
organism (
str
) – Organism to get annotation for. Currently supported: “human” and “mouse”.genome_assembly (
str
, optional) – Ensembl assembly/version to use. Default for “human” is “hg38/grch38” and for “mouse” is “mm10/grcm38”.output_dir (
str
, optional) – Directory to write output to. Defaults to “reference” in current directory.overwrite (
bool
, optional) – Whether existing files should be overwritten by new ones. Otherwise they will be kept and no action is made. Defaults toTrue
.
- Returns
Path to blacklist BED file
- Return type
-
ngs_toolkit.general.
get_tss_annotations
(organism, genome_assembly=None, save=True, output_dir=None, chr_prefix=True, gene_types=['protein_coding', 'processed_transcript', 'lincRNA', 'antisense'], overwrite=True)[source]¶ Get annotations of TSS for a given organism/genome assembly. This is a simple approach using Biomart’s API querying the Ensembl database. Saves results to disk and returns a dataframe.
- Parameters
organism (
str
) – Organism to get annotation for. Currently supported: “human” and “mouse”.genome_assembly (
str
, optional) – Ensembl assembly/version to use. Default for “human” is “grch38” and for “mouse” is “grcm38”.save (
bool
, optional) – Whether to save to disk underoutput_dir
. Defaults toTrue
.output_dir (
str
, optional) – Directory to write output to. Defaults to “reference” in current directory.chr_prefix (
bool
, optional) – Whether chromosome names should have the “chr” prefix. Defaults to Truegene_types (
list
, optional) – Subset of transcript biotypes to keep. See here the available biotypes https://www.ensembl.org/Help/Faq?id=468 Defaults to ‘protein_coding’, ‘processed_transcript’, ‘lincRNA’, ‘antisense’.overwrite (
bool
, optional) – Whether existing files should be overwritten by new ones. Otherwise they will be kept and no action is made. Defaults toTrue
.
- Returns
DataFrame with genome annotations
- Return type
-
ngs_toolkit.general.
get_genomic_context
(organism, genome_assembly=None, save=True, output_dir=None, chr_prefix=True, region_subset=['promoter', 'exon', '5utr', '3utr', 'intron', 'genebody', 'intergenic'], gene_types=['protein_coding', 'processed_transcript', 'lincRNA', 'antisense'], promoter_width=3000, overwrite=True)[source]¶ Get annotations of TSS for a given organism/genome assembly. This is a simple approach using Biomart’s API querying the Ensembl database. Saves results to disk and returns a dataframe.
The API call to BioMart can take a bit, so the function should take ~4 min for a human genome.
- Parameters
organism (
str
) – Organism to get annotation for. Currently supported: “human” and “mouse”.genome_assembly (
str
, optional) – Ensembl assembly/version to use. Default for “human” is “grch38” and for “mouse” is “grcm38”.save (
bool
, optional) – Whether to save to disk underoutput_dir
. Defaults toTrue
.output_dir (
str
, optional) – Directory to write output to. Defaults to “reference” in current directory.chr_prefix (
bool
, optional) – Whether chromosome names should have the “chr” prefix. Defaults to Truegene_types (
list
, optional) – Subset of transcript biotypes to keep. See here the available biotypes https://www.ensembl.org/Help/Faq?id=468 Defaults to ‘protein_coding’, ‘processed_transcript’, ‘lincRNA’, ‘antisense’.overwrite (
bool
, optional) – Whether existing files should be overwritten by new ones. Otherwise they will be kept and no action is made. Defaults toTrue
.
- Returns
DataFrame with genome annotations
- Return type
-
ngs_toolkit.general.
get_chromosome_sizes
(organism, genome_assembly=None, output_dir=None, overwrite=True)[source]¶ Get a file with the sizes of chromosomes in a given organism/genome assembly. Saves results to disk and returns a path to a text file.
- Parameters
organism (
str
) – Organism to get chromosome sizes for. Currently supported: “human” and “mouse”.genome_assembly (
str
, optional) – Ensembl assembly/version to use. Default for “human” is “hg38/grch38” and for “mouse” is “mm10/grcm38”.output_dir (
str
, optional) – Directory to write output to.Defaults to “reference” in current directory.
overwrite (
bool
, optional) – Whether existing files should be overwritten by new ones. Otherwise they will be kept and no action is made.Defaults to
True
.
- Returns
Path to text file with chromosome sizes.
- Return type
-
ngs_toolkit.general.
deseq_analysis
(count_matrix, experiment_matrix, comparison_table, formula, output_dir, output_prefix, overwrite=True, alpha=0.05, independent_filtering=False, create_subdirectories=True, save_inputs=True, **kwargs)[source]¶ Perform differential comparison analysis with DESeq2.
Note
Do not include hyphens (“-“) in any of the samples or groups names! R freaks out with this.
# TODO: fix hyphens in names issue
- Parameters
count_matrix (
pandas.DataFrame
) – Data frame of shape (samples, variables) with raw read counts.experiment_matrix (
pandas.DataFrame
) – Data frame with columns “sample_name” and any other variables used in the formula.comparison_table (
pandas.DataFrame
) – Data frame with columns “comparison_name”, “sample_group” and sample_name”.formula (
str
) – Formula to test in R/patsy notation. Usually something like “~ batch + group”.output_dir (
str
) – Output directory for produced files.output_prefix (
str
) – Prefix to add to produced files.overwrite (
bool
, optional) – Whether files existing should be overwritten. Defaults toTrue
.alpha (number, optional) – Significance level to reject null hypothesis. This in practice has no effect as results for all features will be returned. Defaults to 0.05.
create_subdirectories (
bool
) – Whether to create subdirectories for the result of each comparison.**kwargs (
dict
) – Additional keyword arguments to be passed to the DESeq function of DESeq2.
- Returns
Data frame with results, statistics for each feature.
- Return type
-
ngs_toolkit.general.
least_squares_fit
(matrix, design_matrix, test_model, null_model='~ 1', standardize_data=True, multiple_correction_method='fdr_bh')[source]¶ Fit a least squares model with only categorical predictors. Computes p-values by comparing the log likelihood ratio of the chosen model to a null_model.
- Parameters
matrix (
pandas.DataFrame
) – A Data frame of shape (samples, variables).design_matrix (
pandas.DataFrame
) – A Data frame of shape (samples, variables) with all the variables in test_model.test_model (
str
) – Model design to test in R/patsy notation.null_model (
str
, optional) – Null model design in R/patsy notation. Defaults to “~ 1”.standardize_data (
bool
, optional) – Whether data should be standardized prior to fitting. Defaults toTrue
.multiple_correction_method (
str
, optional) – Method to use for multiple test correction. See statsmodels.sandbox.stats.multicomp.multipletests. Defaults to “fdr_bh”.
- Returns
pandas.DataFrame
– Statistics of model fitting and comparison between models for each feature.Example:
matrix = np.random.random(10000000).reshape(100, 100000)
P = np.concatenate([[0] * 50, [1] * 50]) # dependent variable
Q = np.concatenate([[0] * 25, [1] * 25] + [[0] * 25, [1] * 25]) # covariate
design_matrix = pd.DataFrame([P, Q], index=[“P”, “Q”]).T
matrix = matrix.T * ((1 + design_matrix.sum(axis=1)) * 4).values
matrix = pd.DataFrame(matrix.T)
test_model = “~ Q + P”
null_model = “~ Q”
res = least_squares_fit(matrix, design_matrix, test_model, null_model)
res.head()
-
ngs_toolkit.general.
differential_from_bivariate_fit
(comparison_table, matrix, output_dir, output_prefix, n_bins=250, multiple_correction_method='fdr_bh', plot=True, palette='colorblind', make_values_positive=False)[source]¶ Perform differential analysis using a bivariate gaussian fit on the relationship between mean and fold-change for each comparison.
- Parameters
comparison_table (
pandas.DataFrame
) – Dataframe with ‘comparison_name’, ‘comparison_side’ and ‘sample_name’, ‘sample_group’ columns.matrix (
pandas.DataFrame
) – Matrix of n_features, n_samples with normalized, log-transformed values to perform analysis on.output_dir (
str
) – Output directoryoutput_prefix (
str
) – Prefix for outputs.n_bins (
int
) – Number of bins of mean values along which to standardize fold-changes.multiple_correction_method (
str
) – Multiple correction method from statsmodels.sandbox.stats.multicomp.multipletests.plot (
bool
) – Whether to generate plots.palette (
str
) – Color palette to use. This can be any matplotlib palette and is passed to sns.color_palette.make_values_positive (
bool
) – Whether to transform matrix to have minimum value 0. Default False.
- Returns
Results of fitting and comparison between groups for each feature.
- Return type
-
ngs_toolkit.general.
lola
(bed_files, universe_file, output_folder, genome, output_prefixes=None, cpus=8)[source]¶ Perform location overlap analysis (LOLA).
If bed_files is a list with more than one element, use
output_prefixes
to pass a list of prefixes to label the output files for each input BED file.Files will be created in
output_folder
mimicking the output that the R function LOLA::writeCombinedEnrichment writes.Requires the R package “LOLA” to be installed:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("LOLA")
- Parameters
bed_files ({str, list}) – A string path to a BED file or a list of paths.
universe_file (
str
) – A path to a BED file representing the universe from where the BED file(s) come from.output_folder (
str
) – Output folder for resulting files.genome (
str
, optional) – Genome assembly from which the BED files come from. This is used to get the LOLA databases from thengs_toolkit._CONFIG
parameters.output_prefixes (
list
, optional) – A list of strings with prefixes to be used in casebed_files
is a list.cpus (
int
, optional) – Number of CPUs/threads to use.Defaults to 8.
-
ngs_toolkit.general.
homer_combine_motifs
(comparison_dirs, output_dir, region_prefix='differential_analysis', reduce_threshold=0.6, match_threshold=10, info_value=0.6, p_value_threshold=1e-25, fold_enrichment=None, cpus=8, run=True, distributed=True, genome='hg38', motif_database=None)[source]¶ Create consensus of de novo discovered motifs from HOMER
- Parameters
comparison_dirs (
list
) – Iterable of comparison directories where homer was run. Should contain a “homerMotifs.all.motifs” file.output_dir (
str
) – Output directory.p_value_threshold (number, optional) – Threshold for inclusion of a motif in the consensus set. Defaults to 1e-5
cpus (number, optional) – Number of available CPUS/threads for multithread processing. Defaults to 8
run (
bool
, optional) – Whether to run enrichment of each comparison in the consensus motifs. Default is Truedistributed (
bool
, optional) – Whether to run enrichment as a cluster job. Default is Truegenome (
str
) – Genome assembly of the data. Default is ‘hg38’.motif_database (
str
) – Motif database to restrict motif matching too.
- Returns
If run is False, returns path to consensus motif file. Otherwise None.
- Return type
{str,None}
-
ngs_toolkit.general.
enrichr
(dataframe, gene_set_libraries=None, kind='genes', max_attempts=5)[source]¶ Use Enrichr on a list of genes (currently only genes supported through the API).
- Parameters
dataframe (
str
) – DataFrame with column “gene_name”.gene_set_libraries (
list
, optional) – Gene set libraries to use. Defaults to values in initial configuration file. To see them, do:ngs_toolkit._CONFIG['resources']['enrichr']['gene_set_libraries']
kind (
str
, optional) – Type of input. Right now, only “genes” is supported. Defaults to “genes”max_attempts (
int
, optional) – Number of times to try a call to Enrichr API. Defaults to 5
- Returns
Results of enrichment analysis
- Return type
- Raises
Exception – If max_attempts is exceeded
-
ngs_toolkit.general.
run_enrichment_jobs
(results_dir, genome, background_bed, steps=['lola', 'meme', 'homer', 'enrichr'], overwrite=True, pep_config=None)[source]¶ Submit parallel enrichment jobs for a specific analysis.
- Parameters
param results_dir: – Directory with files prepared by ngs_toolkit.general.run_enrichment_jobs
param genome: – Genome assembly of the analysis.
background_bed (
str
) – BED file to use as background for LOLA analysis. Typically the analysis’ own consensus region set.steps (
list
, optional) – Steps of the analysis to perform. Defaults to [“region”, lola”, “meme”, “homer”, “enrichr”].param overwrite: bool, optional – Whether output should be overwritten. In this case no jobs will be submitted for jobs with existing output files. Defaults to True
param pep_config:
str
, optional – Pickle file of the analysis. Only required for “region” enrichment.
-
ngs_toolkit.general.
project_to_geo
(project, output_dir='geo_submission', samples=None, distributed=False, dry_run=False, **kwargs)[source]¶ Prepare raw sequencing files for submission to GEO. Files will be copied or generated in a new directory
output_dir
. It will get the raw BAM file(s) of each sample, and in case of ATAC-seq/ChIP-seq samples, the bigWig and peak files. If multiple BAM files exist for each sample, all will be copied and sequencially named with the “fileN” suffix, where “N” is the file number.For each copied file a md5sum will be calculated.
A pandas DataFrame with info on the sample’s files and md5sums will be returned.
- Variables
project (
peppy.Project
) – Apeppy.Project
object to process.output_dir (
str
, optional) –Directory to create output. Will be created/overwriten if existing.
Defaults to “geo_submission”.
samples (
list
, optional) –List of
peppy.Sample
objects in project to restrict to.Defaults to all samples in project.
distributed (
bool
, optional) –Whether processing should be distributed as jobs in a computing cluster for each sample. Currently available implementation supports a SLURM cluster only.
Defaults is
False
.dry_run (
bool
, optional) –Whether copy/execution/submisison commands should be not be run, to test.
Default is
False
.**kwargs (
dict
) –Additional keyword arguments will be passed to
ngs_toolkit.utils.submit_job()
ifdistributed
isTrue
, and on to adivvy
submission template. Pass for example:computing_configuration=”slurm”
jobname=”job”
cores=2
mem=8000
partition=”longq”
- Returns
Annotation of samples and their BAM, BigWig, narrowPeak files and respective md5sums.
- Return type
-
ngs_toolkit.general.
rename_sample_files
(annotation_mapping, old_sample_name_column='old_sample_name', new_sample_name_column='new_sample_name', tmp_prefix='rename_sample_files', results_dir='results_pipeline', dry_run=False)[source]¶ Rename existing directories with pipeline outputs for samples based on mapping of old/new sample names.
All files within the directory with the old sample name will be renamed recursively. Old and new sample names can overlap - this procedure will handle these cases correctly by a 2-step process with temporary sample names with prefix
tmp_prefix
.Caution
NEEDS TESTING! This function has not been used in a while and needs more testing.
- Variables
annotation_mapping (
pandas.DataFrame
) – DataFrame with mapping of old (column “previous_sample_name”) vs new (“new_sample_name”) sample names.old_sample_name_column (
str
, optional) –Name of column with old sample names.
Defaults to “old_sample_name”.
new_sample_name_column (
str
, optional) –Name of column with new sample names.
Defaults to “new_sample_name”.
tmp_prefix (
str
, optional) –Prefix for temporary files to avoid overlap between old and new names.
Defaults to “rename_sample_files”.
results_dir (
str
, optional) –Pipeline output directory containing sample output directories.
Defaults to “results_pipeline”.
dry_run (
bool
, optional) –Whether to print commands instead of running them.
Defaults to
False
.
-
ngs_toolkit.general.
query_biomart
(attributes=None, species='hsapiens', ensembl_version='grch38', max_api_retries=5)[source]¶ Query Biomart (https://www.ensembl.org/biomart/martview/).
Query Biomart for gene attributes. Returns pandas dataframe with query results. If a certain field contains commas, it will attemp to return dataframe but it might fail.
- Parameters
attributes (
list
, optional) – List of ensembl atrributes to query.Defaults to [“ensembl_gene_id”, “external_gene_name”, “hgnc_id”, “hgnc_symbol”].
species (
str
, optional) – Ensembl string of species to query. Must be vertebrate.Defaults to “hsapiens”.
ensembl_version (
str
, optional) – Ensembl version to query. Currently “grch37”, “grch38” and “grcm38” are tested.Defaults to “grch37”.
max_api_retries (
int
, optional) – How many times to try .Defaults to “grch37”.
- Returns
Dataframe with required attributes for each entry.
- Return type
-
ngs_toolkit.general.
subtract_principal_component
(x, pc=1, norm=False, plot=True, plot_name='PCA_based_batch_correction.svg', max_pcs_to_plot=6)[source]¶ Given a matrix (n_samples, n_variables), remove pc (1-based) from matrix.
-
ngs_toolkit.general.
subtract_principal_component_by_attribute
(df, attributes, pc=1)[source]¶ Given a matrix (n_samples, n_variables), remove pc (1-based) from matrix.
-
ngs_toolkit.general.
fix_batch_effect_limma
(matrix, batch_variable='batch', covariates=None)[source]¶ Fix batch effect in matrix using limma.
Requires the R package “limma” to be installed:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("limma")
- Parameters
matrix (
pandas.DataFrame
) – DataFrame with MultiIndex for potential covariate annotationsformula (
str
, optional) – Model formula to regress out Defaults to “~batch”
- Returns
Regressed out matrix
- Return type
ngs_toolkit.graphics¶
-
ngs_toolkit.graphics.
barmap
(x, figsize=None, square=False, row_colors=None, z_score=None, ylims=None)[source]¶ Plot a heatmap-style grid with barplots.
- Parameters
x (
pandas.DataFrame
) – DataFrame with numerical values to plot. If DataFrame, indexes will be used as labels.figsize (
tuple
) – Size in inches (width, height) of figure to produce.square (
bool
) – Whether resulting figure should be square.row_colors (
list
) – Iterable of colors to use for each row.z_score (
int
) – Whether input matrix x should be Z-score transformed row-wise (0) or column-wise (1).
- Returns
Figure object
- Return type
matplotlib.pyplot.Figure
- Raises
ValueError – If length of
row_colors
does not match size of provided Y axis from matrixx
.
-
ngs_toolkit.graphics.
radar_plot
(data, subplot_var='patient_id', group_var='timepoint', radial_vars=['NaiveBcell', 'SwitchedBcell', 'UnswitchedBcell'], cmap='inferno', scale_to_max=True)[source]¶ Heavy inspiration from here: https://matplotlib.org/examples/api/radar_chart.html
-
ngs_toolkit.graphics.
plot_projection
(df, color_dataframe, dims, output_file, attributes_to_plot, plot_max_dims=8, rasterized=False, plot_group_centroids=True, axis_ticklabels=True, axis_ticklabels_name='PC', axis_lines=True, legends=False, always_legend=False)[source]¶ Plot a low dimentionality projection of samples.
- Parameters
df (
pandas.DataFrame
) – Dataframe with sample projections.color_dataframe (
pandas.DataFrame
) – Dataframe of RGB tuples for sample i in attribute j.dims (
int
) – Number of dimentions to plotoutput_file (
str
) – Path to figure output fileattributes_to_plot (
list
) – List of levels in df.index to plotplot_max_dims (number, optional) – Maximum number of dimentions to plot. Defaults to 8.
plot_group_centroids (
bool
, optional) – Whether centroids of each sample group should be plotted alongside samples. Will be square shaped. Defaults to True.axis_ticklabels (
bool
, optional) – Whether axis ticks and tick labels should be plotted. Defaults to False.axis_lines (
bool
, optional) – Whether (0, 0) dashed lines should be plotted. Defaults to True.legends (
bool
, optional) – Whether legends for group colours should be plotted. Defaults to False.always_legend (
bool
, optional) – Whether legends for group colours should be plotted in every figure panel. If False, will plot just on first/last figure panel. Defaults to False.
-
ngs_toolkit.graphics.
plot_region_context_enrichment
(enr, output_dir='results', output_prefix='region_type_enrichment', across_attribute=None, pvalue=0.05, top_n=5)[source]¶ Plot results of ATACSeqAnalysis.region_context_enrichment.
- Parameters
enr (
pandas.DataFrame
) – Results of region_context_enrichment.output_dir (
str
, optional) – Directory to save plots to. Defaults to “results”.optional output_prefix (
str
, optional) – Prefix to use when saveing plots. Defaults to “region_type_enrichment”across_attribute (
str
, optional) – Column in enrichment matrix to plot results across e.g. “comparison_name” when results matrix contains the result of various comparisons. Defaults to None (not used).pvalue (float, optional) – Value at which to plot a line marking the significant level. Defaults to 0.05.
top_n (
int
, optional) – Number of features to label in volcano plot. Defaults to 5.
-
ngs_toolkit.graphics.
plot_comparison_correlations
(diff, output_dir, output_prefix='comparison_correlations')[source]¶ Plot pairwise log fold changes for various comparisons.
- Parameters
diff (
pandas.DataFrame
) – Dataframe with differential resultsoutput_dir (
str
) – Output directory for plots.output_prefix (
str
, optional) – Prefix for plots. Defaults to “comparison_correlations”
ngs_toolkit.utils¶
-
ngs_toolkit.utils.
is_running_inside_ipython
()[source]¶ Check whether code is running inside an IPython session.
-
ngs_toolkit.utils.
get_timestamp
(fmt='%Y-%m-%d-%H:%M:%S')[source]¶ Get current timestamp in
fmt
format.
-
ngs_toolkit.utils.
remove_timestamp_if_existing
(file)[source]¶ Remove timestamp from path if matching timestamp pattern exists.
-
ngs_toolkit.utils.
get_this_file_or_timestamped
(file, permissive=True)[source]¶ Get a path to an existing timestamped file based on an non-timestamped path.
- Parameters
- Raises
IndexError – If not permissive and can’t find timestamped file.
-
ngs_toolkit.utils.
is_analysis_descendent
(exclude_functions=None)[source]¶ Check whether any call in the traceback comes from a function part of a
ngs_toolkit.Analysis()
object.
-
ngs_toolkit.utils.
record_analysis_output
(file_name, **kwargs)[source]¶ Register a file that is an output of an Analysis. The file will be associated with the function that produced it and saved in the attribute
output_files
.- Parameters
file_name (
str
) – File name of analysis output to record.**kwargs (
bool
) – Keyword arguments passed tongs_toolkit.utils.is_analysis_descendent()
.
-
ngs_toolkit.utils.
submit_job
(code, job_file, log_file=None, computing_configuration=None, dry_run=False, limited_number=False, total_job_lim=500, refresh_time=10, in_between_time=5, **kwargs)[source]¶ Submit a job to be run. Uses divvy to allow running on a local computer or distributed computing resources.
- Parameters
code (
str
) – String of command(s) to be run.job_file (
str
) – File to write jobcode
to.log_file (
str
) – Log file to write job output to.Defaults to
job_file
with “.log” ending.computing_configuration (
str
) – Name ofdivvy
computing configuration to use.Defaults to ‘default’ which is to run job in localhost.
dry_run (
bool
) – Whether not to actually run job.Defaults to
False
.limited_number (
bool
) – Whether to restrict jobs to a maximum number. Currently only possible if using “slurm”.Defaults to
False
.total_job_lim (
int
) – Maximum number of jobs to restrict to.Defaults to 500.
refresh_time (
int
) – Time in between checking number of jobs in seconds.Defaults to 10.
in_between_time (
int
) – Time in between job submission in seconds.Defaults to 5.
**kwargs (
dict
) – Additional keyword arguments will be passed to the chosen submission template according to computing_configuration. Pass for example: jobname=”job”, cores=2, mem=8000, partition=”longq”.
-
ngs_toolkit.utils.
chunks
(l, n)[source]¶ Partition iterable l in chunks of size n.
- Parameters
l (iterable) – Iterable (e.g. list or numpy array).
n (
int
) – Size of chunks to generate.
-
ngs_toolkit.utils.
sorted_nicely
(l)[source]¶ Sort an iterable in the way that humans expect.
- Parameters
l (iterable) – Iterable to be sorted
- Returns
Sorted iterable
- Return type
iterable
-
ngs_toolkit.utils.
standard_score
(x)[source]¶ Compute a standard score, defined as (x - min(x)) / (max(x) - min(x)).
- Parameters
x (
numpy.ndarray
) – Numeric array.- Returns
Transformed array.
- Return type
-
ngs_toolkit.utils.
z_score
(x)[source]¶ Compute a Z-score, defined as (x - mean(x)) / std(x).
- Parameters
x (
numpy.ndarray
) – Numeric array.- Returns
Transformed array.
- Return type
-
ngs_toolkit.utils.
logit
(x)[source]¶ Compute the logit of x, defined as log(x / (1 - x)).
- Parameters
x (
numpy.ndarray
) – Numeric array.- Returns
Transformed array.
- Return type
-
ngs_toolkit.utils.
count_dataframe_values
(x)[source]¶ Count number of non-null values in a dataframe.
- Parameters
x (
pandas.DataFrame
) – Pandas DataFrame- Returns
Number of non-null values.
- Return type
-
ngs_toolkit.utils.
location_index_to_bed
(index)[source]¶ Get a pandas DataFrame with columns “chrom”, “start”, “end” from an pandas Index of strings in form “chrom:start-end”.
- Parameters
index ({
list
,pandas.Index
,pandas.Series)
,pandas.DataFrame)
}) – Index strings of the form “chrom:start-end”.- Returns
Pandas dataframe.
- Return type
-
ngs_toolkit.utils.
bed_to_index
(df)[source]¶ Get an index of the form chrom:start-end from a a dataframe with such columns.
- Parameters
df ({
pandas.DataFrame
,pybedtools.bedtool.BedTool
,str
}) – DataFrame with columns “chrom”, “start” and “end”.- Returns
Pandas index.
- Return type
-
ngs_toolkit.utils.
bedtool_to_index
(bedtool)[source]¶ Convert bedtool or path to a bed file to list of region identifiers
-
ngs_toolkit.utils.
to_bed_index
(sites)[source]¶ Convert bedtool, BED file or dataframe to list of region identifiers
-
ngs_toolkit.utils.
timedelta_to_years
(x)[source]¶ Convert a timedelta to years.
- Parameters
x (
pandas.Timedelta
) – A Timedelta object.- Returns
Years.
- Return type
-
ngs_toolkit.utils.
signed_max
(x, f=0.66, axis=0)[source]¶ Return maximum or minimum of array
x
depending on the sign of the majority of values. If there isn’t a clear majority (at leastf
fraction in one side), return mean of values. If given a pandas DataFrame or 2D numpy array, will apply this across rows (columns-wise, axis=0) or across columns (row-wise, axis=1). Will returnnumpy.nan
for non-numeric values.- Parameters
x ({
numpy.ndarray
,pandas.DataFrame
,pandas.Series
}) – Input values to reducef (
float
) – Threshold fraction of majority agreement.Default is 0.66.
axis (
int
) – Whether to apply across rows (0, column-wise) or across columns (1, row-wise).Default is 0.
- Returns
Pandas Series with values reduced to the signed maximum.
- Return type
-
ngs_toolkit.utils.
log_pvalues
(x, f=0.1)[source]¶ Calculate -log10(p-value) of array.
Replaces infinite values with:
max(x) + max(x) * f
that is, fraction
f
more than the maximum non-infinite -log10(p-value).- Parameters
x (
pandas.Series
) – Series with numeric valuesf (
float
) – Fraction to augment the maximum value by ifx
contains infinite values.Defaults to 0.1.
- Returns
Transformed values.
- Return type
-
ngs_toolkit.utils.
r2pandas_df
(r_df)[source]¶ Make
pandas.DataFrame
from aR
dataframe given byrpy
.
-
ngs_toolkit.utils.
recarray2pandas_df
(recarray)[source]¶ Make
pandas.DataFrame
fromnumpy.recarray
.
-
ngs_toolkit.utils.
collect_md5_sums
(df)[source]¶ Given a dataframe with columns with paths to md5sum files ending in ‘_md5sum’, replace the paths to the md5sum files with the actual checksums.
Useful to use in combination with
project_to_geo()
.- Parameters
df (
pandas.DataFrame
) – A dataframe with columns ending in ‘_md5sum’.- Returns
DataFrame with md5sum columns replaced with the actual md5sums.
- Return type
-
ngs_toolkit.utils.
decompress_file
(file, output_file=None)[source]¶ Decompress a gzip-compressed file out-of-memory. Output default is same as
file
without “.gz” ending.
-
ngs_toolkit.utils.
compress_file
(file, output_file=None)[source]¶ Compress a gzip-compressed file out-of-memory. Output default is same as
file
but with “.gz” ending.
-
ngs_toolkit.utils.
download_file
(url, output_file, chunk_size=1024)[source]¶ Download a file and write to disk in chunks (not in memory).
-
ngs_toolkit.utils.
download_gzip_file
(url, output_file, **kwargs)[source]¶ Download a gzip compressed file and write uncompressed file to disk in chunks (not in memory).
- Parameters
url (
str
) – URL to download from.output_file (
str
) – Path to file as output.**kwargs (
dict
) – Additional keyword arguments are passed tongs_toolkit.utils.download_file()
.
-
ngs_toolkit.utils.
deseq_results_to_bed_file
(deseq_result_file, bed_file, sort=True, ascending=False, normalize=False, significant_only=False, alpha=0.05, abs_fold_change=1.0)[source]¶ Write BED file with fold changes from DESeq2 as score value.
-
ngs_toolkit.utils.
homer_peaks_to_bed
(homer_peaks, output_bed)[source]¶ Convert HOMER peak calls to BED format. The fifth column (score) is the -log10(p-value) of the peak.
-
ngs_toolkit.utils.
macs2_call_chipseq_peak
(signal_samples, control_samples, output_dir, name, distributed=True)[source]¶ Call ChIP-seq peaks with MACS2 in a slurm job.
- Parameters
signal_samples (
list
) – Signal Sample objects.control_samples (
list
) – Background Sample objects.output_dir (
list
) – Parent directory where MACS2 outputs will be stored.name (
str
) – Name of the MACS2 comparison being performed.distributed (
bool
) – Whether to submit a SLURM job or to return a string with the runnable.
-
ngs_toolkit.utils.
homer_call_chipseq_peak_job
(signal_samples, control_samples, output_dir, name, distributed=True)[source]¶ Call ChIP-seq peaks with MACS2 in a slurm job.
-
ngs_toolkit.utils.
bed_to_fasta
(input_bed, output_fasta, genome_file)[source]¶ Retrieves DNA sequence underlying specific region. Names of FASTA entries will be of form
chr:start-end
.- Parameters
- Raises
ValueError – If genome_file format cannot be guessed or is not supported.
-
ngs_toolkit.utils.
read_bed_file_three_columns
(input_bed: str) → pandas.core.frame.DataFrame[source]¶ Read BED file into dataframe, make ‘name’ field from location.
-
ngs_toolkit.utils.
bed_to_fasta_through_2bit
(input_bed, output_fasta, genome_2bit)[source]¶ Retrieves DNA sequence underlying specific region. Requires the twoBitToFa command from UCSC kent tools. Names of FASTA entries will be of form “chr:start-end”.
-
ngs_toolkit.utils.
bed_to_fasta_through_fasta
(input_bed, output_fasta, genome_fasta)[source]¶ Retrieves DNA sequence underlying specific region. Uses
bedtools getfasta
(internally through pybedtools.BedTool.sequence). Names of FASTA entries will be of form “chr:start-end”.
-
ngs_toolkit.utils.
count_reads_in_intervals
(bam, intervals, permissive=True)[source]¶ Count total number of reads in a iterable holding strings representing genomic intervals of the form
"chrom:start-end"
.Please make sure both
intervals
andbam
file are zero- or one-indexed.
-
ngs_toolkit.utils.
normalize_quantiles_r
(array)[source]¶ Quantile normalization with a R implementation.
Requires the R package “preprocessCore” to be installed:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("preprocessCore")
- Parameters
array (
numpy.ndarray
) – Numeric array to normalize.- Returns
Normalized numeric array.
- Return type
-
ngs_toolkit.utils.
normalize_quantiles_p
(df_input)[source]¶ Quantile normalization with a ure Python implementation. Code from https://github.com/ShawnLYU/Quantile_Normalize.
- Parameters
df_input (
pandas.DataFrame
) – Dataframe to normalize of shape (features, samples).- Returns
Normalized numeric array.
- Return type
-
ngs_toolkit.utils.
cqn
(matrix, gc_content, lengths)[source]¶ Conditional quantile normalization (CQN) with the
cqn
R library. It uses GC content and length of regulatory elements as covariates.Requires the R package “cqn” to be installed:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("cqn")
- Parameters
matrix (
pandas.DataFrame
) – DataFrame to normalize.gc_content (
pandas.Series
) – Series with GC content of each feature inmatrix
.lengths (
pandas.Series
) – Series with length of each feature inmatrix
.
- Returns
Normalized DataFrame
- Return type
-
ngs_toolkit.utils.
count_bam_file_length
(bam_file: str) → int[source]¶ Get length of BAM indexed file
ngs_toolkit.parsers¶
-
ngs_toolkit.parsers.
parse_ame
(ame_output)[source]¶ Parse results of MEME-AME motif enrichment.
-
ngs_toolkit.parsers.
parse_homer
(homer_dir)[source]¶ Parse results of HOMER findMotifs.pl de novo motif enrichment.
-
ngs_toolkit.parsers.
parse_great_enrichment
(input_tsv)[source]¶ Parse output from GREAT enrichment (http://great.stanford.edu).
- Parameters
input_tsv (
str
) – TSV file exported from GREAT through the option “All data as .tsv” in “Global Controls”.- Returns
Pandas dataframe with enrichment results.
- Return type
ngs_toolkit¶
-
ngs_toolkit.
setup_logger
(level='INFO', logfile=None)[source]¶ Set up a logger for the library.
- Parameters
level (
str
, optional) – Level of logging to display. See possible levels here: https://docs.python.org/2/library/logging.html#levelsDefaults to “INFO”.
logfile (
str
, optional) – File to write log to.Defaults to “~/.ngs_toolkit.log.txt”.
- Returns
A logger called “ngs_toolkit”.
- Return type
-
ngs_toolkit.
setup_config
(custom_yaml_config=None)[source]¶ Set up global library configuration.
It reads ngs_toolkit’s package data to load a default configuration, tries to update it by reading a file in
~/.ngs_toolkit.config.yaml
if present, and lastly, updates it by reading a possible passed yaml filecustom_yaml_config
. Non-exisiting fields will maintain the previous values, so that the user needs only to specify the section(s) as needed.- Parameters
custom_yaml_config (
str
, optional) – Path to YAML file with configuration. To see the structure of the YAML file, see https://github.com/afrendeiro/toolkit/blob/master/ngs_toolkit/config/default.yamlDefaults to
None
.- Returns
Dictionary with configurations.
- Return type