Skip to main content

Network topology tests

Project description

PyGNA2

PyGNA2 is based on PyGNA. It uses the same statistical tests, but improves the way results are reported.

PyGNA is a tool that can perform several kinds of statistical tests on gene networks and gene sets.

See also:

Hypothesis

I have a gene network, consisting of pairs of co-expressed genes, e.g.:

A --- B

C --- D

Independently, I have single-cell data indicating genes are expressed in specific cell types:

Gene Cell type
A Cortex
B Cortex
C Stele
D Stele

Notably, the co-expressed genes share cell types. Is this a significant observation?

Hypothesis: Pairs of co-expressed genes are more likely to share cell type than random pairs of genes.

This is a hypothesis about the topology of the gene set with respect to the gene network.

Testing topology

Input files

  1. Network file network.tsv, a tab-delimited file with two columns. Each row is a pair of co-expressed genes. For example:
Glyma.01G075300	Glyma.01G068000
Glyma.01G076100	Glyma.01G074800
Glyma.02G148200	Glyma.02G024200
Glyma.02G148500	Glyma.01G163500
Glyma.02G149600	Glyma.02G024200
Glyma.02G149600	Glyma.02G148200
...

Note: The input file can contain additional columns (such as a Cytoscape edges file), but they will be ignored.

  1. Gene set file genesets.gmt, a tab-delimited file with one row per gene set. Each row is a gene set name, followed by a description, followed by a tab-separated list of genes. This format is called GMT for "Gene Matrix Transposed". For example:
cortex	"genes expressed in cortex"	Glyma.01G075300	Glyma.09G115800	Glyma.12G116800	...
stele	"genes expressed in stele"	Glyma.01G075300	Glyma.09G115800	Glyma.12G116800	...

Install PyGNA2

We can create and activate a conda environment with pygna by:

conda create -n pygna2 -c stracquadaniolab -c bioconda -c conda-forge pygna \
    "numpy<1.20"
conda activate pygna2
pip install pygna2

Perform Geneset Network Topology tests

The tests in PyGNA are divided into two groups: Geneset Network Topology (GNT) tests, which analyse a single gene set, or Geneset Network Association (GNA) tests, which analyse a pair of gene sets. The GNT tests are suitable for our hypothesis. They include:

  • Total Degree test
  • Internal Degree test
  • Module test
  • Shortest Path test
  • Random Walk with Restart test

The shortest path and random walk tests require pre-computing matrixes from the network. They can be generated by:

pygna build-distance-matrix network.tsv network_sp.hdf5
pygna build-rwr-diffusion network.tsv --output-file network_rwr.hdf5

Total Degree test

mkdir topology_total_degree/
pygna test-topology-total-degree network.tsv genesets.gmt \
  topology_total_degree.csv \
  --number-of-permutations 1000 --cores 4 \
  --results-figure topology_total_degree.pdf \
  --diagnostic-null-folder topology_total_degree/

The Topology Total Degree statistic (TTD) is the average degree (number of edges) of genes in the gene set. The Total Degree test tests whether or not the TTD of the gene set is higher than the TTD of the entire network.

From documentation:

It computes a p-value for the largest connected component of the geneset being bigger than the one expected by chance for a geneset of the same size.

From the paper:

While TTD could be helpful to have an idea of how relevant and well characterized the nodes in the geneset are, we do not expect this statistic to be informative on the strength of interaction withing a geneset.

Internal Degree test

mkdir topology_internal_degree/
pygna test-topology-internal-degree network.tsv genesets.gmt \
  topology_internal_degree.csv \
  --number-of-permutations 1000 --cores 4 \
  --results-figure topology_internal_degree.pdf \
  --diagnostic-null-folder topology_internal_degree/

An internal edge of a gene in the gene set is an edge that connects it to another gene in the same gene set.

The internal degree of a gene is its number of internal edges.

The internal fraction of a gene is its internal degree divided by its degree. (a value between 0 and 1)

The Topology Internal Degree statistic (TID) is the average internal fraction of genes in the gene set (between 0 and 1). The Internal Degree test tests whether or not the TID of the gene set is higher than the TID of the entire network. In principle, highly connected gene sets should have TID close to 1.

From the documentation:

test-topology-internal-degree performs the analysis of internal degree. It computes a p-value for the ratio of internal degree of the geneset being bigger than the one expected by chance for a geneset of the same size.

From the paper:

In practice, the internal degree statistic captures the amount of direct interactions between genes in a geneset, and thus a geneset showing a network effect should have TTD values close to 1. However, the main limitation of this model lies in the fact that it only captures direct interactions, whereas biological networks are usually characterized by medium and long range interactions.

Module test

mkdir topology_module/
pygna test-topology-module network.tsv genesets.gmt \
  topology_module.csv \
  --number-of-permutations 1000 --cores 4 \
  --results-figure topology_module.pdf \
  --diagnostic-null-folder topology_module/

The gene network is made up of components. Two genes are in the same component if there is a path along edges connecting them. Two genes in different components do not have a path between them. The size of a component is the number of genes it contains. In the context of gene networks, components may be called modules.

As a subset of the network, the gene set also has components/modules. A highly connected gene set will have a few large components, while a disconnected gene set will have many small components. The Topology Module statistic (TM) is the size of the largest component in the gene set. The Module test tests whether TM of the gene set is larger than expected by chance.

From documentation:

It computes a p-value for the largest connected component of the geneset being bigger than the one expected by chance for a geneset of the same size.

Shortest Path test

mkdir topology_sp/
pygna test-topology-sp network.tsv genesets.gmt \
  network_sp.hdf5 topology_sp.csv \
  --number-of-permutations 1000 --cores 4 \
  --results-figure topology_sp.pdf \
  --diagnostic-null-folder topology_sp/

From documentation:

test-topology-sp performs geneset network topology shortest path analysis. It computes a p-value for the average shortest path length of the geneset being smaller than expected by chance for a geneset of the same size.

From the paper:

A main concern regarding direct interaction methods is that they could fail in presence of missing links, which is a well-known problem in biological networks analysis, where experimental screens are often not sensitive enough to detect all existing gene/protein interactions.

A shortest path interaction model allows to overcome this limitation by explicitly taking into account the distance between nodes.

Random Walk with Restart test

mkdir topology_rwr/
pygna test-topology-rwr network.tsv genesets.gmt \
  network_rwr.hdf5 topology_rwr.csv \
  --number-of-permutations 1000 --cores 4 \
  --results-figure topology_rwr.pdf \
  --diagnostic-null-folder topology_rwr/

According to the paper, this is the best test. It is also the most complicated though.

From documentation:

test-topology-rwr performs the analysis of random walk probabilities. Given the Random Walk with Restart matrix, it compares the probability of walking between the genes in the geneset compared to those of walking between the nodes of a geneset with the same size

From the paper:

modelling gene interactions using shortest path provides a simple analytical framework to include local and global awareness of the connectivity. However, this approach is also sensitive to missing links and small-world effects, which is common in biological networks and could lead to false positives [19]. Propagation models provide an analytical model to overcome these limitations, and have been shown to be robust for biological network analysis [20]. While its interpretation is not necessarily straightforward, the RWR model is more robust than the shortest path model, because it effectively adjusts interaction effects for network structure; it rewards nodes connected with many shortest paths, and penalizes those that are connected only by path going through high degree nodes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygna2-0.1.1.tar.gz (12.3 kB view hashes)

Uploaded Source

Built Distribution

pygna2-0.1.1-py3-none-any.whl (15.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page