Start by loading pwcuratr
and a collection of packages that facilitates data manipulation and plotting.
library(pwcuratr)
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
#> ✔ ggplot2 3.3.0 ✔ purrr 0.3.3
#> ✔ tibble 2.1.3 ✔ dplyr 0.8.4
#> ✔ tidyr 1.0.2 ✔ stringr 1.4.0
#> ✔ readr 1.3.1 ✔ forcats 0.5.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
library(pander)
The first step to curating a pathway cluster is to define a set of ‘seed genes’ that are of interest. This set could, e.g., be defined via an extensive literature review. The package pwcuratr
comes with a set of genes associated with the serotonin neurotransmitter pathways. The list of genes can be loaded via
seed_genes <- read_csv(
system.file("extdata/serotonin_seed_genes.csv", package="pwcuratr"),
col_types = c(ensembl_gene_id = col_character())
)$ensembl_gene_id
print(seed_genes)
#> [1] "ENSG00000129167" "ENSG00000132437" "ENSG00000178394" "ENSG00000135312"
#> [5] "ENSG00000179546" "ENSG00000102468" "ENSG00000135914" "ENSG00000147246"
#> [9] "ENSG00000166736" "ENSG00000149305" "ENSG00000178084" "ENSG00000186090"
#> [13] "ENSG00000164270" "ENSG00000157219" "ENSG00000158748" "ENSG00000148680"
#> [17] "ENSG00000189221" "ENSG00000036565" "ENSG00000165646" "ENSG00000108576"
#> [21] "ENSG00000164638"
Next, external data is used to define a plausible functional neighborhood for genes that are functionally related to any of the seed genes. We start by querying a snapshot of the reactome.org pathway database for any pathways containing at least one of the seed genes.
tbl_reactome_pathways <- tibble(
reactome_pathway_id = query_reactome_pathways(seed_genes),
n_participating_genes = map_int(
reactome_pathway_id,
~length(query_participating_genes(.))
)
) %>%
arrange(n_participating_genes)
pander(tbl_reactome_pathways)
reactome_pathway_id | n_participating_genes |
---|---|
R-HSA-5579012 | 1 |
R-HSA-141333 | 2 |
R-HSA-379398 | 2 |
R-HSA-380612 | 2 |
R-HSA-380615 | 3 |
R-HSA-379397 | 3 |
R-HSA-209905 | 4 |
R-HSA-140179 | 4 |
R-HSA-209931 | 5 |
R-HSA-379401 | 5 |
R-HSA-112311 | 11 |
R-HSA-390666 | 12 |
R-HSA-83936 | 12 |
R-HSA-209776 | 18 |
R-HSA-181429 | 18 |
R-HSA-181430 | 18 |
R-HSA-442660 | 20 |
R-HSA-212676 | 23 |
R-HSA-5579029 | 42 |
R-HSA-375280 | 43 |
R-HSA-425397 | 44 |
R-HSA-112310 | 51 |
R-HSA-425366 | 100 |
R-HSA-5668914 | 111 |
R-HSA-211945 | 112 |
R-HSA-6785807 | 123 |
R-HSA-112314 | 220 |
R-HSA-416476 | 223 |
R-HSA-211859 | 233 |
R-HSA-425407 | 269 |
R-HSA-112315 | 287 |
R-HSA-373076 | 347 |
R-HSA-71291 | 412 |
R-HSA-112316 | 438 |
R-HSA-418594 | 458 |
R-HSA-449147 | 512 |
R-HSA-500792 | 513 |
R-HSA-418555 | 656 |
R-HSA-382551 | 806 |
R-HSA-1280215 | 1024 |
R-HSA-388396 | 1290 |
R-HSA-1643685 | 1304 |
R-HSA-372790 | 1370 |
R-HSA-1430728 | 2291 |
R-HSA-168256 | 2734 |
R-HSA-162582 | 3105 |
A naive approach to defining a function neighborhood for the seed genes would be to take the union of all genes who are participating in any of the identified reactome pathways. However, reactome.org pathways are defined on very different levels of granularity and it will not always be sensible to include all participating genes from very large pathways. A first step to address this ‘specificity problem’ is to hand-curate the list of pathways to consider for further processing. Here, a completely arbitrary cut-off in terms of the pathway size of 250 participating genes is chosen. To avoid loosing any of the seed genes, they are added manually to the initial list of candidate genes.
candidate_genes <- c(
query_participating_genes(
tbl_reactome_pathways %>%
filter(n_participating_genes <= 250) %>%
pull(reactome_pathway_id)
),
seed_genes
) %>%
unique
length(candidate_genes)
#> [1] 1059
To refine the candidate set of genes even further, predicted functional gene-gene interactions based on the 2018 release of the interactions derived in
Wu, G., Feng, X., & Stein, L. (2010). A human functional protein interaction network and its application to cancer data analysis. Genome biology, 11(5), R53.
are used. The set of interactions is scored between 0 (low confidence) and 1 (high confidence) for functional interaction between the respective genes.
tbl_reactome_interactions %>%
ggplot() +
aes(score) +
geom_histogram(binwidth = .01) +
ggtitle("Confidence score distribution for functional interactions") +
theme_bw()
Clearly, the majority of included interactions has a confidence score of 0.9 or greater and we filter the set of interactions with this threshold. This results in a gene interaction network that can be used to refine our notion of functional neighborhood. We want to filter the set of candidate genes derived above such that only genes that are connected to a seed gene via at most \(k\) edges are retained.
tibble(
maxedgedistance = 0:10,
n_genes_pruned = map_int(
maxedgedistance,
~length(prune(
candidate_genes,
seed_genes,
minscore = .9,
maxedgedistance = .
))
)
) %>%
ggplot() +
aes(maxedgedistance, n_genes_pruned) +
geom_line() +
geom_point() +
scale_y_continuous(
"number of retained genes",
limits = c(0L, NA_integer_)
) +
scale_x_continuous("maximal edge distance", breaks = 0:10) +
ggtitle("Number of retained candiate genes by choice of maximal edge-distance (k)") +
theme_bw() +
theme(
panel.grid.minor = element_blank()
)
The final choice of \(k\) can only be justified by manually inspecting the resulting gene-networks. As an example, consider the choice \(k=2\). We first prune the set of candidate genes and then create an igraph object to represent the network information. This allows us to use the wealth of graph-based techniques implemented in igraph
package. For instance, one may be interested in the number of connected components of the pruned gene network.
candidate_genes <- prune(
candidate_genes,
seed_genes,
minscore = .9,
maxedgedistance = 2
)
gr <- as_igraph(candidate_genes, minscore = .9)
igraph::components(gr)$csize
#> [1] 360 5 2 1 1 1 1
There are 7 components with almost all candidate genes connected in the first components and two small non-trivial components. All components of size 1 correspond to seed genes for which no functional neighbors in the selected reactome.org pathways where found.
plot_graph(
gr,
seed_genes,
title = "Serotonin Pathway Cluster",
layout = "stress",
layout_args = list(bbox = 3)
)
Based on this visual representation, the neighborhood parameters could be tweaked or the set of initial reactome pathways could be altered. In the presence of non-connected components it might be worthwhile considering a split of the pathway cluster such that each individual network is fully connected.
For instance, one could consider the HTR subsystem separately by restricting the initial set of seed genes. Since the HTR3 subsystem is not connected, we can filter these seed genes out as well.
seed_genes_htr <- tbl_ensembl %>%
select(external_gene_name, ensembl_gene_id) %>%
distinct() %>%
filter(
ensembl_gene_id %in% seed_genes,
str_detect(external_gene_name, "HTR[^3]")
) %>%
pull(ensembl_gene_id) %>%
unique()
tbl_reactome_pathways_htr <- tibble(
reactome_pathway_id = query_reactome_pathways(seed_genes_htr),
n_participating_genes = map_int(
reactome_pathway_id,
~length(query_participating_genes(.))
)
) %>%
arrange(n_participating_genes)
candidate_genes_htr <- c(
query_participating_genes(
tbl_reactome_pathways_htr %>%
filter(n_participating_genes <= 250) %>%
pull(reactome_pathway_id)
),
seed_genes_htr
) %>%
unique %>%
prune(
seed_genes_htr,
minscore = .9,
maxedgedistance = 2
)
gr <- as_igraph(candidate_genes_htr, minscore = .9)
igraph::components(gr)$csize
#> [1] 216