pvclust Node Values In ggtree

I recently was using the R package pvclust to test the ‘robusticity’ of clusters in a microbiome-related clustering analysis. While pvclust provides it’s own plots via plot() on a pvclust object, this plots the dendrogram in base R. For readability and customisability reasons I prefer using the packages ggplot2 and ggtree for making my figures. However, I was having a hard time to extract the node uncertainty values from the pvclust object and integrate them into a generic R phylo object for plotting the dendrogram in ggtree.

Fortunately, a bit of googling (a couple of days…) showed me someone else had already solved the problem of transferring additional hclust information into a phylo object but in a different context. The fastbap package has the function as.phylo.hclust.node.attributes()which essentially does what I needed to do - i.e. when calculating the node numbers from each merge event, also store with the same node number the corresponding attribute, or in this case,pvclust` AU value.

I then modified this function slightly to make it more consistent with how pvclust will display the values in the base R plot (rounding and converting to a ‘percentage’). Note that this outputs in the phylo object the metadata node.label, not as node.attributes as in the original fastbaps function.

So in summary:

## 1. Make pvclust object e.g.
hclust_boot <- pvclust::pvclust(otu_matrix,
                 method.hclust = selected_method,
                 method.dist = "euclidean",
                 nboot = 1000,
                 parallel = T)

## 2. Set modified fastbaps function
as.phylo.pvclust.node.attributes <- function(x, attribute)
{
  N <- dim(x$merge)[1]
  edge <- matrix(0L, 2*N, 2)
  edge.length <- numeric(2*N)
  ## `node' gives the number of the node for the i-th row of x$merge
  node <- integer(N)
  node[N] <- N + 2L
  node.attributes <- rep(NA, N)
  cur.nod <- N + 3L
  j <- 1L
  for (i in N:1) {
    edge[j:(j + 1), 1] <- node[i]
    for (l in 1:2) {
      k <- j + l - 1L
      y <- x$merge[i, l]
      if (y > 0) {
        edge[k, 2] <- node[y] <- cur.nod
        cur.nod <- cur.nod + 1L
        edge.length[k] <- x$height[i] - x$height[y]
        node.attributes[edge[k, 1] - (N + 1)] <- attribute[i]
      } else {
        edge[k, 2] <- -y
        edge.length[k] <- x$height[i]
        node.attributes[edge[k, 1] -  (N + 1)] <- attribute[i]
      }
    }
    j <- j + 2L
  }

  if (is.null(x$labels))
    x$labels <- as.character(1:(N + 1))
  
  ## MODIFICATION: clean up node.attributes so they are in same format in 
  ## pvclust plots
  node.attributes <- as.character(round(node.attributes * 100, 0))
  node.attributes[1] <- NA
  
  obj <- list(edge = edge, edge.length = edge.length / 2,
              tip.label = x$labels, Nnode = N, node.label = node.attributes)
  class(obj) <- "phylo"
  stats::reorder(obj)
}

## 3. Use the modified fastbaps function by accessing the hclust object in first 
## position, and the corresponding au values from the edges list entry.
hclust_boot_phylo <- as.phylo.pvclust.node.attributes(hclust_boot$hclust, 
                                                     hclust_boot$edges$au)

To display the values on the tree with ggtree you can then run

ggtree(hclust_boot_phylo  aes(x, y)) +
    geom_text2(aes(subset = !isTip, label = label)) 

as described in the ggtree FAQ under the section “bootstrap values from newick format”.

James A. Fellows Yates
James A. Fellows Yates
Bioinformatician and Biomolecular Archaeologist

My research interests focus on biomolecular archaeology, ancient DNA, metagenomics and bioinformatics

Related