HIERARCHICAL CLUSTERING IN R (层级聚类及其在R中实现)


HIERARCHICAL CLUSTERING IN R: THE ESSENTIALS (层级聚类)

link1-lesson
link2-heatmap-extend

Agglomerative Hierarchical Clustering

丛集聚类
分裂聚类

Steps to agglomerative hierarchical clustering

Data structure and preparation

# Load the data
data("USArrests")

# Standardize the data
## The R function scale() can be used for standardization
df <- scale(USArrests)

# Show the first 6 rows
head(df, nrow = 6)
A matrix: 6 × 4 of type dbl
MurderAssaultUrbanPopRape
Alabama1.242564080.7828393-0.5209066-0.003416473
Alaska0.507862481.1068225-1.2117642 2.484202941
Arizona0.071633411.4788032 0.9989801 1.042878388
Arkansas0.232349380.2308680-1.0735927-0.184916602
California0.278268231.2628144 1.7589234 2.067820292
Colorado0.025714560.3988593 0.8608085 1.864967207

Similarity measures

# Compute the dissimilarity matrix
# df = the standardized data
res.dist <- dist(df, method = "euclidean")  # Euclidean and manhattan
# view distance
as.matrix(res.dist)[1:6, 1:6]
A matrix: 6 × 6 of type dbl
AlabamaAlaskaArizonaArkansasCaliforniaColorado
Alabama0.0000002.7037542.2935201.2898103.2631102.651067
Alaska2.7037540.0000002.7006432.8260393.0125412.326519
Arizona2.2935202.7006430.0000002.7177581.3104841.365031
Arkansas1.2898102.8260392.7177580.0000003.7636412.831051
California3.2631103.0125411.3104843.7636410.0000001.287619
Colorado2.6510672.3265191.3650312.8310511.2876190.000000

Linkage

res.hc <- hclust(d = res.dist, method = "ward.D2")

Verify the cluster tree

One way to measure how well the cluster tree generated by the hclust() function reflects your data is to compute the correlation between the cophenetic distances and the original distance data generated by the dist() function.
If the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the original distance matrix.
评估聚类效果的方法之一,就是评估 hclust() 算出来的距离 和 用dist()算的原始数据的距离,二者之间的相关性好不好。

The closer the value of the correlation coefficient is to 1, the more accurately the clustering solution reflects your data. Values above 0.75 are felt to be good.
The “average” linkage method appears to produce high values of this statistic.
相关性越接近 1,说明聚类的效果越能反映真实情况.

# Compute cophentic distance
res.coph <- cophenetic(res.hc)

# Correlation between cophenetic distance and
# the original distance
cor(res.dist, res.coph)

0.697526563237039

# Execute the hclust() function again using the average linkage method
res.hc2 <- hclust(res.dist, method = "average")

cor(res.dist, cophenetic(res.hc2))

0.718038237932047

Dendrogram(树状图)

Dendrogram can be produced in R using the base function plot(res.hc), where res.hc is the output of hclust()

plot(res.hc, cex = 0.8)

png

fviz_dend() ( in factoextra R package ) to produce a beautiful dendrogram.

# cex: label size
suppressPackageStartupMessages(library("factoextra"))
fviz_dend(res.hc, title='', cex = 0.7)

png

Cut the dendrogram into different groups

You can cut the hierarchical tree at a given height in order to partition your data into clusters.
The R base function cutree() can be used to cut a tree, generated by the hclust() function, into several groups either by specifying the desired number of groups or the cut height.

# Cut tree into 4 groups
grp <- cutree(res.hc, k = 4)
head(grp, n = 4)
Alabama
1
Alaska
2
Arizona
2
Arkansas
3
# Number of members in each cluster
table(grp)
grp
 1  2  3  4 
 7 12 19 12 
# Get the names for the members of cluster 1
rownames(df)[grp == 1]
  1. 'Alabama'
  2. 'Georgia'
  3. 'Louisiana'
  4. 'Mississippi'
  5. 'North Carolina'
  6. 'South Carolina'
  7. 'Tennessee'

fviz_dend advanced handbook

fviz_dend(res.hc, title='', cex=0.7, k=4, type='rectangle', rect=T ,rect_border = "jco", rect_fill = TRUE, horiz = TRUE) # type='scatter'

png

Cluster R package

The R package cluster makes it easy to perform cluster analysis in R. It provides the function agnes() and diana() for computing agglomerative and divisive clustering, respectively. These functions perform all the necessary steps for you. You don’t need to execute the scale(), dist() and hclust() function separately.

library("cluster")
# Agglomerative Nesting (Hierarchical Clustering)
res.agnes <- agnes(x = USArrests, # data matrix
                   stand = TRUE, # Standardize the data
                   metric = "euclidean", # metric for distance matrix
                   method = "ward" # Linkage method
                   )

# DIvisive ANAlysis Clustering
res.diana <- diana(x = USArrests, # data matrix
                   stand = TRUE, # standardize the data
                   metric = "euclidean" # metric for distance matrix
                   )
Warning message:
"package 'cluster' was built under R version 4.0.5"
fviz_dend(res.diana, cex = 0.7, k = 4, horiz = T, type = 'circular')

png

Application of hierarchical clustering to gene expression data analysis

In gene expression data analysis, clustering is generaly used as one of the first step to explore the data.
We are interested in whether there are groups of genes or groups of samples that have similar gene expression patterns.

Several clustering distance measures have been described for assessing the similarity or the dissimilarity between items, in order to decide which items have to be grouped together or not.
These measures can be used to cluster genes or samples that are similar.

For most common clustering softwares, the default distance measure is the Euclidean distance.
The most popular methods for gene expression data are to use log2(expression + 0.25), correlation distance and complete linkage clustering agglomerative-clustering.

Note that, when the data are scaled, the Euclidean distance of the z-scores is the same as correlation distance.

Pearson’s correlation is quite sensitive to outliers. When clustering genes, it is important to be aware of the possible impact of outliers.
An alternative option is to use Spearman’s correlation instead of Pearson’s correlation.

Comparing Cluster Dendrograms in R

Data preparation

df <- scale(USArrests)

# Subset containing 10 rows
set.seed(123)
ss <- sample(1:50, 10)
df <- df[ss,]

Dendrograms comparison

suppressPackageStartupMessages(library(dendextend))

# Compute distance matrix
res.dist <- dist(df, method = "euclidean")

# Compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "average")
hc2 <- hclust(res.dist, method = "ward.D2")

# Create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)

# Create a list to hold dendrograms
dend_list <- dendlist(dend1, dend2)
  1. Visual comparison of two dendrograms
# Align and plot two dendrograms side by side
dendlist(dend1, dend2) %>%
  untangle(method = "step1side") %>% # Find the best alignment layout
  tanglegram()                       # Draw the two dendrograms

png

Heatmap in R: Static and Interactive Visualization

R Packages/functions for drawing heatmaps

  • heatmap() [R base function, stats package]: Draws a simple heatmap
  • heatmap.2() [gplots R package]: Draws an enhanced heatmap compared to the R base function.
  • pheatmap() [pheatmap R package]: Draws pretty heatmaps and provides more control to change the appearance of heatmaps.
  • d3heatmap() [d3heatmap R package]: Draws an interactive/clickable heatmap
  • Heatmap() [ComplexHeatmap R/Bioconductor package]: Draws, annotates and arranges complex heatmaps (very useful for genomic data analysis)

data preparation

# start by standardizing the data to make variables comparable
df <- scale(mtcars)

R base heatmap: heatmap()

heatmap(df, scale = "row")

png

It’s possible to specify a color palette using the argument col, which can be defined as follow:

  • use custom colors
col<- colorRampPalette(c("red", "white", "blue"))(256)
  • Or, using RColorBrewer color palette:
library("RColorBrewer")
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)

Additionally, you can use the argument RowSideColors and ColSideColors to annotate rows and columns, respectively.

# Use RColorBrewer color palette names
library("RColorBrewer")
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
heatmap(df, scale = "none", col =  col, 
        RowSideColors = rep(c("blue", "pink"), each = 16),
        ColSideColors = c(rep("purple", 5), rep("orange", 6)))

png

Enhanced heat maps: heatmap.2()

# install.packages("gplots")
suppressPackageStartupMessages(library("gplots"))
heatmap.2(df, scale = "none", col = bluered(100), 
          trace = "none", density.info = "none")

png

Pretty heat maps: pheatmap()

library("pheatmap")
pheatmap(df, cutree_rows = 4, cutree_cols = 2)

png

Interactive heat maps: d3heatmap()

Complex heatmap

simple heatmap

suppressPackageStartupMessages(library(ComplexHeatmap))
Heatmap(df, 
        name = "mtcars", #title of legend
        column_title = "Variables", row_title = "Samples",
        row_title_gp = gpar(fontsize = 14),
        row_names_gp = gpar(fontsize = 10), # Text size for row names
        column_names_gp = gpar(fontsize = 14)
        )

png

To specify a custom colors, you must use the the colorRamp2() function [circlize package], as follow

suppressPackageStartupMessages(library(circlize))
mycols <- colorRamp2(breaks = c(-2, 0, 2), 
                    colors = c("green", "white", "red"))
Heatmap(df, name = "mtcars", col = mycols)

png

We can also customize the appearance of dendograms using the function color_branches() [dendextend package]

library(dendextend)
row_dend <- hclust(dist(df))
col_dend <- hclust(dist(t(df)))
Heatmap(df, name='mtcars',
        row_names_gp = gpar(fontsize = 11),
        cluster_rows = color_branches(row_dend,k=4),
        cluster_columns = color_branches(col_dend,k=2)
       )

png

Splitting heatmap by rows

You can split the heatmap using either the k-means algorithm or a grouping variable

  • by k-means
# Divide into 2 groups
set.seed(2)
Heatmap(df, name = "mtcars", k = 2)

png

  • grouping variables
# split by a vector specifying rowgroups
Heatmap(df, name = "mtcars", split = mtcars$cyl,
        row_names_gp = gpar(fontsize = 7))

png

# Split by combining multiple variables
Heatmap(df, name ="mtcars", 
        split = data.frame(cyl = mtcars$cyl, am = mtcars$am),
        row_names_gp = gpar(fontsize = 7))

png

Heatmap annotation

For the example below, we’ll transpose our data to have the observations in columns and the variables in rows.

df <- t(df)

Simple annotation

A vector, containing discrete or continuous values, is used to annotate rows or columns.
We’ll use the qualitative variables cyl (levels = “4”, “5” and “8”) and am (levels = “0” and “1”), and the continuous variable mpg to annotate columns.

table(mtcars['am'])
 0  1 
19 13 
# Define colors for each levels of qualitative variables
col = list(cyl = c("4"="green","6"="gray","8"="darkred"),
          am = c("0"="yellow","1"="orange"),
          mpg = circlize::colorRamp2(c(10,34),
                                    c("lightblue", "purple")))
ha <- HeatmapAnnotation(
cyl = mtcars$cyl, am = mtcars$am, mpg = mtcars$mpg,
col = col)
Heatmap(df, name='mtcars',
       top_annotation = ha)

png

*complex annotation

# Define some graphics to display the distribution of columns
.hist = anno_histogram(df, gp = gpar(fill = "lightblue"))
.density = anno_density(df, type = "line", gp = gpar(col = "blue"))
ha_mix_top = HeatmapAnnotation(
  hist = .hist, density = .density,
  height = unit(3.8, "cm")
  )
# Define some graphics to display the distribution of rows
.violin = anno_density(df, type = "violin", 
                       gp = gpar(fill = "lightblue"), which = "row")
.boxplot = anno_boxplot(df, which = "row")
ha_mix_right = HeatmapAnnotation(violin = .violin, bxplt = .boxplot,
                              which = "row", width = unit(4, "cm"))
# Combine annotation with heatmap
Heatmap(df, name = "mtcars", 
        column_names_gp = gpar(fontsize = 8),
        top_annotation = ha_mix_top) + ha_mix_right

png

*Combining multiple heatmaps

# Heatmap 1
ht1 = Heatmap(df, name = "ht1", km = 2,
              column_names_gp = gpar(fontsize = 9))
# Heatmap 2
ht2 = Heatmap(df, name = "ht2", 
        col = circlize::colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
        column_names_gp = gpar(fontsize = 9))
# Combine the two heatmaps
ht1 + ht2

png

Application to gene expression matrix

In gene expression data, rows are genes and columns are samples.

More information about genes can be attached after the expression heatmap such as gene length and type of genes.

expr <- readRDS(paste0(system.file(package = "ComplexHeatmap"),
                      "/extdata/gene_expression.rds"))
head(expr,1)
A data.frame: 1 × 27
s1_cell01s2_cell02s3_cell03s4_cell01s5_cell02s6_cell03s7_cell01s8_cell02s9_cell03s10_cell01...s18_cell03s19_cell01s20_cell02s21_cell03s22_cell01s23_cell02s24_cell03lengthtypechr
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>...<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
gene111.1163212.6598611.380510.8434412.2940211.0782911.0833512.8315711.2491610.45512...11.4132410.6966912.8330511.1208210.2530612.6391111.20286124882protein_codingchr1
table(expr$chr)
 chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19  chr2 chr20 
    9    10     6     7     3     8     9     3     5     5     5    14     3 
chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrM  chrX 
    1     7     8     9     9    12     5     4     6     7 
mat <- as.matrix(expr[,grep("cell",colnames(expr))])
type <- gsub('s\\d+_','',colnames(mat))
ha = HeatmapAnnotation(
  df = data.frame(type = type),
    annotation_height = unit(4, "mm")
  )

Heatmap(mat, name='expression' , km = 5, top_annotation = ha,
        show_column_names = F, show_row_names = F) + 
Heatmap(expr$length, name = 'length', width = unit(5, "mm"),
        col = circlize::colorRamp2(c(0,100000),c('white','orange'))) +
Heatmap(expr$type, name = 'type', width = unit(5, "mm")) +
Heatmap(expr$chr, name = 'chr', width = unit(5, "mm"),
       col = circlize::rand_color(length(unique(expr$chr))))

png

set.seed(123)
mydata <- matrix(rnorm(200), 20, 10)
mydata[1:10, seq(1, 10, 2)] = mydata[1:10, seq(1, 10, 2)] + 3
mydata[11:20, seq(2, 10, 2)] = mydata[11:20, seq(2, 10, 2)] + 2
mydata[15:20, seq(2, 10, 2)] = mydata[15:20, seq(2, 10, 2)] + 4
colnames(mydata) = paste("Sple", 1:10, sep = "")
rownames(mydata) = paste("Gene", 1:20, sep = "")
head(mydata[, 1:4], 4)
A matrix: 4 × 4 of type dbl
Sple1Sple2Sple3Sple4
Gene12.439524-1.06782372.305293 0.3796395
Gene22.769823-0.21797492.792083-0.5023235
Gene34.558708-1.02600441.734604-0.3332074
Gene43.070508-0.72889125.168956-1.0185754
library("pheatmap")
pheatmap(mydata, scale = "row")
Warning message:
"package 'pheatmap' was built under R version 4.0.3"

png

R 中聚类方法

  • ‘ward’
  • ‘ward.D’
  • ‘ward.D2’
  • ‘single’
  • ‘complete’
  • ‘average’
  • ‘mcquitty’
  • ‘median’
  • ‘centroid’

文章作者: 梁绍波
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 梁绍波 !
评论
  目录