发布日期: 2023-04-28

更新日期: 2023-04-28

HIERARCHICAL CLUSTERING IN R: THE ESSENTIALS (层级聚类)

link1-lesson
link2-heatmap-extend

Agglomerative Hierarchical Clustering

丛集聚类
分裂聚类

Steps to agglomerative hierarchical clustering

Data structure and preparation

# Load the data
data("USArrests")

# Standardize the data
## The R function scale() can be used for standardization
df <- scale(USArrests)

# Show the first 6 rows
head(df, nrow = 6)

A matrix: 6 × 4 of type dbl
	Murder	Assault	UrbanPop	Rape
Alabama	1.24256408	0.7828393	-0.5209066	-0.003416473
Alaska	0.50786248	1.1068225	-1.2117642	2.484202941
Arizona	0.07163341	1.4788032	0.9989801	1.042878388
Arkansas	0.23234938	0.2308680	-1.0735927	-0.184916602
California	0.27826823	1.2628144	1.7589234	2.067820292
Colorado	0.02571456	0.3988593	0.8608085	1.864967207

Similarity measures

# Compute the dissimilarity matrix
# df = the standardized data
res.dist <- dist(df, method = "euclidean")  # Euclidean and manhattan

# view distance
as.matrix(res.dist)[1:6, 1:6]

A matrix: 6 × 6 of type dbl
	Alabama	Alaska	Arizona	Arkansas	California	Colorado
Alabama	0.000000	2.703754	2.293520	1.289810	3.263110	2.651067
Alaska	2.703754	0.000000	2.700643	2.826039	3.012541	2.326519
Arizona	2.293520	2.700643	0.000000	2.717758	1.310484	1.365031
Arkansas	1.289810	2.826039	2.717758	0.000000	3.763641	2.831051
California	3.263110	3.012541	1.310484	3.763641	0.000000	1.287619
Colorado	2.651067	2.326519	1.365031	2.831051	1.287619	0.000000

Linkage

res.hc <- hclust(d = res.dist, method = "ward.D2")

Verify the cluster tree

One way to measure how well the cluster tree generated by the hclust() function reflects your data is to compute the correlation between the cophenetic distances and the original distance data generated by the dist() function.
If the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the original distance matrix.
评估聚类效果的方法之一，就是评估 hclust() 算出来的距离和用dist()算的原始数据的距离，二者之间的相关性好不好。

The closer the value of the correlation coefficient is to 1, the more accurately the clustering solution reflects your data. Values above 0.75 are felt to be good.
The “average” linkage method appears to produce high values of this statistic.
相关性越接近 1，说明聚类的效果越能反映真实情况.

# Compute cophentic distance
res.coph <- cophenetic(res.hc)

# Correlation between cophenetic distance and
# the original distance
cor(res.dist, res.coph)

0.697526563237039

# Execute the hclust() function again using the average linkage method
res.hc2 <- hclust(res.dist, method = "average")

cor(res.dist, cophenetic(res.hc2))

0.718038237932047

Dendrogram(树状图)

Dendrogram can be produced in R using the base function plot(res.hc), where res.hc is the output of hclust()

plot(res.hc, cex = 0.8)

png

fviz_dend() ( in factoextra R package ) to produce a beautiful dendrogram.

# cex: label size
suppressPackageStartupMessages(library("factoextra"))
fviz_dend(res.hc, title='', cex = 0.7)

png

Cut the dendrogram into different groups

You can cut the hierarchical tree at a given height in order to partition your data into clusters.
The R base function cutree() can be used to cut a tree, generated by the hclust() function, into several groups either by specifying the desired number of groups or the cut height.

# Cut tree into 4 groups
grp <- cutree(res.hc, k = 4)
head(grp, n = 4)

Alabama: 1
Alaska: 2
Arizona: 2
Arkansas: 3

# Number of members in each cluster
table(grp)

grp
 1  2  3  4 
 7 12 19 12

# Get the names for the members of cluster 1
rownames(df)[grp == 1]

'Alabama'
'Georgia'
'Louisiana'
'Mississippi'
'North Carolina'
'South Carolina'
'Tennessee'

fviz_dend advanced handbook

fviz_dend(res.hc, title='', cex=0.7, k=4, type='rectangle', rect=T ,rect_border = "jco", rect_fill = TRUE, horiz = TRUE) # type='scatter'

png

Cluster R package

The R package cluster makes it easy to perform cluster analysis in R. It provides the function agnes() and diana() for computing agglomerative and divisive clustering, respectively. These functions perform all the necessary steps for you. You don’t need to execute the scale(), dist() and hclust() function separately.

library("cluster")
# Agglomerative Nesting (Hierarchical Clustering)
res.agnes <- agnes(x = USArrests, # data matrix
                   stand = TRUE, # Standardize the data
                   metric = "euclidean", # metric for distance matrix
                   method = "ward" # Linkage method
                   )

# DIvisive ANAlysis Clustering
res.diana <- diana(x = USArrests, # data matrix
                   stand = TRUE, # standardize the data
                   metric = "euclidean" # metric for distance matrix
                   )

Warning message:
"package 'cluster' was built under R version 4.0.5"

fviz_dend(res.diana, cex = 0.7, k = 4, horiz = T, type = 'circular')

png

Application of hierarchical clustering to gene expression data analysis

In gene expression data analysis, clustering is generaly used as one of the first step to explore the data.
We are interested in whether there are groups of genes or groups of samples that have similar gene expression patterns.

Several clustering distance measures have been described for assessing the similarity or the dissimilarity between items, in order to decide which items have to be grouped together or not.
These measures can be used to cluster genes or samples that are similar.

For most common clustering softwares, the default distance measure is the Euclidean distance.
The most popular methods for gene expression data are to use log2(expression + 0.25), correlation distance and complete linkage clustering agglomerative-clustering.

Note that, when the data are scaled, the Euclidean distance of the z-scores is the same as correlation distance.

Pearson’s correlation is quite sensitive to outliers. When clustering genes, it is important to be aware of the possible impact of outliers.
An alternative option is to use Spearman’s correlation instead of Pearson’s correlation.

Comparing Cluster Dendrograms in R

Data preparation

df <- scale(USArrests)

# Subset containing 10 rows
set.seed(123)
ss <- sample(1:50, 10)
df <- df[ss,]

Dendrograms comparison

suppressPackageStartupMessages(library(dendextend))

# Compute distance matrix
res.dist <- dist(df, method = "euclidean")

# Compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "average")
hc2 <- hclust(res.dist, method = "ward.D2")

# Create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)

# Create a list to hold dendrograms
dend_list <- dendlist(dend1, dend2)

Visual comparison of two dendrograms

# Align and plot two dendrograms side by side
dendlist(dend1, dend2) %>%
  untangle(method = "step1side") %>% # Find the best alignment layout
  tanglegram()                       # Draw the two dendrograms

png

Heatmap in R: Static and Interactive Visualization

R Packages/functions for drawing heatmaps

heatmap() [R base function, stats package]: Draws a simple heatmap
heatmap.2() [gplots R package]: Draws an enhanced heatmap compared to the R base function.
pheatmap() [pheatmap R package]: Draws pretty heatmaps and provides more control to change the appearance of heatmaps.
d3heatmap() [d3heatmap R package]: Draws an interactive/clickable heatmap
Heatmap() [ComplexHeatmap R/Bioconductor package]: Draws, annotates and arranges complex heatmaps (very useful for genomic data analysis)

data preparation

# start by standardizing the data to make variables comparable
df <- scale(mtcars)

R base heatmap: heatmap()

heatmap(df, scale = "row")

png

It’s possible to specify a color palette using the argument col, which can be defined as follow:

use custom colors

col<- colorRampPalette(c("red", "white", "blue"))(256)

Or, using RColorBrewer color palette:

library("RColorBrewer")
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)

Additionally, you can use the argument RowSideColors and ColSideColors to annotate rows and columns, respectively.

# Use RColorBrewer color palette names
library("RColorBrewer")
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
heatmap(df, scale = "none", col =  col, 
        RowSideColors = rep(c("blue", "pink"), each = 16),
        ColSideColors = c(rep("purple", 5), rep("orange", 6)))

png

Enhanced heat maps: heatmap.2()

# install.packages("gplots")
suppressPackageStartupMessages(library("gplots"))
heatmap.2(df, scale = "none", col = bluered(100), 
          trace = "none", density.info = "none")

png

Pretty heat maps: pheatmap()

library("pheatmap")
pheatmap(df, cutree_rows = 4, cutree_cols = 2)

png

Interactive heat maps: d3heatmap()

Complex heatmap

simple heatmap

suppressPackageStartupMessages(library(ComplexHeatmap))
Heatmap(df, 
        name = "mtcars", #title of legend
        column_title = "Variables", row_title = "Samples",
        row_title_gp = gpar(fontsize = 14),
        row_names_gp = gpar(fontsize = 10), # Text size for row names
        column_names_gp = gpar(fontsize = 14)
        )

png

To specify a custom colors, you must use the the colorRamp2() function [circlize package], as follow

suppressPackageStartupMessages(library(circlize))
mycols <- colorRamp2(breaks = c(-2, 0, 2), 
                    colors = c("green", "white", "red"))
Heatmap(df, name = "mtcars", col = mycols)

png

We can also customize the appearance of dendograms using the function color_branches() [dendextend package]

library(dendextend)
row_dend <- hclust(dist(df))
col_dend <- hclust(dist(t(df)))
Heatmap(df, name='mtcars',
        row_names_gp = gpar(fontsize = 11),
        cluster_rows = color_branches(row_dend,k=4),
        cluster_columns = color_branches(col_dend,k=2)
       )

png

Splitting heatmap by rows

You can split the heatmap using either the k-means algorithm or a grouping variable

by k-means

# Divide into 2 groups
set.seed(2)
Heatmap(df, name = "mtcars", k = 2)

png

grouping variables

# split by a vector specifying rowgroups
Heatmap(df, name = "mtcars", split = mtcars$cyl,
        row_names_gp = gpar(fontsize = 7))

png

# Split by combining multiple variables
Heatmap(df, name ="mtcars", 
        split = data.frame(cyl = mtcars$cyl, am = mtcars$am),
        row_names_gp = gpar(fontsize = 7))

png

Heatmap annotation

For the example below, we’ll transpose our data to have the observations in columns and the variables in rows.

df <- t(df)

Simple annotation

A vector, containing discrete or continuous values, is used to annotate rows or columns.
We’ll use the qualitative variables cyl (levels = “4”, “5” and “8”) and am (levels = “0” and “1”), and the continuous variable mpg to annotate columns.

table(mtcars['am'])

 0  1 
19 13

# Define colors for each levels of qualitative variables
col = list(cyl = c("4"="green","6"="gray","8"="darkred"),
          am = c("0"="yellow","1"="orange"),
          mpg = circlize::colorRamp2(c(10,34),
                                    c("lightblue", "purple")))
ha <- HeatmapAnnotation(
cyl = mtcars$cyl, am = mtcars$am, mpg = mtcars$mpg,
col = col)

Heatmap(df, name='mtcars',
       top_annotation = ha)

png

*complex annotation

# Define some graphics to display the distribution of columns
.hist = anno_histogram(df, gp = gpar(fill = "lightblue"))
.density = anno_density(df, type = "line", gp = gpar(col = "blue"))
ha_mix_top = HeatmapAnnotation(
  hist = .hist, density = .density,
  height = unit(3.8, "cm")
  )
# Define some graphics to display the distribution of rows
.violin = anno_density(df, type = "violin", 
                       gp = gpar(fill = "lightblue"), which = "row")
.boxplot = anno_boxplot(df, which = "row")
ha_mix_right = HeatmapAnnotation(violin = .violin, bxplt = .boxplot,
                              which = "row", width = unit(4, "cm"))
# Combine annotation with heatmap
Heatmap(df, name = "mtcars", 
        column_names_gp = gpar(fontsize = 8),
        top_annotation = ha_mix_top) + ha_mix_right

png

*Combining multiple heatmaps

# Heatmap 1
ht1 = Heatmap(df, name = "ht1", km = 2,
              column_names_gp = gpar(fontsize = 9))
# Heatmap 2
ht2 = Heatmap(df, name = "ht2", 
        col = circlize::colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
        column_names_gp = gpar(fontsize = 9))
# Combine the two heatmaps
ht1 + ht2

png

Application to gene expression matrix

In gene expression data, rows are genes and columns are samples.

More information about genes can be attached after the expression heatmap such as gene length and type of genes.

expr <- readRDS(paste0(system.file(package = "ComplexHeatmap"),
                      "/extdata/gene_expression.rds"))

head(expr,1)

A data.frame: 1 × 27
	s1_cell01	s2_cell02	s3_cell03	s4_cell01	s5_cell02	s6_cell03	s7_cell01	s8_cell02	s9_cell03	s10_cell01	...	s18_cell03	s19_cell01	s20_cell02	s21_cell03	s22_cell01	s23_cell02	s24_cell03	length	type	chr
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<chr>
gene1	11.11632	12.65986	11.3805	10.84344	12.29402	11.07829	11.08335	12.83157	11.24916	10.45512	...	11.41324	10.69669	12.83305	11.12082	10.25306	12.63911	11.20286	124882	protein_coding	chr1

table(expr$chr)

 chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19  chr2 chr20 
    9    10     6     7     3     8     9     3     5     5     5    14     3 
chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrM  chrX 
    1     7     8     9     9    12     5     4     6     7

mat <- as.matrix(expr[,grep("cell",colnames(expr))])
type <- gsub('s\\d+_','',colnames(mat))
ha = HeatmapAnnotation(
  df = data.frame(type = type),
    annotation_height = unit(4, "mm")
  )

Heatmap(mat, name='expression' , km = 5, top_annotation = ha,
        show_column_names = F, show_row_names = F) + 
Heatmap(expr$length, name = 'length', width = unit(5, "mm"),
        col = circlize::colorRamp2(c(0,100000),c('white','orange'))) +
Heatmap(expr$type, name = 'type', width = unit(5, "mm")) +
Heatmap(expr$chr, name = 'chr', width = unit(5, "mm"),
       col = circlize::rand_color(length(unique(expr$chr))))

png

set.seed(123)
mydata <- matrix(rnorm(200), 20, 10)
mydata[1:10, seq(1, 10, 2)] = mydata[1:10, seq(1, 10, 2)] + 3
mydata[11:20, seq(2, 10, 2)] = mydata[11:20, seq(2, 10, 2)] + 2
mydata[15:20, seq(2, 10, 2)] = mydata[15:20, seq(2, 10, 2)] + 4
colnames(mydata) = paste("Sple", 1:10, sep = "")
rownames(mydata) = paste("Gene", 1:20, sep = "")
head(mydata[, 1:4], 4)

A matrix: 4 × 4 of type dbl
	Sple1	Sple2	Sple3	Sple4
Gene1	2.439524	-1.0678237	2.305293	0.3796395
Gene2	2.769823	-0.2179749	2.792083	-0.5023235
Gene3	4.558708	-1.0260044	1.734604	-0.3332074
Gene4	3.070508	-0.7288912	5.168956	-1.0185754

library("pheatmap")
pheatmap(mydata, scale = "row")

Warning message:
"package 'pheatmap' was built under R version 4.0.3"

png

R 中聚类方法

‘ward’
‘ward.D’
‘ward.D2’
‘single’
‘complete’
‘average’
‘mcquitty’
‘median’
‘centroid’

梁绍波

https://liangshaw.github.io/2023/04/28/Hierarchical_clustering_in_R/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源梁绍波 !

hierarchical clustering R

服务器硬盘存储拓展

2023-05-08 梁绍波

server hardware

蛋白质结构预测与可视化

2023-04-21 bioinformatics

structure alphafold

HIERARCHICAL CLUSTERING IN R (层级聚类及其在R中实现)

HIERARCHICAL CLUSTERING IN R: THE ESSENTIALS (层级聚类)

Agglomerative Hierarchical Clustering

Steps to agglomerative hierarchical clustering

Data structure and preparation

Similarity measures

Linkage

Verify the cluster tree

Dendrogram(树状图)

Cut the dendrogram into different groups

Cluster R package

Application of hierarchical clustering to gene expression data analysis

Comparing Cluster Dendrograms in R

Data preparation

Dendrograms comparison

Heatmap in R: Static and Interactive Visualization

R Packages/functions for drawing heatmaps

data preparation

R base heatmap: heatmap()

Enhanced heat maps: heatmap.2()

Pretty heat maps: pheatmap()

Interactive heat maps: d3heatmap()

Complex heatmap

simple heatmap

Splitting heatmap by rows

Heatmap annotation

Simple annotation

*complex annotation

*Combining multiple heatmaps

Application to gene expression matrix

R 中聚类方法