Analysis of high dimensional data: a customized workflow

Genomic data are often high-dimensional. For example, a typical transcroptome data set has less than 100 samples, but over ten thousand genes/variables. This article will introduce a three-part workflow of analyzing high dimensional data: t-SNE for dimension reduction; dbSCAN for sample classificataion, and plotly for visualization. The data to be used is the ExpO (The Expression Project for Oncology) microarray data set, which includes gene expression measurements of 19,361 genes in 2,158 tumor samples.



t-SNE (t-Distributed Stochas- tic Neighbor Embedding) is a popular method of dimension reduction that maps high-dimentional data to a 2D or 3D space. This method is implemented by the Rtsne {Rtsne} R function. On a t-SNE space, the similarities between samples is measured by their overall KL (Kullback–Leibler) divergence. An optimal t-SNE run should have relatively lower KL divergence. Each t-SNE run of this analysis involves the following sub-steps:

  • Run the Barnes-Hut approximation to reduce computational time;
  • Calculate the euclidean distance between any pair of samples;
  • Fit the distance of each sample to its nearest neighbors to a Cauchy distribution;
  • Adjust sample positions on the low dimensional space by 1000 iterations to lower their overall KL divergence.

Multiple t-SNE runs using the same data set will not have identical results because their initial state is dependent on random seeds, and the overall KL divergence can be used to select an optimal run (the lower the better). The figure below shows the distribution of overall KL divergence from 1000 t-SNE runs. Because the worst run is only ~4% higher than the best run, we can confidently say that t-SNE has quite consistent performance on the ExpO data.



DBSCAN (Density-based spatial clustering of applications with noise) is an supervised clustering method based on the density of sample on a multi-dimensional space. It clusters samples packed together to form a high-density region in given space into the same group. This analysis uses the dbscan {dbscan} R implementation of DBSCAN and the result of t-SNE run having the lowest KL divergence as input.

Each DBSCAN run has 2 key parameters: epsilon – the size of epsilon neighborhood, and minCore – the minimal number of core samples to form a cluster. Both parameters can be arbitrarily chosen, but we are going to run another iterative process to identify their optimal combination based on the silhouette. Silhouette is the measurement of sample clustering cohesion. For each clustered sample, it silhouette value is its distance to its own cluster comparing its distance to other cluster, and between -1 and 1 (the higher the better). In this analysis, we use the average silhouette value of all samples to select the optimal combination parameters, within the range of 1E-6 to 1E+6 for epsilon and 2 to 100 for minCore. The figure below plots the average silhouette value given each minCore value and shows that the clustering of ExpO is optimal when minCore = 78, which clusters about 89% of all samples into 8 clusters.



Now, we can have fun by visually inspecting the clustering results using an online graphing tool called Plotly. Plotly makes online graphs with interactive features, such as zooming, figure downloading, and hover over text. An example of Plotly graph can be found here (click the “Load an example” button to load the results of a t-SNE run from ExpO data).


In summary, the t-SNE/DBSCAN/Plotly combination makes a practical workflow of sample unsupervised clustering using high-dimensinal data. The whole workflow is implemented by the RoCA project.


About Jim Zhe Zhang

Principal Bioinformatics Scientist, PhD Department of Biomedical and Medical Informatics The Children's Hospital of Philadelphia
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s