Genomic data are often high-dimensional. For example, a typical transcroptome data set has less than 100 samples, but over ten thousand genes/variables. This article will introduce a three-part workflow of analyzing high dimensional data: *t-SNE* for dimension reduction; *dbSCAN* for sample classificataion, and *plotly* for visualization. The data to be used is the ExpO (The Expression Project for Oncology) microarray data set, which includes gene expression measurements of 19,361 genes in 2,158 tumor samples.

# t-SNE

t-SNE (t-Distributed Stochas- tic Neighbor Embedding) is a popular method of dimension reduction that maps high-dimentional data to a 2D or 3D space. This method is implemented by the *Rtsne {Rtsne}* R function. On a t-SNE space, the similarities between samples is measured by their overall KL (Kullback–Leibler) divergence. An optimal t-SNE run should have relatively lower KL divergence. Each t-SNE run of this analysis involves the following sub-steps:

- Run the
*Barnes-Hut*approximation to reduce computational time; - Calculate the euclidean distance between any pair of samples;
- Fit the distance of each sample to its nearest neighbors to a
*Cauchy*distribution; - Adjust sample positions on the low dimensional space by 1000 iterations to lower their overall KL divergence.

Multiple t-SNE runs using the same data set will not have identical results because their initial state is dependent on random seeds, and the overall KL divergence can be used to select an optimal run (the lower the better). The figure below shows the distribution of overall KL divergence from 1000 t-SNE runs. Because the worst run is only ~4% higher than the best run, we can confidently say that t-SNE has quite consistent performance on the ExpO data.

# DBSCAN

DBSCAN (Density-based spatial clustering of applications with noise) is an supervised clustering method based on the density of sample on a multi-dimensional space. It clusters samples packed together to form a high-density region in given space into the same group. This analysis uses the *dbscan {dbscan}* R implementation of DBSCAN and the result of t-SNE run having the lowest KL divergence as input.

Each DBSCAN run has 2 key parameters: epsilon – the size of epsilon neighborhood, and minCore – the minimal number of core samples to form a cluster. Both parameters can be arbitrarily chosen, but we are going to run another iterative process to identify their optimal combination based on the silhouette. Silhouette is the measurement of sample clustering cohesion. For each clustered sample, it silhouette value is its distance to its own cluster comparing its distance to other cluster, and between -1 and 1 (the higher the better). In this analysis, we use the average silhouette value of all samples to select the optimal combination parameters, within the range of 1E-6 to 1E+6 for epsilon and 2 to 100 for minCore. The figure below plots the average silhouette value given each minCore value and shows that the clustering of ExpO is optimal when minCore = 78, which clusters about 89% of all samples into 8 clusters.

# plotly

Now, we can have fun by visually inspecting the clustering results using an online graphing tool called Plotly. Plotly makes online graphs with interactive features, such as zooming, figure downloading, and hover over text. An example of Plotly graph can be found here (click the “Load an example” button to load the results of a t-SNE run from ExpO data).

In summary, the t-SNE/DBSCAN/Plotly combination makes a practical workflow of sample unsupervised clustering using high-dimensinal data. The whole workflow is implemented by the RoCA project.