## t-SNE

### What Is t-SNE?

t-SNE (`tsne`

) is an algorithm for dimensionality reduction that is well-suited to visualizing high-dimensional data. The name stands for *t*-distributed Stochastic Neighbor Embedding. The idea is to embed high-dimensional points in low dimensions in a way that respects similarities between points. Nearby points in the high-dimensional space correspond to nearby embedded low-dimensional points, and distant points in high-dimensional space correspond to distant embedded low-dimensional points. (Generally, it is impossible to match distances exactly between high-dimensional and low-dimensional spaces.)

The `tsne`

function creates a set of low-dimensional points from high-dimensional data. Typically, you visualize the low-dimensional points to see natural clusters in the original high-dimensional data.

The algorithm takes the following general steps to embed the data in low dimensions.

Calculate the pairwise distances between the high-dimensional points.

Create a standard deviation

*σ*for each high-dimensional point_{i}*i*so that the*perplexity*of each point is at a predetermined level. For the definition of perplexity, see Compute Distances, Gaussian Variances, and Similarities.Calculate the

*similarity matrix*. This is the joint probability distribution of X, defined by Equation 1.Create an initial set of low-dimensional points.

Iteratively update the low-dimensional points to minimize the Kullback-Leibler divergence between a Gaussian distribution in the high-dimensional space and a

*t*distribution in the low-dimensional space. This optimization procedure is the most time-consuming part of the algorithm.

See van der Maaten and Hinton [1].

### t-SNE Algorithm

The basic t-SNE algorithm performs the following steps.

#### Prepare Data

`tsne`

first removes each row of the input data X that contains any `NaN`

values. Then, if the `Standardize`

name-value pair is `true`

, `tsne`

centers X by subtracting the mean of each column, and scales X by dividing its columns by their standard deviations.

The original authors van der Maaten and Hinton [1] recommend reducing the original data X to a lower-dimensional version using Principal Component Analysis (PCA). You can set the `tsne`

`NumPCAComponents`

name-value pair to the number of dimensions you like, perhaps 50. To exercise more control over this step, preprocess the data using the `pca`

function.

#### Compute Distances, Gaussian Variances, and Similarities

After the preprocessing, `tsne`

calculates the distance *d*(*x _{i}*,

*x*) between each pair of points

_{j}*x*and

_{i}*x*in X. You can choose various distance metrics using the

_{j}`Distance`

name-value pair. By default, `tsne`

uses the standard Euclidean metric. `tsne`

uses the square of the distance metric in its subsequent calculations.Then for each row *i* of X, `tsne`

calculates a standard deviation *σ _{i}* so that the

*perplexity*of row

*i*is equal to the

`Perplexity`

name-value pair. The perplexity is defined in terms of a model Gaussian distribution as follows. As van der Maaten and Hinton [1] describe, “The similarity of data point *x*to data point

_{j}*x*is the conditional probability, $${p}_{j|i}$$, that

_{i}*x*would pick

_{i}*x*as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at

_{j}*x*. For nearby data points, $${p}_{j|i}$$ is relatively high, whereas for widely separated data points, $${p}_{j|i}$$ will be almost infinitesimal (for reasonable values of the variance of the Gaussian,

_{i}*σ*).”

_{i}Define the conditional probability of *j* given *i* as

$$\begin{array}{l}{p}_{j|i}=\frac{\mathrm{exp}\left(-d{\left({x}_{i},{x}_{j}\right)}^{2}/\left(2{\sigma}_{i}^{2}\right)\right)}{{\displaystyle \sum _{k\ne i}\mathrm{exp}\left(-d{\left({x}_{i},{x}_{k}\right)}^{2}/\left(2{\sigma}_{i}^{2}\right)\right)}}\\ {p}_{i|i}=0.\end{array}$$

Then define the joint probability *p _{ij}* by symmetrizing the conditional probabilities:

$${p}_{ij}=\frac{{p}_{j|i}+{p}_{i|j}}{2N},$$ | (1) |

where *N* is the number of rows of X.

The distributions still do not have their standard deviations *σ _{i}* defined in terms of the

`Perplexity`

name-value pair. Let *P*represents the conditional probability distribution over all other data points given data point

_{i}*x*. The perplexity of the distribution is

_{i}$$\text{perplexity}\left({P}_{i}\right)={2}^{H\left({P}_{i}\right)},$$

where *H*(*P _{i}*) is the Shannon entropy of

*P*:

_{i}$$H\left({P}_{i}\right)=-{\displaystyle \sum _{j}{p}_{j|i}{\mathrm{log}}_{2}\left({p}_{j|i}\right)}.$$

The perplexity measures the effective number of neighbors of point *i*. `tsne`

performs a binary search over the *σ _{i}* to achieve a fixed perplexity for each point

*i*.

#### Initialize the Embedding and Divergence

To embed the points in X into a low-dimensional space, `tsne`

performs an optimization. `tsne`

attempts to minimize the Kullback-Leibler divergence between the model Gaussian distribution of the points in X and a Student *t* distribution of points Y in the low-dimensional space.

The minimization procedure begins with an initial set of points Y. `tsne`

create the points by default as random Gaussian-distributed points. You can also create these points yourself and include them in the `'InitialY'`

name-value pair for `tsne`

. `tsne`

then calculates the similarities between each pair of points in Y.

The probability model *q _{ij}* of the distribution of the distances between points

*y*and

_{i}*y*is

_{j}$$\begin{array}{l}{q}_{ij}=\frac{{\left(1+{\Vert {y}_{i}-{y}_{j}\Vert}^{2}\right)}^{-1}}{{\displaystyle \sum _{k}{\displaystyle \sum _{l\ne k}{\left(1+{\Vert {y}_{k}-{y}_{l}\Vert}^{2}\right)}^{-1}}}}\\ {q}_{ii}=0.\end{array}$$

Using this definition and the model of distances in X given by Equation 1, the Kullback-Leibler divergence between the joint distribution *P* and *Q* is

$$KL(P||Q)={\displaystyle \sum _{j}{\displaystyle \sum _{i\ne j}{p}_{ij}\mathrm{log}\frac{{p}_{ij}}{{q}_{ij}}}.}$$

For consequences of this definition, see Helpful Nonlinear Distortion.

#### Gradient Descent of Kullback-Leibler Divergence

To minimize the Kullback-Leibler divergence, the `'exact'`

algorithm uses a modified gradient descent procedure. The gradient with respect to the points in Y of the divergence is

$$\frac{\partial KL(P||Q)}{\partial {y}_{i}}=4{\displaystyle \sum _{j\ne i}Z\left({p}_{ij}-{q}_{ij}\right){q}_{ij}\left({y}_{i}-{y}_{j}\right)},$$

where the normalization term

$$Z={\displaystyle \sum _{k}{\displaystyle \sum _{l\ne k}{\left(1+{\Vert {y}_{k}-{y}_{l}\Vert}^{2}\right)}^{-1}}}.$$

The modified gradient descent algorithm uses a few tuning parameters to attempt to reach a good local minimum.

`'Exaggeration'`

— During the first 99 gradient descent steps,`tsne`

multiplies the probabilities*p*from Equation 1 by the exaggeration value. This step tends to create more space between clusters in the output Y._{ij}`'LearnRate'`

—`tsne`

uses adaptive learning to improve the convergence of the gradient descent iterations. The descent algorithm has iterative steps that are a linear combination of the previous step in the descent and the current gradient.`'LearnRate'`

is a multiplier of the current gradient for the linear combination. For details, see Jacobs [3].

### Barnes-Hut Variation of t-SNE

To speed the t-SNE algorithm and to cut down on its memory usage, `tsne`

offers an approximate optimization scheme. The Barnes-Hut algorithm groups nearby points together to lower the complexity and memory usage of the t-SNE optimization step. The Barnes-Hut algorithm is an approximate optimizer, not an exact optimizer. There is a nonnegative tuning parameter `Theta`

that effects a tradeoff between speed and accuracy. Larger values of `'Theta'`

give faster but less accurate optimization results. The algorithm is relatively insensitive to `'Theta'`

values in the range (0.2,0.8).

The Barnes-Hut algorithm groups nearby points in the low-dimensional space, and performs an approximate gradient descent based on these groups. The idea, originally used in astrophysics, is that the gradient is similar for nearby points, so the computations can be simplified.

See van der Maaten [2].

### Characteristics of t-SNE

#### Cannot Use Embedding to Classify New Data

Because t-SNE often separates data clusters well, it can seem that t-SNE can classify new data points. However, t-SNE cannot classify new points. The t-SNE embedding is a nonlinear map that is data-dependent. To embed a new point in the low-dimensional space, you cannot use the previous embedding as a map. Instead, run the entire algorithm again.

#### Performance Depends on Data Sizes and Algorithm

t-SNE can take a good deal of time to process data. If you have *N* data points in *D* dimensions that you want to map to *Y* dimensions, then

Exact t-SNE takes of order

*D***N*^{2}operations.Barnes-Hut t-SNE takes of order

*D***N*log(*N*)*exp(dimension(*Y*)) operations.

So for large data sets, where *N* is greater than 1000 or so, and where the embedding dimension *Y* is 2 or 3, the Barnes-Hut algorithm can be faster than the exact algorithm.

#### Helpful Nonlinear Distortion

T-SNE maps high-dimensional distances to distorted low-dimensional analogues. Because of the fatter tail of the Student *t* distribution in the low-dimensional space, `tsne`

often moves close points closer together, and moves far points farther apart than in the high-dimensional space, as illustrated in the following figure. The figure shows both Gaussian and Student *t* distributions at the points where the densities are at 0.25 and 0.025. The Gaussian density relates to high-dimensional distances, and the *t* density relates to low-dimensional distances. The *t* density corresponds to close points being closer, and far points being farther, compared to the Gaussian density.

t = linspace(0,5); y1 = normpdf(t,0,1); y2 = tpdf(t,1); plot(t,y1,'k',t,y2,'r') hold on x1 = fzero(@(x)normpdf(x,0,1)-0.25,[0,2]); x2 = fzero(@(x)tpdf(x,1)-0.25,[0,2]); z1 = fzero(@(x)normpdf(x,0,1)-0.025,[0,5]); z2 = fzero(@(x)tpdf(x,1)-0.025,[0,5]); plot([0,x1],[0.25,0.25],'k-.') plot([0,z2],[0.025,0.025],'k-.') plot([x1,x1],[0,0.25],'g-',[x2,x2],[0,0.25],'g-') plot([z1,z1],[0,0.025],'g-',[z2,z2],[0,0.025],'g-') text(1.1,.25,'Close points are closer in low-D') text(2.4,.05,'Far points are farther in low-D') legend('Gaussian(0,1)','Student t (df = 1)') xlabel('x') ylabel('Density') title('Density of Gaussian(0,1) and Student t (df = 1)') hold off

This distortion is helpful when it applies. It does not apply in cases such as when the Gaussian variance is high, which lowers the Gaussian peak and flattens the distribution. In such a case, `tsne`

can move close points farther apart than in the original space. To achieve a helpful distortion,

Set the

`'Verbose'`

name-value pair to`2`

.Adjust the

`'Perplexity'`

name-value pair so the reported range of variances is not too far from`1`

, and the mean variance is near`1`

.

If you can achieve this range of variances, then the diagram applies, and the `tsne`

distortion is helpful.

For effective ways to tune `tsne`

, see Wattenberg, Viégas and Johnson [4].

## References

[1] van der Maaten, Laurens, and Geoffrey Hinton. "Visualizing Data using *t*-SNE." *J. Machine Learning Research* 9, 2008, pp. 2579–2605.

[2] van der Maaten, Laurens. *Barnes-Hut-SNE*. arXiv:1301.3342 [cs.LG], 2013.

[3] Jacobs, Robert A. "Increased rates of convergence through learning rate adaptation." *Neural Networks* 1.4, 1988, pp. 295–307.

[4] Wattenberg, Martin, Fernanda Viégas, and Ian Johnson. "How to Use t-SNE Effectively." *Distill*, 2016. Available at `How to Use t-SNE Effectively`

.