1. SNE and t-SNE

Stochastic Neighbor Embedding (SNE) is nonlinear dimensionality reduction methods developed by Geoffrey Hinton. It assume the distribution of given data points $X_i \in \mathbb{R}^p, i=1,2,…n$ and their connectivity. Here n is the numbers of sample and p is the dimension of original data. And by making low dimensional data points $Y_i \in \mathbb{R}^q , i=1,2,…,n$ preserving same connectivity of original data points, we can find new sets of data having much less dimension but preserve connectivity with original data. t-SNE is an algorithm adding t-distribution concepts on SNE.

Let’s see detailed process. To understand the SNE and t-SNE, we need to focus on three points of algorithms.

First, we define the measure of connectivity of original data.

In the original paper, they suggested following measure

$p_{i \mid j} = \frac{exp( -d_{ij}^2)}{\sum_{k \neq i} exp( -d_{ik}^2)}$

where $d_{ij}$ is a metric between i-th and j-th data points. Possibly, $d_{ij} = \parallel x_i - x_{j} \parallel_2^2/\sigma_i^2$ works

In the praise, we have several noteworthy things.

  • The measure $p_{i \mid j}$ is asymmetric. That is, $p_{i \mid j}$ and $p_{j \mid i}$ are different so we need to make it symmetric like $p_{ij} = \frac{1}{c}(p_{i \mid j }+ p_{j \mid i})$ to get a consistent result. The c is scaling parameter, but the author suggest it as the sample size n.

  • If we use $d_{ij}$ as $\parallel x_{i} - x_{j} \parallel_2$, the $p_{i \mid j}$ becomes the probability of “$x_i$ follows same distribution with $x_j$ given that all of the points follow Gaussian Distribution and $x_i$ belongs to one of them”

    Let’s assume $X_i \sim N(\mu_i ,\sigma_i^2)$ and $\mu_i = \mu_k$ for one of $k \neq i$.

    Then, $Pr(\mu_i = \mu_j \mid x_i) = \frac{Pr(x_i \mid \mu_i=\mu_j)Pr(\mu_i = \mu_j)}{\sum_{k\neq i}Pr(x_i\mid \mu_i=\mu_k)Pr(\mu_i=\mu_k)} = \frac{\exp( - \parallel x_i - x_j \parallel_2^2/2\sigma_i^2)}{\sum_{k \neq i} \exp( - \parallel x_i - x_k \parallel_2^2/2\sigma_i^2)}$

    given that $Pr(\mu_i = \mu_j), \forall j$ which means that unless having any data point, we can conjecture that the probability of $\mu_i=\mu_j$ is the same along all of the j.

  • Similarly, if we let $d_{ij} = \parallel x_i - x_j \parallel_1^{1/2}$, then we have similar result but assuming the data points follow Laplace Distribution.
  • By using it, we can set some kind of adjoint matrix $P = {p_{ij}}_{i,j=1}^{n,n}$

Secondly, we’ll set up the dissimilarity structure of reduced data Y.

The original SNE paper suggest similar form with the p. However, Laurens van der Maaten suggest to use the t-distribution with one degree of freedom.

Original SNE

$q_{i \mid j} =\frac{exp( -\parallel y_i - y_j \parallel_2^2/2\sigma_{i}^2)}{\sum_{k \neq i} exp( -\parallel y_k -y_j \parallel_2^2/2\sigma_{i}^2)}$

t-SNE

$q_{i \mid j} = \frac{(1+\parallel y_i - y_j \parallel^2)^{-1}}{\sum_{k} \sum_{l\neq k}(1+\parallel y_k - y_l \parallel^2)^{-1}}$

$q_{ij} =\frac{1}{c}(q_{i \mid j} + q_{j \mid i })$

In any way, it assume another distribution which have much less dimension than original data. Therefore, it aims to make a new data set which have the same connectivity but follow different, less complicate distribution.

It is said that Cauchy distribution is much more robust about out lier high dimension data.

undefined

Above figure is shape of t-distribution with $\nu$-degree of freedom. It converges to standard normal distribution as $\nu$ goes to infinity. Observe that the probability of extreme part is larger as $\nu$ become small. That is, by using the t-distribution, we can give less confidence and more flexibility on the assumption of distribution. Distributions with this kind of shape are called heavy-tailed distribution.

Finally, let’s see how can we find the Y with above structure. In the original paper, they used Kulback-Leibler Divergence used to measure the difference between two distributions.

$\min_{y} KL(P\mid \mid Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

  • The KL divergence is $E(\log p(X) - \log q(X) \mid \text{The measure p is true in reality})$. Therefore, it is the expectation of difference of log-probability.
  • Note that KL-divergence is also asymmetric. But in this case, we have true-distribution set of P, so we don’t need to adjust it.

Now, we’ll use the gradient based methods to find the Y. Basically, we can use gradient descent.

$Y^{+} = Y - t \nabla_Y KL(P \mid \mid Q) $

Original SNE : $\frac{\partial}{\partial y_i} KL(P \mid \mid Q) = c \sum_{j \neq i}(p_{i,j} -q_{i,j})(y_i-y_j)$

t-SNE : $\frac{\partial}{\partial y_i} KL(P \mid \mid Q) = c \sum_{j \neq i}(p_{i,j} -q_{i,j})(1+ \parallel y_i-y_j \parallel^2)^{-1}(y_i-y_j)$

So we can finally find the reduced data set Y.

2. UMAP

3. MDS

Multidimensional Scaling (MDS) is a method to find reduced axis preserving the pairwise distance information among the data. It seems similar with the SNE, but the structure and method to find it is quite different.

The ordinary MDS is also called Principal Coordinate Analysis (PCoA). As we can see the name is similar with PCA, it has similar structure with PCA. The objective function of PCoA is as follow. \(\min_{X \in \mathbb{R}^{n \times q}} \left(\frac{\sum_{i,j}(b_{ij}-x_i^tx_j)^2}{\sum_{i,j}b_{ij}^2}\right)^{1/2}\)

  • The vector $x_i$ is q-dimensional and represents the features of the i-th sample. Meanwhile, $b_{ij}$ indicates the closeness or similarity between the i-th and j-th elements of the original data, being 0 if they’re distant and 1 if they’re close.
  • $\left(\sum_{i,j} a_{i,j}^2\right)^{1/2} = (tr(A^tA))^{1/2}$ is a matrix norm called Shatten-2 norm, Hilbert Schmidt Norm and Frobenius norm.
  • Therefore, the cost function is $\parallel B-X^tX \parallel_F$. By Eckart-Yong-Mirsky Theorem, the minimizer of the cost function is q-th order Low Rank Approximation of $B$.

The classical MDS has following traits.

  • As a result, the result $X$ is $n \times q$ reduced matrix whose inner product matrix $X^tX$ resembles the distance matrix of original data (assuming they are centerized).
  • Note that the PCA is a Low Rank Approximation on the covariance matrix of original data. On the other hand, the PCoA is a Low Rank Approximation on the distance matrix of original data.
  • For the $n \times p$ matrix, the covariance matrix is $p \times p$ matrix but the distance matrix is $n \times n$ matrix. Because the spectral decomposition on the $m \times m$ matrix have $O(m^{2.3})$ of time complexity and we usually have $n»p$ data, the PCoA needs much more computational resource.

So, there is another way to approach the MDS problem, Metric Multidimensional Scaling (mMDS). Let’s see its objective function. \(Stress_{D}(X) = \left( \sum_{i\neq j} (d_{ij}- \parallel x_i - x_j\parallel)^2 \right)^{1/2}\)

  • It is much more direct way in that it target to copy the distance of original data $d_{ij}$ and distance of reduced data $\parallel x_i - x_j \parallel$. The way to measure the distance could be selected with user’s appetite.
  • The optimization strategies for the mMDS are little bit specialized, called Stress Majorization. Most of them are based on iterative methods including gradient descent and SMACOF.
  • The main differences between MDS and mMDS are in 1. structure of target function, 2. flexibility over the choice of distance measure and 3. optimization methods.
  • Because the distance matrix of result are similar with the distance matrix of the Original Data, it could be said a more intuitive method.

On the other hand, we can find the adequate methods to measure the distance. Sometimes, the usual way to calculate the similarity might not work.

  • Just think about the case that features contain nominal variable. There is no common method to measure the distance between nominal data.

  • Consider situations where differences aren’t linear, such as the distribution of wealth on Earth. It’s strikingly exponential, given that the top 1% of the wealthiest individuals possess half of the world’s riches.

  • It is called Non-metric Multidimensional Scaling (NMDS).

\[\underset{X \in \mathbb{R}^{n\times q}, f \in F}{\min} f\]
  1. Learning Based DR - SNE,t-SNE,UMAP
  2. Kernel based DR - Multi Dimension Scaling (MDS), Isomap, kernel PCA, MVU (Maximum Variance Unfolding), Diffusion Maps
  3. Other DR algorithms - Laplacian Eigenmaps , Spectral Embedding, Locally Linear Embedding (LLE)
  4. Variational Inference
  5. Gaussian Process Latent Variable Models (GPLMVs)
  6. Matern-$\nu$ Graph Gaussian Process

Yun Li

Xihao Li

Dimension Reduction

Optimization

Machine Learning

Quantum Data Science

Linear Model

Model Selection

Nonparametric Modeling

Kernel Methods