Sufficient Dimension Reduction - 1. Introduction and Sliced Inverse Regression
1. Sufficient Dimension Reduction
Introduction
Sufficient Dimension Reduction is kind of dimensionality reduction technique combining concepts of sufficient statistic and dimension reduction. This concepts are
Unlike general dimension reduction technique which known as unsupervised learning like Principle component analysis or Cluster analysis, SDR is supervised learning. It aims to find smallest portion of independent variables for explaining dependent variable. It has theoretical structure as follow.
Sufficient Dimension Reduction
Finding $\beta$ such that $Y \perp X \mid \beta^t X \iff F(Y \mid X) = F(Y \mid \beta^t X)$
It could be understand as follow.
As we know, sufficient statistics is defined by a function T(X) which take data point as input and $f(X \mid T(X))$ does not depend on parameter $\theta$. That is, $f(X \mid T(X);\theta)$ is consistent with every possible $\theta$.
Because the theory came from Frequentists, $\theta$ is understand as constant. But in the view of Bayesian, $\theta$ can be understand another variables, so we can rewrite the definition of sufficient statistics as $f(Z \mid T(Z), \theta) = f(Z \mid T(Z))$
Now, let’s apply that concept to regression problem. Regression problem is to find $f(Y \mid X)$. In general, we want to find expectation by linear combination like $f(Y \mid X) = E(Y\mid X) = X\beta$.
In this case, we can understand Y and X as Z and $\theta$ respectively. Then, sufficient statistic in regression problem become $f(Y\mid X ,T(X)) = f(Y \mid T(X))$. It can be write as $Y \perp X \mid T(X)$ using concept of conditional independence. In SDR, to make problem easier, we can assume T(X) as simple linear combination of X.
To sum up, by using notion of sufficient statistic more generally, SDR problem can be understand as a combine of sufficient statistics and regression problem. By setting problem exactly like above, we can imagine various methodology.
Someone may be confused with conditional independence. Conditional independence is a status which two random variable $X_1$ and $X_2$ are independent given $X_3$. Below is precise definition of conditional independence.
Conditional Independence
For a probability space $(\Omega, F, P)$ and its sub $\sigma$ -field $g_1,g_2,g_3$,
two field $g_1$ and $g_2$ is said to be independent given $g_3$ and denoted by $g_1 \perp g_2 \mid g_3$,
if, for every $A \in g_1$ and $B \in g_2, P(A \cap B \mid g_3) = P(A \mid g_3) P(B \mid g_3) \quad a.s.P$
Students usually confuse whether two variable commutes in these case because its notation seems $X_3$ affect only $X_2$. That is, the notation $X_1 \perp X_2 \mid X_3$ seems that $X_1$ and $X_2\mid X_3$ looks independent. But condition cannot affect solely to one of two variables, we can commutes them with confidence i.e. $X_1 \perp X_2 \mid X_3 \iff X_2 \perp X_1 \mid X_3$
Property of SDR subspace
To make SDR problem analytical, people apply concept of subspace: a sufficient dimension reduction subspace. The definition of SDR subspace is below.
SDR subspace
span($\beta$) is called SDR subspace if $Y \perp X \mid \beta^t X$ and denoted as $S(\beta)$
By using it, we can explore various property of SDR.
- SDR subspace is not unique.
Let $S_1$ be SDR subspace and $S_2$ contains $S_1$. Then $S_2$ is SDR subspace too.
pf)
Let $\beta_1 ,\beta_2$ is basis corresponding to $S_1,S_2$.
Because $\sigma(X, \beta_2^t X) = \sigma(X)$, we have $Y \perp (X, \beta^2 X) \mid \beta^1 X$.
Using the property of conditional independence, $Y \perp X \mid (\beta_2^t X, \beta_1^tX)$
And $\beta^2 X$ can cover all $\beta_1^t X$, we can conclude that $Y \perp X \mid \beta_2^t X$
Moreover, obviously, $Y \perp X \mid X$, so span(X) is SDR subspace itself.
- Central Subspace
Therefore, just finding SDR subspace is not enough. We have to find smallest SDR subspace. Before find smallest one, we have to clarify definition of such subspace.
Central Subspace
Let U be the class of all SDR subspaces, then $S_{Y \mid X} = \cap {S : S \in U}$ would be smallest SDR subspace and we call it central subspace. Moreover, the dimension of $S_{Y \mid X}$ is called structural dimension
In general, all of method in SDR target to find central subspace
- Linear transformation law of Central Subspace
In general, SDR subspace’s basis are linear combinations so it is easy to transform in linear. Following formula holds to transform.
$S_{Y \mid X} = A^t S_{Y \mid AX + b}$
Detail proof is omitted. We often use standardization of variable in a form of $Z = \Sigma^{-1}(X-\mu)$. Therefore, we can use Z for easy calculation and re-transform it using above formula
- Function transformation of Central Subspace
Moreover, follow connection is holds for any measurable function
$S_{g(Y) \mid X} \subset S_{Y \mid X}$
That means, if we find SDR subspace of Y, then we can use result to find SDR subspace of g(Y). For example, GLM has such form that $g(Y) \sim f(X^\beta)$, so it told that we can use SDR for estimating GLM.
pf)
It can be easily proved using theorem in mathematical statistics that $X$ and $Y$ are independent then for some function $f,g, f(X)$ and $g(Y)$ are independent too. Then,$Y \perp X \mid \beta^t X \Rightarrow g(Y) \perp X \mid \beta^tX$, so beta is SDR subspace’s basis for $g(Y)$ too.
2. Sliced Inverse Regression
Introduction
Sliced Inverse Regression is one of the main method for SDR. It is proposed first by KC Li in 1991.
Inverse Regression is literally regression analysis X onto Y. But,in general, the model$Y = X\beta + \epsilon$ holds, so Y is vacillating even on a fixed point X. Therefore, by using sliced interval of Y, the vacillating could be mitigated and we could estimate X consistently. This is called by sliced inverse regression.
To use it, we have to apply Linear Conditional Mean(LCM) assumption that $E(X \mid \beta^t X)$ is a linear function of $\beta^t X$. But the assumption cannot be checked because it depends on target parameter $\beta$. So we use another assumption
Lemma
It an integrable X is an elliptically contoured distribution, then $E(X \mid B^t X)$ is linear for any matrix B and if $E(X \mid v^t X)$ is linear in $v^tX$ for each v in $R^p$ then X has an elliptically contoured distribution.
Detailed proof is omitted. The lemma said that we can check linearity by checking the distribution is elliptically contoured. The definition of elliptical contour is below.
Elliptical Contour
A random vector X is said to be elliptically contoured or elliptical distribution if its pdf has a form$f_X(x) = h(x^tAx)$
It is generalized version of Multivariate Normal distribution. As we can see at the form, function depends on X by only the quadratic form. Like its name, its every possible pairwise scatter plot looks elliptical. As examples of elliptical distribution. there are Multivariate normal, Multivariate t, Symmetric multivariate stable, Symmetric multivariate Laplace, Multivariate logistic.
Any way, with this assumption, following theorem holds.
Theorem
If X is square-integrable and $\Sigma = Var(X)$ is non-singular and LCM assumption holds, $\Sigma^{-1}(E(X \mid Y) -E(X)) \in S_{Y \mid X}$.
pf)
Lemma: If LCM holds, then $E(X-E(X) \mid \beta^t X) = P_{\beta}^t(\Sigma)(X-E(X))$
pf of lemma)
$E(X \mid \beta^t X) = c + D \beta^tX$ by LCM
$E(E(X \mid \beta^t X)) = c+D \beta^t E(X) \quad \quad ${take both side $E_X()$}
$\mu = c+ D \beta^t \mu$
substitute this with first line, then we get
$E(X - \mu \mid \beta^t X) = D \beta^t (X-\mu)$
$\Rightarrow E(X-\mu \mid \beta^t X) (X-\mu)^t \beta = D \beta^t (X-\mu)(X-\mu)^t \beta$
$\Rightarrow E(E(X-\mu \mid \beta^t X) (X-\mu)^t \beta) = E(D \beta^t (X-\mu)(X-\mu)^t \beta))$
$\Rightarrow E((X-\mu) (X-\mu)^t \beta= D \beta^t E((X-\mu)(X-\mu)^t) \beta$
$\Rightarrow \Sigma \beta = D \beta^t \Sigma \beta$
$\Rightarrow D = \Sigma \beta (\beta^t \Sigma \beta)^{-1}$
then $E(X - \mu \mid \beta^t X) = D \beta^t (X-\mu)= P_\beta ^t(\Sigma)(X -\mu)$
${P_{\beta}(\Sigma) = \beta (\beta^t \Sigma \beta)^{-1} \beta^t \Sigma$ : Orthogonal Projection Matrix onto $\beta$ with $\left< x,y \right> = x^t \Sigma y}$
Without loss of generality, we can assume E(X)=0.
$E(X \mid Y) = E(E(X \mid \beta^tX,Y)\mid Y)$
$=E(E(X \mid \beta^tX) \mid Y) \qquad {Y \perp X \mid \beta^t X}$
$= E(P_{\beta}^t(\Sigma)X\mid Y)$ {using above lemma}
$= P_{\beta}^t(\Sigma) E(X \mid Y)$
$=\Sigma P_{\beta}(\Sigma)\Sigma^{-1} E(X \mid Y) \quad {P_{\beta}^t (\Sigma) = \Sigma P_{\beta}(\Sigma) \Sigma^{-1}}$
$\Rightarrow \Sigma^{-1}E(X\mid Y) = P_{\beta}(\Sigma)\Sigma^{-1} E(X\mid Y)$
Therefore, $\Sigma^{-1}E(X \mid Y) \in Span(P_{\beta}(\Sigma)) = Span(\beta) = S_{Y \mid X}$
Using this theorem, we could find one of basis of Central Subspace and at least a part of central subspace.
By using it, we can construct $\mbox{span}(\Sigma^{-1} cov[E(X \mid Y)]\Sigma^{-1}) \subset S_{Y \mid X}$ and we gonna define $cov(E(X\mid Y))$ as $\Lambda_{SIR}$ and $\mbox{span}(\Sigma^{-1}\Lambda_{SIR}\Sigma^{-1})$ as $S_{SIR}$. Notice that SIR is unbiased.
It could be solved with GEV (generalized eigenvalue problem) which is finding eigenvalues and eigenvectors with given inner product. Detail procedure would be delivered in next section.
Estimation of SIR Coefficients
Let $(X_i,Y_i)$ be a sample of $(X,Y)$ where $X\in R^p,Y\in R$.
Let $J_i$ be sliced interval of Y and define $g(Y) = \sum_{i=1}^h i I(Y \in J_i)$. Then g(y) would return one of 1,2,…,h depending on region of the y.
We know that $\Sigma^{-1}(E(X -E(X)\mid g(Y))) \in S_{Y\mid X}$. So, $E(X-E(X) \mid g(Y)) \in \Sigma S_{Y \mid X}$
Then, let $Z= \Sigma^{-1/2}(X-E(X))$ and $\Lambda = Cov[E(Z \mid g(Y))]$
Suppose $rank(\Lambda) =r$, and let $v_1, v_2,…v_r$ be the eigenvector of $\Lambda$.
Then $span(\Sigma^{-1} \Lambda \Sigma^{-1}) = S_{SIR}$ and eigenvectors of $\Sigma^{-1}\Lambda\Sigma^{-1}$ is ${u_i} = {\Sigma^{-1/2}v_i}$
As mentioned above, it can be solved with General Eigenvalue Problem.
$GEV(A,\Sigma)$
$\underset{x}{\max} x^tAx$
$\mbox{subject to } x^t\Sigma x =1$
It can be transformed with follow standard eigenvalue problem.
$GEV(\Sigma^{-1/2}A\Sigma^{-1/2},I)$
$\underset{x}{\max} x^t\Sigma^{-1/2}A\Sigma^{-1/2}x$
$\mbox{subject to } x^tx=1$
Its purpose is finding eigenvalue satisfying following equality ${v : \Lambda_{SIR} v = \lambda \Sigma v }$
Then we can project X into $S_{SIR}$ with ${u_i^t(X-E(X))}$
Algorithm. Sliced Inverse Regression
-
Compute the sample mean and sample variance and standardize sample:
$\hat{\mu} = E_n(X), \hat{\Sigma} = var_n(X) , Z_i = \hat{\Sigma}^{-1/2}(X_i-\hat{\mu})$
-
Approximate $E(Z \mid Y \in J_i)$ by
$E_n(Z \mid Y \in J_i) = \frac{E_n(Z I(Y \in J_i))}{E_n(I(Y\in J_i))}$ ,i=1,….,h
-
Approximate $Cov(E(Z\mid g(Y)))$ by
$\hat{\Lambda} = \sum_{i=1}^n E_n(I(Y\in J_i))E_n(Z \mid Y \in J_i)E_n(Z \mid Y \in J_i)^T$
-
Let $\hat v_1,\hat v_2 ,…$ be the eigenvalue of $\hat{\Lambda}$ and let $\hat{\beta}_k = \hat{\Sigma}^{-1/2}\hat{v}_k$, k=1,2,…r
Desired predictor would be $\hat{\beta}_k^t(X_1-\mu),\hat{\beta}_k^t(X_2-\mu),…,\hat{\beta}_k^t(X_n-\mu)$
We have seen concept of SDR and SIR method. However, even though SIR are good for estimating SDR subspace, it has serious problem.
That is, if between the connection of Y and X, X has symmetric at 0, then it means $E(X\mid Y)=0$ ,so it did not deliver any information about SDR subspace. We call this problem as U-shape problem.
This problem could be overcame with other methods like SAVE, CR, DR. Before approaching them, we gonna see another method to estimate SDR subspace in next posting.
Debnath, L.& Mikusinski, P. (2005). Introduction to Hilbert Space.London,UK.:Elsevire Academic Press
Li, B. (2018). Sufficient Dimension Reduction. Boca Raton,FL:CRC Press.