Sufficient Dimension Reduction - 2. Advanced algorithm of SIR
1. Parametric Inverse Regression (PIR)
A. Limit of Sliced Inverse Regression
In the previous method;Sliced Inverse Regression, we sliced Y because we want to estimate X more exactly.
If we do not slice, then we can not get any information because dimension of y is 1 and dimension of X is p. That is, if we do not slice, we estimate $E(X \mid Y)$ as $\frac{1}{N_y}\sum E(X\mid Y) P_y \approx E(E(X\mid Y)) = E(X)$. Therefore, $E(X \mid Y)$ become just a point estimator. It gives no information between connection of Y and X.
On the contrary, if we sliced it too many e.g. slicing as the number of sample, then estimate of $E(X \mid Y = y_1)$ is too much inconsistent. It can give us many information but most of it lies outside of trend of connection.
Therefore, we have to choose apropos number of slicing to get ideal result. However, it could be too hard to check it.
To overcome the limit, we can apply more general mathematical tools to estimate $E(X \mid Y)$. One of them is the Parametric Inverse Regression (PIR).
B. Structure of PIR
Let’s think about ${f_i}$ for $i=1,2,…,m$ : a set of functions of y. Then we gonna estimates $E(X\mid Y)$ using them. That means,
$E(X \mid Y) = \alpha_0 + \alpha_1 f_1(y)+ \alpha_2 f_2(y)+ … + \alpha_m f_m(y)$
What is this for? What is meaning of doing this thing?.
Actually, estimating X with y may be impossible. As mentioned above, X has complexity with p and y has complexity with 1. That means y is too simple to estimate X linearly.
Therefore, we have to expand complexity of y via non-linear expansion and one of the easiest way is using ${f_i}$. This is why we use above structure.
One of the examples of ${f_i}$ is ${y,y^2,y^3,y^4,…}$. If you remark class of linear regression, then you can remember that we apply polynomial regression to better expectation :$y = \alpha_0 + \alpha_1 x_1 +\alpha_2 x_2 + \alpha_3 x_1^2 + \alpha_4 x_2^2 + \alpha_5 x_1x_2$. Two methods have a same logic for better estimating.
Another possible function could be ${I(y \in J_1),I(y \in J_2),…,I(y \in J_h) }$: indicator functions in each region in range of y. This is also one of the non-linear expansion and in actual, it is same with SIR. Therefore, we could see SIR as one of the cases of PIR. Moreover, as setting m=1 and $f_1(y)=y$, we can connect PIR with OLS too.
PIR Estimator is as follow
Theorem
span${(\Sigma_{XX}^{-1} Cov(X,F(Y))} \subset S_{Y \mid X}$ with the LCM assumption
As done at SIR, we can find eigenvector of it and project original data to SDR space. Now let’s prove it.
proof
From the structure of general PIR, we could derive objective function as $E(\parallel X - \alpha_0 - B F(Y) \parallel^2)$. This has general solution form $B = Cov(X,F(Y))[Cov(F(Y))]^{-1} , \alpha_0 = E(X) -BE(F(Y))$
$ E(UV^t)E(VV^t)^{-1}= \underset{B}{\arg \min}\ E(\parallel U-BV\parallel^2)$
pf)
Set $B^\ast = E(UV^t)E(VV^t)^{-1}$
$E(\parallel U-BV\parallel^2) = E(\parallel U- B^\ast V + B^\ast V -BV \parallel^2)$
$= E(\parallel U-B^\ast V \parallel^2) + 2tr(E((U-B^\ast V)(B^\ast V - BV)^t)) + E(\parallel B^\ast V - B V\parallel^2)$
This middle term is zero;
$E((U-B^\ast V)(B^\ast V - BV)^t) = E((U-E(UV^t)E(VV^t)^{-1}V)(B^\ast V - BV)^t)$
$= E(U-E(UV^t)E(VV^t)^{-1}V)V^t^t$
$= E(UV^t-E(UV^t)E(VV^t)^{-1}VV^t)^t$
$= E(UV^t)-E(UV^t)E(VV^t)^{-1}E(VV^t)^t =0$
$ \geq E(\parallel U - B^\ast V \parallel^2)$
If $B \neq B^\ast$, then $E(\parallel B^\ast V - BV \parallel^2) > 0$ ,so $E(\parallel U - BV \parallel^2)$ always become larger than $E(\parallel U-B^\ast V \parallel^2)$.
Therefore it has unique solution $B = B^\ast$
Without loss of generality, we can assume $E(X) = 0$ and $E(F(Y)) = 0$
Then $Cov (X,F(Y)) = E(X F(Y)^t)$
And we can derive $E(XF(Y)^t) = E(E(X \mid Y) F(Y)^t) = E(XE(F(Y)^t\mid Y))$ using general procedure of conditional expectation.
And by using $Y \perp X \mid \beta^t X$, we can expand above expression.
$E(E(X \mid Y) F(Y)^t) = E(E(X \mid \beta^t X , Y) F(Y)^t) = E(E(E(X \mid \beta^t X)\mid Y) F(Y)^t)$
With LCM assumption, we know that $E(X\mid \beta^t X) = P_\beta (\Sigma_{XX})^tX$, so by injecting it,
$ E(E(E(X \mid \beta^t X)\mid Y) F(Y)^t) = P_{\beta}(\Sigma_{XX})^{t} E(E(X\mid Y) F(Y)^t) = P_{\beta}(\Sigma_{XX})^t E(X F(Y)^t) = P_{\beta}(\Sigma_{XX})^tCov(X,F(Y))$
AS done at SIR, we could know that $P_\beta(\Sigma_{XX})^t = \Sigma_{XX} P_{\beta}(\Sigma_{XX}) \Sigma_{XX}^{-1}$
And $Cov(X,F(Y)) = \Sigma_{XX} P_{\beta}(\Sigma_{XX})\Sigma_{XX}^{-1} Cov(X,F(Y)) \in \Sigma_{XX}$span$(\beta) = \Sigma_{XX}S_{Y \mid X}$
Therefore, we can get $span(\Sigma_{XX}^{-1} Cov(X,F(Y))) \subset S_{Y \mid X}$
C. Estimating PIR coefficient
Now, we get $span(\Sigma_{XX}^{-1} Cov(X,F(Y))) \subset S_{Y \mid X}$.
To finding eigenvector of $span(\Sigma_{XX}^{-1} Cov(X,F(Y)))$, we could approach it more efficiently.
Let’s denote $\Sigma_{XF} = Cov(X,F(Y)), \Sigma_{FF} = Cov(F(Y),F(Y))$, then $span(\Sigma_{XX}^{-1}Cov(X,F(Y)) = span(\Sigma_{XX}^{-1}\Sigma_{XF})$
For any non-singular matrix A, $span(\Sigma_{XX}^{-1}\Sigma_{XF})$ have same space with $span(\Sigma_{XX}^{-1}\Sigma_{XF} A)$.
We can take two choices of A
1) $A = \Sigma_{FF}^{-1}$
Then, it would be $span(\Sigma_{XX}^{-1}\Sigma_{XF}X_{FF}^{-1}) = span(\Sigma_{XX}^{-1} B)$
Where $B$ is regression coefficient of $E(X \mid Y) = \alpha_0 + \alpha_1 f_1(y)+ \alpha_2 f_2(y)+ … + \alpha_m f_m(y)$
Therefore, just fitting regression, we could find PIR eigenvector.
2) $A = \Sigma_{FF}^{-1}\Sigma_{FX} \Sigma_{XX}^{-1}$
It would seem little bit complex but it could make the problem as follow.
$span(\Sigma_{XX}^{-1}\Sigma_{XF} \Sigma_{FF}\Sigma_{FX} \Sigma_{XX}^{-1}$)
It is same with $GEV(\Sigma_{XF} \Sigma_{FF} \Sigma_{FX},\Sigma_{XX}^{-1})$
In general, regression coefficients could be calculated with inverse matrix which needs eigenvalue decomposition.
So approaching it with GEV could be more direct way to solve the problem.
2. Kernel Inverse Regression
In the parametric inverse regression, we have expand the complexity of feature space with function mapping higher dimension to estimate $E(X\mid Y)$ more exactly.
In the kernel inverse regression, we will gonna approach it with another traditional method: Kernel smoothing.
Kernel Smoothing could be used as non-parametric predictor model with follow form
$f(x) = \int \frac{yK(X,x)}{K(X,x)} dX$ where K(y,X,x) is metric function between X and x, and y is a value that corresponds to X
Intuitively, it is metric based average that near ones are reflected a lot, and far ones are reflected less.
Usually, we use Gaussian Kernel $k_b(u) = b^{-1} exp(-u^2/b)$ and The connection between SIR and KIR is similar with the connection between histogram and other density estimator.
Remember SIR algorithm. We estimated $\hat{E_n}(Z \mid Y \in J_i) = \frac{E_n(Z I(Y \in J_i))}{E_n(I(Y\in J_i))}$.
Now using similar form in KIR, we could estimate $\hat{E_n}(Z \mid Y =y) = \frac{E_n(X \cdot k(Y-y))}{E_n(k(Y-y))}$
As we could see, we can estimate with whole information of y.
With it, we could solved the problem with $GEV(Var_n[\hat{E_n}(X\mid Y), Var_n(X)])$
At this posting, we have seen advanced approaching to SIR problem. Purpose of PIR and KIR is to estimate $E(X \mid Y)$ better. But sometimes, even though we can estimate $E(X\mid Y)$ perfectly, there are cases that we cannot find SDR subspace.
3. Sliced Average Variance Estimate
A. Derivation of SAVE subspace
Sliced Average Variance Estimate(;SAVE) is developed to overcome the limit of SIR. As I mentioned at previous posting, SIR algorithm cannot work when $E(X \mid \beta^t X)$ is symmetry around $E(X)$.
SAVE overcome this problem with additional information $E(XX^t \mid Y)$. Due to this information, we could see SIR as first-order method and SAVE as second-order method.
To use SAVE, it needs another assumption constant conditional variance w.r.t $\beta$ (;$CCV(\beta$)). The assumption means $var(X \mid \beta^t X)$ is constant. This is true when X follows multivariate normal distribution
With CCV and LCM assumptions, we could build follow theorem.
Lemma
If $X \in R^p$ satisfies $LCM(\beta)$ and $CCV(\beta)$, then $Var(X \mid \beta^t X) = \Sigma Q_{\beta}(\Sigma)$ where $Q_{\beta}(\Sigma) = I - P_{\beta}(\Sigma)$
Notice that in this case $Q_{\beta}(\Sigma)$ is projection matrix to null space of $\beta$
This fact might give intuition that $Var(X \mid \beta^t X)$ makes null subspace of SDR space.
pf)
$Var(X) = E(Var(X \mid \beta^t X ))+Var(E(X \mid \beta^t X ))$
$= Var(X\mid \beta^t X ) + Var(P_{\beta}(\Sigma)^tX)$ by using result of SIR and CCV assumptions.
$= Var(X\mid \beta^tX) + P_{\beta}(\Sigma)^t\Sigma P_{\beta}(\Sigma)$
$=Var(X \mid \beta^t X) + \Sigma P_{\beta}(\Sigma)$
(The procedure of right term of each line is results of easy calculation by injecting $P_{\beta}(\Sigma) = \beta (\beta^t \Sigma \beta)^{-} \beta^t\Sigma$)
This could be rewrite as $\Sigma (I-P_{\beta}(\Sigma)) = Var(X\mid \beta^t X)$ which is desired result.
With above lemma, we could derive follow theorem.
Theorem
Suppose $\beta$ is basis matrix for $S_{Y \mid X}$, and X satisfies $LCM(\beta)$ and $CCV(\beta)$. Then $span(\Sigma - var(X \mid Y)) \subset \Sigma S_{Y \mid X}$
and this theorem makes SAVE subspace for SDR.
It also generalized to GEV problem, i.e. $GEV(\Sigma - var(X\mid Y) ,\Sigma)$
pf)
$Var(X \mid Y) = E(Var(X \mid \beta^t X ,Y)) + Var(E(X \mid \beta^t X , Y ))$
$=E(Var(X\mid \beta^t X) \mid Y) + Var(E(X\mid \beta^t X) \mid Y)$ by using property of SDR subspace
$= E(\Sigma (I-P_{\beta}(\Sigma))\mid Y) +Var(P_{\beta}(\Sigma)^t X \mid Y)$
$= \Sigma-\Sigma P_{\beta}(\Sigma) + P_{\beta}^t(\Sigma) Var(X \mid Y) P_{\beta}(\Sigma)$
Then $\Sigma - Var(X \mid Y) = \Sigma P_{\beta}(\Sigma) \Sigma^{-1} (\Sigma - Var(X\mid Y))P_{\beta}(\Sigma))$ using $P_{\beta}(\Sigma)^t = \Sigma P_{\beta}(\Sigma) \Sigma^{-1}$
It means $\Sigma - Var(X \mid Y) \subset \Sigma S_{Y\mid X}$
Transformation to easier problem
Even though, $GEV(\Sigma - var(X\mid Y) ,\Sigma)$ is quite direct problem, it could be transformed to be easier.
First, let’s standardize Z = $\Sigma^{-1/2} (X-\mu)$ ,then $Span(\Sigma - Var(X\mid Y)) \subset S_{Y \mid X}$ become $Span(I-Var(Z \mid Y)) \subset S_{Y \mid Z}$
Let $\Lambda_{SAVE} = E[(I- Var(Z\mid Y))^2]$, then $I-Var(Z \mid Y) \subset S_{Y \mid Z} \iff span (\Lambda_{SAVE}) \subset S_{Y \mid Z}$ by following lemma.
Lemma. If U is a random matrix whose entries are square integrable, and S is a subspace of $R^p$, then $span(U) \subset S \quad a.s.P \iff span(E(U U^t)) \subset S$
Then we can recover $\Sigma ^{-1/2}span(\Lambda_{SAVE})\subset S_{Y \mid Z}$
We can use this result to implement SAVE algorithm.
B. More about SAVE
Connection between SIR and SAVE
In addition to their similarity of target, they have another important relation. That is, even if LCM and CCV assumptions does not hold, $S_{SIR} \subset S_{SAVE}$. This could be prove as follow.
pf)
$S_{SIR} = \Sigma^{-1/2} span(cov(E(Z\mid Y))), S_{SAVE} = \Sigma^{-1/2}E(I-var(Z\mid Y))^2$
$Var(Z) = E(Var(Z \mid Y)) + Var(E(Z \mid Y))$
$\Rightarrow Var(E(Z\mid Y)) = I-E(Var(Z\mid Y)) = E(I-var(Z\mid Y))$
Claim: If $span(E(U)) \subset span(E(UU^t))$
It is equivalent with $span(E(UU^t))^{\perp} \subset span(E(U))^{\perp}$
If $b \in span(E(UU^t))^{\perp}$, then $b \in span(U)^{\perp}$. It means $b^t U =0 \ \ a.s.P$ and $b^tE(U)=0$.
$\therefore b \in span(E(U))^{\perp}$
$span(Var(E(Z \mid Y))) = span(E(I-Var(Z\mid Y))) \subset span (E(I-Var(Z\mid Y))^2)$
which implies $S_{SIR} \subset S_{SAVE}$
This result implies why SAVE can find subspace basis which cannot be found with SIR.
Exhaustive condition for SAVE
There is a condition that makes SAVE subspace to be exhaustive to SDR subspace. Because SAVE is alway subset of SDR, exhaustiveness of SAVE means fisher consistent of SAVE.
Theorem
Suppose $LCM(\beta),CCV(\beta)$, and $E(Z)=0,Var(Z)=I$. If for each $v \in S_{Y\mid Z}$, at least one of random variables $E(v^t Z \mid Y) ,Var(v^t Z \mid Y)$ is non-degenerate, then $span(E(I - Var(Z \mid Y))^2) = S_{Y\mid Z}$
In this theorem, “degenerate” means convergence in probability. Therefore, it state that if $Z \mid Y$ is not convergent in probability in any linear combination $(v^t Z)$, SAVE subspace is fisher consistent.
It is quite mild condition, so we can believe $S_{SAVE} = S_{Y\mid X}$ in almost every case
Debnath, L.& Mikusinski, P. (2005). Introduction to Hilbert Space.London,UK.:Elsevire Academic Press
Li, B. (2018). Sufficient Dimension Reduction. Boca Raton,FL:CRC Press.