The random matrix is literally the matrix whose entries are random variables. It has various application including high dimensional statistics, information theory and quantum dynamics. Moreover, according to [Jinho Baik et al (2016) Combinatorics and Random Matrix Theory], it is also connected with combinatorics.

Therefore, I believe that if I study the random matrix theory, I can get new intuition about my current fields of interest such as dimensional reduction and quantum computing. Maybe it could help to study combinatoric optimization too.

Text books about Random Matrix Theory written by Jinho Baik target for experts on mathematics or physics. So I will read them after I become familiar with the theory. I’ll use the text book [Terence Tao (2012) Topics in Random Matrix Theory]. It is more appropriate for me in that it handles the random matrix theory in statistical background.

The first chapter of the book contains probability theory in purer fields. So it can help me to get familiar with purer concepts of probability theory such as “almost sure” or “Probabilistic concepts”. Let’s start it.

Ch1. Probability Theory

A. Probabilistic Concepts

In the construction level, the probability theory could be viewed as theory about measure space whose total measure is one.

We can construct the probability space as triplet \((\Omega,\mathcal{B},P)\).

Sample Space \(\Omega\): A set containing all the possible outcomes of random situation where one is concerning.

Sigma Algebra \(B\) of \(\Omega\): A set of subsets of the sample space. There are several rules in sigma algebra (omitted)

Event: The element of the sigma algebra

Probability Measure \(P\): A function assigning real numbers between 0 and 1 to each event.

Remark that the elements of the sigma algebra is called as measurable sets at text book of Real Analysis but in the context of probability theory, it is specially called as events. An element of the sample space is usually denoted as \(\omega\).

For example, suppose there is a coin and we are studying the pattern of head and tail of the coin. Then the sample space is \(\{ t,h\}\). And sigma algebra containing every subsets is \(\{\phi,\{t\},\{h\},\{t,h\}\}\). Then the function \(f\) mapping each element of sigma algebra to real value is probability measure.

We can extend the probability space \((\Omega, \mathcal{B}, P)\) into another probability space \((\Omega',\mathcal{B}',P')\) if there is a surjective mapping \(\pi : \Omega \rightarrow \Omega'\) which is measurable i.e. \(\forall E \in \mathcal{B}', \pi^{-1}(E) \subset \mathcal{B}\) and preserves probability i.e. \(\forall E \in \mathcal{B}', P(\pi^{-1}(E))=P(E)\). The author of the book states that there is a important dogma: “probability theory is only ‘allowed’ to study concepts and perform operations which are preserved with respect to extension of the underlying sample space.” In this manner, the probability use different notations with usual set theory.

\(E \lor F:\) The union of events \(E\) and \(F\). The event that at least one of \(E\) and \(F\) holds

\(E \land F :\) The intersection of events \(E\) and \(F\). The event that \(E\) and \(F\) both hold.

\(\overline{E} :\) The event that \(E\) does not hold or the event that \(E\) fails.

\(E \subset F :\) E is contained at F.

This notation is similar with logical operator of boolean algebra.

To construct analytic structure of the probability theory, we need special boundary theory.

\(X=O_k(Y) :\) If \(\mid X \mid \leq c_kY\) for some scalar \(c\) depending on some parameter \(k\).

\(X = \Omega_k(Y)\) : If \(\mid X \mid \geq c Y\) for some scalar \(c\) depending on some parameter \(k\).

\(X = o_k(Y):\) If \(\mid X \mid \leq c_k Y\) for some \(c\) that goes to zero as \(k\) goes to infinity.

In deterministic theory, each proposition is one of two: True or False. However, because things in probability theory are handled in random situation, it is hard to say that some properties holds 100% true or false. Therefore, we use true and false in probabilistic sense as follow.

Logical statement in probabilistic sense.

  1. Sure: A proposition holds surely if the event of the proposition equals with total sample space.
  2. Almost Sure: A proposition holds almost surely if the probability of the event of the proposition is one. That is, the complement of the event is measure zero.
  3. Overwhelming probability : A proposition holds with overwhelming probability if for every fixed \(\epsilon >0,\) the proposition holds with the probability Overwhelming probability : A proposition holds with overwhelming probability if for every fixed \(\epsilon >0,\) the proposition holds with the probability \(1-O_\epsilon(n^{-\epsilon})\). (i.e. the probability that the event does not occur is below \(c_\epsilon n^{-\epsilon}\))
  4. High probability : A proposition holds with high probability if for some \(\epsilon >0,\) the proposition holds with the probability \(1-O(n^{-\epsilon})\). (i.e. the probability that the event does not occur is below \(c n^{-\epsilon}\))
  5. Asymptotically almost surely : A proposition holds asymptotically almost surely if it holds with the probability \(1- o(1)\) (i.e. the probability that the event does not occur goes to zero as \(n\) goes to infinity)

We can see that except (5), the higher statement is stronger than lower one.

2. Random Variable

Even though the structure of probability theory is quite realistic and theoretically solid at the same time, because the event, the basic component of probability theory, can only be at two states, the theory is lack of flexibility. To supplement such lack of flexibility, we can use function mapping events into more general concepts. This is called as a random variable.

Definition [Random Variable]

Let \(R = (R,\mathcal{R})\) be a measurable space. A Random Variable is a measurable mapping from the sample space to \(R\), i.e. \(X: \Omega \rightarrow R\) such that \(X^{-1}(S)\) is measurable set for \(S \subset \mathcal{R}\).

The most usual target space \(R\) is a real line \(\mathbb{R}\), but we can use much more various target space. The subclass of the random variable could be defined by the type of the target space.

Subclass of the random variable

  1. Discrete Random Variables : If the target space \(R\) is countable and \(\mathcal{R}\) is \(2^R\) which means set of every possible subsets of a countable set. Particularly, if the target space is \(R = \{0,1\}\), we call the random variable as Boolean and if the target space consists with just one element, \(R = \{c\}\), then we call it as deterministic.
  2. Real-valued Random Variable : If the target space \(R\) is the real line \(\mathbb{R}\) and \(\mathcal{R}\) is Borel sigma algebra.
  3. Complex-valued Random Variable : If the target space \(R\) is the complex plane \(\mathbb{C}\) and \(\mathcal{R}\) is Borel sigma algebra
  4. Function of random variable: For a measurable function \(f\), a function onto random variable \(f(X)\) is definitely random variable by definition of measurable function and random variable. \(i.e. X^{(-1)}\circ f^{-1}(S)\) is definitely measurable for measurable set of domain of function.
  5. Joint Random Variable : By joining several random variable and making set as \((X_\alpha)_{\alpha \in A}\). Then the target space become cartesian product of each target space \(\bigotimes_{\alpha \in A} R_{\alpha}\). There is no restriction on it that we can design various joining random variable such as \(Z = (C,D,X) , R_Z = R_C \otimes R_D \otimes R_X = \{\mbox{head},\mbox{tail}\} \otimes \{1,2,3,4,5,6\}\otimes \mathbb{R}\).
  6. Vector-valued Random Variable : It the target space \(R\) is the finite dimensional vector space \(\mathbb{R}^d\) or \(\mathbb{C}^d\).
  7. Matrix-valued Random Variable : If the target space \(R\) is a matrix space \(M_{p \times q}(\mathbb{R})\) or \(M_{p \times q}(\mathbb{C})\).

Each random variable \(X\) has its on distribution \(\mu_X\) defined as \(\mu_{X} (S) := P(X \in S)\). The important fact is every random variable has its own distribution and every kind of distribution can be connected with random variable. Therefore, we can regard two concepts has unique connection that if we know all about the random variable, it means we know the distribution of it and vice versa.

If the random variable is discrete, then the distribution could be expressed as \(\mu_X(S) = \sum_{x \in S} p_x, \mbox{ (where } p_x := P(X=x))\). There are famous discrete random variables.

Discrete Random Variable

  1. Dirac distribution \(\delta_a\) : For \(x=a,p_x = 1\) and for \(x \neq a , p_x =0\).
  2. Discrete Uniform Distribution : The target space \(R\) is countable and \(p_x = 1/ \mid R \mid\) for all \(x \in R\).
  3. (Unsigned) Bernoulli Distributions : The target space \(R\) is \(\{0,1\}\) and \(p_1 = p, p_0 =1-p\)
  4. Signed Bernoulli Distributions : The target space \(R\) is \(\{-1,1\}\) and \(p_+ = p_- =1/2\)
  5. Lazy signed Bernoulli Distribution : The target space \(R\) is \(\{-1,0,1\}\) and \(p_+ = p_- =\mu/2\) and \(p_0 = 1-\mu\)
  6. Geometric Distribution : The target space \(R\) is \(\{0,1,2,...\}\) and \(p_k = (1-p)^kp\).
  7. Poisson Distribution : The target space \(R\) is \(\{0,1,2,...\}\) and \(p_k = \frac{\lambda^k e^{-\lambda}}{k!}\).

A random variable is continuous random variable if \(P(X=x) =0\) for all \(x \in R\).

Assume that the target space \(R\) has a reference measure \(dm\) such as Lebesgue measure on Euclidean space. By Radon-Nikodym Theorem, there always exists a non-negative absolutely integrable function \(f \in L^1(R,dm)\) with \(\int_R f dm =1\) such that \(\mu_X(S)=\int_S f dm\).

That is, every distribution function could be rewritten by specific measurable function defined on the target space. Such \(f\) is called as probability density function.

For the real line, the distribution function could be expressed as Cumulative Distribution Function like \(F_X(x):= P(X \leq x) = \mu_X((\infty, x])\) and derivative of it is the probability density function.

Continuous Random Variable

  1. Uniform Distributions : \(f(x) := \frac{1}{m(E)}I_E(x)\).
  2. Real Normal Distribution : \(f(x) := \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-(x-\mu)^2/2\sigma^2)\)
  3. Complex Normal Distribution : \(f(x) := \frac{1}{\pi \sigma^2} \exp(-\mid x-\mu \mid^2/\sigma^2)\)

These are all well-known distributions. In statistics, we also handle other useful continuous distributions such as Exponential distribution, Chi-square Distribution, Gamma Distribution, Beta Distribution etc. Notice that the complex version normal distribution has mean as complex and its shape differs with real normal distribution.

3. Expectation

For each random variable, we are interested at the expectation or mean value of the random variable. For the nonnegative random variable \(X\), the expectation is defined as \(EX := \int_{0}^\infty x d \mu_X(x)\).


The term could be rewritten as \(EX = \int_{0}^{\infty} P(X \geq \lambda)d\lambda\) by Foubini - Tonelli Theorem which state that \(\int_{A \otimes B} \mid f(x,y) \mid d (x\otimes y) = \int_B\int_A \mid f(x,y) \mid dxdy = \int_A\int_B \mid f(x,y) \mid dydx\). If the function is nonnegative, then we can use the theorem.

\[x= \int_0^{x} dx = \int_0^{\infty}I(\lambda\leq x) d\lambda\] \[EX = \int_0^\infty [\int_0^{\infty}I(\lambda\leq x)dt] d\mu = \int_0^{\infty}\int_{0}^{\infty} I(\lambda\leq x) d\mu d\lambda = \int _{0}^{\infty} Pr(X\geq \lambda) d \lambda\]

We can summarize the property of expectation as follow.

Absolutely Integrable: The random variable is said to be absolutely integrable if \(E \mid X \mid < \infty\). Because \(\mid\int f(x)d\mu\mid \leq \int \mid f(x)\mid d \mu\), \(\mid EX \mid \leq E\mid X \mid < \infty\) holds, so absolutely integrable is stronger conditions then just integrable.

Expectation of Random Vector : For multi-valued random variable \(X = \{X_{ij}\},E(X) = \{E(X_{ij}\}\).

Linearity : If each random variable \(X_i\) is absolutely integrable, then for finite scalars \(c_i, \sum_{i=1}^k c_i X_i\) is also absolutely integrable and \(E(\sum_{i=1}^k c_i X_i) = \sum_{i=1}^k c_iE(X_i)\). By using the Fubini-Tonelli theorem, we can expand the k into infinity.

Markov Inequality : It holds that \(Pr(\mid X\mid \geq \lambda) \leq \frac{E(\mid X \mid)}{\lambda}\). It states the relation between distribution of \(X\) and its expectation

Expectation of function : For a function of the random variable \(f(X)\), \(Ef(X) = \int f(X) d\mu\).

In statistics, there are important expectation values of \(f(X)\) called as Moments

Variance : \(Var(X) := E\mid X - E(X) \mid ^2\)

k-th moments : \(E\mid X \mid ^k\)

Exponential moments : \(Ee^{tX}\)

Fourier Moments : \(Ee^{itX}\) or \(E e^{it \cdot X}\) for vector valued random variable.

Resolvents : \(E\frac{1}{X-z}\) for complex \(z\).

Negative Moments : \(E\mid X \mid ^{-k}\)

**Zeroth Moment **: \(E \mid X \mid ^0 = P(X \neq 0)\)

Variance is another important statistical component of variable. By applying them in Markov’s inequality, we can derive \(E( \mid X -E(X) \mid \geq \lambda) \leq \frac{Var(X)}{\lambda^2}\) called as Chebyshev’s inequality.

By differentiate the exponential moments k times, we can derive k-th moment. So it also called as Moment Generating Function(MGF).

4. Tail Behavior of Distribution

T ail behavior of Distribution is important topic of analytic statistics. Because the value of probability on total sample space have to be one, measure of some subset of the sample space have to be zero or at least closed to zero. For example, the value of pdf of normal distribution converges to zero as going to tail.

But the important fact is that some distribution such as uniform distribution on \([0,1]\), have the region of measure zero, but other distribution such as normal distribution, don’t have the measure zero region. It is also connected to the boundary of the random variable in that the uniform distribution is completely bounded by [0,1] but the normal distribution is unbounded.

Therefore, the speed of the distribution converging to zero is one of the important considerations at mathematical statistics.

The canonical classification of the speed is follow.

1. Surely Bounded : The random variable is said to be surely bounded if \(\exists M>0\) such that \(\mid X \mid \leq M\) surely.

2. Almost Surely Bounded : The random variable is said to be almost surely bounded if \(\exists M>0\) such that \(\mid X \mid \leq M\) almost surely.

3. sub-Gaussian : The random variable is said to be sub-Gaussian, if there exists \(C,c> 0\) such that \(P(\mid X \mid \geq \lambda) \leq C exp(-c\lambda^2) = O(exp(\lambda^2)) , \forall \lambda >0\).

4. sub-Exponential : The random variable is said to have sub-exponential tail, if there exists \(C,c>0,a\geq1\) such that \(P(\mid X \mid \geq \lambda) \leq C \exp (- c \lambda^a)=O(\exp ( \lambda^a), \forall \lambda >0\).

5. Finite k-th Moment: The random variable is said to have finite k-th moment for some \(k \geq 1\) if there exists \(C\) such that \(E \mid X \mid ^k \leq C\)

6. Absolutely Integrable: The random variable is said to be absolutely integrable if \(E \mid X \mid < \infty\).

7. Almost surely finite: The random variable is said to be almost surely finite if \(\mid X \mid < \infty\) almost surely.

8. Heavy-tailed Distribution: The random variable is said to be heavy tailed distribution if \(\mid X \mid <\infty\) but \(E \mid X \mid > \infty\).

Similar with Logical statement in probabilistic sense, the upper statement implies the lower statement except 8.


pf)

1)->2) If P holds surely, it also definitely holds almost surely.

2)->3) If X is almost surely bounded, then \(Pr( \mid X \mid >M) =0\). So for \(\forall \lambda >M\), \(\exists C,c >0\) such that \(0 \leq C\exp(-c \lambda^2)\). For small finite \(\lambda\), we can also find \(C,c>0\) such that \(1 \leq C\exp(-c \lambda^2)\) to naturally make the inequality holds.

3)->4) If \(X\) is sub-Gaussian, the factor \(c\) of sub exponential tail become 2.

4)->5)

\[E \mid X \mid ^k = \int_0^{\infty} Pr(\mid X \mid ^k \geq \lambda) d\lambda = \int_{0}^{\infty} Pr( \mid X \mid \geq \lambda^{1/k}) d\lambda\] \[Pr( \mid X \mid \geq \lambda^{1/k}) \leq C \exp ( - c \lambda^{\frac{a}{k}}) \leq C \exp(-c \lambda)\]

For second inequality, we can find k which holds the inequality.

\[E \mid X \mid ^k \leq \int_{0}^{\infty} Pr(\mid X \mid \geq \lambda^{\frac{1}{k}}) d\lambda \leq \int_{0}^{\infty} C exp(-c \lambda) d \lambda = \frac{C}{c} < \infty\]

5)->6)

Actually, the \(E\mid X \mid^p\) is Lp norm using probability measure. There is a famous fact that if measurable space does not contain sets of finite but arbitrarily large measure, \(L^p \subset L^q\) for \(q<p\). It directly implies that if \(E \mid X \mid ^p <\infty\) implies \(E \mid X \mid ^q <\infty\). Therefore, if one of \(k\geq 1\) holds, then \(E \mid X \mid < \infty\) holds

6)->7)

We can use the Markov inequality \(Pr( \mid X \mid \geq \lambda) \leq \frac{E\mid X \mid}{\lambda} < \infty\). It means that as \(\lambda\) goes to infinity, it shrinks to zero. Therefore, it approximately almost surely bounded at \(\lambda\) and almost surely bounded at infinity.


If we have finite k-th moment i.e.\(E \mid X \mid ^k \leq C\), then by using the Markov inequality, we can control the tail behavior of the random variable as \(P (\mid X \mid \geq \lambda) \leq C \lambda^{-k}\).

Note that there could be a distribution called as heavy tailed distribution which is almost surely finite but the expectation is not finite. It states that the tail behavior of such distribution do not converge easily. The most famous heavy tailed distribution is Cauchy Distribution \((f(x) = \frac{1}{ \pi} \frac{\gamma}{(x- x_0)^2 + \gamma^2})\).

5. Joint Behavior of random variables

As mentioned above, if we know the distribution function of some random variable, then we can know all the behavior and information single random variable has. However, to know such information of random families \(\{X_\alpha\}_{\alpha \in A}\), we have to know more than just distribution of each variable. To get it, we have to know joint behavior of the family.

Definition

Joint Distribution : If \(X_\alpha\) is a random variable defined by a function \(X_\alpha : \Omega \rightarrow R_\alpha\), then the family \(\{X_\alpha\}_{\alpha \in A}\) of random variables is defined by a function \(\{X_\alpha\}_{\alpha \in A} : \Omega \rightarrow \underset{\alpha \in A}{\bigotimes} R_{\alpha}\).

Therefore, to know the probabilistic behavior of multiple random variable, one have to know the relation or connection among the random variables.

The simplistic and most tractable connections among random variable are independence i.e. there is no connection among them and each random variable moves independently.

Definition

Joint Independence : A family \(\{X_\alpha\}_{\alpha \in A}\) of random variables is said to be jointly independent if the distribution of the family is the product measure of the distribution of the individual \(X_\alpha\).

Pairwise Independence : A pair \(\{X_\alpha ,X_\beta\}\) of random variables is said to be pairwise independent if they are jointly independent.

There is a important property of independence.

Union and Intersection probability : If \(E_1,E_2,...E_k\) are jointly independent events, then \(Pr(\land_{i=1}^k E_i) = \prod_{i=1}^kPr(E_i)\) and \(Pr( \lor_{i=1}^{k} E_i) = 1-\prod_{i=1}^{k}(1-Pr(E_i))\).

Expectation : If \(X_1,X_2,...X_k\) are jointly independent absolutely integrable random variables, then \(E\prod_{i=1}^k X_i = \prod_{i=1}^k EX_i\)

Variance : If \(X_1,X_2,...X_k\) are pairwise independent absolutely square integrable random variables, then \(Var(\sum_{i=1}^k c_i X_i) = \sum_{i=1}^k \mid c_i \mid ^2 Var(X_i)\)

The concepts in the probabilistic theory always have to be handled at probabilistic situation. Therefore, it is hard to use the deterministic techniques or property at the probability space.

For example, a simple regression problem \(Y \sim X\) confronts the difference that we want to see the expectation of Y of some fixed explainable variable \(X\). However, because the \(X\) is also probabilistic, it is hard to perfectly construct the concepts in the full-probabilistic world.

To overcome this difficulty, the statistics have eliminated portion of randomness using the concepts of Conditioning. This is done by assumptions that some portion of random variable are realized. By doing this, it makes the partial randomness of the probability space into deterministic concepts. In this manner, the conditioning is also called as freezing.

Definition

Conditioning on an event : By conditioning on the some event E, we’ll replacing the sample space \(\Omega\) with the subset \(\Omega_{E}\) where event \(E\) holds i.e. \(\Omega_{E} = \{\omega \in \Omega \mid \mbox{ the event }E \mbox{ holds at } \omega\}\), and replacing the underlying probability measure \(P\) by the conditioning probability measure \(P(\cdot\mid E) := P(\cdot \land E) / P(E)\). That is, the probability space \((\Omega , B ,P)\) is replaced as conditioned probability space \((\Omega_{E},B_E,P(\cdot \mid E))\).

The conditional distribution \(\mu_{( X \mid E)}\) is defined at conditioned probability space. If two events \(E_1\) and \(E_2\) are independent, then \(P(E_1 \mid E_2) = P(E_1)P(E_2)/P(E_2) =P(E_1)\). Likewise, \(E(X \mid Y) =E(X)\).

Because the conditioning means we’ll handle the random variable deterministic, we can observe the equation \(E(E(XY \mid Y))=E(Y E(X\mid Y))\)

If we have the information of complementary conditioned probability space \((\Omega_{\bar{E}}, B_{\bar{E}},P(\cdot \mid \bar{E}))\), we can reverse the procedure i.e. unconditioning and restore the original space as follow.

Unconditioning the event :

Decomposition of probability : \(P(F) = P(F \mid E) P(E) + P(F \mid \bar{E})P(\bar{E})\)

Decomposition of distribution : \(\mu_X = \mu_{X \mid E} P(E) + \mu_{X \mid \bar E} P(\bar E)\)

Decomposition of Expectation : \(EX = E(X \mid E) P(E)+P(X \mid \bar{E})P(\bar{E})\)

Disintegration : \(P(F) = E(P(F\mid Y))\)

Iterative Expecation : \(E(X) = E(E(X \mid Y))\)

By using the concepts, we can derive more general equation \(E(f(X)g(Y))= E(E(f(X)g(Y) \mid Y)) = E(E(f(X)\mid Y) g(Y))\).

We can regard the equation as stepwise expectation in that it first fix the \(Y\), calculate the expectation over \(X\) and then calculate the expectation over \(Y\).

6. Convergence of Probability Space

In the analysis, the convergence of the sequence is defined that scalar sequence \(x_n\) converges into \(x\) if \(\forall \epsilon>0, \exists N \in \mathbb{N} \mbox{ such that }\forall n \geq N, d(x_n,x) < \epsilon\). This definition is defined at metric space because it uses metric \(d(\cdot,\cdot)\).

Because the main tool of the probability space is random variable, we are interested in the convergence of random variable. But because the random variable is defined in the probabilistic concepts, we have to use little different definition of the convergence as follow.

Definition Suppose that each random variable has the \(\sigma\)-compact metric space as target space.(A space is \(\sigma\)-compact if it is countable union of compact sets)

1. Almost Sure Convergence : \(X_n\) converges almost surely to \(X\), if for almost every \(\omega \in \Omega, X_n(\omega)\) converges to \(X(\omega)\), or equivalently \(\forall \epsilon>0 , P(\underset{n \rightarrow \infty}{\lim \sup} d(X_n,X) <\epsilon) =1\)

2. Convergence in Probability : \(X_n\) converges in probability to \(X\) if \(\forall \epsilon>0, \underset{n \rightarrow \infty}{\lim \inf} P(d (X_n,x)\leq \epsilon)=1\) or equivalently if \(d(X_n,X)\leq \epsilon\) holds asymptotically almost surely for every \(\epsilon >0\).

3. Convergence in Distribution : \(X_n\) converges in distribution to \(X\) if for every bounded continuous function \(F, \underset{n \rightarrow \infty}{\lim} EF(X_n) =EF(X)\)

4. Tight sequence of distribution : \(X_n\) has a tight sequence of distribution if \(\forall \epsilon>0, \exists \mbox{ compact set }K \subset R,\exists N \in \mathbb{N} \mbox{ such that } \forall n \geq N,P(X_n \in K) \geq 1-\epsilon\)

Because the \(EF(X_n) = \int F d\mu_n , EF(X)=\int F d \mu\), the convergence in distribution means the convergence of distribution function i.e. \(\mu_n \rightarrow \mu\). Like before, the upper one is stronger one and implies next one.