$\def\ket#1{\mid #1 \rangle}\def\bra#1{\langle #1 \mid}$
1. Reproducing Kernel Hilbert Space
In the linear regression, we can improve the predictive accuracy of the linear model by mapping original features into higher dimensions. For example, with $y = \beta_0 + x\beta_1$, we can reduce the mean square error further by adding higher terms such as $x^2 , x^3,\cos x , \sin x,…$. If we keep adding the term, we can shrink the error to almost zero for most of the regression problems. This is the most basic method to build a non-linear model.
We can connect the above situation with Taylor expansion. A lot of functions in real world could be analyzed with Taylor expansion which means that we can analyze some function $f$ as $f(x) \approx \sum_{k=0}^{\infty} \frac{f^{(k)}(0)}{k!}x^{k}$. Because $\frac{f^{(k)}(0)}{k!}$ is constant, we can find the term using Least Square Estimator. If we find the constants which can makes $y=f(x)$, then we can reduce the cost function almost zero
We can develop the above concepts into Hilbert space and Basis. In the square integrable functional space denoted as $L^2$ functional space where inner product could be defined, any function $f \in L^2$ function could be represented with basis $B_i(x)$ such as $f(x) = \sum_{i=1}^{\infty} \alpha_i B_i(x)$. Therefore, by mapping the original feature $x$ into $B_i(x)$ and finding constant $\alpha_i$, we can find the perfect predictive function. Such concepts could be used in a lot of statistics algorithms including Linear Regression, Principal Component Analysis and Support Vector Machine.
The problem is, as expanding the data into higher space, the computational cost increases exponentially. For the linear regression $y = B_x\alpha$, the coefficient $\alpha_i$ could be estimated using LSE $(B_x^tB_x)^{-1}B_xy$. However, to get the estimation, we have to find the value of the matrix multiplication and inverse matrix which take a lot of cost.
We can shrink such costs by using the concept of Reproducing Kernel Hilbert Space. The more detailed explanation of RKHS is in the posting (https://hgmin1159.github.io/dimension/nonlineardr/).
The important concepts to using the RKHS are as follows.
For Reproducing Kernel Hilbert Space $H(\Omega)$ and its element $f$ and $g$, we construct the following relation.
- Positive Definite Kernel: Bivariate symmetric function $k(x_i,y_i) : \Omega \times \Omega \rightarrow \mathbb{R}$ is positive definite if for any finite subset $S \subset \Omega$, matrix ${k(x_i,x_j)}_{ij}, \quad (x_i ,x_j \in S)$ is positive definite matrix.
- Basis
- Positive Definite Kernel with one fixed value such as $k_j(x_i) = k(x_i,y_j)$ is the basis of RKHS.
- $f$ could be represented as $f = \sum_{i=1}^{p}\alpha_{fi} k_j(x_i), \quad (\alpha_{fi} \in \mathbb{R} )$ and the real valued vector $(\alpha_{f1},\alpha_{f2},…,\alpha_{fn})$ is called coordinate representation of $f$ and denoted as $[f]$
- Gram Matrix: Matrix with basis $G = {k(x_i,x_j)}_{i,j} , \quad (i=1,2,…,n,j=1,2,…n)$ is called Gram Matrix
- Inner Product: The inner product $\langle f,g \rangle $ could be defined as $\langle f, g \rangle = \sum_{i,j} [f]_i[g]_j k(x_i,x_j) =[f]^t G[g]$
- **Reproducing Property: ** $\langle f, k_j\rangle = \sum_{i,j} [f]_i[k_j]_jk(x_i,x_j) = \sum_i [f]_ik(x_i,x_j) = f(x_j)$ $\because [k_j]_j = (0,0,…,1,…,0)$
- Kernel Trick: For some specific vector valued function $\Psi = (\psi_1,\psi_2,…,\psi_p)$, there is a kernel $k(x,y)$ such that $\Psi(x)^t \Psi(y) = k(x,y)$
There are two famous functional vectors which could be calculated using the kernel trick;Gaussian Kernel $k(x,y) = \exp \left( -\frac{\parallel \textbf{x}-\textbf{x}’ \parallel^2}{2\sigma^2}\right)$, Polynomial Kernel $k(x,y) = (x^ty+c)^d$.
To use the structure of the RKHS, the only thing we have to know is the Gram matrix. Therefore, if we can get the gram matrix, then we can use them to build various non-linear statistical models.
2. Quantum Kernel Estimator
In the quantum computer, there is an algorithm to calculate the inner product between two qubits. Therefore, by using the algorithm, we can estimate the Gram Matrix of the quantum kernel and construct various quantum versions of non-linear statistical models. Let’s see the detailed procedure of estimation.
We’ll assume that each of the U matrix maps into data vectors. That is,$U_{0}\ket{0} = \ket{x} , U_1 \ket{0} = \ket{y}$
Let’s track down the procedure.
- $(I \otimes U_0 \otimes U_1) \ket{000} = \ket{0xy}$
- $(H \otimes I \otimes I) \ket{000} = \frac{1}{\sqrt{2}}(\ket{0}+\ket{1})\ket{xy}$
- $\frac{1}{\sqrt{2}}CSWAP[(\ket{0}+\ket{1})\ket{xy}] = \frac{1}{\sqrt{2}}(\ket{0xy}+\ket{1yx})$
-
$\frac{1}{\sqrt{2}}(H \otimes I \otimes I) (\ket{0xy}+\ket{1yx}) = \frac{1}{2}(\ket{0}(\ket{xy}+\ket{yx})+\ket{1}(\ket{xy}-\ket{yx}))$
- Measurement
- $Pr(0) = \frac{1}{4}\parallel \ket{xy}+\ket{yx} \parallel_2^2 = \frac{1}{2}(1+\mid\langle x,y \rangle \mid_2^2)$
By applying them, we can build a general version of the inner product estimator by setting encoding gates as kernel operators. That is, we can estimate $k(x,y) = \langle \phi(x),\phi(y) \rangle $.
There are several advanced versions of estimating kernel estimators but we’ll not see them.
3. Algorithms using Quantum Kernel
There are several strategies to use the quantum kernel. We’ll see three strategies to use the quantum feature space. I named them Hybrid Kernel Model, Analytic Quantum Model with kernel encoding and Variational Quantum model with kernel encoding. The most representative algorithms are Quantum Kernel E, Distance-based Classifier and Parameterized Quantum Circuit respectively.
Hybrid Kernel Model
The first strategy is using the quantum computer only to get the Gram matrix and building models with classical computers.
As mentioned above, the Gram matrix is the only thing to use the reproducing kernel Hilbert space. Therefore, by estimating the gram matrix on the quantum computer and building a classical model using the quantum gram matrix, we can use the quantum feature space.
The procedure is simple.
- Constructing Quantum Kernel Estimator circuit.
- Evaluating kernel values by iteratively input data vectors and constructing gram matrices.
- Building classical models using the gram matrix.
There are several classical models using the gram matrix including Support Vector Machine, Kernel Principal Components Analysis, K-Nearest Neighborhood and Generalized Sliced Inverse Regression. Therefore, by using the above procedure, we can build various quantum versions of data science models.
There are several advantages of the above algorithm. The first one is that we can use quantum feature space. Even though they are in research, the complex quantum space shows peculiar characteristics compared to Gaussian kernel space or Polynomial kernel space because they are essentially periodic. The more detailed research is in [Vojtěch Havlíček et al (2019) Supervised learning with quantum-enhanced feature spaces].
The second one is that we can get the result of a machine learning model through a classical computer. That is, the quantum computer is used only at the training phase. Once the models are trained, we can see the result using the classical computer only. In the current quantum environment, it is not easy and cheap to use the quantum computer. Therefore, by the above strategy, we can reduce the cost of the quantum model because it uses the quantum computer only at the training phase.
Analytic Quantum Circuit with Kernel Encoding
The second strategy is setting an encoding operator which maps the raw data into quantum feature space. That is, we use the $E_n(x) \ket{0}=\phi (x)$ and inject the distance estimator into the operator $U$. By using them, we can build the most intuitive algorithm and can control the total procedure. One of the famous algorithms is the quantum distance based classifier. There are various papers about them including [Carsten Blank (2020) Quantum classifier with tailored quantum kernel].
If we can get the exact procedure to build the classifier, we can make the most intuitive quantum model directly. But there are just a few algorithms to build supervised or unsupervised models. Those algorithms are under study.
Variational Quantum Circuit with Kernel Encoding
But there is an easier way to build the quantum model. If we use the parameterized quantum circuit, then we can build quite a good model. One of the limits of parameterized quantum circuits is they cannot explore nonlinear space. However, if we use the quantum kernel, then we can explore the nonlinear space by mapping the data into nonlinear space with the quantum kernel and finding the most efficient linear mapping by parameterized gates.
Therefore, the parameterized quantum circuit and quantum kernel show quite a good combination. There is a good example of PQC using kernel encoding in [Tak Hur (2021) Quantum convolutional neural network for classical data classification].
It shows an ideal result using a quantum simulator. The paper introduced various ansatz and encoding methods and most of them showed high accuracy like 94~98%.
As we see, the quantum kernel shows very bright potential. Not only can they be easily implemented using the quantum computer, it might show us a new point of view about data science and data space.
Vojtěch Havlíček et al (2019) Supervised learning with quantum-enhanced feature spaces
Carsten Blank (2020) Quantum classifier with tailored quantum kernel
Tak Hur (2021) Quantum convolutional neural network for classical data classification