THE ESSENCE OF DEEP LEARNING THE ESSENCE OF DEEP LEARNING

Currently, we use a huge amount of parameters and extremely complicated Deep Neural Networks to assure strong performances on datasets. Indeed, most of them can achieve great results on some tasks, but they are still black boxes and we do not know the details of their processes. I start from a single sample to introduce a novel theory for deep learning networks. I prove that the essence of deep learning networks is a large amount of matrixes’ multiplication. In particular, each sample has its matrix which could enable sample matching the result of it, and deep learning networks memory some of their corresponding matrixes. The process of training deep learning is seeking a set of parameters that could generate as many as possible matrixes which match the samples. If a model could memory many samples’ matrix, this model could get a good result on a test set called generalization. Otherwise, the model is called overﬁtting or not ﬁtting. I present empirical results showing that my theory is right on Cifar-10 Krizhevsky [2012].


Introduction
Deep learning networks on many tasks can get excellent performances, and more and more people are getting into the research and improvement work of deep learning networks. In which appeared the following questions.
1. How deep learning works: we all know deep learning is a combination of many simple nonlinear functions, because of its nonlinear characteristic, cause it is hard to study what happened in it. Another reason is that the dimensions of the input data tend to be multidimensional beyond people's understanding, although people adopted a variety of ways, such as showing the middle feature maps, due to the large dimensions and feature maps were too abstract, so people can only give some subjective interpretations, it is difficult to persuade the readers Zeiler and Fergus [2013], , Karpathy et al. [2015], Bau et al. [2017], Alvarez-Melis and Jaakkola [2018].
2. The generalization: This is the ultimate goal of all the people who work in deep learning. We want to get a model that can perform well in a variety of data sets, eventually applying the model in real life with the same effects. Sometimes there will be a phenomenon called overfitting in many experiments, so what is the nature of overfitting? Why some models can get good performance on the test set, some bad? Many people do their best to find the model that could have a better generalization Zhang et al. [2016], Wang et al. [2019].
3. What is the theory of deep learning: This is the problem for all people engaged in research on deep learning. Do not really have a way to explain deep learning? Although people have made efforts in this area for a long time and made all kinds of deductions. But, they are far from our reality Sun et al. [2021], Mittelstadt et al. [2018], Lipton [2016], Du et al. [2018].
This paper points out that deep learning is by no means an illusory and groundless thing. Next, this paper will give an explanation of all the above problems and a mathematical representation of the nature of deep learning.

What is the essence of Deep learning
First of all, I will give an intuitive explanation. We all know that deep learning has huge parameters and many nonlinear functions, so it is difficult to study. This paper argues that deep learning is essentially a memory of some matching matrixes of corresponding samples in data set, which stores countless matrices that can fit input samples to corresponding results or feature maps.
Firstly, considering a model has converged. When the model has converged, inputting a sample into a model, then its corresponding deep learning network can be thought of as a large matrix, which are some linear combinations of the input sample, because all the states of Relu have fixed. The proof is given below.
I donate scalars as a, vectors as a, matrices as A,set as A and equality by defination as ≡. Given a training dataset S ≡ ∪ n i {(x i , y i )}.

The theory of full connected neural networks
When the model is fixed, the whole process of deep learning can be written as:

The theory of convolutional neural networks
x i ⊆ R m * n . Fisrtly, we reshape x i in tox i (m * n, 1) and the convolutional kernels are 3 * 3, and strides are 1.
The whole process of convolution can be written as: x T i ⊆ R 1 * (m * n) : The transpose ofx i . E ⊆ R (m−2)(n−2) * 1 : Copying thex i in the direction of row. Its size depends on how many times the convolution.
K ⊆ R 9(m−2)(n−2) * 1 : K is the original kernal.K is a column vector copyed in the row direction. The details of all matrixes: According to the above proof, the whole convolution process in a network can be written as: W is a representation of all the corresponding convolution and Relu. It also can be calculated.

The essence of fully connected and fully convolutional neural network
Based on the above Theorem 1 and Theorem 2, we could conclude that they just generate their corresponding matrixes according to different samples. There are similarities between their matrixes, otherwise it is hard for them to fulfill the tasks with some combination of their parameters. The similarities of matrixes are shown in the next section.
3 The reasons behind the feasibility of neural networks

Test the theorem 2
We know the best solution for the whole dataset is a set of matrixes that could match as many as possible samples. It is hard for us to find the corresponding matrixes, but we could let deep learning networks generate the matrixes.
The first experiment was conducted on Cifar-10. Firstly, It is hard for me to find the corresponding matrixes, so I trained a Resnet-20-W (Altering the structure of ResNet He et al. [2015] ) on the dataset, and the batch size was 128. The process is as follows:  We could find that there is some similarity between the W, but it seems that the differences do not depend on classes. So I contend that Resnet-20-W will generate a similar weight according to the dataset.

Rewrite the theorem 2
According to function 13, we know that if the inputs' and outputs' shapes are large, we will need to do a huge amount of matrixes' calculation. So we could according to the process in section2.3 unfold the input images' channels in the directions of the row. The the whole process could be written as x ⊗ W = Z 1 SEx T TkZ 2 Λ. S is responsible for summing the values of different channels after convolution. The whole process could be written as: Theorem 3 could be used in studying the middle process if we have a pre-trained model. Besides, it also could be used in tasks with high dimensions.  The results are as follows: We could find that theorem 3 could be used in studying the middle layers or complex tasks. Besides, we can find that it is easier for a deep learning network to find a better solution for the whole dataset. According to figures 5 and 6, we could find there some similities between their matrixes in different samples.
In the equation 13, Z 1 E is a unit matrix. TkZ 2 Λ could write as a matrix W . So the residual process can be simply written as: x i l ,x i l+2 are the input of the lth and l + 2th layers for the simple x i . W l , W l+1 are corresponding matrixes of the lth and l + 1th layers. So all the middle processes can be written asx i L =x i l W i .x i L is the output before the linear operation. W i is responsible for simplifying the input data (including pooling).
The last process of Cifar-10 on Resnet isŷ i =x i L W L . All data share the same W L .
So the whole processes before the last linear function are just trying to simplify the input data and W L is the corresponding matrix which can fulfill all the simplified data. So the ResNet could find the relationship of all samples and make them can be calculated in the same matrix. Besides, we could find that compared with full convolutional neural networks, the matixes W is more varied. It could help to provide information of original inputs.

Rethinking neural networks
For most of fully connected neural networks or full convolutional neural networks, they just generate their corresponding matching matrixes before the last operations which can fulfill the tasks. But it is too difficult to meet the requirements of all samples, we will reach the bottleneck. So we introduce some structures which can simplify the data and make it easier to deal with to achieve a state-of-the-art result, like ResNet. Many people are engaging in creating a new model. But there is a crucial problem, whether the works are meaningful like trying to achieve great results on some datasets or whether the models have an upper bound.
We take a ten classification problem on ResNet as an example. The real question that deep learning networks are solving is as follows.
x T 1 ;x T 2 ,x T 3 , · · ·x T n are the input data. W 1 , W 2 , W 3 · · · W n are the operation before the last linear operation. Those are responsible for decomposing the data and making them can be calculated by one matrix.
· ·ŷ 1,10 y 2,1ŷ2,2 · ·ŷ 2,10 y 3,1ŷ3,2 · ·ŷ 3,10 · · · · · · y n,1ŷn,2 · ·ŷ n,10 x T . ⊆ R 1 * m are simplified inputs for last linear operation of different samples. k . ⊆ R m,1 is the corresponding vector for ten classes. Only one class is right for one sample, so the values of the corresponding classes need to exceed other values. Most of the time n >> m, therefore we need to find a group of parameters that matche as many as possible samples. This is a first order inequality equations with multivariate, but the equations do not have strict targets. It just has some corresponding states. (If the targets are right, the corresponding values are larger than other values.) So we could call it multivariate first-order state equations(MFSE). When we know labels, we could write the eqution as: