How Convolutional Neural Networks Defy the Curse of Dimensionality: Deep Learning Explained

The required number of learning samples grows with the number of parameters that have to be estimated. In a deep convolutional neural network, while the total number of parameters (weights and biases) grows with the number of synapses between neurons, the number of independent parameters can be many orders of magnitude smaller than the number of neurons. Furthermore, the early layers are detecting features, many samples of which appear in every ample sentence or picture. The deeper layers (with far fewer neurons) are detecting thoughts or objects that rarely appear in the learning samples.

Index Terms-convolutional network, deep learning, machine learning, neural networks, pattern recognition I. THE CONVOLUTIONAL NEURAL NETWORK THE theory behind the accuracy of pattern recognizers has a long history [1][2][3][4] . For a given number of features, the performance improves as you increase the number of learning samples (the data from which the machine learns). Performance increases, and then plateaus. The required number of samples to reach the plateau grows very rapidly (e.g., exponentially) with the number of features (the dimensionality of the pattern vector) and/or with the number of parameters or weights that need to be estimated. It takes a lot of points (sample vectors) to fill a high-dimensional space.
The problem is that to increase performance we need more features and/or weights, and hence much more data samples. Consider a layered, non-recurrent, neural network such as shown in Figure 1. For pattern classification, each output activation might represent the likelihood that the network's input is a member of a specific class. For a more general problem, using reinforcement learning, each output activation would represent the quality of a particular action. Since more features provide better performance, the number of inputs to each layer should be as large as can be supported by the number of data samples available. This would imply a very large number of neurons and hence a much larger number of weights; an enormous number, if the network is deep. We know from Statistics, that a large number of independent parameters requires an enormously large number of data samples. No problem. Design a network with many weights, but few independent weights. Thus, we can have our cake and eat it. We will show that the convolutional neural network does this because the weights are not independent. But first, let me briefly remind you (in one paragraph) how a deep neural network optimizes its performance. (See the free online book by Michael Nielson: www.NeuralNetworksandDeepLearning.com for a more complete treatment) An artificial neuron (Threshold Logic Unit, Perceptron, or ADALINE) fires, activates, or recognizes something when a weighted sum of its input features exceeds a threshold. For a neural network, we replace thresholds with a differentiable nonlinearity. For pattern recognition with a squared error loss, ½(y-a) 2 , we note that the derivative of the loss is the error. If we measure the output errors and differentiate the loss with respect to the weights, the chain rule of differential calculus allows us to back-propagate the errors and hence expose the hidden layers. Thus, least mean squares (LMS) 5 and back-propagation fit like hand and glove. Section II serves as an Appendix. It provides mathematical details and generalizes to more general cost functions and to reinforcement learning. Now return to Figure 1. Considering only the heavy lines, it represents a simple example of the first three layers of a multi-

Enormous parameter reduction
layer or deep neural network. Ignore the light lines. They are only here to remind you that the output layer should be fully connected so that final decisions or actions depend on the network's entire input. For example, all pixels in a sample picture to be described or all acoustic time samples of a sample sentence to be understood. Let m be the dimensionality of the input. For example, it could be 3 times the number of pixels in a color picture; or it could be the number of acoustic time samples in a spoken sentence. Let m1=m, and let m2 = k2m be the number of neurons in the 2 nd layer (the first hidden layer). Let k be the number of inputs to each neuron in the second layer, k<<m. Each unit in a convolutional hidden layer is looking for local features by applying a localized linear filter to the input. If we start by examining a 4x4 or 4x5 greyscale local regions, k is 16 or 20.
The diagram here is one dimensional so the inputs may be time samples of an acoustic speech waveform. In figure 1, k2=1, m = m1 = m2 = 4 and k = 3. Since, in reality, m >> k, we ignore edge effects and assume k is constant. For an arbitrary numerical example, let us assume the input is m dimensional with m = 100,000. If the first two layers were fully connected there would be m1m2 = k2m 2 or k2 times 1 billion weights. The use of only k local weights brings this number down to km2= kk2m (or k2 times 2 million weights). Now here comes the kicker. If we are looking for only one feature (a vertical edge, for example) each hidden unit in the second layer would use the same linear filter and there would be only k weights that need to be determined. Of course, we do not choose the feature; the network does. But obviously we must look for several different features. Actually, k2 of them.
With k2 sets of k weights, the total number of distinct weights that need to be determined from the learning samples is only k2k. For simplicity, we choose k2=k. If k=20, by using widely shared local weights, we have increased the number of neurons in the first two layers from 2m =200,000 to (k2+1) m = 2,100,000 while decreasing the number of weights that need to be learned, from m 2 = 10,000,000,000 to k 2 = 400. That is the savings in the first convolutional layer. The pooling layer reduces the number of neurons in the layer by a factor k' so that m3 = m2 / k', where k' can be a multiple of k2, In the figure k2 is 1 instead of k, and k'=2. The most economically efficient pooling is max pooling. Simply choose the largest of k' activations in the previous layer. For example, we might select the best feature amongst the k2 candidates. However, the feature (e.g., a vertical edge) is not needed at each and every pixel. Similarly, the phoneme is not needed at each and every acoustic time sample. Layer 3 decimates the sample rate by a factor of k'/k2 < k. In our example, the reduction factor, k', must be less than kk2=k 2 =400. The numbers used here are extreme to make a point. Note that even if we continue to train a deep convolutional network until all the training samples are correctly classified, we are not likely to arrive at the best possible set of first layer weights or the best set of initial features for generalizing to new data. Apparently, this does not matter too much. We arrive at an excellent set of last layer weights and a solution that can often generalize to unseen data. When at a saddle point, with many dimensions to choose from, we are almost certain to find a fast way down. We do not reach a global minimum, but with a large deep network, there are many excellent local minima for the expected cost. One problem is that a neural network identified a rabbit as a wolf because there was snow on the ground and the network had never seen a wolf without snow or a rabbit with snow. Like a person, the machine has to keep learning. In a typical deep network, there can be several convolution layers between each pooling layer. The differentiable activation function that replaces the threshold is no longer "S" shaped, but is the Rectified Linear Unit (ReLU), a=f(z) = max (0, z). Its derivative (for back-propagation) is always either 0 or 1, reducing computational complexity, and it is made for order for reinforcement learning. Since we choose the highest quality action, we can ignore negative scores (z<0) and even eliminate neurons that usually produce zero activation. It is interesting to note that Dr, Fukushima 6,7 was using the ReLU long before it became popular. In April 2021, Dr. Kunihiko Fukushima of Japan received the $250,000 Bower Award and Prize for Achievement in Science from the Franklin Institute "for his pioneering research that applied principles of neuroscience to engineering through his invention of the first deep convolutional neural network, "Neocognitron"-a key contribution to the development of artificial intelligence." Performance is enhanced by adding feedback (recursion) every few layers 11 . Since this allows data to remain in the system longer, it was named LSTM for Long Short-Term Memory. Deep Learning, made possible and practical by back- propagation and the convolutional architecture, starts from raw data and progresses to increasing levels of abstraction. The human mind also does this, though not necessarily going through the same steps. For example, for print reading or written language translation, the layers we may go through are: pixels → edges→ strokes → characters → words → meaning.
With LSTM, word recognition can depend on not only the current letters, but also on the previous word.
In 2003, Andrew Ng combined reinforcement learning concepts with neural networks and aerodynamics to design an autonomous helicopter in his PhD Dissertation at U. CA, Berkeley, and he built one at Stanford U. in 2006. The Neural Network, Alpha Go, had sparse reinforcement rewards (only at game's end). It learned to became the world champion in both Chess and Go by playing millions of games against itself.

II. AN OVERVIEW OF DEEP LEARNING ALGORITHMS
Learning can be generalized to other cost functions, and then, by use of reinforcement learning, to more general problems. In error correction for pattern recognition, we wish to minimize the expected loss due to errors. So, we define a desired output and a scaler cost or loss for an error for a given input vector, x.
The decision, and hence the cost (given the true state), depends only on the output activations. = C( ). The derivative of this function with respect to the unknown parameters is what we backpropagate and call the "error". We define the cost for errors by averaging over a set of training samples This is the quantity to be minimized by steepest descent. For Reinforcement Learning, we define desired actions and a reinforcement reward for a desired action given the state of nature. The action depends on the output activations, and the reward depends on the action and the state of nature, where the state may depend on the previous actions. The state of nature is not given. This is learning without a teacher, and is a compound decision problem 12 . We only see the past rewards. We define Return as the accumulated reward and we attempt to maximize the expected Return. We chose the action which maximizes the quality of the action, Q( ), where Q is the average value of discounted future returns. C is replaced by -Q. The fact that the parameter correction is a small positive constant (the learning rate, ) times the divergence of quality with respect to the parameters, is intuitively satisfying. We chose the action with the highest estimated quality and then increase or decrease parameters in a way to increase the estimate of the quality of the chosen action. This works because whenever we receive a reward it increases the estimate of the quality of the decisions made to get that reward. It turns out that the problem of weights and/or activations continuing to grow until they saturate (or the problem of having too many weights driven to zero) appears to be helped by LSTM. In fact, such automatic regularization was the reason it was introduced.