Neural Layer Bypassing Network Neural Layer Bypassing Network

This research introduces and evaluates the Neural Layer Bypassing Network (NLBN), a new neural network architecture to improve the speed and effectiveness of forward propagation in deep learning. This architecture utilizes 1 additional (fully connected) neural network layer after every layer in the main network. This new layer determines whether finishing the rest of the forward propagation is required to predict the output of the given input. To test the effectiveness of the NLBN, I programmed coding examples for this architecture with 3 different image classification models trained on 3 different datasets: MNIST Handwritten Digits Dataset, Horses or Humans Dataset, and Colorectal Histology Dataset. After training 1 standard convolutional neural network (CNN) and 1 NLBN per dataset (both of equivalent architectures), I performed 5 trials per dataset to analyze the performance of these two architectures. For the NLBN, I also collected data regarding the accuracy, time period, and speed of the network with respect to the percentage of the model the inputs are passed through. It was found that this architecture increases the speed of forward propagation by 6% - 25% while the accuracy tended to decrease by 0% - 4%; the results vary based on the dataset and structure of the model, but the increase in speed was normally at least twice the decrease in accuracy. In addition to the NLBN’s performance during predictions, it takes roughly 40% longer to train and requires more memory due to its complexity. However, the architecture can be made more efficient if integrated into TensorFlow libraries. Overall, by being able to autonomously skip neural network layers, this architecture can potentially be a foundation for neural networks to teach themselves to become more efficient for applications that require fast, accurate, and less computationally intensive predictions.


Introduction
Neural networks have significantly improved and dominated deep learning in the past decades.
They can reach high levels of precision and recall and in some cases, outperform humans as well. One main reason why neural networks do well is because of their large and complex architectures. Though these architectures increase the accuracy of neural networks, they tend to take more time for forward propagation, so they output results slower than other deep learning architectures. This raises issues in applications of neural networks that require speed and accuracy. For instance, this issue is present in self-driving cars, where a minuscule delay can create a life or death situation for passengers.
Some have attempted to reduce the forward propagation time for deep neural networks.
However, the majority of people solve this issue by reducing the number of layers or number of neurons per layer, which can improve prediction time, but this significantly decreases the accuracy and capability of neural networks due to potential underfitting. There are also certain architectures that help speed up deep learning applications without sacrificing the accuracy of the algorithm in various ways, but these solutions generally make use of different architectures, such as the 2017 Illustrated Transformer. 1 In this research, I will be proposing a new architecture that uses the fundamental building blocks of standard neural networks, such as convolutional and pooling layers in the case of CNNs, to forward propagate and perform tasks. However, I intend to have checkpoints after each layer to determine whether the forward propagation should continue or not. This decision depends on the probability or confidence of the model that a certain output should be produced after passing inputs through a variable number of layers in the network.

Transformers
Transformers are architectures that use self-attention to learn sequences across elements of inputs, allowing long-term dependencies and scalability. 2 Initially they were used in Natural Language Processing but have been adapted to work in Computer Vision through data manipulation as well. 3 Transformers provide attention and processing power to specific parts of the inputs that are important when obtaining the final output. This makes them more efficient as the network focuses on particular portions of the inputs rather than the entire input.

Pipelining
Machine Learning Pipelining is a technique to optimize the input data for a neural network to reduce the processing time and memory required for a model to perform forward propagation. 4 This includes preprocessing and compressing data for neural networks to use. Some examples are Processor Pipelining and Instruction Pipelining.
3. Architecture I am proposing a new architecture (the Neural Layer Bypassing Network or NLBN) that revolves around the standard neural network structure but adds a new layer (the rejection layer or rejection model) to each layer of the main model to determine the probability of the semi-processed input resulting in a certain output. If this probability is above a trainable or fixed threshold, the forward propagation will be stopped and the next input will be taken. Else, the forward propagation will continue, and this process will occur for the next layers as well. In Figure 1, the Legend is as follows: 1. CONV Layer -Convolutional Layer for Feature Extraction. In this manner, the NLBN is not expected to forward propagate each input through the entire model. If there are certain inputs that can be classified with fewer layers, this is identified through the rejection models (or rejection layers) and only the required number of layers are used. This will make predictions faster because inputs are only partially passed through the NLBN, resulting in fewer computations However, a possible drawback is an increase in training time and memory required due to the extra rejection layers. Additionally, the recall or precision of the network may slightly decrease depending on the number of layers and labels, since inputs can be rejected at any stage of the network, potentially increasing the false negatives or increasing the false positives from predictions.
This approach to speed up neural networks is mostly applicable for networks used in real-time classification problems or utilities that require less power consumption and faster predictions.
This includes public chatbots with millions of simultaneous users and smart security cameras that alert owners when they detect an unidentified individual on their property.

Procedure
In my experiment, I will be training a standard CNN and my NLBN architecture for 3 datasets: MNIST Handwritten Digits Dataset, Horses or Humans Dataset, and Colorectal histology dataset.
I will also be collecting and analyzing specific data regarding the performance of the NLBN architecture. This will be the primary focus of this procedure section

Variables
Independent Variable: 1. Percentage (%) of the model that the inputs are passed through I chose this as the independent variable because the NLBN is hypothesized to work by reducing the percentage of the model that the inputs are passed through. By changing this variable, I will be able to accurately determine the effectiveness of this architecture.

Datasets
MNIST Handwritten Digits Dataset: 5 Contains 60,000 images with resolution of 28x28. The dataset contains images of handwritten digits from 0 to 9 with an even distribution.
Horses or Humans Dataset: 6 Contains 1000+ images with resolution of 300x300 in RGB. The dataset contains images of horses and humans with a 50:50 distribution.
Colorectal Histology Dataset: 7 Contains 5,000 images with resolution of 150x150 in RGB. The dataset contains images of colorectal cancer cells with an even distribution.

Experiment
To determine the effectiveness of the NLBN architecture I have proposed, I will be using Google COLAB to analyze the performance of my NLBN and CNN models. These networks were created using the Keras 8  Steps to create and use NLBN architecture: 1. Import the necessary packages and libraries.
2. Upload and preprocess the training and testing data (in this case images).

Create a list of TensorFlow / Keras layers called 'layers' that would succeed one another
in a model. 4. Create a list of TensorFlow / Keras models (Broken Models -'bModels') that input the layers from the previous index and output the layers in the same index.
a. The first index (0) will input the images from the training data. 5. Train the last model in the 'bModels' list; this will automatically train the preceding models.
6. Create a list of rejection layers (Rejection Models -'rModels') that input the layers in the same index of the 'bModels' and output the probability of an image being a certain class.
7. Train each element of 'rModels'. 8. Create the final list of models that are used for predictions (Structured Models -sModels). Use the layers in the 'layers' list created earlier and load the weights of the respective layers in the 'bModels' list into these layers.

Data
From my experiment in the Google COLAB Jupyter Notebooks, I collected the following data.

Analysis
From the data above, I will be plotting three different graphs with the different dependent variables (accuracy, speed, and percentage of inputs passed) against the independent variable (percentage of model that the inputs pass through). Each graph uses data from Tables 3 -5.
Hence, each graph is not made for each dataset as done for the tables.  From this analysis and Graph 1, it can be seen that as the percentage of the model that inputs pass through increases, the accuracy of the model increases until it cannot be further increased.
For this reason the accuracy tends to increase much more in the early stages of the model, in comparison to the later stages. The curves are limited or asymptote-like to the maximum accuracy. This signifies that the NLBN architecture will not need to pass through 100% of the model to make accurate predictions.
The results of the percentage of inputs passed through a certain percentage of the model is given in Graph 2.
In Graph 2, there is a strong correlation between the percentage of inputs and the percentage of the model these inputs pass through in the NLBN. There are no major outliers, and there is a constant downwards trend of the percentage of inputs, as expected. This is because the layers of the model can only use a percentage of the input less than or equal to that of the previous layer.
For the MNIST Handwritten Digits dataset, the percentage of inputs decreases by over 80% in the first 10% of the model. Next, the percentage on inputs reduces by 15% (from 18% to 3%). Overall, this analysis of Graph 2 for the three datasets presents how the NLBN architecture does not pass all the inputs through the layers but stops forward propagation when necessary.
As the percentage of the model increases, the percentage of inputs passed through reduces.
Moreover, the decrease in percentage of inputs tends to be the highest in the early stages of the network. This is because the accuracy of the model while using only the initial layers is still high enough to accurately classify most inputs. As a result there is an exponential-decay-like trend in the data.
To analyze the changes in speed as the percentage of the model increases, Graph 3 will be used. In Graph 3, there is a visible, negative correlation between the speed of forward propagation and the percentage of the model that the inputs are passed through. However, there are certain outliers in the data that do not suggest a negative trend.
In the case of the MNIST Handwritten Digits dataset, there appears to be a consistent downwards trend in the beginning 40% of the model, where the speed decreases from 100% of the initial speed to 94%. However, in the next 10% of the model (the next layer), the speed increases back to 100% of the initial speed. After this, the speed consistently decreases to 90%.
The sudden increase in the speed at the 50% mark of the model is an outlier. However, it could be attributed to the complexity of the network and the efficiency of the CPU/TPU. It is possible that going through certain layers with certain parameters to reduce the size of the input and then reaching a fully connected layer requires fewer computations than passing a large input from the earlier stages into a fully connected layer. Hence, the speed increases. Apart from this, the speed does decrease as the percentage of the model used increases. This is intuitively correct as it generally takes more time to pass inputs through a greater number of layers.
For the Horses or Humans dataset, there is a trend similar to that of the previous dataset. There is an overall negative correlation, which is in accordance with the theory of the NLBN architecture. However, the speed of forward propagation at the 20%, 30%, and 40% mark increases in comparison to the speed previous to these points. The cause of these outlier points may be similar to that of the MNIST Handwritten Digits dataset. The number and size of the computations required for these specific layers is less, in comparison to other layers. Hence, the speed has increased. Besides these data points, there is a clear trend between the independent and dependent variable. As the percentage of the model that the inputs pass through increases, the speed of the model decreases. In this case, it decreases from 100% the initial speed to 83%.
Finally, there is a stronger trend for the Colorectal Histology dataset. There are no major outliers, and the speed steadily decreases from 100% the initial speed to 94%, which is, again, in agreement with the theory behind the NLBN architecture.
From this analysis of Graph 3, it is shown that there is a negative correlation between the percentage of the model that inputs passed through and the speed of forward propagation.
There are certain outliers, but it is possible for them to be due to the specific structure of the NLBN and the resulting computations. All in all, the NLBN architecture can increase the speed of forward propagation by reducing the number of layers (the percentage of the model) that the inputs pass through.

Results
From Tables 1 and 2 and the analysis of Graphs 1 This is shown in Graph 2 and its analysis. Hence, as seen in Graph 4, the NLBN makes faster predictions in comparison to the CNN (an increase in speed from 6.8% -22.2%).

Conclusion
The NLBN architecture reduces the time taken for forward propagation per image or training example. This results in an increase in the propagation speed as the models tend to exit the forward propagation of inputs at earlier layers, reducing the number of computations.
The accuracy tends to decrease, due to the use of rejection layers to exit forward propagation midway which could possibly reduce recall and precision. This is because labels may be predicted inaccurately in rejection layers with high confidence, thus increasing the number of false positives and false negatives predicted by the NLBN model.
Overall, the NLBN tends to increase the forward propagation time, and will most likely decrease accuracy by a percentage relatively less than the percentage increase in speed. Furthermore, the confidence levels for the rejection layers were fixed, rather than trained. If they were trained, it is likely that the accuracy and forward propagation speed would increase.
Additionally, the datasets used in this research are considered to be relatively easy to train on while attaining a high accuracy. Hence, for these datasets, the accuracy may not significantly depend on the complexity of the model. Additionally, these datasets have an even distribution of data, which is not always applicable. In real-world scenarios, especially those that require real time applications, there tend to be many more true negatives which the model identifies. For the NLBN, it is expected that true negatives are easier to predict, so the architecture will pass through fewer layers, making it faster in practice.
This NLBN model architecture can be further improved through more investigation and can be deployed in more applicable use cases, such as in smart security cameras or in autonomous vehicles.

Potential & Future Applications
This research ventures into the beginnings of deep learning architectures that do not follow the traditional method of forward propagating through all the layers in a model. The NLBN lays the ground for other potential neural network architectures that may learn to skip or propagate through layers to improve performance in terms of accuracy, CPU load, and speed. In a more intuitive sense, the next steps are to teach deep learning models to be more effective and efficient using deep learning. Analogous to human perception, this idea is similar to humans thinking and understanding their thought process and learning how to think in the future.
The NLBN architecture requires more memory and more time to train, because it has extra rejection layers. However, the impact of this downside is based on the resources, circumstances, and limitations of developers for the given applications.