Computational Complexity of Gradient Descent Algorithm

: Information is mounting exponentially, and the world is moving to hunt knowledge with the help of Big Data. The labelled data is used for automated learning and data analysis which is termed as Machine Learning. Linear Regression is a statistical method for predictive analysis. Gradient Descent is the process which uses cost function on gradients for minimizing the complexity in computing mean square error. This work presents an insight into the different types of Gradient descent algorithms namely, Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent, which are implemented on a Linear regression dataset, and hence determine the computational complexity and other factors like learning rate, batch size and number of iterations which affect the efficiency of the algorithm. by


INTRODUCTION
Machine Learning is a learning technique consisting of certain algorithms that identify patterns to expect the potential data, or to execute crucial decision making under uncertainty situations. Gaining Knowledge, understanding the patterns, image detection and face recognition are a few phrases which machine learning defines precisely. This refers to transformation for completing a job related to artificial intelligence (AI). A model designed should preserve the features of the algorithm adopting to the modifications in its environment and figuring necessary actions by possibly predicting the effects.
Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. Gradient descent was invented by French mathematician Louis Augustin Cauchy in 1847. It is based on the idea that instead of immediately guessing the best solution for the problem, gradient descent iteratively takes steps in the direction in which the solution becomes better and better. The process of changing the parameter values involves differential Calculus, specifically calculating the "derivative" of the cost function. The derivative gives the slope of the function at a specific point. In other words, it specifies how to scale a small change in the input to obtain the corresponding change in the output. The derivative is therefore useful for minimizing the cost function because it tells us how to change the parameters in order to make a small improvement in finding the function's minimum.
Gradient Descent Algorithm can be implemented in three ways namely, Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent. This paper aims to deduce and compare the computational efficiencies of the three stated algorithms by implementing them on a Linear Regression Dataset and simultaneously vary some of the parameters like batch size, number of iterations learning rate, etc. and see its effect on the efficiency of the algorithm.
Gradient descent is widely used in its raw form in Linear and Logistic regression analysis of data. Mini-batch is the most widely used approach to solve regression or classification problems. However, a few optimizations on this algorithm gives birth to the foundational algorithms used in Artificial Neural Networks, like ADAM, RMS PROP, ADAGRAD etc. which work on the same principle as that of gradient descent but is more optimized in terms of time complexity.

II. LINEAR REGRESSION
In statistics, linear regression is a linear approach to modelling the relationship between a scalar response or dependent variable and one or more explanatory variables or independent variables. The case with one independent variable is termed as simple linear regression. For more than one independent variable, the process is termed as multiple linear regression. This term is distinct from multivariate linear regression, where multiple dependent variables are predicted, rather than a single scalar variable. A general equation of a simple linear regression would look like this: Where Y represents a single dimensional output vector, X represents a single dimensional input vector, A is the explanatory or independent variable, whereas B is the scalar response or the dependent variable. In regression problems like these, the main aim is to determine the set of optimum values of (A,B) which exactly makes the input and the output vector fit into the equation 1.  1 shows the regression analysis, which basically involves scatter plotting the data points with X on x-axis and Y on y-axis, and then analysing the type of regression it belongs to, in this case it appears to be linear.
The values of A and B is determined by adopting various algorithms out of which the most used ones are Newton's Normal equation method and Gradient Descent algorithm.

III. GRADIENT DESCENT
Before analysing the Gradient descent algorithm, it is important to know some equations and terms.
The (eq.1) can be re-written as (eq.2) is called the Prediction function where θ represents an n*1 matrix representing n number of features or variables for each data point (A and B in eqn.1). The matrix X is an m*n vector with m number of data points and n features. The aim of linear regression is to keep approximating the values of θ matrix (A and B in eqn.1) iteratively to reach an optimal value which satisfies fits the given data in eqn.2.
Simply put, a cost function is a measure of how wrong the model is, in terms of its ability to estimate the relationship between X and Y. This is typically expressed as a square of difference or distance between the predicted value and the actual value. Here the summation of such differences over m samples gives us the net cost for the parameters θ which is approximated.
Gradient descent is a way to minimize an objective function J(θ) parameterized by a model's parameters θ ∈ R by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) with respect to the parameters (eq.4). The learning rate α determines the size of the steps we take to reach a (local) minimum.
In other words, the direction of the slope of the surface created by the objective function downhill is followed until a valley is reached. At this point the model has optimized the weights such that they minimize the cost function. This process is integral to the ML process, because it greatly expedites the learning process. Gradient descent, therefore, enables the learning process to make corrective updates to the learned estimates that move the model towards an optimal combination of parameters.
The learning rate α determines how fast or slow should the update of the parameters be while performing gradient descent. If the learning rate is too high, the local minima seen in fig.2 may be skipped completely due to a very big update, else if the learning rate is too small, then it would take a large number of updates or iterations to reach the minima point, which can cost a lot of time.
Hence, selection of the optimum learning rate is important while performing the gradient descent algorithm on a data. On finding the first order partial derivative of the cost function J(θ) and substituting back to the equation, we get the finalized Gradient descent equation shown in eq.5. A general Gradient descent equation for multivariate problems is the same as eq.5 with more than two parameters to be updated in θ matrix. The above step is to be repeated until the cost function attains a minimum value. The cost function is dependent on the θ directly, hence after every update of the parameters, cost value must be calculated to check if it has reached a minimal point. The number of iterations it takes to converge to the minima is of great interest as it determines the efficiency of the algorithm. The number of iterations depend on the type of data fed into the model, however, for a given dataset, parameters like Batch Size and learning rate influence the convergence of the cost to its minimum value.

IV. TYPES
Various variants of gradient descent are defined based on how we use the data to calculate derivative of cost function in gradient descent. Depending upon the quantity of data used, the time complexity and accuracy of the algorithms differs from one another.

A. Batch Gradient Descent
It is the primary type of gradient descent in which the complete dataset will be used to compute the gradient of cost function. As there is a need to calculate the gradient on the entire dataset just to perform one update, batch gradient descent can be very slow and is unmalleable for datasets that don't fit in memory. After initializing the parameter with arbitrary values, we calculate gradient of cost function using the eq.5

B. Stochastic Gradient Descent
Batch Gradient Descent turns out to be a slower algorithm as all the training examples are taken for each update. So, for faster computation, it is preferred to use stochastic gradient descent. As a first step of the algorithm, the whole training set is to be randomized. Then, for updation of every parameter only one training example is used in every iteration to compute the gradient of cost function. As it uses one training example in every iteration this algorithm is faster for larger data set.
The reason for shuffling the dataset is to primarily induce randomness in the dataset. In SGD, as updating the parameter values takes place for each training sample individually, there is a lot of noise created in the cost function updates as seen in Fig.4, hence, to decrease this, shuffling of dataset in mandatory.

Fig. 4 Contour Plot: SGD Cost Function
In SGD, one might not achieve accuracy, but the computation of results is faster. After initializing the parameter with arbitrary values the gradient of cost function is calculated using eq.6 as follows: (eq.6) where, 'm' is the number of training examples. SGD never actually converges like batch gradient descent does but ends up wandering around some region close to the global minimum as seen in Fig. 4.

SGD algorithm unlike Batched Gradient
Descent, doesn't depend on the total number of training examples, however learning rate influences its working tremendously. The noise in the update can also be regularised by selecting the appropriate α. However, it is observed that SGD wanders around the global minima in lesser number of iterations due to multiple updates of parameters in a single iteration.

C. Mini-Batch Gradient Descent
Mini batch algorithm is the most favourable and widely used algorithm that makes precise and faster results using a batch of 'k' training examples. Hence the whole training set is shuffled and divided into batches of 'k' examples, which is then used for updating the gradient of the cost. Common mini-batch sizes range between 16 and 256 but can vary for different applications.
After initializing the parameter with discretional values, gradient of cost function is calculated using the following: (eq. 7) where ' b ' is number of batches taken and ' m ' is the total number training examples.
From eq.7 it is inferred that the update of parameters happens in multiple batches of fixed training examples, hence due to presence of multiple batches, noise in the cost function gradient is expected. However, when compared with the Stochastic Gradient approach, the reduction in noise is significant as seen in fig. 5 due to the batched training approach.
Mini-batch gradient descent seeks to strike a balance between the robustness of stochastic gradient descent and the efficiency of Batch Gradient Descent. It is the foremost common implementation of gradient descent utilized in the field of deep learning. It reduces the variance of the parameter updates, which can lead to more stable convergence and it makes use of highly optimized matrix, that makes computing of gradient very efficient.

B. Technological Stack Required
The experiment was conducted on Google Colab Platform. Colaboratory is a free Jupyter notebook environment provided by Google where free GPUs and TPUs can be used to train ML models.
The dataset contained 700 training examples of a simple linear regression data which was manually made to conduct this experiment; hence no pre-processing of data was required, it was mostly noise free.
The algorithm was written and trained in Python language. Additional modules such as NumPy, Random, Sklearn.metrics and Matplotlib for vector multiplication, RMS error finding and graph plotting, were imported.

C. Data Visualisation and Analysis
The chosen dataset contained 700 datapoints with only one input parameter X and one output parameter Y. A sample of the dataset is shown here:  fig. 6, it was clearly deducible that the data was in accordance with linear regression. Small deviations are usually common in large datasets, however it the above graph provided clear evidence to use it for training a Linear regression model.

D. Control Parameters
Parameters kept for observation were, the number of training samples taken 'm', learning rate 'α', batch size 'k', cost estimated 'J' and finally the number of basic steps required to complete training 't'.
These parameters were varied across all the three algorithms to realise their effect on performance and efficiency.

E. Model Training, Evaluation and Observation
The dataset was trained on all the three models namely, Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent. Test Parameters were varied, and the observations were recorded for further analysis.
The python function used to compute cost (J) is as follows:

Fig. 7: Cost Function
This function is called after every iteration to check if the cost has reached the minimum or has saturated to a value when the parameters are updated.
a. Batch Gradient Descent The python program for implementing Batch Gradient Descent is as follows: Y= a + b*X is the prediction equation taken in which a and b are the variables which are in need to be found out through BGD The following data was tabulated after running the batch gradient descent on varying number of 'm' and learning rate α:  Table 2 shows the variation of number of iterations or epochs taken to reach the minimum cost for the corresponding learning rates and training samples taken.  Table 2., the number of iterations taken to reach the minimal cost is very high even for a small training set of 16 samples. The loss of the model is evaluated by calculating the root mean square loss. The main aim of this paper is to determine the complexity of computation of this algorithm. The basic step is the update of the parameters 'a' and 'b'. This step is performed twice every iteration (for 'a' and 'b'). Therefore, the total number of basic steps performed, or the complexity is: Where I is the number of iterations ran to reach the minimal cost.

b. Stochastic Gradient Descent
The python program for implementing Stochastic Gradient Descent is as follows: Y= a + b*X is the prediction equation taken in which a and b are the variables which are in need to be found out through SGD The following data was tabulated after running the Stochastic gradient descent on varying number of 'm' and learning rate α:  Table 5 shows the variation of number of iterations or epochs taken to reach the minimum cost for the corresponding learning rates and training samples taken. The second variable 'b' converges eventually to a constant value of 1.00x over large number of data points taken, whereas variable 'a' converges when 'm' reaches 700. Looking at the iterations in Table 5., the number of iterations taken to reach the minimal cost is high for smaller datasets however becomes very low for larger training datasets taken. The loss of the model is evaluated by calculating the root mean square loss. The main aim of this paper is to determine the complexity of computation of this algorithm. The basic step is the update of the parameters 'a' and 'b'. This step is performed for '2*m' number of times every iteration (for 'a' and 'b'). Therefore, the total number of basic steps performed, or the complexity is given by: Where I is the number of iterations ran to reach the minimal cost. The number of steps taken is tabulated in Table 7.

c. Mini-Batch Gradient Descent
The python program for implementing Mini-Batch Gradient Descent is as follows: Y= a + b*X is the prediction equation taken in which a and b are the variables which are in need to be found out through MBGD The following data was tabulated after running the Mini-Batch gradient descent on varying number of 'm' and learning rate α and batch size 'k':  Table 8 shows the variation of number of iterations or epochs taken to reach the minimum cost for the corresponding learning rates and training samples taken. The second variable 'b' converges eventually to a constant value of 1.00x over large number of data points taken, whereas variable 'a' converges when 'm' reaches 700. The values of 'a', 'b' and cost are very similar to those which were obtained in mini-batch Gradient Descent, however the iterations (in Table 8.) taken to reach the minimal cost is high for smaller datasets however gradually lowers for larger training datasets taken. The loss of the model is evaluated by calculating the root mean square loss. The main aim of this paper is to determine the complexity of computation of this algorithm. The basic step is the update of the parameters 'a' and 'b'. This step is performed for '2*m/k' number of times every iteration (for 'a' and 'b').
Here k represents the batch size, and (m/k) represents the number of batches present in the dataset.
As update happens for each batch in one iteration therefore, the total number of basic steps performed, or the complexity is given: Where 'I' is the number of iterations ran to reach the minimal cost. The number of steps taken is tabulated in Table 10.

F. Graphical Analysis of the Model
The linear regression model trained on three different algorithms, gave a similar output for various set of inputs given. From what is outwardly seen, taking test parameters into consideration, the number of iterations taken for SGD to reach the minimal value is the least when compared to the other two algorithms. Thus, SGD has an advantage of lesser iterations over the other two algorithms.

SGD < MBGD < BGD
Next is the comparison of the RMS loss computed for all the algorithms, it is observed that, the loss is least for Batched Gradient descent whereas it is high in SGD. Thus, BGD has an advantage of lower loss in the computation of the model

BGD < MBGD < SGD
The last and the most important comparison of the algorithms is with respect number of basic steps performed. Outwardly it is seen that BGD takes very less steps to complete computation whereas SGD takes the most as seen in observation tables. But those tables are incomplete. Though, theoretically the above said is true, in practical implementations, it seen that MBGD and SGD takes lesser computation time to train the model whereas BGD takes the most. This is because, in Batch gradient Descent, though in each iteration, update of 'a' and 'b' happened only once, the time consumption for this update was a lot, i.e. each update required a traversal through all the datapoints in training sample and then a final summation. Considering the traversal and summation also to be significant we formulate a new time complexity for BGD: This changes the Table 4 to the following:   table in  Table 7 correctly identifies the basic number of steps as well as the complexity of the algorithm.
It is to be noted that, there is a tough competition between MBGD and SGD with regards to the time complexity. Though theoretically, SGD should have the lowest time complexity, in practical situations, on many datasets, MBGD outperforms SGD offering lower time complexity.
From the above analysis, we can conclude that Stochastic Gradient descent and Mini-Batch Gradient descent take lesser time to train the model and reach the minima when compared to Batch gradient descent which clearly takes a lot more than other two from the fig. 18, fig. 19

B. Technological Stack Requirement
The experiment was conducted on Google Colab Platform. Colaboratory is a free Jupyter notebook environment provided by Google where free GPUs and TPUs can be used to train ML models.
The dataset contained 700 training examples of a simple linear regression data which was manually made to conduct this experiment; hence no pre-processing of data was required, it was mostly noise free.
The algorithm was written and trained in Python language. Additional modules such as NumPy, Random, Sklearn.metrics and Matplotlib for vector multiplication, RMS error finding and graph plotting, were imported.
C. Data Visualisation: The data being used in this part of our analysis for Logistic regression is one where 2 sets of features are tallied and a column of targets consisting of 1 or 0 is present, which provided the tally of whether the parameter has passed a certain threshold. The following is a table that presents a certain sample of the dataset. On plotting these points on a scatter plot, we receive the following plot.
The red points represent those who have 1 in the corresponding target section, while the blue points represent those who have 0 in the target section. These parameters were varied across all the three algorithms to realize their effect on performance and efficiency.

E. Model Training, Evaluation and Observation
The dataset was trained on all the three models namely, Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent across 3 notebooks, with subtle variations to suite shapes of data frames present in the algorithms. Test Parameters were varied, and the observations were recorded for further analysis.
The python function used to compute gradient descent(J) is as follows:

Fig. 16: Gradient Descent and Update Functions
This function is called after every iteration to check if the cost has reached the minimum or has saturated to a value when the parameters are updated.
The following functions provide the mathematical models used to calculate the parameters we have considered i.e gradient and loss.

Fig 17: Gradient Descent Functions
Here loss function essentially represents the cost that the model undergoes over the course of the entire training. This is the factor we are trying to minimise. Gradient is then used to calculate the direction and to what extent the function has to travel in order to minis the cost.

a. Batch Gradient Descent
The python program for implementing Batch Gradient Descent is as follows: 'Learning rate' is set variably which determines the leap gradient takes every iteration.
Y= (X*Weights) is the prediction equation taken in which weights represents the list of variables we need to determine through BGD The following data was tabulated after running the batch gradient descent on varying number of 'epochs' and learning rate 0.0001: Table. 14: Test parameter vs Iterations The reason we have taken a constant learning rate is that the corresponding cost that comes with these iterations can provide an accurate understanding of how many iterations are required for minimal loss. We can see that after a certain number of iterations, the cost has become minimised for this set. We see that weights start to converge to certain values that are incredibly close to each other and produce the same cost. The loss of the model is evaluated by calculating the log loss. The main aim of this paper is to determine the complexity of computation of this algorithm. The basic step is the update of the parameters weights. This step is performed twice every iteration (for both weights). Therefore, the total number of basic steps performed, or the complexity is: Where I is the number of iterations ran to reach the minimal cost.
b. Stochastic Gradient Descent: The python program for implementing Stochastic Gradient Descent is as follows :   Fig 19: Training Function 'features' represents the training set that we want to implement and targets represents the actual output of the line. Here, the basic operation is the update of 'loss' and 'grad' which provide the cost of the model and the gradient output respectively. 'learning_rate' is set variably which determines the leap gradient takes every iteration.
Y= (X*Weights) is the prediction equation taken in which weights represents the list of variables we need to determine through BGD The following data was tabulated after running the stochastic gradient descent on varying number of 'epochs' and learning rate 0.0001:  Table 17 shows the variation of number of iterations or epochs taken to reach the minimum cost for the corresponding learning rates and training samples taken.
As can be noted, the number of iterations remain the same as BGD. However, as we will move on to see, the number of epochs required are lesser to reach the minimum cost We can see that after a certain number of iterations, the cost has become minimised for this set. We see that weights start to converge to certain values that are incredibly close to each other and produce the same cost.
Moreover, to understand a comparison with the BGD algorithm, notice how fast the jump of both the weights as well as the cost of the SGD is. It is in line with the ability to have faster analysis for the same epochs as BGD.
The difference in cost between BGD and SGD is attributed to the shuffling of data required in each epoch of SGD in order to access random features in every iteration. The loss of the model is evaluated by calculating the log loss. The main aim of this paper is to determine the complexity of computation of this algorithm. The basic step is the update of the parameters weights. This step is performed twice every iteration (for both weights). Therefore, the total number of basic steps performed, or the complexity is: Where I is the number of iterations ran to reach the minimal cost. The number of steps taken is tabulated in Table 18.

c. Mini-Batch Gradient Descent
The python program for implementing Mini-Batch Gradient Descent is as follows: Fig. 20: Mini-Batch Gradient Descent 'features' represents the training set that we want to implement and targets represents the actual output of the line. Here, the basic operation is the update of 'loss' and 'grad' which provide the cost of the model and the gradient output respectively. 'learning_rate' is set variably which determines the leap gradient takes every iteration.
Y= (X*Weights) is the prediction equation taken in which weights represents the list of variables we need to determine through BGD The following data was tabulated after running the mini batch gradient descent on varying number of 'epochs' and learning rates that are variable.  Table 19 shows the variation of number of iterations or epochs taken to reach the minimum cost for the corresponding learning rates and training samples taken. The second variable 'b' converges eventually to a constant value of 1.00x over large number of data points taken, whereas variable 'a' converges when 'm' reaches 700. The values of 'a', 'b' and cost are very similar to those which were obtained in mini-batch Gradient Descent, however the iterations taken to reach the minimal cost is high for smaller datasets however gradually lowers for larger training datasets taken.
The loss of the model is evaluated by calculating the root mean square loss. The main aim of this paper is to determine the complexity of computation of this algorithm. The basic step is the update of the parameters 'a' and 'b'. This step is performed for '2*m/k' number of times every iteration (for 'a' and 'b').
Here k represents the batch size, and (m/k) represents the number of batches present in the dataset.
As update happens for each batch in one iteration therefore, the total number of basic steps performed, or the complexity is given by: Where I is the number of iterations ran to reach the minimal cost.

F. Graphical Analysis of the Model
The linear regression model trained on three different algorithms, gave a similar output for various set of inputs given. From what is outwardly seen, taking test parameters into consideration, the number of iterations taken for SGD to reach the minimal value is the least when compared to the other two algorithms. Notice the cost reduction in SGD compared to the other algorithms, and what you derive is that the least cost for equal number of epochs comes from SGD.
Thus, SGD has an advantage of lesser iterations over the other two algorithms.

SGD < MBGD < BGD
Next is the comparison of the log loss computed for all the algorithms, it is observed that, the loss is least for Batched Gradient descent whereas it is high in SGD. Thus, BGD has an advantage of lower loss in the computation of the model BGD < MBGD < SGD The last and the most important comparison of the algorithms is with respect number of basic steps performed. Outwardly it is seen that BGD takes very less steps to complete computation whereas SGD takes the most as seen in observation tables. But those tables are incomplete. Though, theoretically the above said is true, in practical implementations, it seen that MBGD and SGD takes lesser computation time to train the model whereas BGD takes the most. This is because, in Batch gradient Descent, though in each iteration, update of 'a' and 'b' happened only once, the time consumption for this update was a lot, i.e. each update required a traversal through all the datapoints in training sample and then a final summation. Considering the traversal and summation also to be significant we It is to be noted that, there is a tough competition between MBGD and SGD with regards to the time complexity. Though theoretically, SGD should have the lowest time complexity, in practical situations, on many datasets, MBGD outperforms SGD offering lower time complexity.
From the above analysis, we can conclude that Stochastic Gradient descent and Mini-Batch Gradient descent take lesser time to train the model and reach the minima when compared to Batch gradient descent which clearly takes a lot more than other two from the fig. 18, fig. 19 and fig. 14 Batched Gradient Descent algorithm performs well on small sized datasets with minimal loss in the model computation, however, in practical situations, due to robustness of datasets, BGD fails to perform well due its high time complexity.
In practical applications, Stochastic Gradient descent is used only for datasets having about 1000-2000 datapoints, but it is not suitable for higher number of data points due to the noise it generates during parameter updation and a high loss is encountered in the computational results.
Hence, Mini-Batch Gradient descent is the most widely used algorithm to perform any kind of Regression analysis, be it Linear or Logistic. Mini-batch Gradient Descent forms the basis for the construction of Artificial Neural Networks. Further optimizations are done which completely helps in removing the noise present in MBGD, and further improves its computational complexity. Famous optimizers like RMS Prop, ADAM, ADAGRAD etc. are all based on Mini-Batch Gradient Descent algorithm.