Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    AT&T Shook Up Its Unlimited Phone Plans. Here’s What You’re Paying For

    Environmentalists turn out in force to oppose Trump coal ash rollbacks

    Do You Actually Need to Pay for Transcription Software?

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI Tools»Why Gradient Descent Became Stochastic
    AI Tools

    Why Gradient Descent Became Stochastic

    By No Comments22 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Why Gradient Descent Became Stochastic
    Share
    Facebook Twitter LinkedIn Pinterest Email

    , we are going to discuss not only how but also why gradient descent and stochastic gradient descent are used.

    We already know about linear regression, and recently I wrote about it in the context of vectors and projections.

    Now, we will try to understand gradient descent with the help of a linear regression problem.

    But before that, I just want to briefly recall what we already know about linear regression and the math behind it, so that anyone starting out finds it easy to follow.

    If you already know the basic math behind linear regression, then you can directly start from the section titled Why Do We Need Gradient Descent?


    Let’s say we started our machine learning journey, and the first thing we did was implementing a linear regression model using Python.

    We implemented it successfully and got the best values for the slope and intercept.

    Now we have a question: What’s actually happening behind this algorithm?

    We want to understand the math behind it.


    Linear Regression Recap

    For that, let’s consider this data.

    Image by Author

    Now, we want to understand the math behind the algorithm.

    Image by Author

    We come across these formulas for the slope and intercept.

    [
    beta_1 = frac{sum_{i=1}^{n} (x_i – bar{x})(y_i – bar{y})}{sum_{i=1}^{n} (x_i – bar{x})^2}
    ]

    [
    beta_0 = bar{y} – beta_1bar{x}
    ]

    Now, by using these formulas we calculate the slope and intercept.

    The Simple Linear Regression equation is:

    [
    hat{y}
    =
    beta_0+beta_1x
    ]

    The slope formula is:

    [
    beta_1
    =
    frac{
    sum_{i=1}^{n}(x_i-bar{x})(y_i-bar{y})
    }{
    sum_{i=1}^{n}(x_i-bar{x})^2
    }
    ]

    The intercept formula is:

    [
    beta_0
    =
    bar{y}
    –
    beta_1bar{x}
    ]

    The dataset is:

    [
    x=
    [1.2,1.4,1.6,2.1,2.3,3.0,3.1,3.3,3.3,3.8]
    ]
    [
    y=
    [39344,46206,37732,43526,39892,56643,60151,54446,64446,57190]
    ]

    Compute the mean of x:

    [
    bar{x}
    =
    frac{1.2+1.4+1.6+2.1+2.3+3.0+3.1+3.3+3.3+3.8}{10}
    ]
    [
    bar{x}
    =
    frac{25.1}{10}
    =
    2.51
    ]

    Compute the mean of y:

    [
    bar{y}
    =
    frac{
    39344+46206+37732+43526+39892+56643+60151+54446+64446+57190
    }{10}
    ]
    [
    bar{y}
    =
    frac{499576}{10}
    =
    49957.6
    ]

    Now compute:

    [
    sum(x_i-bar{x})(y_i-bar{y})
    ]

    After substitution and calculation:

    [
    sum(x_i-bar{x})(y_i-bar{y})
    =
    41663.44
    ]

    Now compute:

    [
    sum(x_i-bar{x})^2
    ]

    After calculation:

    [
    sum(x_i-bar{x})^2
    =
    4.619
    ]

    Now compute the slope:

    [
    beta_1
    =
    frac{41663.44}{4.619}
    ]
    [
    beta_1
    =
    9020.66
    ]

    Now compute the intercept:

    [
    beta_0
    =
    49957.6-(9020.66)(2.51)
    ]
    [
    beta_0
    =
    27315.74
    ]

    Therefore:

    [
    beta_0=27315.74
    ]
    [
    beta_1=9020.66
    ]

    Final regression equation:

    [
    hat{y}
    =
    27315.74+9020.66x
    ]


    We got the values using the formulas, but we are not satisfied and want to go deeper.

    Now our goal is to learn how we got these formulas.

    To understand that, we will now see a 3D bowl curve. We get that bowl curve when we plot all the possible combinations of β0beta_0​, β1beta_1 and the mean squared error (MSE).

    Image by Author

    Now, by looking at the curve, we understand that we need the mean squared error to be as low as possible, and it reaches it’s minimum when the gradient becomes zero.

    We already know that to find the slope of any curve, we need differentiation.

    Next, we perform differentiation on the loss function, since the bowl curve is the 3D representation of it, and you realize that here we have two variables.

    So, we perform partial differentiation and then solve further to get the formulas for the slope and intercept.

    Deriving the Formulas for Slope and Intercept

    Start with the Mean Squared Error (MSE) loss function:

    [
    MSE(beta_0,beta_1)
    =
    frac{1}{n}
    sum_{i=1}^{n}
    (y_i-(beta_0+beta_1x_i))^2
    ]

    Rearrange the inner expression:

    [
    =
    frac{1}{n}
    sum_{i=1}^{n}
    (y_i-beta_0-beta_1x_i)^2
    ]

    Now take partial derivative with respect to ( beta_0 ):

    [
    frac{partial MSE}{partial beta_0}
    =
    frac{partial}{partial beta_0}
    left(
    frac{1}{n}
    sum_{i=1}^{n}
    (y_i-beta_0-beta_1x_i)^2
    right)
    ]

    Take constant outside:

    [
    =
    frac{1}{n}
    frac{partial}{partial beta_0}
    sum_{i=1}^{n}
    (y_i-beta_0-beta_1x_i)^2
    ]

    Move derivative inside the summation:

    [
    =
    frac{1}{n}
    sum_{i=1}^{n}
    frac{partial}{partial beta_0}
    (y_i-beta_0-beta_1x_i)^2
    ]

    Apply chain rule:

    [
    =
    frac{1}{n}
    sum_{i=1}^{n}
    2(y_i-beta_0-beta_1x_i)
    cdot
    frac{partial}{partial beta_0}
    (y_i-beta_0-beta_1x_i)
    ]

    Apply derivative rules:

    [
    frac{d}{dbeta_0}(y_i)=0
    ]
    [
    frac{d}{dbeta_0}(-beta_0)=-1
    ]
    [
    frac{d}{dbeta_0}(-beta_1x_i)=0
    ]

    So the inner derivative becomes:

    [
    frac{partial}{partial beta_0}
    (y_i-beta_0-beta_1x_i)
    =
    -1
    ]

    Substitute back:

    [
    frac{partial MSE}{partial beta_0}
    =
    frac{1}{n}
    sum_{i=1}^{n}
    2(y_i-beta_0-beta_1x_i)(-1)
    ]

    Simplify:

    [
    =
    -frac{2}{n}
    sum_{i=1}^{n}
    (y_i-beta_0-beta_1x_i)
    ]

    Set derivative equal to zero:

    [
    -frac{2}{n}
    sum_{i=1}^{n}
    (y_i-beta_0-beta_1x_i)
    =
    0
    ]

    Multiply both sides by:

    [
    -frac{n}{2}
    ]
    [
    sum_{i=1}^{n}
    (y_i-beta_0-beta_1x_i)
    =
    0
    ]

    Expand:

    [
    sum_{i=1}^{n}y_i
    –
    nbeta_0
    –
    beta_1sum_{i=1}^{n}x_i
    =
    0
    ]

    Rearrange:

    [
    nbeta_0
    =
    sum_{i=1}^{n}y_i
    –
    beta_1sum_{i=1}^{n}x_i
    ]

    Divide by ( n ):

    [
    beta_0
    =
    frac{1}{n}sum_{i=1}^{n}y_i
    –
    beta_1
    frac{1}{n}sum_{i=1}^{n}x_i
    ]

    Using means:

    [
    bar{x}
    =
    frac{1}{n}sum_{i=1}^{n}x_i
    ]
    [
    bar{y}
    =
    frac{1}{n}sum_{i=1}^{n}y_i
    ]

    Final intercept formula:

    [
    beta_0
    =
    bar{y}
    –
    beta_1bar{x}
    ]

    Now take partial derivative with respect to ( beta_1 ):

    [
    frac{partial MSE}{partial beta_1}
    =
    frac{partial}{partial beta_1}
    left(
    frac{1}{n}
    sum_{i=1}^{n}
    (y_i-beta_0-beta_1x_i)^2
    right)
    ]

    Take constant outside:

    [
    =
    frac{1}{n}
    frac{partial}{partial beta_1}
    sum_{i=1}^{n}
    (y_i-beta_0-beta_1x_i)^2
    ]

    Move derivative inside the summation:

    [
    =
    frac{1}{n}
    sum_{i=1}^{n}
    frac{partial}{partial beta_1}
    (y_i-beta_0-beta_1x_i)^2
    ]

    Apply chain rule:

    [
    =
    frac{1}{n}
    sum_{i=1}^{n}
    2(y_i-beta_0-beta_1x_i)
    cdot
    frac{partial}{partial beta_1}
    (y_i-beta_0-beta_1x_i)
    ]

    Apply derivative rules:

    [
    frac{d}{dbeta_1}(y_i)=0
    ]
    [
    frac{d}{dbeta_1}(-beta_0)=0
    ]
    [
    frac{d}{dbeta_1}(-beta_1x_i)=-x_i
    ]

    So the inner derivative becomes:

    [
    frac{partial}{partial beta_1}
    (y_i-beta_0-beta_1x_i)
    =
    -x_i
    ]

    Substitute back:

    [
    frac{partial MSE}{partial beta_1}
    =
    frac{1}{n}
    sum_{i=1}^{n}
    2(y_i-beta_0-beta_1x_i)(-x_i)
    ]

    Simplify:

    [
    =
    -frac{2}{n}
    sum_{i=1}^{n}
    x_i(y_i-beta_0-beta_1x_i)
    ]

    Set derivative equal to zero:

    [
    -frac{2}{n}
    sum_{i=1}^{n}
    x_i(y_i-beta_0-beta_1x_i)
    =
    0
    ]

    Multiply both sides by:

    [
    -frac{n}{2}
    ]
    [
    sum_{i=1}^{n}
    x_i(y_i-beta_0-beta_1x_i)
    =
    0
    ]

    Expand:

    [
    sum_{i=1}^{n}x_iy_i
    –
    beta_0sum_{i=1}^{n}x_i
    –
    beta_1sum_{i=1}^{n}x_i^2
    =
    0
    ]

    Substitute:

    [
    beta_0
    =
    bar{y}
    –
    beta_1bar{x}
    ]

    into the equation:

    [
    sum_{i=1}^{n}x_iy_i
    –
    (bar{y}-beta_1bar{x})
    sum_{i=1}^{n}x_i
    –
    beta_1sum_{i=1}^{n}x_i^2
    =
    0
    ]

    Expand:

    [
    sum_{i=1}^{n}x_iy_i
    –
    bar{y}sum_{i=1}^{n}x_i
    +
    beta_1bar{x}sum_{i=1}^{n}x_i
    –
    beta_1sum_{i=1}^{n}x_i^2
    =
    0
    ]

    Since:

    [
    sum_{i=1}^{n}x_i=nbar{x}
    ]

    Substitute:

    [
    sum_{i=1}^{n}x_iy_i
    –
    nbar{x}bar{y}
    +
    beta_1nbar{x}^2
    –
    beta_1sum_{i=1}^{n}x_i^2
    =
    0
    ]

    Group ( beta_1 ) terms:

    [
    beta_1
    (nbar{x}^2-sum_{i=1}^{n}x_i^2)
    =
    nbar{x}bar{y}
    –
    sum_{i=1}^{n}x_iy_i
    ]

    Multiply both sides by -1:

    [
    beta_1
    (sum_{i=1}^{n}x_i^2-nbar{x}^2)
    =
    sum_{i=1}^{n}x_iy_i
    –
    nbar{x}bar{y}
    ]

    Final slope formula:

    [
    beta_1
    =
    frac{
    sum_{i=1}^{n}x_iy_i
    –
    nbar{x}bar{y}
    }{
    sum_{i=1}^{n}x_i^2
    –
    nbar{x}^2
    }
    ]

    Equivalent covariance form:

    [
    beta_1
    =
    frac{
    sum_{i=1}^{n}(x_i-bar{x})(y_i-bar{y})
    }{
    sum_{i=1}^{n}(x_i-bar{x})^2
    }
    ]

    Finally, substitute the computed value of ( beta_1 ) into the intercept equation:

    [
    beta_0
    =
    bar{y}
    –
    beta_1bar{x}
    ]

    Thus, the final regression equation becomes:

    [
    hat{y}
    =
    beta_0
    +
    beta_1x
    ]

    Now, we learned how we got the formulas for the slope and intercept.

    But one thing we need to consider here is that we derived these formulas for a case where we only have one feature, and even for one feature, we can see how complex the math was.

    What if we have more than one feature, as most real-world datasets do?

    The math becomes more complex, and this is where we use the matrix form to represent the equations. Using matrix notation, we can derive the normal equation, which generalizes to any number of features.


    Deriving the Normal Equation

    In Simple Linear Regression, we derived one intercept and one slope:

    [
    hat{y}
    =
    beta_0+beta_1x
    ]

    However, real-world problems usually contain multiple features.

    For example:

    years of experience
    education level
    age

    In such cases, Linear Regression becomes:

    [
    hat{y}
    =
    beta_0
    +
    beta_1x_1
    +
    beta_2x_2
    +
    beta_3x_3
    +
    cdots
    +
    beta_px_p
    ]

    where:

    ( beta_0 ) is the intercept and
    ( beta_1,beta_2,beta_3,dots,beta_p ) are slopes for different features

    As the number of features increases, solving separate equations for every parameter becomes difficult.

    To solve this easily, Linear Regression is rewritten using matrix notation.

    Suppose we have ( n ) observations and ( p ) features.

    First define the target vector:

    [
    Y
    =
    begin{bmatrix}
    y_1\
    y_2\
    y_3\
    vdots\
    y_n
    end{bmatrix}
    ]

    Now define the feature matrix.

    The first column contains only 1s to represent the intercept term.

    [
    X
    =
    begin{bmatrix}
    1 & x_{11} & x_{12} & cdots & x_{1p}\
    1 & x_{21} & x_{22} & cdots & x_{2p}\
    1 & x_{31} & x_{32} & cdots & x_{3p}\
    vdots & vdots & vdots & ddots & vdots\
    1 & x_{n1} & x_{n2} & cdots & x_{np}
    end{bmatrix}
    ]

    Now define the parameter vector:

    [
    beta
    =
    begin{bmatrix}
    beta_0\
    beta_1\
    beta_2\
    vdots\
    beta_p
    end{bmatrix}
    ]

    Using matrix multiplication:

    [
    Xbeta
    =
    begin{bmatrix}
    1 & x_{11} & x_{12} & cdots & x_{1p}\
    1 & x_{21} & x_{22} & cdots & x_{2p}\
    1 & x_{31} & x_{32} & cdots & x_{3p}\
    vdots & vdots & vdots & ddots & vdots\
    1 & x_{n1} & x_{n2} & cdots & x_{np}
    end{bmatrix}
    begin{bmatrix}
    beta_0\
    beta_1\
    beta_2\
    vdots\
    beta_p
    end{bmatrix}
    ]

    Performing the multiplication:

    [
    =
    begin{bmatrix}
    beta_0+beta_1x_{11}+beta_2x_{12}+cdots+beta_px_{1p}\
    beta_0+beta_1x_{21}+beta_2x_{22}+cdots+beta_px_{2p}\
    beta_0+beta_1x_{31}+beta_2x_{32}+cdots+beta_px_{3p}\
    vdots\
    beta_0+beta_1x_{n1}+beta_2x_{n2}+cdots+beta_px_{np}
    end{bmatrix}
    ]

    This gives the prediction vector:

    [
    hat{Y}=Xbeta
    ]

    Now define the residual vector.

    Residuals are the differences between actual and predicted values.

    [
    Y-hat{Y}
    ]

    Substituting:

    [
    Y-Xbeta
    ]

    The Mean Squared Error (MSE) becomes:

    [
    MSE
    =
    frac{1}{n}
    (Y-Xbeta)^T(Y-Xbeta)
    ]

    The transpose is required because:

    [
    (Y-Xbeta)
    ]

    is a column vector.

    Multiplying by its transpose converts the expression into a scalar sum of squared residuals.

    Now expand the expression.

    [
    MSE
    =
    frac{1}{n}
    (Y-Xbeta)^T(Y-Xbeta)
    ]
    [
    =
    frac{1}{n}
    left(
    Y^TY
    –
    Y^TXbeta
    –
    (Xbeta)^TY
    +
    (Xbeta)^TXbeta
    right)
    ]

    Using transpose property:

    [
    (Xbeta)^T
    =
    beta^TX^T
    ]

    Substitute into the equation:

    [
    MSE
    =
    frac{1}{n}
    left(
    Y^TY
    –
    Y^TXbeta
    –
    beta^TX^TY
    +
    beta^TX^TXbeta
    right)
    ]

    Notice that:

    [
    Y^TXbeta
    ]

    is a scalar.

    Scalars are equal to their transpose.

    Therefore:

    [
    Y^TXbeta
    =
    beta^TX^TY
    ]

    So the middle two terms combine:

    [
    MSE
    =
    frac{1}{n}
    left(
    Y^TY
    –
    2beta^TX^TY
    +
    beta^TX^TXbeta
    right)
    ]

    To minimize MSE, take derivative with respect to ( beta ).

    Derivative of:

    [
    Y^TY
    ]

    is zero because it does not contain ( beta ).

    Derivative of:

    [
    -2beta^TX^TY
    ]

    becomes:

    [
    -2X^TY
    ]

    Derivative of:

    [
    beta^TX^TXbeta
    ]

    becomes:

    [
    2X^TXbeta
    ]

    Therefore:

    [
    frac{partial MSE}{partial beta}
    =
    frac{1}{n}
    left(
    -2X^TY
    +
    2X^TXbeta
    right)
    ]

    Simplify:

    [
    =
    frac{-2}{n}X^TY
    +
    frac{2}{n}X^TXbeta
    ]

    Set derivative equal to zero for minimization:

    [
    frac{-2}{n}X^TY
    +
    frac{2}{n}X^TXbeta
    =
    0
    ]

    Multiply both sides by:

    [
    frac{n}{2}
    ]
    [
    -X^TY
    +
    X^TXbeta
    =
    0
    ]

    Rearrange:

    [
    X^TXbeta
    =
    X^TY
    ]

    Now multiply both sides by:

    [
    (X^TX)^{-1}
    ]
    [
    (X^TX)^{-1}X^TXbeta
    =
    (X^TX)^{-1}X^TY
    ]

    Using the identity matrix property:

    [
    (X^TX)^{-1}(X^TX)=I
    ]

    we get:

    [
    Ibeta
    =
    (X^TX)^{-1}X^TY
    ]

    Since:

    [
    Ibeta=beta
    ]

    the final Normal Equation becomes:

    [
    beta
    =
    (X^TX)^{-1}X^TY
    ]

    This equation simultaneously computes:

    the intercept
    all slopes
    the optimal parameters

    that minimize the Mean Squared Error.

    In general, the normal equation is derived by minimizing the RSS (Residual Sum of Squares). However, since MSE is simply RSS divided by the number of observations, minimizing MSE also produces the same normal equation.


    Now we have the normal equation. Let’s solve for the slope and intercept once again using this equation.

    Solving for Slope and Intercept Using the Normal Equation

    The matrix form of Linear Regression is:

    [
    beta=(X^TX)^{-1}X^TY
    ]

    Construct the feature matrix.

    The first column contains 1s for the intercept term.

    [
    X
    =
    begin{bmatrix}
    1 & 1.2\
    1 & 1.4\
    1 & 1.6\
    1 & 2.1\
    1 & 2.3\
    1 & 3.0\
    1 & 3.1\
    1 & 3.3\
    1 & 3.3\
    1 & 3.8
    end{bmatrix}
    ]

    Construct the target vector:

    [
    Y
    =
    begin{bmatrix}
    39344\
    46206\
    37732\
    43526\
    39892\
    56643\
    60151\
    54446\
    64446\
    57190
    end{bmatrix}
    ]

    Parameter vector:

    [
    beta
    =
    begin{bmatrix}
    beta_0\
    beta_1
    end{bmatrix}
    ]

    Now compute the transpose:

    [
    X^T
    =
    begin{bmatrix}
    1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1\
    1.2 & 1.4 & 1.6 & 2.1 & 2.3 & 3.0 & 3.1 & 3.3 & 3.3 & 3.8
    end{bmatrix}
    ]

    Compute:

    [
    X^TX
    =
    begin{bmatrix}
    10 & 25.1\
    25.1 & 67.89
    end{bmatrix}
    ]

    Now compute the inverse:

    [
    (X^TX)^{-1}
    =
    begin{bmatrix}
    1.4547 & -0.5378\
    -0.5378 & 0.2142
    end{bmatrix}
    ]

    Now compute:

    [
    X^TY
    =
    begin{bmatrix}
    493576\
    1326200.7
    end{bmatrix}
    ]

    Substitute into the Normal Equation:

    [
    beta
    =
    begin{bmatrix}
    1.4547 & -0.5378\
    -0.5378 & 0.2142
    end{bmatrix}
    begin{bmatrix}
    493576\
    1326200.7
    end{bmatrix}
    ]

    After multiplication:

    [
    beta
    =
    begin{bmatrix}
    27315.02\
    9020.93
    end{bmatrix}
    ]

    Therefore:

    [
    beta_0=27315.02
    ]
    [
    beta_1=9020.93
    ]

    Final regression equation:

    [
    hat{y}
    =
    27315.02+9020.93x
    ]


    Why Do We Need Gradient Descent?

    Now, after getting the normal equation for linear regression, we might think that we can solve for the optimal parameters even when we have many features.

    But one thing we need to observe here is that this method works well only for small or medium-sized datasets. When we have very large datasets, solving the normal equation becomes computationally expensive.

    Let’s look at the normal equation:

    [
    beta = (X^TX)^{-1}X^Ty
    ]

    From the equation, we can observe the inverse calculation, and this is where solving for the slope and intercept using the normal equation becomes computationally expensive.

    This works well for small datasets, but in the real world, we often have thousands of features and millions of data points.

    In such cases, solving the normal equation becomes slow and requires a lot of computational power.

    This is where gradient descent is used, because instead of directly solving for the solution, we gradually move toward the optimal solution step by step.

    Now, to understand how gradient descent works, let’s look at the math behind it.


    The Math Behind Gradient Descent

    When we were deriving the normal equation, we arrived at this equation.

    [
    frac{partial MSE}{partial beta}
    =
    frac{2}{n}X^T(Xbeta-Y)
    ]

    This equation represents the gradient (slope) of the bowl-shaped loss curve.

    We made it equal to zero and then solved further to get the normal equation, which is used to find the optimal solution.

    But in gradient descent, we stop at this equation and initialize some random values forβbeta. Using these values, we calculate the gradient (slope) and gradually move toward the minimum loss step by step.

    Let’s assume we initialize:

    β0=2beta_0 = 2 and β1=5beta_1 = 5

    [
    beta^{(0)}=
    begin{bmatrix}
    beta_0 \
    beta_1
    end{bmatrix}
    =
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    ]

    Next, we calculate the slope of the bowl curve by substituting these values into the gradient equation.

    We already know that the gradient equation is:

    [
    frac{partial MSE}{partial beta}
    =
    frac{-2}{n}X^Ty
    +
    frac{2}{n}X^TXbeta
    ]

    The initialized parameter values are:

    [
    beta^{(0)}=
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    ]

    These are just the starting values from where Gradient Descent begins searching for the minimum loss.

    Now let’s construct the feature matrix.

    Since we have one feature, the matrix (X) becomes:

    [
    X=
    begin{bmatrix}
    1 & 1.2 \
    1 & 1.4 \
    1 & 1.6 \
    1 & 2.1 \
    1 & 2.3 \
    1 & 3.0 \
    1 & 3.1 \
    1 & 3.3 \
    1 & 3.3 \
    1 & 3.8
    end{bmatrix}
    ]

    The first column contains ones for the intercept term.

    Now calculate:

    [
    X^T
    ]
    [
    X^T=
    begin{bmatrix}
    1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \
    1.2 & 1.4 & 1.6 & 2.1 & 2.3 & 3.0 & 3.1 & 3.3 & 3.3 & 3.8
    end{bmatrix}
    ]

    Now calculate:

    [
    X^TX
    ]
    [
    X^TX=
    begin{bmatrix}
    10 & 25.1 \
    25.1 & 67.89
    end{bmatrix}
    ]

    Next, let the target vector be:

    [
    y=
    begin{bmatrix}
    39344 \
    46206 \
    37732 \
    43526 \
    39892 \
    56643 \
    60151 \
    54446 \
    64446 \
    57190
    end{bmatrix}
    ]

    Now calculate:

    [
    X^Ty
    ]
    [
    X^Ty=
    begin{bmatrix}
    493576 \
    1326200.7
    end{bmatrix}
    ]

    Since our dataset contains:

    [
    n=10
    ]

    Now substitute all the values into the gradient equation:

    [
    frac{partial MSE}{partial beta}
    =
    frac{-2}{10}
    begin{bmatrix}
    493576 \
    1326200.7
    end{bmatrix}
    +
    frac{2}{10}
    begin{bmatrix}
    10 & 25.1 \
    25.1 & 67.89
    end{bmatrix}
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    ]

    First, calculate the matrix multiplication:

    [
    begin{bmatrix}
    10 & 25.1 \
    25.1 & 67.89
    end{bmatrix}
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    =
    begin{bmatrix}
    (10)(2)+(25.1)(5) \
    (25.1)(2)+(67.89)(5)
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    20+125.5 \
    50.2+339.45
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    145.5 \
    389.65
    end{bmatrix}
    ]

    Now multiply by:

    [
    frac{2}{10}
    ]
    [
    frac{2}{10}
    begin{bmatrix}
    145.5 \
    389.65
    end{bmatrix}
    =
    begin{bmatrix}
    29.1 \
    77.93
    end{bmatrix}
    ]

    Next, calculate:

    [
    frac{-2}{10}
    begin{bmatrix}
    493576 \
    1326200.7
    end{bmatrix}
    =
    begin{bmatrix}
    -98715.2 \
    -265240.14
    end{bmatrix}
    ]

    Now substitute everything back:

    [
    frac{partial MSE}{partial beta}
    =
    begin{bmatrix}
    -98715.2 \
    -265240.14
    end{bmatrix}
    +
    begin{bmatrix}
    29.1 \
    77.93
    end{bmatrix}
    ]

    Finally:

    [
    frac{partial MSE}{partial beta}
    =
    begin{bmatrix}
    -98686.1 \
    -265162.21
    end{bmatrix}
    ]

    This gradient represents the slope of the bowl-shaped MSE loss curve at the current parameter values.

    Here:

    [
    -98686.1
    ]

    represents the slope with respect to (beta_0)

    and

    [
    -265162.21
    ]

    represents the slope with respect to (beta_1)

    Since both values are negative, the loss decreases toward the right, so Gradient Descent moves toward the right to reduce the loss.

    Now, instead of directly solving for the optimal parameters mathematically, Gradient Descent gradually updates the parameter values step by step until it reaches the minimum point of the bowl-shaped loss curve.

    This update is performed using the Gradient Descent update equation:

    [
    beta:=beta-alphafrac{partial MSE}{partial beta}
    ]

    where:

    [
    alpha
    ]

    is called the learning rate and controls how large each update step should be.

    The update equation can be understood step by step.

    [
    beta
    ]

    represents the current parameter values.

    [
    frac{partial MSE}{partial beta}
    ]

    represents the slope (gradient) of the bowl-shaped loss curve at the current point.

    The gradient tells us the direction in which the loss increases the fastest.

    Therefore, to reduce the loss, we move in the opposite direction of the gradient.

    This is why the update equation subtracts the gradient:

    [
    beta:=beta-alphafrac{partial MSE}{partial beta}
    ]

    Here:

    [
    alpha
    ]

    controls how large each step should be while moving toward the minimum point.

    If the gradient is positive, Gradient Descent moves toward the left.

    If the gradient is negative, Gradient Descent moves toward the right.

    By repeatedly calculating gradients and updating parameters, Gradient Descent gradually moves toward the minimum point of the bowl-shaped loss curve.

    After updating the parameters, the entire process is repeated again until the loss becomes minimum, and the model reaches the optimal parameters.

    We can observe here is that there is no inverse calculation involved.


    Learning Rate

    One important thing we need to understand here is the learning rate.

    Let’s assume:

    [
    alpha = 0.01
    ]

    and the calculated gradient is:

    [
    frac{partial MSE}{partial beta}
    =
    begin{bmatrix}
    -98686.1 \
    -265162.21
    end{bmatrix}
    ]

    Now substitute these values into the update equation:

    [
    beta=
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    –
    0.01
    begin{bmatrix}
    -98686.1 \
    -265162.21
    end{bmatrix}
    ]

    First, multiply the learning rate with the gradient:

    [
    0.01
    begin{bmatrix}
    -98686.1 \
    -265162.21
    end{bmatrix}
    =
    begin{bmatrix}
    -986.861 \
    -2651.6221
    end{bmatrix}
    ]

    Now substitute back:

    [
    beta=
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    –
    begin{bmatrix}
    -986.861 \
    -2651.6221
    end{bmatrix}
    ]

    then

    [
    beta=
    begin{bmatrix}
    2+986.861 \
    5+2651.6221
    end{bmatrix}
    ]

    Finally:

    [
    beta=
    begin{bmatrix}
    988.861 \
    2656.6221
    end{bmatrix}
    ]

    After one iteration of Gradient Descent:

    [
    beta_0
    ]

    changed from:

    [
    2 rightarrow 988.861
    ]

    and

    [
    beta_1
    ]

    changed from:

    [
    5 rightarrow 2656.6221
    ]

    These updated parameter values move us closer to the minimum point of the bowl-shaped MSE loss curve.

    Now using these updated values, the entire process is repeated again:

    [
    text{Predictions}
    rightarrow
    text{Residuals}
    rightarrow
    text{Loss}
    rightarrow
    text{Gradient}
    rightarrow
    text{Parameter Update}
    ]

    This iterative process continues until the loss becomes minimum and the model reaches the optimal parameters.

    Now let’s understand why choosing the learning rate is very important.

    If the learning rate is very small:

    [
    alpha = 0.000001
    ]

    then the updates become extremely small.

    As a result:

    [
    text{Very Slow Learning}
    ]

    and Gradient Descent may require thousands of iterations to reach the minimum point.

    On the other hand, if the learning rate is very large:

    [
    alpha = 10
    ]

    then the updates become extremely large.

    As a result, Gradient Descent may overshoot the minimum point repeatedly and fail to reach the solution.

    Therefore, choosing a proper learning rate is very important for efficient optimization.

    GIF by Author

    Stochastic Gradient Descent

    Now we have an idea about what gradient descent actually is.

    In this method, we can observe that we used the entire dataset to calculate the gradients before updating the parameters.

    This process can become slow for very large datasets, and this approach is called batch gradient descent because it uses the entire dataset for every update step.

    Now imagine a dataset containing millions of data points.

    For every single update step, Gradient Descent would need to:

    [
    text{Process Entire Dataset}
    ]
    [
    text{Calculate Loss}
    ]
    [
    text{Calculate Gradients}
    ]

    and then finally update the parameters.

    This repeated computation becomes computationally expensive and time taking process.

    This is where Stochastic Gradient Descent (SGD) comes into the picture.

    Instead of calculating gradients using the entire dataset, SGD randomly selects only one observation at a time and immediately updates the parameters.

    The update equation still remains the same:

    [
    beta:=beta-alphafrac{partial MSE}{partial beta}
    ]

    The only difference is that the gradient is now calculated using a single observation instead of the entire dataset.

    We can understand this by using one data point from our dataset.

    The parameter values are:

    [
    beta^{(0)}=
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    ]

    and the learning rate is:

    [
    alpha = 0.01
    ]

    Now let’s say SGD randomly selected the following training example from our dataset:

    [
    (x,y)=(3.0,56643)
    ]

    For this single observation:

    [
    X=
    begin{bmatrix}
    1 & 3.0
    end{bmatrix}
    ]

    and

    [
    y=
    begin{bmatrix}
    56643
    end{bmatrix}
    ]

    Now calculate:

    [
    X^T=
    begin{bmatrix}
    1 \
    3.0
    end{bmatrix}
    ]

    Next calculate:

    [
    X^TX
    ]
    [
    =
    begin{bmatrix}
    1 \
    3.0
    end{bmatrix}
    begin{bmatrix}
    1 & 3.0
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    1 & 3.0 \
    3.0 & 9.0
    end{bmatrix}
    ]

    Now calculate:

    [
    X^Ty
    ]
    [
    =
    begin{bmatrix}
    1 \
    3.0
    end{bmatrix}
    begin{bmatrix}
    56643
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    56643 \
    169929
    end{bmatrix}
    ]

    Since SGD is using only one observation:

    [
    n=1
    ]

    Now substitute everything into the gradient equation:

    [
    frac{partial MSE}{partial beta}
    =
    frac{-2}{n}X^Ty
    +
    frac{2}{n}X^TXbeta
    ]

    Substituting:

    [
    =
    frac{-2}{1}
    begin{bmatrix}
    56643 \
    169929
    end{bmatrix}
    +
    frac{2}{1}
    begin{bmatrix}
    1 & 3.0 \
    3.0 & 9.0
    end{bmatrix}
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    ]

    First calculate the matrix multiplication:

    [
    begin{bmatrix}
    1 & 3.0 \
    3.0 & 9.0
    end{bmatrix}
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    (1)(2)+(3.0)(5) \
    (3.0)(2)+(9.0)(5)
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    2+15 \
    6+45
    end{bmatrix}
    ]
    [
    =
    begin{bmatrix}
    17 \
    51
    end{bmatrix}
    ]

    Now multiply by:

    [
    frac{2}{1}
    ]
    [
    =
    begin{bmatrix}
    34 \
    102
    end{bmatrix}
    ]

    Now calculate:

    [
    frac{-2}{1}
    begin{bmatrix}
    56643 \
    169929
    end{bmatrix}
    =
    begin{bmatrix}
    -113286 \
    -339858
    end{bmatrix}
    ]

    Now substitute everything back:

    [
    frac{partial MSE}{partial beta}
    =
    begin{bmatrix}
    -113286 \
    -339858
    end{bmatrix}
    +
    begin{bmatrix}
    34 \
    102
    end{bmatrix}
    ]

    Finally:

    [
    frac{partial MSE}{partial beta}
    =
    begin{bmatrix}
    -113252 \
    -339756
    end{bmatrix}
    ]

    This gradient represents the slope of the bowl-shaped loss curve for this single training example.

    Now update the parameters using:

    [
    beta:=beta-alphafrac{partial MSE}{partial beta}
    ]

    Substituting the values:

    [
    beta=
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    –
    0.01
    begin{bmatrix}
    -113252 \
    -339756
    end{bmatrix}
    ]

    First multiply the learning rate:

    [
    =
    begin{bmatrix}
    2 \
    5
    end{bmatrix}
    –
    begin{bmatrix}
    -1132.52 \
    -3397.56
    end{bmatrix}
    ]

    Now subtract:

    [
    =
    begin{bmatrix}
    2+1132.52 \
    5+3397.56
    end{bmatrix}
    ]

    Finally:

    [
    beta=
    begin{bmatrix}
    1134.52 \
    3402.56
    end{bmatrix}
    ]

    After solving for just one observation, the parameters immediately get updated.

    Now SGD randomly selects another observation from the dataset and repeats the same process again.

    Unlike batch gradient descent, which waits to process the entire dataset before updating the parameters, SGD updates the parameters after every single training example.

    Because of these frequent updates, SGD reaches the solution faster.

    We can observe how simple the calculation becomes when using just one observation.

    SGD continues updating the parameters repeatedly using different training examples until the loss becomes minimum or stops changing significantly.

    But the path toward the minimum point becomes noisy and zig-zag in nature.

    This makes SGD highly useful for modern machine learning and deep learning problems involving very large datasets.


    Conclusion

    Now we have an idea of both gradient descent and stochastic gradient descent.

    First, we derived the normal equation, and then we learned that the inverse matrix calculation becomes computationally expensive and memory usage becomes high for large datasets.

    To solve this problem, we used gradient descent, which is not limited to linear regression but is also used in many machine learning and deep learning algorithms.

    Next, we learned that even the first method of gradient descent that we used, called batch gradient descent, can become slow for very large datasets because it uses the entire dataset before updating parameters.

    This led us to stochastic gradient descent (SGD), which updates the parameters using one training example at a time and works faster than batch gradient descent for large datasets.

    We also have another variation of gradient descent called mini-batch gradient descent, in which we use a small batch of training examples from the dataset, such as 32 or 64 rows, before updating the parameters.

    In this way, it becomes faster than batch gradient descent and more stable than stochastic gradient descent.


    Even though linear regression has a closed-form solution, we often prefer to use gradient descent when working with large datasets containing millions of observations because the normal equation becomes computationally expensive and impractical.

    In deep learning, however, closed-form solutions usually do not exist, which makes optimization algorithms like gradient descent even more important.


    Dataset License

    The dataset used in this blog is the Salary dataset.

    It is publicly available on Kaggle and is licensed under the Creative Commons Zero (CC0 Public Domain) license. This means it can be freely used, modified, and shared for both non-commercial and commercial purposes without restriction.


    I hope you now have a better understanding of what gradient descent and stochastic gradient descent actually are.

    If you’d like to read more of my writing, you can also find it on Medium and LinkedIn.

    I recently wrote a detailed breakdown of Lasso Regression from a geometric and intuitive perspective.

    You can read it here.

    Thanks for reading!

    descent gradient Stochastic
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleKiwibit’s AI-powered bird feeder is my new backyard buddy
    Next Article Trump’s mass deportations are only possible with racial profiling
    • Website

    Related Posts

    AI Tools

    Baseline Enterprise RAG, From PDF to Highlighted Answer

    AI Tools

    Explaining Lineage in DAX | Towards Data Science

    AI Tools

    Five Questions About Chronos-2, the Time Series Foundation Model

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    AT&T Shook Up Its Unlimited Phone Plans. Here’s What You’re Paying For

    0 Views

    Environmentalists turn out in force to oppose Trump coal ash rollbacks

    0 Views

    Do You Actually Need to Pay for Transcription Software?

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    AT&T Shook Up Its Unlimited Phone Plans. Here’s What You’re Paying For

    0 Views

    Environmentalists turn out in force to oppose Trump coal ash rollbacks

    0 Views

    Do You Actually Need to Pay for Transcription Software?

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.