In the previous posts, I used gradient descent to model linear, quadratic and sinusoidal data. In the first post, the linear and quadratic models could be fit, but the sinusoidal data could not be fit. In the second post, we saw how neural networks kind of fit the data, but not very well.

This time, I will add the stochasticity by introducing mini-batches. My hope is that I will be able to fit the sinusoidal data that I could not fit in the first post. I will discuss this example at the end, because it is the most interesting

Models with neural networks

Here are animations of a neural network trying to fit using stochastic gradient descent.

The end results look similar to the end results without using mini-batches. The big difference is that the network converges much faster.

Fitting sinusoidal data using a sinusoidal model

Here is a representative example of my first attempts using SGD on sinusoidal data. Note that in all of the animations in this section, the same dataset and initial parameters were used, so the comparisons are more rigorous.

To my dismay, we seem to have the same problem as with normal gradient descent. The system gets stuck in some local minimum where the amplitude is small. It still continues to ‘learn’ though.

At this point, I experimented a little (e.g. with some ‘regularisation’), which I plan to describe in a separate post (spoiler alert - they didn’t work). While planning this blogpost, I re-watched the animation above and thought that maybe the learning rate is too big. I presumably tried playing with the learning rate already but for the sake of completeness, I thought it would be good to produce a series of animations to show you that varying the learning rate does not help.

The learning rate in the animation above was 0.1. Below is an animation for a learning rate of 0.01.

! That was unexpected! It managed to find good parameters, but then jumped to some inferior local minimum. I managed to achieve something similar to this using the regularisation mentioned above (and which I will describe in a later post), but I was not expecting to see this by changing the learning rate. Clearly my memory is off and I had not experimented with the learning rate. I thought I would re-run the calculations to see if the same behaviour would occur again. (Note that the initial parameters are the same in all these animations, but there is still randomness from how the mini-batches are selected.)

So we get similar behaviour. For some time it looks like we are getting close to a good model but then it jumps away to some other set of parameters.

Looks like I should make the learning rate smaller, and see if that prevents jumping away from the correct model. The next animation is for a learning rate of 0.001.

!!! Wow! Given that I had failed after several hours of trying, this was basically magic to me. The model gently and slides its way into position, increases its amplitude, then stays there. Fantastic!

Now a big question arises. Were the learning rates in the first post of this series too high, and that was the reason for the struggles with sinusoidal models? There’s only one way to find out, and that’s by doing the experiment. Below is the resulting animation.

!!! All this time, it was as simple as changing the learning rate. How did I miss this?! What is noteworthy is how slow the learning is in gradient descent as compared to stochastic gradient descent.


There are two big lessons I learnt from this.

  • The first is to somehow take good notes of what I have tried and to be systematic. In my previous job teaching maths to STEM Foundation Year students, my colleague who taught Laboratory Skills was trying to explain the purpose of a lab-book to students: it should be a record of what you did, why you did it, what you observed, etc. so that somebody else (in particular, future-you) can read it and re-live your experience. It looks like I have only now learnt this lesson my colleague was trying to teach. Better late than never, I suppose.

  • The second is that the main benefit of stochastic gradient descent seems to be in efficiency/speed. I have read in places that it can help with preventing local minimums, but I am still unsure of this latter point.

The code

The code for this project is in this GitHub repository. I encourage you to play around and see what you can learn. If there is anything you do not understand in the code, please ask.