Introduction

I will recap my investigations into fitting sinusoidal data using sinusoidal models with SGD.

Regularisation

Based on the examples I had tried, the amplitude would always tend to zero. Hence, I thought it would be worth adding a regularisation term that punishes having small amplitudes.

The loss function was loss = mse(y_est, y). After the regularisation, it became loss = mse(y_est, y) - parameters_est[0]. Why did I choose this regularisation?

Below is the first result of introducing this regularisation.

As you can see, there are moments where the model gets close to the data. This got my hopes up, and made me feel like I was onto something.

I tried various other things to see if I could make it better. I tried changing the weight of the regularisation term. I tried adding other regularisation terms (because in the experiments, it looked like there was now a tendency for the frequency to keep increasing). I can’t remember if I tried other things or not. Suffice it to say, I made no progress.

Below is an animation of an experiment which involved changing the weight of the regularisation term. I include it only because I thought it was particularly funky and visually interesting.

Visualising the loss function

After failing to get regularisation to work, I decided I should try to visualise the loss function, and find out exactly where the local minima were, and hopefully better understand why things were not working.

The process I followed was:

To begin, I set c and d to 0 and varied a and b. a is the amplitude and b is the frequency (multiplied by 2*pi) or the coefficient of x. The reason for fixing c and d is that it was the amplitude and the frequency which were giving the most trouble.

The first animation below shows a sequence of charts. Each individual chart shows how the loss varies with frequency, and from chart to chart the amplitude is changing.

As can be seen from this, there are many local minima, so the model might get stuck in the wrong one. Eye-balling the chart, if the initial frequency is below 0.5 or above 1.7, then gradient descent will push the frequency away from the optimal value of 1. It is now clear why there should be a tendency for the frequency to increase, as we saw in the SGD examples in Part III.

The next animation is the opposite. For each individual chart, we see how the loss varies with amplitude, and from chart to chart we are modifying the frequency.

Fantastic! This I feel like I can understand. For the majority of frequencies, the optimal value for amplitude is zero and the amplitude will just slide its way to that value. Only for a narrow range of frequencies is the optimal value of the amplitude non-zero.

To summarise, based on these two animations, here is what I would predict:

As I am writing this up and thinking things through, I am starting to wonder about my conclusion in Part III about the sinusoidal model. In Part III, I concluded that the issue all along was having an inappropriate learning rate, but the two animations above suggest there is more to it. Did I just get lucky and stumble upon starting parameters which fit the criteria I described above, and hence that is why I got the sinusoidal model to fit? There’s only one way to find out, which is to do more experimentation!

Investigating parameter initialisation

The steps for the investigation are as follows:

I start by trying a large value for the frequency, 5.

So, as predicted, the frequency gets stuck in some sub-optimal value and the amplitude tends to zero. It looks like I did just get lucky in Part III.

Frequency of 2 and 1.5 is similar:

Frequency of 1.2:

We get the model converging to the data! Though this is to be expected, it is still satisfying to see it actually work. With a bit of manual experimentation, the cut-off between these two behaviours is roughly 1.46.

How about lower frequencies? A frequency of 0.6 converges to the correct model:

And a frequency of 0.5 converges to a different minima:

Again, this is consistent with the frequncy vs loss charts above, where you can see there are local minima to the left of the global minimum.

Conclusion

This has been a bit of a topsy-turvy learning experience. I am still surprised at how much I learnt from this basic example. And having struggled with this simple example, I better appreciate how impressive it is to get complicated neural networks to learn.