Other posts in series
Introduction
Recently, FastAI launched a new version of their course, using their revamped FastAI version 2 packages. As I had only completed the first two lessons of the previous version of the course, I thought I’d restart. It is good I did re-start, because they have made big changes to how the course is organised and also to the syntax (e.g. in the old version, there was significant discussion about setting the learning rate using their ‘learning rate finder’, but so far, there has been no mention of this and it looks like that has been automated). Towards the end of the first lesson, Jeremy showcases the different domains in which one can use Deep Learning. One of those was NLP. I have not yet done any NLP, so I thought this is a good chance to play around a little, without having to delve into any of the technical details.
In this particular example, the NLP consists of sentiment analysis of movie reviews. Looking at the code, they use AWD LSTM architecture. A quick google search reveals that this is a fairly new algorithm, combining various ideas and tools into one. I am not yet at the stage to understand this though, but hopefully it will be discussed later in the course.
Experimentation
The input to the model is a string containing a movie review. The output is a probability/score between 0 and 1 indicating how much the model thinks the review is positive. My initial thought was to to see if the model had learnt about numbers, by seeing how it rates reviews of the form ‘x out of y’.
‘x out of 5’
The results of this experimentation:
0 out of 5: 0.5570
1 out of 5: 0.4760
2 out of 5: 0.3281
3 out of 5: 0.4038
4 out of 5: 0.4073
5 out of 5: 0.8569
This was surprising to me:
- The model thinks that the review ‘0 out of 5’ is likely a positive review
- The model is roughly undecided about ‘1 out of 5’
- The model thinks 2,3 or 4 out of 5 is negative
- The model thinksthat ‘5 out of 5’ is positive.
As with most of these ML models, this must be a reflection of the underlying dataset. I imagine very few people actually give a movie 0 or 1 out of 5, so the model doesn’t really know what to do with it, whereas ‘2 out of 5’ probably is more common and basically says the movie is bad, and ‘3 out of 5’ and ‘4 out of 5’ are more average-y ratings.
‘x out of 10’
The results of this:
0 out of 10: 0.5402
1 out of 10: 0.4478
2 out of 10: 0.2972
3 out of 10: 0.3741
4 out of 10: 0.3798
5 out of 10: 0.8475
6 out of 10: 0.9897
7 out of 10: 0.99996
8 out of 10: 0.99991
9 out of 10: 0.9998
10 out of 10:0.99996
Again the model is confused with the worst reviews, presumably because they just don’t occur in the training data. Then the model seems to deal with 2,3 and 4 out of 10 sensibly. But then it has big jump and thinks the rest are definitely positive, with anything above 7 out of 10 being considered positive with high confidence. Again, the explanation for these unusual evaluations of the reviews must be that they are uncommon in the dataset.
Variations on ‘x out of 5’
I tried a few little variations on ‘x out of 5’ to see if it would make any impact: adding an exclamation mark at the end, adding the word ‘stars’ at the end, and adding the word ‘Only’ at the start.
x | ‘x out of 5’ | ‘x out of 5!’ | ‘x out of 5 stars’ | ‘Only x out of 5’ |
---|---|---|---|---|
0 | 0.5570 | 0.7436 | 0.7247 | 0.0240 |
1 | 0.4760 | 0.6756 | 0.6538 | 0.0247 |
2 | 0.3281 | 0.4901 | 0.4884 | 0.0196 |
3 | 0.4038 | 0.5670 | 0.5634 | 0.0189 |
4 | 0.4073 | 0.5634 | 0.5589 | 0.0179 |
5 | 0.8569 | 0.9051 | 0.8969 | 0.5464 |
- Adding an exclamation mark makes the model think the reviews are more positive, which in my opinion is sensible.
- Adding the word ‘stars’ seems to have very similar effect to adding an exclamation mark. I do not know if this is just a coincidence - I do not have intuition for this. I guess it must mean that if the word ‘star’ appeared in the training data, then it mostly meant it was a positive review.
- Adding the word ‘only’ dramatically reduced the positivity score, which makes sense.
Miscellaneous statements
Lastly, I tried a variety of miscellaneous statements:
I liked some bits and I disliked other bits
0.9345
I disliked some bits and I liked other bits
0.8857
I liked most of it, but not all of it
0.9906
I disliked most of it, but not all of it
0.9782
I hated most of it, but not all of it
0.9486
I disliked the movie
0.7621
I hated the movie
0.6873
I cannot believe how bad the movie is
0.5535
The model is comically bad at judging whether the statement is positive. I started by trying a neutral statement, but the model thought it was great. I then tried the favourable statement ‘I liked most of it…’ and the model thought it was positive (sensible), but still thought the statement was positive when I switched liked and disliked. Using the word ‘hated’ instead reduced it a bit.
Maybe the model hasn’t even learnt the word ‘disliked’ and ‘hated’. So I just tried the simple statements ‘I dislike/hated the movie’, and the model does reduce its score, but still thinks these are overall positive. I end with a particularly strong negative review, which the model perceives as slightly positive.
Conclusions
It is nice to play around with a model without any expectation to understand the inner-intricacies or working. My main takeaways are:
- The model has learnt some sensible patterns, e.g. adding ‘only’ indicates a negative review, adding an exclamation mark indicates a positive review, etc.
- The model fails spectacularly badly on several examples, which do not feel contrived to me. This makes me question the reliability of sentiment analysis. Maybe the examples I created are not representative of real-world examples.
- If I were a decision-maker in some business, I would definitely be using manual analysis until I see better evidence.