Introduction

In the last post, I described the cleaning that I did, In particular, I had to manually correct over 140 titles in my data that did not occur in the imdb dataset. So I should be ready to join, right? No, of course not, why would things be so simple.

The main problem

Many movie titles occur multiple times in the imdb database, and as far as I know, there is not an automated way to deal with this. To measure the extend of the problem, I added a column to count how many matches there are in the imdb database:

def num_rows_in_imdb(movie_title, print_rows = False):
    """
    returns the number of rows in the imdb dataset that have primaryTitle equal to movie_title.
    if print_rows is True, then also print the rows
    """
    temp = imdb[imdb.primaryTitle == movie_title]
    num_rows = temp.shape[0]
    if print_rows:
        print(temp)
    return num_rows

df["rows_in_imdb"] = np.nan
df['rows_in_imdb'] = df.movie_title.map(num_rows_in_imdb)

Doing this revealed that most of the entries had multiple occuences in the imdb database! This was not good. I looked into a couple of examples by searching on the imdb website. It looked like many of the repetitions were from matches with TV episodes. So I repeated the above process but with only the movies from the imdb dataset:

imdb = pd.read_csv('title.basics.tsv',
                   sep="\t",
                   na_values = '\\N',
                   usecols = ['tconst', 'primaryTitle', 'startYear', 'titleType'])

imdb = imdb[imdb.titleType.isin(['short', 'movie', 'tvMovie'])]

By removing all the TV episodes, the number of matches decreased, but we still have a large problem. Only 351 of my entries have precisely one match. That is a low number indeed.

Conclusion

I have decided to leave the dataset in the mess that it is and move onto the analysis stage. I will look for patterns within my dataframe first, and then look at those 351 entries which should be able to join with the imdb dataset for any patterns there.