Fixing bad indexing

One of the columns is an index to show how many movies I have watched since I started the record. This was mostly accurate, except for reasons unknown to me, the latest records went through the indices 670 to 679 a few times; instead of an index changing from 679 to 680, I would go back to 670. So, I wrote some code to fix this.

Creating datetime column

One of the columns has year, and another column has date formatted as a string “DD Mon” (e.g. 1 Jan). I wanted to combine these to create a column with datetime type, and “yyyy-mm-dd” format. I first tidied the month column (some months had multiple entries, e.g. Jun and June), and sometimes other data was in the month column (e.g. #recc), so I fixed those. Then I tidied the date column - I padded out the digits so they were always two strings long. Once it was all tidy, I created the new column:

df['datetime'] = pd.to_datetime(df.year.astype(str)+' '+df.month+' '+df.date,
                               errors = 'coerce',
                               format = '%Y %b %d')

Tidying the source column

There is a column called “source”, which contains information on where I watched the movie, e.g. Netflix, cinema, TV, etc. Tidying this required a lot of manual work, because there were various different sources that I wanted to deal with in different ways. E.g. sometimes the source was the name of a friend (e.g. I watched a movie at their house), and so I wanted to replace source with friend, and move the friend’s name to the Other column. In retrospect, I could have made the process a bit more automated. But what is done is done, I suppose.

Tidying the other column

There is a column called “other” to contain miscellaneous information. This required no cleaning, which is nice!

Tidying the title column

I anticipate joining this table with an imdb dataset, so I can get information like ratings, director, movie length, that might be fun to include in my analysis. I downloaded the title.basics.tsv dataset, then read it into a dataframe, keeping only the columns tconst, primaryTitle and startYear. I converted all the primaryTitles to lowercase.

Next I created a column in my dataframe to say whether the title in my records occurs in the imdb dataset.

imdb_titles = set(imdb.primaryTitle.values)
df['title_in_imdb'] = df.movie_title.map(lambda x: x.lower() in imdb_titles)

Doing a little check reveals that 549 of the entries are fine, but 147 are not. So, I bite the bullet and go into another batch of manual corrections. I anticipate their being various reasons why the title would not work, so try to make the process as flexible as possible. I create the following function to make it easy to update a specific row, then loop through the values which were not in the imdb database.

def update_row(index):
    new_title = input("title: ")
    df.loc[index, 'movie_title'] = new_title
    new_other = input("other: ")
    if new_other != "":
        df.loc[index, 'other'] = new_other
    return

for index, row in df.iterrows():
    if row[-1]:
        continue
    else:
        print(f'{index}. {row[1]};    {row[-3]}')
        update_row(index)
        df.to_csv('third_tidy.csv', sep="\t")
        print("")
        print("")

Conclusion

This was a considerable effort, with a lot of manual grinding. I am curious to know what would happen in a commerical context with millions of entries, with a similar hodge-podge of bad data. I presume there must be more efficient ways of tidying, but maybe you really do just have to get your hands dirty if the data is messy enough.