Fixing bad indexing

One of the columns is an index to show how many movies I have watched since I started the record. This was mostly accurate, except for reasons unknown to me, the latest records went through the indices 670 to 679 a few times; instead of an index changing from 679 to 680, I would go back to 670. So, I wrote some code to fix this.

Creating datetime column

One of the columns has year, and another column has date formatted as a string “DD Mon” (e.g. 1 Jan). I wanted to combine these to create a column with datetime type, and “yyyy-mm-dd” format. I first tidied the month column (some months had multiple entries, e.g. Jun and June), and sometimes other data was in the month column (e.g. #recc), so I fixed those. Then I tidied the date column - I padded out the digits so they were always two strings long. Once it was all tidy, I created the new column:

df['datetime'] = pd.to_datetime(df.year.astype(str)+' '+df.month+' '+df.date,
                               errors = 'coerce',
                               format = '%Y %b %d')

Tidying the source column

There is a column called “source”, which contains information on where I watched the movie, e.g. Netflix, cinema, TV, etc. Tidying this required a lot of manual work, because there were various different sources that I wanted to deal with in different ways. E.g. sometimes the source was the name of a friend (e.g. I watched a movie at their house), and so I wanted to replace source with friend, and move the friend’s name to the Other column. In retrospect, I could have made the process a bit more automated. But what is done is done, I suppose.

Tidying the other column

There is a column called “other” to contain miscellaneous information. This required no cleaning, which is nice!

Tidying the title column

I anticipate joining this table with an imdb dataset, so I can get information like ratings, director, movie length, that might be fun to include in my analysis. I downloaded the title.basics.tsv dataset, then read it into a dataframe, keeping only the columns tconst, primaryTitle and startYear. I converted all the primaryTitles to lowercase.

Next I created a column in my dataframe to say whether the title in my records occurs in the imdb dataset.

imdb_titles = set(imdb.primaryTitle.values)
df['title_in_imdb'] = df.movie_title.map(lambda x: x.lower() in imdb_titles)

Doing a little check reveals that 549 of the entries are fine, but 147 are not. So, I bite the bullet and go into another batch of manual corrections. I anticipate their being various reasons why the title would not work, so try to make the process as flexible as possible. I create the following function to make it easy to update a specific row, then loop through the values which were not in the imdb database.

def update_row(index):
    new_title = input("title: ")
    df.loc[index, 'movie_title'] = new_title
    new_other = input("other: ")
    if new_other != "":
        df.loc[index, 'other'] = new_other
    return

for index, row in df.iterrows():
    if row[-1]:
        continue
    else:
        print(f'{index}. {row[1]};    {row[-3]}')
        update_row(index)
        df.to_csv('third_tidy.csv', sep="\t")
        print("")
        print("")

Conclusion

This was a considerable effort, with a lot of manual grinding. I am curious to know what would happen in a commerical context with millions of entries, with a similar hodge-podge of bad data. I presume there must be more efficient ways of tidying, but maybe you really do just have to get your hands dirty if the data is messy enough.

Analysing the movies I've watched, Part II, Data cleaning

Other posts in series