Analysing the movies I've watched, Part II, Data cleaning
I briefly describe the tidying that I did.
Other posts in series
-
Analysing the movies I’ve watched, Part V, Data visualisation II
-
Analysing the movies I’ve watched, Part IV, Data visualisation
-
Analysing the movies I’ve watched, Part III, Joining the tables
Fixing bad indexing
One of the columns is an index to show how many movies I have watched since I started the record. This was mostly accurate, except for reasons unknown to me, the latest records went through the indices 670 to 679 a few times; instead of an index changing from 679 to 680, I would go back to 670. So, I wrote some code to fix this.
Creating datetime column
One of the columns has year, and another column has date formatted as a string “DD Mon” (e.g. 1 Jan
). I wanted to combine these to create a column with datetime type, and “yyyy-mm-dd” format. I first tidied the month column (some months had multiple entries, e.g. Jun
and June
), and sometimes other data was in the month column (e.g. #recc
), so I fixed those. Then I tidied the date column - I padded out the digits so they were always two strings long. Once it was all tidy, I created the new column:
df['datetime'] = pd.to_datetime(df.year.astype(str)+' '+df.month+' '+df.date,
errors = 'coerce',
format = '%Y %b %d')
Tidying the source column
There is a column called “source”, which contains information on where I watched the movie, e.g. Netflix, cinema, TV, etc. Tidying this required a lot of manual work, because there were various different sources that I wanted to deal with in different ways. E.g. sometimes the source was the name of a friend (e.g. I watched a movie at their house), and so I wanted to replace source with friend, and move the friend’s name to the Other
column. In retrospect, I could have made the process a bit more automated. But what is done is done, I suppose.
Tidying the other column
There is a column called “other” to contain miscellaneous information. This required no cleaning, which is nice!
Tidying the title column
I anticipate joining this table with an imdb
dataset, so I can get information like ratings, director, movie length, that might be fun to include in my analysis. I downloaded the title.basics.tsv
dataset, then read it into a dataframe, keeping only the columns tconst
, primaryTitle
and startYear
. I converted all the primaryTitles to lowercase.
Next I created a column in my dataframe to say whether the title in my records occurs in the imdb dataset.
imdb_titles = set(imdb.primaryTitle.values)
df['title_in_imdb'] = df.movie_title.map(lambda x: x.lower() in imdb_titles)
Doing a little check reveals that 549 of the entries are fine, but 147 are not. So, I bite the bullet and go into another batch of manual corrections. I anticipate their being various reasons why the title would not work, so try to make the process as flexible as possible. I create the following function to make it easy to update a specific row, then loop through the values which were not in the imdb database.
def update_row(index):
new_title = input("title: ")
df.loc[index, 'movie_title'] = new_title
new_other = input("other: ")
if new_other != "":
df.loc[index, 'other'] = new_other
return
for index, row in df.iterrows():
if row[-1]:
continue
else:
print(f'{index}. {row[1]}; {row[-3]}')
update_row(index)
df.to_csv('third_tidy.csv', sep="\t")
print("")
print("")
Conclusion
This was a considerable effort, with a lot of manual grinding. I am curious to know what would happen in a commerical context with millions of entries, with a similar hodge-podge of bad data. I presume there must be more efficient ways of tidying, but maybe you really do just have to get your hands dirty if the data is messy enough.