Analysing the movies I’ve watched, Part I, Data collection

Introduction

Since 2008, I have been maintaining a record of all the movies I have watched. It started because I borrowed a CD stack (see here if you are too young to not know what a CD stack is) of movies from a friend, and the easiest way of keeping track of which movies I had watched was to create a list. Once I finished the movies on the CD stack, I thought I may as well continue adding to the list, and it has been going ever since. Now I figured this could be a fun data set to analyse.

The dataset

The list started out as a notepad .txt file, then I moved it into a Google spreadsheet, and it now resides in Dynalist. The format of the entries in the list has changed over time.

The earliest entries in the list have the format movie_title, i, where i is an index to keep track of how many films are in the list.
In Easter 2009, I started including the date, so the standard format is movie_title, date, i. The date is the day of the month and the month.
In September 2009, I started including the source / location of the movie, e.g. a cinema, or a DVD, or a friend, etc. The standard format is movie_title, date, source, i. This has been the format ever since.
I keep track of movies I would recommend. This is recorded by including #recc between movie_title and whatever the second item of the entry is.
I keep track of year by having an entry like this: ---2020 above, 2019 below---

The first task was to get this dataset into a pandas dataframe. It should not be too difficult - I have a few formats that can be easily identified, and I can create a new row as appropriate.

Collecting the data

What I was not conscious of was how many entries did not fit into the formats described above, roughly 150 or so. To my frustration, there is no simple pattern to them either so I had to manually sort these out. To illustrate, here are just a small sample of the exceptions:

Having ?? for the date, or just leaving out the date altogether
Including extra details, e.g. whether I had watched the whole movie or not, or a description of the movie to help me remember what the movie is, or, inclusion of which city I was in. These extra details were not in consistent locations within the entry.
Sometimes the movie title contained a comma in them, which messed things up because I was using splitting each entry by a comma.

The result is that the process took much longer than I anticipated. I can better understand now how data processing can be the biggest part of a data scientist’s role! One thing I am curious about is whether there are better alternatives to doing manual work. If there were millions of entries, what could I do?

The code

Here is the code. Many details will not make sense without knowing exactly how the data is stored in dynalist. Hopefully the flow of the process is clear.

# read text file from dynalist
with open("Regular.txt", 'r') as f:
    lines = f.readlines()


# extract relevant rows from lines, by finding the start and end indices
# store these rows into variable movies, separated by commas
i = 0
while '\tmovies' not in lines[i]:
    i += 1
i += 2 
while '\t\t\t' in lines[i]:
    i += 1
start = i
while '\t\t' in lines[i]:
    i += 1
end = i

movies = [re.split('\s*,\s*',line[2:-1]) for line in lines[start:end]]


# add the year to end of each entry in movies
year = 2020
for i in range(len(movies)):
    if '---' not in movies[i][0]:
        movies[i].append(year)
    elif '---' in movies[i][0]:
        year -= 1

# delete year separator entries, those with ---
for i in range(len(movies)-1,-1,-1):
    if '---' in movies[i][0]:
        del(movies[i])


# define date regex pattern
date_pattern = r"\d{1,2} \w+"


# define function to determine the format of the row
def check_format(row):
    dates = [ True if re.match(date_pattern, item) else False for item in row[:-1]]
    recc = [item == '#recc' for item in row]
    digits = [item.isdigit() for item in row[:-1]]
    length = len(row)
     
    # first format is movie_title, date, source, my_index, year
    if length == 5:
        if all([dates[1], digits[3]]):
            return 1
 
    # second format is movie_title, date, my_index, year
    if length == 4:
        if all([dates[1], digits[2]]):
            return 2
        
    # third format is movie_title, recc, date, source, my_index, year
    if length == 6:
        if all([recc[1], dates[2], digits[4]]):
            return 3
        
    # fourth format is movie_title, my_index, year
    if length == 3:
        if digits[1]:
            return 4
    
    # fifth format is movie_title, source, my_index, year
    if length == 4:
        if digits[2]:
            return 5
    
    # sixth format is movie_title, recc, source, my_index, year
    if length == 5:
        if all([recc[1], digits[3]]):
            return 6
    
    return None


# define function to obtain manual inputs from me, for those entries that do not match
# the standard format
def get_input(col, row):
    raw = input(col+": ")
    if raw == "":
        return ""
    elif raw.isdigit():
        return row[int(raw)]
    else:
        return raw


# create list of column names
# create variable df_rows that will contain data that will be turned into frame
columns = ['my_index','movie_title','year','date','source','recommended','other']
df_rows = []

# loop through entries in movies, and create appropriate row data
for i, row in enumerate(movies):
    df_row = {col: np.nan for col in columns }
    df_row.update(year = row[-1])
    
    row_format = check_format(row)
    
    # first format is movie_title, date, source, my_index, year
    if row_format == 1:
        df_row.update(movie_title = row[0])
        df_row.update(date = row[1])
        df_row.update(source = row[2].lower())
        df_row.update(my_index = int(row[3]))
        df_row.update(recommended = False) 
        
    # second format is movie_title, date, my_index, year
    elif row_format == 2:
        df_row.update(movie_title = row[0])
        df_row.update(date = row[1])
        df_row.update(my_index = int(row[2]))
        df_row.update(recommended = False) 
    
    # third format is movie_title, recc, date, source, my_index, year
    elif row_format == 3:        
        df_row.update(movie_title = row[0])
        df_row.update(recommended = True)
        df_row.update(date = row[2])
        df_row.update(source = row[3].lower())
        df_row.update(my_index = int(row[4]))
        
    # fourth format is movie_title, my_index, year
    elif row_format == 4:
        df_row.update(movie_title = row[0])
        df_row.update(my_index = int(row[1]))
        df_row.update(recommended = False) 
    
    # fifth format is movie_title, source, my_index, year
    elif row_format == 5:
        df_row.update(movie_title = row[0])
        df_row.update(source = row[1].lower())
        df_row.update(my_index = int(row[2]))
        df_row.update(recommended = False) 
    
    # sixth format is movie_title, recc, source, my_index, year
    elif row_format == 6:
        df_row.update(movie_title = row[0])
        df_row.update(recommended = True)        
        df_row.update(source = row[2].lower())
        df_row.update(my_index = int(row[3]))
        
    elif (row_format is None):
        for i,item in enumerate(row[:-1]):
            print(f'{i}: {item}')
        print("")
        df_row.update(movie_title = get_input('movie_title', row))
        df_row.update(date = get_input('date', row))
        df_row.update(source = get_input('source', row).lower())
        df_row.update(my_index = int(get_input('my_index', row)))
        df_row.update(recommended = bool(get_input('recommended', row)))
        df_row.update(other = get_input('other', row))
        print("")
    
    df_rows.append(df_row)


# create dataframe from data, and save as csv file
df = pd.DataFrame(df_rows)
df.to_csv('mymovies1.csv')

Analysing the movies I've watched, Part I, Data collection

Other posts in series

Introduction

The dataset

Collecting the data

The code