Introduction

Since 2008, I have been maintaining a record of all the movies I have watched. It started because I borrowed a CD stack (see here if you are too young to not know what a CD stack is) of movies from a friend, and the easiest way of keeping track of which movies I had watched was to create a list. Once I finished the movies on the CD stack, I thought I may as well continue adding to the list, and it has been going ever since. Now I figured this could be a fun data set to analyse.

The dataset

The list started out as a notepad .txt file, then I moved it into a Google spreadsheet, and it now resides in Dynalist. The format of the entries in the list has changed over time.

The first task was to get this dataset into a pandas dataframe. It should not be too difficult - I have a few formats that can be easily identified, and I can create a new row as appropriate.

Collecting the data

What I was not conscious of was how many entries did not fit into the formats described above, roughly 150 or so. To my frustration, there is no simple pattern to them either so I had to manually sort these out. To illustrate, here are just a small sample of the exceptions:

The result is that the process took much longer than I anticipated. I can better understand now how data processing can be the biggest part of a data scientist’s role! One thing I am curious about is whether there are better alternatives to doing manual work. If there were millions of entries, what could I do?

The code

Here is the code. Many details will not make sense without knowing exactly how the data is stored in dynalist. Hopefully the flow of the process is clear.

# read text file from dynalist
with open("Regular.txt", 'r') as f:
    lines = f.readlines()


# extract relevant rows from lines, by finding the start and end indices
# store these rows into variable movies, separated by commas
i = 0
while '\tmovies' not in lines[i]:
    i += 1
i += 2 
while '\t\t\t' in lines[i]:
    i += 1
start = i
while '\t\t' in lines[i]:
    i += 1
end = i

movies = [re.split('\s*,\s*',line[2:-1]) for line in lines[start:end]]


# add the year to end of each entry in movies
year = 2020
for i in range(len(movies)):
    if '---' not in movies[i][0]:
        movies[i].append(year)
    elif '---' in movies[i][0]:
        year -= 1

# delete year separator entries, those with ---
for i in range(len(movies)-1,-1,-1):
    if '---' in movies[i][0]:
        del(movies[i])


# define date regex pattern
date_pattern = r"\d{1,2} \w+"


# define function to determine the format of the row
def check_format(row):
    dates = [ True if re.match(date_pattern, item) else False for item in row[:-1]]
    recc = [item == '#recc' for item in row]
    digits = [item.isdigit() for item in row[:-1]]
    length = len(row)
     
    # first format is movie_title, date, source, my_index, year
    if length == 5:
        if all([dates[1], digits[3]]):
            return 1
 
    # second format is movie_title, date, my_index, year
    if length == 4:
        if all([dates[1], digits[2]]):
            return 2
        
    # third format is movie_title, recc, date, source, my_index, year
    if length == 6:
        if all([recc[1], dates[2], digits[4]]):
            return 3
        
    # fourth format is movie_title, my_index, year
    if length == 3:
        if digits[1]:
            return 4
    
    # fifth format is movie_title, source, my_index, year
    if length == 4:
        if digits[2]:
            return 5
    
    # sixth format is movie_title, recc, source, my_index, year
    if length == 5:
        if all([recc[1], digits[3]]):
            return 6
    
    return None


# define function to obtain manual inputs from me, for those entries that do not match
# the standard format
def get_input(col, row):
    raw = input(col+": ")
    if raw == "":
        return ""
    elif raw.isdigit():
        return row[int(raw)]
    else:
        return raw


# create list of column names
# create variable df_rows that will contain data that will be turned into frame
columns = ['my_index','movie_title','year','date','source','recommended','other']
df_rows = []

# loop through entries in movies, and create appropriate row data
for i, row in enumerate(movies):
    df_row = {col: np.nan for col in columns }
    df_row.update(year = row[-1])
    
    row_format = check_format(row)
    
    # first format is movie_title, date, source, my_index, year
    if row_format == 1:
        df_row.update(movie_title = row[0])
        df_row.update(date = row[1])
        df_row.update(source = row[2].lower())
        df_row.update(my_index = int(row[3]))
        df_row.update(recommended = False) 
        
    # second format is movie_title, date, my_index, year
    elif row_format == 2:
        df_row.update(movie_title = row[0])
        df_row.update(date = row[1])
        df_row.update(my_index = int(row[2]))
        df_row.update(recommended = False) 
    
    # third format is movie_title, recc, date, source, my_index, year
    elif row_format == 3:        
        df_row.update(movie_title = row[0])
        df_row.update(recommended = True)
        df_row.update(date = row[2])
        df_row.update(source = row[3].lower())
        df_row.update(my_index = int(row[4]))
        
    # fourth format is movie_title, my_index, year
    elif row_format == 4:
        df_row.update(movie_title = row[0])
        df_row.update(my_index = int(row[1]))
        df_row.update(recommended = False) 
    
    # fifth format is movie_title, source, my_index, year
    elif row_format == 5:
        df_row.update(movie_title = row[0])
        df_row.update(source = row[1].lower())
        df_row.update(my_index = int(row[2]))
        df_row.update(recommended = False) 
    
    # sixth format is movie_title, recc, source, my_index, year
    elif row_format == 6:
        df_row.update(movie_title = row[0])
        df_row.update(recommended = True)        
        df_row.update(source = row[2].lower())
        df_row.update(my_index = int(row[3]))
        
    elif (row_format is None):
        for i,item in enumerate(row[:-1]):
            print(f'{i}: {item}')
        print("")
        df_row.update(movie_title = get_input('movie_title', row))
        df_row.update(date = get_input('date', row))
        df_row.update(source = get_input('source', row).lower())
        df_row.update(my_index = int(get_input('my_index', row)))
        df_row.update(recommended = bool(get_input('recommended', row)))
        df_row.update(other = get_input('other', row))
        print("")
    
    df_rows.append(df_row)


# create dataframe from data, and save as csv file
df = pd.DataFrame(df_rows)
df.to_csv('mymovies1.csv')