Stop and Search, Part I, Data Collection
In light of the current prominence of BlackLivesMatter, I decided to investigate crime in relation to race. Here I describe how I collected the data I will be analysing.
Other posts in series
Introduction
In both traditional and social media, the issue of racial discrimination within the police is a hot topic. I decided to investigate this issue and better understand the statistics that go around.
I googled ‘crime data’ and one of the top results was data.police.uk, which seems like a reliable source of data for crime in the UK. With regard to race, the only data available on this website is about ‘stop-and-search’ (as opposed to prison data, for example).
Stop and search
In the UK, a police officer has the legal authority to stop and search you if they have ‘reasonable grounds’ to suspect you’re involved in a crime, e.g. carrying an illegal item. This UK Government website provides a short and clear summary of the rules, this Scottish Government website also provides clear summary of the rules but with more detail on what counts as reasonable and how a search should be conducted, and finally here is the actual legislation, which is predictably written in unclear legalese.
Downloading the data
I will only describe the final and clean code used to obtain the information I wanted, after several attempts necessary to get everything correct.
First, I downloaded a JSON file listing the name and ‘id’ of each police force, stored it in a pandas dataframe, and saved it as a csv file. The id is just a shortened version of their name and is used in all the other data sources. The code to do this is:
forces_response = requests.get('https://data.police.uk/api/forces')
forces_json = forces_response.json()
force_df = pd.DataFrame({'id':[], 'name': []})
for entry in forces_json:
force_df.loc[force_df.shape[0]] = [entry['id'], entry['name']]
force_df.to_csv('force.csv')
Next I downloaded a JSON file describing for which months and for which forces stop-and-search data was available:
availability_response = requests.get('https://data.police.uk/api/crimes-street-dates')
availability_json = availability_response.json()
availability_df = pd.DataFrame({'month':[], 'id': []})
for entry in availability_json:
date = pd.to_datetime(entry['date'], format='%Y-%m').to_period('M')
for id in entry['stop-and-search']:
availability_df.loc[availability_df.shape[0]] = [date, id]
I then loop through this information and download the stop-and-search data, saving the data onto my laptop.
for i in range(availability_df.shape[0]):
force = availability_df.iloc[i].id
month = availability_df.iloc[i].month.strftime('%Y-%m')
response = requests.get(f"https://data.police.uk/api/stops-force?force={force}&date={month}")
if response.status_code == 200:
data = response.json()
with open(f'{month}_{force}.json', 'w') as f:
json.dump(data, f)
I add a column to the availability dataframe to track which pieces of data were actually successfully downloaded or not. I do this by trying to open each file, and recording a fail if an error occurs while trying to open it. (While writing this paragraph, I realise I could have done this at the same time as the previous step.)
availability_df['downloaded'] = True
for i in range(availability_df.shape[0]):
force = availability_df.iloc[i].id
month = availability_df.iloc[i].month.strftime('%Y-%m')
try:
file = open(f'{month}_{force}.json', 'r')
file.close()
except:
availability_df.iloc[i,2] = False
print(f'{month}_{force}')
availability_df.to_csv('availability.csv')
Lastly, I combine all of the data into one mega pandas dataframe, keeping only those columns that I think will be relevant to my investigations.
cols = ['age_range', 'outcome', 'self_defined_ethnicity',
'gender', 'officer_defined_ethnicity', 'type',
'location.latitude', 'location.longitude', 'force', 'month']
sas_df = pd.DataFrame({col:[] for col in cols})
for i in range(availability_df.shape[0]):
if availability_df.iloc[i,2]:
force = availability_df.iloc[i].id
month = availability_df.iloc[i].month
month_str = month.strftime('%Y-%m')
file = open(f'{month_str}_{force}.json', 'r')
data = json.load(file)
new = pd.json_normalize(data)
new['force'] = force
new['month'] = month
sas_df = sas_df.append(new, ignore_index=True)[cols]
sas_df.to_csv('sas.csv')
A chart
It would be sad for this post to have no charts whatsoever, so I quickly created one which just counts the number of stops-and-searches, grouped by ethnicity.
One might say, ‘Look, white people are stopped more than black people, so the police are not racist.’ This is obviously simplistic. The aim of the project is to dig deeper into the data and see what patterns I can find.