Other posts in series
The cleaning
I cleaned each column, one by one. Note I call the original frame sas
and created a copy sas_clean
in which I would do the cleaning.
- To see a list of all the columns, I ran the code
sas.columns
. - To investigate the distribution of values in a column (before and after cleaning), I would use the code
sas_clean.column.value_counts(dropna = False)
- Some columns required no cleaning:
- age_range
- gender
- location.latitude and location.longitude (except I renamed thse columns)
- force
- month
- The other columns did require some cleaning:
- self_defined_ethnicity and officer_defined_ethnicity
- type
- outcome
Ethnicity
officer_defined_ethnicity
was mostly clean. The distribution of values were:
White 608092
Black 259504
Asian 139531
NaN 91601
Other 31845
Mixed 2563
The only change I made was to combine mixed with other.
self_defined_ethnicity
was less clean, and the distribution of values were:
White - English/Welsh/Scottish/Northern Irish/British 475167
Other ethnic group - Not stated 154802
White - Any other White background 85322
Black/African/Caribbean/Black British - Any other Black/African/Caribbean background 78340
Black/African/Caribbean/Black British - African 66072
Black/African/Caribbean/Black British - Caribbean 49736
Asian/Asian British - Any other Asian background 44517
NaN 41149
Asian/Asian British - Pakistani 33907
Asian/Asian British - Bangladeshi 24128
Other ethnic group - Any other ethnic group 16449
Mixed/Multiple ethnic groups - Any other Mixed/Multiple ethnic background 15560
Asian/Asian British - Indian 14929
Mixed/Multiple ethnic groups - White and Black Caribbean 14063
White - Irish 7843
Mixed/Multiple ethnic groups - White and Black African 4246
Mixed/Multiple ethnic groups - White and Asian 3598
White - Gypsy or Irish Traveller 1689
Asian/Asian British - Chinese 1476
Other ethnic group - Arab 143
I decided to group these up according to the same categories used in officer_defined_ethnicity
. This was done using .replace
:
def simplify_eth(ethnicity):
if isinstance(ethnicity, float) or 'Not stated' in ethnicity:
return np.nan
elif 'Other' in ethnicity or 'Mixed' in ethnicity:
return 'Other'
elif 'Asian' in ethnicity:
return 'Asian'
elif 'Black' in ethnicity:
return 'Black'
elif 'White' in ethnicity:
return 'White'
ethnicities = {eth: simplify_eth(eth) for eth in sas.self_defined_ethnicity.unique()}
sas_clean = sas_clean.replace(to_replace = ethnicities)
Finally, I wanted to create a column ethnicity
that combines these two columns. I started by renaming the other two columns, creating the new column, and filling it in with values where there is no disagreement between the officer defined and self defined ethnicity.
sas_clean.rename(columns = {'self_defined_ethnicity': 'self',
'officer_defined_ethnicity': 'officer'}, inplace = True)
sas_clean['ethnicity'] = np.nan
# if officer and self agree, set ethnicity to either.
indices = (sas_clean.officer == sas_clean.self)
sas_clean.loc[indices, 'ethnicity'] = sas_clean.officer[indices]
# if officer is null, set ethnicity to self, and vice versa
indices = (sas_clean.officer.isnull())
sas_clean.loc[indices, 'ethnicity'] = sas_clean.self[indices]
indices = (sas_clean.self.isnull())
sas_clean.loc[indices, 'ethnicity'] = sas_clean.officer[indices]
I created a column conflicted
to list all the cases where the stated ethnicity differs:
sas_clean['conflicted'] = np.nan
indices = (sas_clean.officer != sas_clean.self) & (sas_clean.officer.notna()) & (sas_clean.self.notna())
sas_clean.loc[indices, 'conflicted'] = sas_clean.officer[indices] + '_' + sas_clean.self[indices]
sas_clean.conflicted.value_counts()
The output was:
Black_Other 18774
White_Other 12423
Asian_Other 6240
Other_Asian 5319
Other_White 4243
Black_White 2924
Asian_White 2394
White_Asian 2027
Black_Asian 1990
White_Black 1935
Other_Black 1764
Asian_Black 1577
To decide how to deal with this, I went back into the original self_defined_ethnicity
to determine what the appropriate label ought to be.
for i in sas_clean.conflicted.unique():
print(i)
indices = (sas_clean.conflicted == i)
print(sas.loc[indices, 'self_defined_ethnicity'].value_counts())
print()
A sample of the output is:
Black_Other
Mixed/Multiple ethnic groups - White and Black Caribbean 9158
Mixed/Multiple ethnic groups - Any other Mixed/Multiple ethnic background 4891
Mixed/Multiple ethnic groups - White and Black African 2689
Other ethnic group - Any other ethnic group 1835
Mixed/Multiple ethnic groups - White and Asian 194
White_Black
Black/African/Caribbean/Black British - Any other Black/African/Caribbean background 815
Black/African/Caribbean/Black British - African 633
Black/African/Caribbean/Black British - Caribbean 487
Other_Black
Black/African/Caribbean/Black British - African 819
Black/African/Caribbean/Black British - Any other Black/African/Caribbean background 750
Black/African/Caribbean/Black British - Caribbean 195
Deciding how to deal with these cases was the trickiest part of the cleaning. First, this is a sensitive issue and it feels wrong for me to decide how people should be labelled. Second, there is clearly no ‘right’ answer here, and I have to use my judgement. In the end, for most cases, I chose the self_defined_ethnicity. However, the two big exceptions were when the officer identified the person as Black or White but the person identified themselves as mixed. There were 30000 such cases. If I added them to the ‘Other’ category, this would grossly skew the numbers and misrepresent the situation, so I decided to assign these Black and White (respectively). Different people will make different judgements on this, and I suppose this is one way our own biases can creep into the data analysis.
In the end, the distribution of ethnicities in this new column is:
White 645261
Black 269286
Asian 143468
NaN 40383
Other 34738
Type
There are 3 types of stop-and-search:
Person search 861870
Person and Vehicle search 246976
Vehicle search 24290
Theoretically, a vehicle search does not involve any people, and thus should not have any ethnicity attached to it. However, a quick query shows this is not the case:
sas_clean.loc[(sas_clean.type == 'Vehicle search'), 'ethnicity'].value_counts(dropna = False)
NaN 20762
White 1796
Black 843
Asian 619
Other 270
This gives some indication of how much inherent noise there is in the data. Given the numbers are relatively small, I did not worry about ignoring these entries and so just removed all Vehicle search entries.
sas_clean = sas_clean[sas_clean.type != 'Vehicle search']
Outcome
Last, I cleaned the outcome column. The distribution of values were:
A no further action disposal 675585
Arrest 126538
False 115566
Community resolution 50906
Suspect arrested 30616
Khat or Cannabis warning 26428
NaN 22090
Summons / charged by post 16366
Penalty Notice for Disorder 13617
Offender given drugs possession warning 12717
Local resolution 4709
Caution (simple or conditional) 4520
Suspect summonsed to court 2941
Offender given penalty notice 2802
Article found - Detailed outcome unavailable 2651
Offender cautioned 778
Suspected psychoactive substances seized - No further action 16
I decided to replace these with a numerical value, where 0 represents that the stop-and-search discovered nothing inappropriate, 1 represents a minor infringement with minimal action and 2 represents a major infringement with significant action.
replacements = {'A no further action disposal': 0,
'Arrest': 2,
'False': 0,
'Community resolution': 1,
'Suspect arrested': 2,
'Khat or Cannabis warning': 1,
'Summons / charged by post': 2,
'Penalty Notice for Disorder': 2,
'Offender given drugs possession warning': 1,
'Local resolution': 1,
'Caution (simple or conditional)': 1,
'Suspect summonsed to court': 2,
'Offender given penalty notice': 2,
'Article found - Detailed outcome unavailable': 1,
'Offender cautioned': 1,
'Suspected psychoactive substances seized - No further action': 1
}
sas_clean['outcome'] = sas_clean.outcome.replace(to_replace = replacements)
sas_clean.outcome.value_counts(dropna = False)
The final distribution of values for outcomes is as follows:
0.0 791151
2.0 192880
1.0 102725
NaN 22090
It is surprising the the majority of stop-and-searches amount to nothing. It makes me wonder what the reasons for this are, and if there is a more efficient means of detecting the actual crimes with fewer false positives.
Conclusion and thoughts
That is end of the cleaning and tomorrow I will try to illustrate the patterns in the data with appropriate charts.
The two main lessons:
- Data is inherently noisy, and one should not treat data as objective truth. (Though, it is the closest we have got!)
- A data scientist has significant power to adjust the story, by grouping and cleaning the data differently. It seems that good practice is to be open about how you processed the data and to check how different choices affect the final results.