The cleaning

I cleaned each column, one by one. Note I call the original frame sas and created a copy sas_clean in which I would do the cleaning.

  • To see a list of all the columns, I ran the code sas.columns.
  • To investigate the distribution of values in a column (before and after cleaning), I would use the code sas_clean.column.value_counts(dropna = False)
  • Some columns required no cleaning:
    • age_range
    • gender
    • location.latitude and location.longitude (except I renamed thse columns)
    • force
    • month
  • The other columns did require some cleaning:
    • self_defined_ethnicity and officer_defined_ethnicity
    • type
    • outcome

Ethnicity

officer_defined_ethnicity was mostly clean. The distribution of values were:

White    608092
Black    259504
Asian    139531
NaN       91601
Other     31845
Mixed      2563

The only change I made was to combine mixed with other.



self_defined_ethnicity was less clean, and the distribution of values were:

White - English/Welsh/Scottish/Northern Irish/British                                   475167
Other ethnic group - Not stated                                                         154802
White - Any other White background                                                       85322
Black/African/Caribbean/Black British - Any other Black/African/Caribbean background     78340
Black/African/Caribbean/Black British - African                                          66072
Black/African/Caribbean/Black British - Caribbean                                        49736
Asian/Asian British - Any other Asian background                                         44517
NaN                                                                                      41149
Asian/Asian British - Pakistani                                                          33907
Asian/Asian British - Bangladeshi                                                        24128
Other ethnic group - Any other ethnic group                                              16449
Mixed/Multiple ethnic groups - Any other Mixed/Multiple ethnic background                15560
Asian/Asian British - Indian                                                             14929
Mixed/Multiple ethnic groups - White and Black Caribbean                                 14063
White - Irish                                                                             7843
Mixed/Multiple ethnic groups - White and Black African                                    4246
Mixed/Multiple ethnic groups - White and Asian                                            3598
White - Gypsy or Irish Traveller                                                          1689
Asian/Asian British - Chinese                                                             1476
Other ethnic group - Arab                                                                  143

I decided to group these up according to the same categories used in officer_defined_ethnicity. This was done using .replace:

def simplify_eth(ethnicity):
    if isinstance(ethnicity, float) or 'Not stated' in ethnicity:
        return np.nan
    elif 'Other' in ethnicity or 'Mixed' in ethnicity:
        return 'Other'
    elif 'Asian' in ethnicity:
        return 'Asian'
    elif 'Black' in ethnicity:
        return 'Black'
    elif 'White' in ethnicity:
        return 'White'

ethnicities = {eth: simplify_eth(eth) for eth in sas.self_defined_ethnicity.unique()}
sas_clean = sas_clean.replace(to_replace = ethnicities)



Finally, I wanted to create a column ethnicity that combines these two columns. I started by renaming the other two columns, creating the new column, and filling it in with values where there is no disagreement between the officer defined and self defined ethnicity.

sas_clean.rename(columns = {'self_defined_ethnicity': 'self',
                            'officer_defined_ethnicity': 'officer'}, inplace = True)
sas_clean['ethnicity'] = np.nan

# if officer and self agree, set ethnicity to either.
indices = (sas_clean.officer == sas_clean.self)
sas_clean.loc[indices, 'ethnicity'] = sas_clean.officer[indices]

# if officer is null, set ethnicity to self, and vice versa
indices = (sas_clean.officer.isnull())
sas_clean.loc[indices, 'ethnicity'] = sas_clean.self[indices]

indices = (sas_clean.self.isnull())
sas_clean.loc[indices, 'ethnicity'] = sas_clean.officer[indices]



I created a column conflicted to list all the cases where the stated ethnicity differs:

sas_clean['conflicted'] = np.nan
indices = (sas_clean.officer != sas_clean.self) & (sas_clean.officer.notna()) & (sas_clean.self.notna())
sas_clean.loc[indices, 'conflicted'] = sas_clean.officer[indices] + '_' + sas_clean.self[indices]
sas_clean.conflicted.value_counts()

The output was:

Black_Other      18774
White_Other      12423
Asian_Other       6240
Other_Asian       5319
Other_White       4243
Black_White       2924
Asian_White       2394
White_Asian       2027
Black_Asian       1990
White_Black       1935
Other_Black       1764
Asian_Black       1577



To decide how to deal with this, I went back into the original self_defined_ethnicity to determine what the appropriate label ought to be.

for i in sas_clean.conflicted.unique():
    print(i)
    indices = (sas_clean.conflicted == i)
    print(sas.loc[indices, 'self_defined_ethnicity'].value_counts())
    print()

A sample of the output is:

Black_Other
Mixed/Multiple ethnic groups - White and Black Caribbean                     9158
Mixed/Multiple ethnic groups - Any other Mixed/Multiple ethnic background    4891
Mixed/Multiple ethnic groups - White and Black African                       2689
Other ethnic group - Any other ethnic group                                  1835
Mixed/Multiple ethnic groups - White and Asian                                194

White_Black
Black/African/Caribbean/Black British - Any other Black/African/Caribbean background    815
Black/African/Caribbean/Black British - African                                         633
Black/African/Caribbean/Black British - Caribbean                                       487

Other_Black
Black/African/Caribbean/Black British - African                                         819
Black/African/Caribbean/Black British - Any other Black/African/Caribbean background    750
Black/African/Caribbean/Black British - Caribbean                                       195

Deciding how to deal with these cases was the trickiest part of the cleaning. First, this is a sensitive issue and it feels wrong for me to decide how people should be labelled. Second, there is clearly no ‘right’ answer here, and I have to use my judgement. In the end, for most cases, I chose the self_defined_ethnicity. However, the two big exceptions were when the officer identified the person as Black or White but the person identified themselves as mixed. There were 30000 such cases. If I added them to the ‘Other’ category, this would grossly skew the numbers and misrepresent the situation, so I decided to assign these Black and White (respectively). Different people will make different judgements on this, and I suppose this is one way our own biases can creep into the data analysis.

In the end, the distribution of ethnicities in this new column is:

White    645261
Black    269286
Asian    143468
NaN       40383
Other     34738

Type

There are 3 types of stop-and-search:

Person search                861870
Person and Vehicle search    246976
Vehicle search                24290

Theoretically, a vehicle search does not involve any people, and thus should not have any ethnicity attached to it. However, a quick query shows this is not the case:

sas_clean.loc[(sas_clean.type == 'Vehicle search'), 'ethnicity'].value_counts(dropna = False)

NaN      20762
White     1796
Black      843
Asian      619
Other      270

This gives some indication of how much inherent noise there is in the data. Given the numbers are relatively small, I did not worry about ignoring these entries and so just removed all Vehicle search entries.

sas_clean = sas_clean[sas_clean.type != 'Vehicle search']

Outcome

Last, I cleaned the outcome column. The distribution of values were:

A no further action disposal                                    675585
Arrest                                                          126538
False                                                           115566
Community resolution                                             50906
Suspect arrested                                                 30616
Khat or Cannabis warning                                         26428
NaN                                                              22090
Summons / charged by post                                        16366
Penalty Notice for Disorder                                      13617
Offender given drugs possession warning                          12717
Local resolution                                                  4709
Caution (simple or conditional)                                   4520
Suspect summonsed to court                                        2941
Offender given penalty notice                                     2802
Article found - Detailed outcome unavailable                      2651
Offender cautioned                                                 778
Suspected psychoactive substances seized - No further action        16

I decided to replace these with a numerical value, where 0 represents that the stop-and-search discovered nothing inappropriate, 1 represents a minor infringement with minimal action and 2 represents a major infringement with significant action.

replacements = {'A no further action disposal': 0,
               'Arrest': 2,
               'False': 0,
               'Community resolution': 1,
               'Suspect arrested': 2,
               'Khat or Cannabis warning': 1,
               'Summons / charged by post': 2,
               'Penalty Notice for Disorder': 2,
               'Offender given drugs possession warning': 1,
               'Local resolution': 1,
               'Caution (simple or conditional)': 1,
                'Suspect summonsed to court': 2,
                'Offender given penalty notice': 2,
                'Article found - Detailed outcome unavailable': 1,
                'Offender cautioned': 1,
                'Suspected psychoactive substances seized - No further action': 1
               }

sas_clean['outcome'] = sas_clean.outcome.replace(to_replace = replacements)
sas_clean.outcome.value_counts(dropna = False)

The final distribution of values for outcomes is as follows:

0.0    791151
2.0    192880
1.0    102725
NaN     22090

It is surprising the the majority of stop-and-searches amount to nothing. It makes me wonder what the reasons for this are, and if there is a more efficient means of detecting the actual crimes with fewer false positives.

Conclusion and thoughts

That is end of the cleaning and tomorrow I will try to illustrate the patterns in the data with appropriate charts.

The two main lessons:

  • Data is inherently noisy, and one should not treat data as objective truth. (Though, it is the closest we have got!)
  • A data scientist has significant power to adjust the story, by grouping and cleaning the data differently. It seems that good practice is to be open about how you processed the data and to check how different choices affect the final results.