Chapter 3 Data

3.1 Sources

After deciding the topic of the project, all of our team members searched data together to find the most suitable one. We have found that, to identify the crime types, patterns and so on, we need to get some historical crime data. Also, the data released by government or other official department would be more reliable. Here we extract the arrest records data from New York OpenData (https://data.cityofnewyork.us/Public-Safety/NYPD-Arrest-Data-Year-to-Date-/uip8-fykc), which is a free public data source published by New York City agencies and other organizations. The data we will use is collected by the New York Police Department(NYPD), including information about the type of crime, the location, the demographics of perpetrators, and so on. The Office of Management Analysis and Planning extracts and reviews the data every quarter. The data is last updated on Oct 19, 2022.

The dataset is categorized as public safety and contains 141,000 observations with 19 columns in a structured form. Each row represents an arrest effected in New York City by NYPD. The column shows the basic information about the arrest, including the time, location, and level of offense. At the same time, the dataset also contains descriptions of perpetrators, such as gender, age, race, etc. Based on these data, we can well accomplish our research goals.

To fully explored the arrest pattern, a single dataset is not reliable. Therefore, we collected historic data for arrests records from the same platform, New York OpenData, (https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u). This dataset has the same format as year-to-date data, making it easy to compare with each other. However, we found that this historic dataset was only updated on June 9, 2022 lastly, which has been covered by our primal dataset. Therefore, instead of using both datasets, we use only the primal one in our projects.

We can export the dataset in CSV format for offline use from New York OpenData. Then we can import and manipulate the CSV formatted data in R from the local path.

3.2 Cleaning / transformation

First, we import the Year-to-Date, which is from January 1, 2022, to September 30, 2022, NYPD Arrest Data from NYC OpenData, where each row is an arrest effected in NYC by the NYPD with columns includeing information about the type of crime, the location and time of enforcement, and suspect demographics. Through the glimpse of the dataset, we found that this dataset is well-stored and well-structured.

## Rows: 140,564
## Columns: 19
## $ arrest_key        <int> 238492853, 238496466, 238498340, 238513835, 23851387…
## $ arrest_date       <dttm> 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01, 202…
## $ pd_cd             <int> 258, 244, 792, 109, 112, 339, 114, 101, 105, 665, 77…
## $ pd_desc           <chr> "CRIMINAL MISCHIEF 4TH, GRAFFIT", "BURGLARY,UNCLASSI…
## $ ky_cd             <int> 351, 107, 118, 106, 126, 341, 344, 344, 106, 126, 12…
## $ ofns_desc         <chr> "CRIMINAL MISCHIEF & RELATED OF", "BURGLARY", "DANGE…
## $ law_code          <chr> "PL 1456002", "PL 1402000", "PL 265031B", "PL 120050…
## $ law_cat_cd        <chr> "M", "F", "F", "F", "F", "M", "M", "M", "F", "F", "F…
## $ arrest_boro       <chr> "K", "M", "Q", "B", "K", "K", "Q", "Q", "K", "M", "M…
## $ arrest_precinct   <int> 72, 23, 114, 47, 71, 60, 114, 103, 73, 26, 23, 25, 1…
## $ jurisdiction_code <int> 1, 0, 0, 2, 97, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 0, …
## $ age_group         <chr> "18-24", "25-44", "18-24", "25-44", "25-44", "25-44"…
## $ perp_sex          <chr> "M", "F", "M", "M", "M", "M", "M", "F", "M", "M", "M…
## $ perp_race         <chr> "WHITE", "WHITE", "BLACK", "BLACK", "BLACK HISPANIC"…
## $ x_coord_cd        <int> 984074, 997744, 1010642, 1026480, 1000099, 985372, 1…
## $ y_coord_cd        <int> 178984, 228061, 218253, 262584, 178227, 147958, 2143…
## $ latitude          <dbl> 40.65795, 40.79264, 40.76569, 40.88731, 40.65586, 40…
## $ longitude         <dbl> -74.00063, -73.95127, -73.90472, -73.84727, -73.9428…
## $ geocoded_column   <chr> "POINT (-74.000634 40.657949)", "POINT (-73.951265 4…

##   arrest_key arrest_date pd_cd                        pd_desc ky_cd
## 1  238492853  2022-01-01   258 CRIMINAL MISCHIEF 4TH, GRAFFIT   351
## 2  238496466  2022-01-01   244  BURGLARY,UNCLASSIFIED,UNKNOWN   107
## 3  238498340  2022-01-01   792     CRIMINAL POSSESSION WEAPON   118
## 4  238513835  2022-01-01   109       ASSAULT 2,1,UNCLASSIFIED   106
## 5  238513876  2022-01-01   112  MENACING 1ST DEGREE (VICT NOT   126
## 6  238513883  2022-01-01   339 LARCENY,PETIT FROM OPEN AREAS,   341
##                        ofns_desc   law_code law_cat_cd arrest_boro
## 1 CRIMINAL MISCHIEF & RELATED OF PL 1456002          M           K
## 2                       BURGLARY PL 1402000          F           M
## 3              DANGEROUS WEAPONS PL 265031B          F           Q
## 4                 FELONY ASSAULT PL 1200501          F           B
## 5        MISCELLANEOUS PENAL LAW PL 1201800          F           K
## 6                  PETIT LARCENY PL 1552500          M           K
##   arrest_precinct jurisdiction_code age_group perp_sex      perp_race
## 1              72                 1     18-24        M          WHITE
## 2              23                 0     25-44        F          WHITE
## 3             114                 0     18-24        M          BLACK
## 4              47                 2     25-44        M          BLACK
## 5              71                97     25-44        M BLACK HISPANIC
## 6              60                 0     25-44        M          BLACK
##   x_coord_cd y_coord_cd latitude longitude
## 1     984074     178984 40.65795 -74.00063
## 2     997744     228061 40.79264 -73.95127
## 3    1010642     218253 40.76569 -73.90472
## 4    1026480     262584 40.88731 -73.84727
## 5    1000099     178227 40.65586 -73.94288
## 6     985372     147958 40.57279 -73.99596
##                              geocoded_column
## 1               POINT (-74.000634 40.657949)
## 2               POINT (-73.951265 40.792642)
## 3               POINT (-73.904725 40.765692)
## 4 POINT (-73.8472717577564 40.8873136344706)
## 5               POINT (-73.942878 40.655857)
## 6           POINT (-73.99596126 40.57278637)

To transform the dataset into a tidy form for exploratory analysis, we drop the column called ‘geocoded_column’ since this column is a combination of the longitude coordinate as the x coordinate and the latitude coordinate as the y coordinate.

## age_group
##   <18 18-24 25-44 45-64   65+ 
##  4895 24560 81067 27959  2083

In addition, we notice that the age groups are not classified evenly. Namely, the age groups are ‘<18’, ‘18-24’, ‘25-44’, ‘45-64’, and ‘65+’, where the range of group ‘18-24’ is only 6, while others’ ranges are 19. However, the number of arrests in the age group of ‘18-24’ is extremely large compared to other age groups. It is worth paying attention to and explicitly exploring that remarkable number in plots. Thus, we keep the categories of age unchanged. Therefore, our dataset is in a tidy form to do exploratory data analysis in R now.

3.3 Missing value analysis

3.3.1 Why analyse missing value?

In this section, we want to know the distribution of missing values in the original dataset. Through missing pattern analysis, we are able to get an overview of the missing logic and determine how to deal with these missing values. Whether to delete them directly or to approach them based on other non-missing values?

3.3.2 Missing pattern

We start by analysis the missing pattern of the whole dataset.

The left-bottom bar chart shows the total number of missing values in each column. The right chart shows the intersection pattern of the missing value. From the chart above, we can find that among 19 column in original data set, only 3 columns (pd_cd, pd_desc, ky_cd) has missing value.

- pd_cd: Three digit arrest internal classification code.
- pd_desc: The description of internal classification code
- ky_cd: Three digit internal classification code (more general category than PD code)

Notice the relation among pd_cd, pd_desc and ky_cd, it is easy to understand the consistency of missing value. Since the Keycode is a more general category, if the Key code of an arrest event is unknown, the PD code also remains unknown. Similarly, if the PD code is unkown, we cannot describe it.

3.3.3 Missing borough analysis:

When faceted on borough, we were surprised to find that majority of missing values are derived from Queens.It is probably due to data filing errors in this region, or the peculiarities of some crimes. We wonder if these arrest belongs to same accident or have some relation.

We can propose a hypothesis from the charts above: Arrest data loss occurs throughout the year. There is no peak or obvious pattern, thus missing value is not due to specific incidents or periods, but it is highly likely because of the relatively poor system for arrest categorization in Queens.

3.3.4 Way to deal with the missing value

Since we cannot induce the classification of the arrest from known data, and all missing data only account for a small part of the total data (0.27%), we decided to delete these data directly, which will not affect our follow-up data analysis.