Data

Featured Image

Data Description

Primary Dataset

  • The primary data that we worked on is Fatal Police Shooting Data collected by The Washington Post.
    • The data relies primarily on news accounts, social media postings and police reports.
    • The data is collected mainly for analysis of circumstances of fatal shootings and the overall demographics of the victims.
    • The dataset and detailed description can be found here.

Secondary Datasets

  • TidyCensus Data
    • It contains geographic information of the US, and they are used for spatial merge with the primary dataset and perform geographic visualization.
    • It is retrieved from TidyCensus package.
  • ACS 2015-2019 SAIPE(Small Area Income and Poverty Estimates) Data
    • It contains income and poverty information of people of different age groups for each county, and they are used for analysis of impact of economic status on each shooting case.
    • The dataset and detailed description can be found here
  • State Gun Ownership Data
    • It contains gun ownership information for each state, and they are used for analysis of the impact of state gun ownership on each shooting case.
    • The dataset and detailed description can be found here

Descriptions of variables of each dataset can be found on the sections below, and since some datasets contain too many variables, only the most important ones are included.

Fatal Police Shooting Data

Variable Description:

  • id –> case IDs
  • name –> name of each victim
  • date –> date of each shooting case occurred
  • manner_of_death –> how did each victim die
  • armed –> arm(s) of each victim
  • age –> age of each victim
  • gender –> gender of each victim
  • race –> race of each victim
  • city –> city where each shooting case occurred
  • state –> letter code of the state where each shooting case occurred
  • signs_of_mental_illness –> whether each victim exhibited mental illness or not
  • threat_level –> police-defined threat level of each victim
  • flee –> whether each victim fled or not
  • body_camera –> whether each police carried a body camera or not
  • longitude –> longitude where each shooting case occurred
  • latitude –> latitude where each shooting case occurred

TidyCensus Data

Variable Description:

  • NAME –> name of each county
  • geometry –> geographic shape information of each county in POLYGON

ACS SAIPE Data

Variable Description:

  • Year –> year when each row’s information was collected
  • State –> number code of each state
  • County.ID –> number code of each county
  • All.Ages.in.Poverty.Count –> count of people of all ages living under the poverty line for each county
  • All.Ages.in.Poverty.Percent –> proportion of people of all ages living under the poverty line for each county
  • Median.Household.Income.in.Dollars –> median household income in dollars for each county
  • state –> letter code of each state
  • County –> name of each county

State Gun Ownership Data

Variable Description:

  • State –> name of each state
  • gunOwnership -> the proportion of people who have gun(s) for each state
  • totalGuns -> the total number of guns for each state
  • state –> letter code of each state(manually added for joining with the primary dataset)

Load and Cleaning Process

Load and Cleaning Process File

  • Our load_and_clean_data.R file can be found here

Detailed Load and Cleaning Process

  • Since our primary dataset has the exact location information, we converted the df to a shp object setting crs to 4326, so that it can be merged with the geographic data provided by TidyCensus.
  • We selected 7 useful columns out of 45 columns from ACS 2015-2019 SAIPE dataset
  • Since the original county name column contains both the county name and the state code while the county column of our primary dataset only contains county name, we cleaned out the state codes so that the dataset became easier to join with the primary data.
  • Since our primary dataset does not include the state name while the state gun ownership dataset only uses state names, we manually added a column of letter code for each corresponding state in order to merge.
  • Since our model focuses on whether the victims were armed with gun or not while the armed column contains 96 types of arms, we recategorized the armed column as a new column named armed_with_gun, which only contains Yes or No to indicate whether each victim was armed with real guns or not.
  • After all tables were joined, we cleaned out rows which contain NA.
Previous Analysis