10/15 Post 1

2021-10-15

Source1: Washington Post’s database

https://github.com/washingtonpost/data-police-shootings

The data records “only those shootings in which a police officer, in the line of duty, shoots and kills a civilian” under circumstances that “most closely parallel the 2014 killing of Michael Brown in Ferguson, Mo.” There are in total of 16 factors (columns) and 4295 rows of data, the exact date and month from 2015 to 2019, including the victims’ names, whether they are shot, were armed or not, the race and gender of the victim, at which city and state did the incident happen, the threat level of the victim, the coordinates of the incident and whether these coordinates are accurate enough for map-plotting.

Since the Washington Post has identified the FBI and CDC’s acknowledgments of under-reporting, they have gathered additional data from “local news reports, law enforcement websites and social media, and by monitoring independent databases such as Killed by Police and Fatal Encounters”. By doing so they have documented more than twice as many fatal shootings by police as recorded on average annually.

Yes, the data itself is rectangular already. The most useful columns could be the columns whether the victim is armed, the geographical location, and the threat levels. This way we would have some judgment, perhaps, about whether the police was overreacting and gain insights into how the police system is. However, we need more knowledge on what does “threat level” suggest about the victim since its introduction is documented in another separate Washington Post site.

Acknowledging the high sensitivity around the issue and data, we hope to find out to what extent is racial discrimination a problem within the police system and whether some data, provided that they are accurate, would be counter-intuitive to information we previously accepted and considered as accurate.

The geographical locations should be an interesting feature if actualized since we would be able to see a map of how frequent police shootings are in a certain area on the real, physical ground. Some data are not that accurate since the margin of error on where did the incident actually occur is larger than 80-100 meters. Secondly, in order to make sense geographical information alone would be insufficient. Combining variables that relate to the geographical location and adds more to the meaning and knowledge should be crucial. Currently, the idea is to combine other geographical information from data that indicate, perhaps, the poverty rate, applied and range of infrastructures, etc.

Source 2: UCI

https://archive.ics.uci.edu/ml/datasets/clickstream+data+for+online+shopping

There are 14 factors(Columns) including the information of products(color,price, .etc) and time and location information(year, month, day and country). Also, there are some unique information like the categorical factor (if the price is higher than average and page number of the website(where products are shown on the website)).

The dataset contains information on clickstreams from the online stores offering clothing for pregnant women. Data are from five months of 2008 and include, among others, product category, location of the photo on the page, country of origin of the IP address, and product price in US dollars.

Yes, we can load the csv file directly but before analyzing the data, we only need to import certain columns which are depending on our questions and topics like “Model Photography” factor may not be used when we are trying to find the relationship of selling time and price.

We are interested in the relationship between the selling time and the price which can indicate seasonality of price to find when is the best time to sell the clothes. Also, how important the page of the product is can be a great topic for us since we want to check if the first page is always the best page for products and should we pay to put our products on the first page.

Some important factors that we want to see are not directly listed in the data set like the accurate sales amount of a certain product is missing and we need to find those factors by manipulating the data set before we build the regression model.

Source 3: CORGIS

https://think.cs.vt.edu/corgis/csv/billionaires/

There are 22 columns and 2615 rows in this dataset, including the billionaire’s personal and background information (like name, rank and year on the Forbes lists, age, gender, citizenship, country, location GDP, wealth type, worth in billions, where does his or her money come from) and the company’s information (like name, year, relationship, sector and type he or she founded, whether it is profit).

The data are collected from the Forbes World’s Billionaires lists from 1996-2014 by the researchers and scholars at Peterson Institute for International Economics have added some more factual and background variables like where do the billionaires’ money come from (whether they are inherited, or profited from the industry, or from politics) Are you able to load/clean the data?

We need to filter certain missing values in certain columns like the age, location GDP, gender and decide whether to ignore certain variables or rows. Sometimes the values are missing because the billionaires’ information is collected by a whole family and it is hard to fit in a usual classification.

What are the factors or variables that are most related to the ranking or occurrence of a billionaire? e.g.Are there any gender differences? Do the industries of their business or company matter? Whether the inherited money can cause an effect? Compare different billionaires especially for those name occurred for many times: give some analysis on his or her information like comparing the wealth worth Compare different regions’ situations: are certain countries more likely to bring up more billionaires?

Most of the data or answers are categorical variables. We may have to combine some of them to create numerical variables if necessary.

Previous 10/22 Post 2