美国连环凶案数据（1980-2014）【Kaggle竞赛】

2020-01-15 |

137 |

0 |

美国连环凶案数据（1980-2014）【Kaggle竞赛】

Data Understanding

The Homicide Reports, 1980-2014 dataset is very large. In total, there are 638,454 instances of reported homicides. The dataset is a part of the Murder Accountability Project, which is a nonprofit group organized in 2015 and dedicated to educate Americans on the importance of accurately accounting for unsolved homicides within the United States. The actual data is from the FBI’s Uniform Crime Report’s “Return A” showing summaries of total homicides committed and cleared through arrest for the years 1980 through 2014. To start off, here is the structure and summary of the data:

Structure:

Summary:

In the dataset, there are 24 different variables. There is an ID field, which we won’t use because it isn’t needed. There is an agency code, agency name, and agency type which are all categorical variables used to identify the agency that recorded and handled the homicide case. There are also variables for City, State, Year, and Month which are also all categorical used to show location and time of the homicide report. There are variables for all of the information of the victim such as victim sex, victim age, victim race, and victim ethnicity. These variables will be very informative for finding trends the targeted victims for homicide. Contrary to that, the dataset also has the same variables but for the perpetrator, which will help us identify the type of people who usually commit homicide. The variables used for information of both the victim and perpetrator are all categorical. An interesting thing to note about the data on the perpetrator is that the only data we have are the instances of homicide that are solved. We also have relationship variable which is categorical, weapon variable which is also categorical, victim count variable which is numeric, and perpetrator count variable also numeric.. There are sixteen different weapons that are categorized in the weapon variable. This can be a very interesting variable because the weapon used for murder may give clues as to whether the murder was pre-empted or not, which can potentially be used to determine whether there is a chance for another homicide. Another important note is that victim/perpetrator count is the number of additional victims/perpetrators, not just the number of them (starts at 0, not 1). The dataset has an asymmetric binary crime solved variable, which is a 1 or 2, 1 meaning the crime was solved and 2 meaning that the crime went unsolved. This is our only variable which can truly be targeted for a supervised approach but since we are trying to analyze more than just if the crime was solved, we will use supervised and unsupervised techniques for the data. The dataset gives us plenty of information to analyze trends in the homicide reports and potentially come up with predictions about future homicides/homicide reports. Data PreparationData Cleaning:

The dataset came straight from the Kaggle website as a comma-separated value (csv) file. There were a few issues with the initial data. A big issue initially in the dataset, which was not allowing us to open it in R, was the instances with extra commas, inside and outside of quotation marks. Many of the agency name records were entered in the wrong format. A lot of them were in quotations, and many of those had commas inside the quotations. CSV files reads the file by having the variable names at the top that are separated by commas. The data below is put into fields based on the commas in the line, so obviously having extra commas was an issue. This was causing issues when trying to load the dataset into Weka. Weka gave a line with each error message, so we went into the csv file in Notepad++ and reformatted each line that had an error. After doing this, we were able to get the file to open up in Weka.

An issue we found was with the dataset was victim age.

This graph shows a clear error in the data. There were almost a thousand incidents that showed the victim's age to be 998, which is obviously not possible. Also, we see an unnatural spike at the age of 99 and 0. After digging deeper, we were able to come up with the assumption that 0, 99, and 998 were used to fill in the field when the value was missing. We decided that the because of the amount of instances in the dataset, could simply remove the instances with these values in victim age. Also, often times when the victim age was 0, 99, or 998, there were many other missing or misclassified values in the same instance. After removing these instances, our victim age now looks normal (see below).

Similar to the victim age, perpetrator age was also misclassified. The missing values were assigned to the number zero. For this variable, we decided not to remove all of the instances with zero values. This is because logically the FBI does not know the age of the perpetrator if they did not solve the crime. So all of the unsolved crimes would be deleted if we deleted the records that had no information on the perpetrator. Instead, we decided to simply ignore the data of perpetrator age when it comes to unsolved crime.

Data Reduction: The first thing we did was delete the Record ID variable, as it is not needed for our analysis. Something we noticed is that the variable labeled city was actually representing counties, so we renamed the variable to County. We could not figure out what the incident variable was and upon searching kaggle and other resources, no one else was really able to figure out exactly what the variable represented. We decided to remove the incident variable completely.

Data Transformation:There was also an issue with the weapon variable. It is a categorical variable with multiple different firearm types. There was firearm, gun, handgun, rifle, and shotgun. We decided that we wanted to keep the specific ones (handgun, rifle, and shotgun) and then combine gun and firearm to create a new variable called “Unspecified firearm”. This will allow us to still use the type of gun as an piece of information while still reducing confusion on the topic.

Descriptive AnalyticsTo explain our data a little bit it’s important to understand the variables. The first is the ratio between crimes that have been solved and crimes that have not been solved. 70.2% of the crimes have been solved which leaves 29.8% unsolved.

Next it was important to recognize the amount of crimes on females vs. males. This would help us analyze of there was one sex to focus on more than the other. It was found that males had a significantly higher victim crime rate than females.

The next attribute that we wanted to look at was victims by race. Similar to sex, finding if there is a certain race that is greater than others could point us in the direction to figure out why a certain race may have more victims than another. Looking at races, it shows that there is a significant difference between black and white victims compared to any other race. While we see this as being a problem, this could also be explained by more of the population being within these two races compared to the others.

We also found that the relationship between a perpetrator and their victim could help us gain more information about the crimes. While we see a huge peak within the relationship stranger, this is because those cases are unsolved. Looking at all the other attributes, we find that if someone is an acquaintance with another it is significantly more likely to be the perpetrator.

Next we wanted to look at the incidents per state and find what states had the highest crime rates. Overall, we found that California, New York, and Texas had the highest crime rates compared to all other states.

This can then be broken down further to look at the cities within the US. When we do this, we find that Texas does not have as significant of crimes per city as does California and New York. Looking at this closely, this could be because of the higher populations per city, as well as higher gang crimes.

Another useful attribute could be the years of the crime. When looking at all crimes from 1980-2014 we found that there is a huge peak in 1993. When conducting research, we could find no accurate depiction of why the crime rate was so high in 1993. Another aspect to note about this graph is that the number of actual records has been decreasing after the peak in 1993.

One of the most informative variables when it comes to crime is the weapon used. Overall, there is a significant amount of crimes using a handgun compared to any other weapon.

While it looked like a handgun had significantly higher crimes unsolved, we found it beneficial to look at each weapon by percentage of unsolved vs. solved. This would allow us to look at each weapon individually. When we did this we found that “Unspecified Firearm” and “Strangulation” go unsolved more than a handgun.

Next we wanted to look at the crime comparing the races of the victim versus the perpetrator. If a correlation is found, this would be information that could be beneficial to law enforcement when looking for the perpetrator of future crimes. We found that crimes within the same race are more likely to commit crimes on one another.

Expanded Modeling and Predictive Analytics

Understanding which attributes are the most informative is a crucial step when analyzing data. Especially when considering homicides, knowing which variables of the crime tell us the most about what happened is practically essential for any preventative measures to be taken by law enforcement. A bit of preprocessing was required in order to produce an effective and informative model. Certain aspects of the data had to be ignored when creating these models as they did not pertain to our business problem. As previously stated, ID and outlier victim ages that were due to either unknown values or faulty reporting were removed. On top of that, we removed all agency information, all perpetrator information, county, relationship, perpetrator count, and victim count as these attributes are, for the most part, only relevant when looking at the solved crimes. Focusing on the unsolved crimes and taking measures to prevent them has been our main business problem, and the attribute that tells us the most about whether or not a crime will be solved is the weapon used. We first found the variables with the highest information gain using all variables besides relationship and perpetrator information. Below, we see that agency code and name are the top two most informative. This shows that there are certain agencies that are better at solving crimes than others. Going along with that, City and State also show which places have better rates of solving homicides.

We also wanted to just use actual variables indicated in the homicide. Using weka, after cleaning the data, we ran a J48 tree using the attributes state, year, month, crime type, victim sex, age race, ethnicity, and weapon used.

The results yielded weapon to be the most informative attribute. Crimes that were more personal in nature, i.e more physical such as murder with a blunt object, etc. had a higher rate of solved to unsolved than did firearms, which includes handguns, rifles, shotguns. This is consistent with our expectations as a crime with more physical contact has a higher chance of DNA or fingerprints being left behind among other things. As far as firearms go, the proximity of the perpetrator to victim can be greater, which consequently leaves less evidence at the scene of the crime. Gun violence has long been an issue in our country. And as the data report, it has remained the leading cause of homicides for decades. Although gun violence saw its peak in the early-mid 1990’s, it has found its way to the top spot in the 21st century as well. Another highly informative attribute in determining successful homicide investigations is the victim's age. We typically see that children aged roughly 13 and under have a highly successful rate of being solved. This is probably due to the fact that they are murdered by family members or an abductor that was captured. One of our initial assumptions was that certain months of the year would indicate a spike in homicides. We broke it down state by state, year by year, and found no conclusive evidence that would support our claim. Another assumption was that modern technology would increase the solved to unsolved rate, but that too has remained consistent over time, which was very surprising to us. The image on the left represents all homicides from 1988-1999, with blue indicating solved and red indicating unsolved.

        The percentage of unsolved crimes between 1988-1999 is 29%. The graph on the right is between 2000-2014, and again, we see a similar percentage of unsolved crimes at 30%. Obviously having zero unsolved crimes would be ideal, but minimizing the amount is a top priority, and 30% is alarmingly high.
        Going back to handgun violence, we find that California has experienced the most homicides from a firearm. Shockingly, the ratio of homicides with firearms compared to all other methods has also remained constant over the years, despite the increased development with forensic technology. The graphs below depict our findings.
        Next we wanted to see if there was instances that could be grouped together, and the characteristics of those groupings (or clusters).  We wanted to only look at homicides that occurred after the year 2000 for this model.  To accomplish this, we ran a simple k means cluster through Weka. Our results were not very conclusive, and 45% were clustered incorrectly. But it gives us a rough look into the typical groups of homicides that are seen throughout the dataset.

After that we were looking for a model that we could use to consider numerous attributes at the same time to estimate the probability of the crime being solved or not. For this, we chose to formulate a Naive Bayes model. From this model, we were able to correctly classify whether or not the crime was solved 81.32% of the time.

It is clear that gun violence is a top national concern and needs to be dealt with if effective change is ever to be seen. Using these models as a starting point, law enforcement would likely turn their focus to gun control and finding new ways to control the distribution of firearms as well as potentially increase the amount of police officers patrolling neighborhoods with abnormally high amounts of firearm related homicides.

Enhanced Models and Prescriptive Analytics

When enhancing our modeling we knew that there were many routes that we could go with our dataset. Because of all the possibilities, it’s hard to focus on just one because there are many causes and relationships that can be addressed with this dataset. In our previous project we decided to focus on serial killer activity and focused on California, this turned out to be a dead end and we switched our focus. For our enhanced models we looked further into race and what factor it plays into the crimes involved. Similar to weapon, we wanted to analyze if there was a certain race that had a higher solving rate. We also decided to separate by the years and see if there were any significant changes. The unsolved crime rate was decreasing for every race through the years except for black people.

https://cloud.githubusercontent.com/assets/28104954/25488831/a2ba40e4-2b2d-11e7-83fd-6589f15facb1.png

From here we wanted to look and see if there were specific agencies that had an increasing unsolved crime rate for black people. Narrowing our analysis to agencies can help us see if there are certain agencies performing poorer than the overall state. Predicting from our descriptive analysis, we would assume that there would be higher crimes in California, New York, and Texas agencies. Our analysis proved us wrong. We found that New York, Illinois, Michigan, District of Columbia, and Maryland all had agencies with higher unsolved black victim crimes.

https://cloud.githubusercontent.com/assets/28104954/25488820/a0595c22-2b2d-11e7-9f70-b36817832e70.png

From here we decided to filter out all the agencies except the top five. Focusing on these five could allow us to analyze issues at these agencies further and gain more information. Once we did this, we decided to focus on the year. Similar to what we found earlier, there was a strong peak in 1993 but only for New York.

https://cloud.githubusercontent.com/assets/28104954/25488821/a05b9258-2b2d-11e7-8f6a-74477d46a99b.png

Within these agencies it is important to not only look at the number of unsolved crimes, but also the percentage of the crimes that are unsolved. This actually changed our analysis to see that just because New York, Illinois, Michigan, District of Columbia, and Maryland had the highest amount of unsolved crimes, that this does not follow through with the percentage of their crimes that were unsolved. There were many other agencies that had the percentage of unsolved crimes over 60%. This is something important to note and look further into what these agencies are doing wrong and question why they are not solving as many crimes as others. This could be because of low resources, funding, or even corruption.

https://cloud.githubusercontent.com/assets/28104954/25488823/a0673414-2b2d-11e7-8825-bd897775a571.png

Another important thing to look at is the type of agency and how well they are solving homicides. We modeled a graph for each type of agency and their solve to not solved percentage (below).

This clearly shows that municipal police, county police, and special police have to worst solve rate. This is something that is important because if all agencies should be able to solve crimes at the same rate. The disconnect between agencies proves that there is an issue with the system. To further enhance our models, we used a few different boosting techniques. The first technique we tried was AdaBoostM1. The AdaBoostM1 gave us almost identical results as our Naive Bayes model did earlier. Here are the results:

We were able to generate a slightly better model using LogitBoost, but it’s not enough to truly differentiate it.

Next, we tried stacking. The accuracy of the stacking model was significantly lower than the others. So we can see that our best boosted model for prediction was LogitBoost.

上一篇：希拉里 vs 特朗普竞选期间 Twitter 数据【Kaggle竞赛】

下一篇：广告实时竞价数据【Kaggle竞赛】

用户评价

全部评价

还没有评论，说两句吧！