Patrick Meaney

Web App Software Engineer

  • Portfolio
  • Background
  • Services

Predictive Analysis of Crime based on Economic Factors

Predictive & Descriptive Statistical Analysis

Predicting rates of crime based on economic indicators (R — statistical computing)

As part of my graduate statistics course "Advanced Statistical Methods", using R, I data-engineered a dataset from public datasets, then tested various predictive models for their accuracy.

After finding the right model, I demonstrate a training model and test real data against it. Similar methodology, on a larger scale, could be replicated to make predictions of business outcomes. (In fact, one project I would like to work on is a D3 web dashboard for business metrics statistically derived with R or Python).

Here is the model I settled on: (note: each datapoint is per zipcode)

[Crime] per capita prediction = [% of Population Below Poverty Level] _ [Median Household Income] _ [Median Rent price]

(Where Crime = Assault, Robbery, Burglary, or Theft, and * = interaction effects)

Trained Models

Here are three examples of trained models tested with actual data. It's a multivariate linear regression model trained with 80% of the dataset, and tested with 20%. Each time I test the model, a new randomly selected training model is created.

Test 1 of Assault, Robbery, Burglary and Theft models

Test 2 of Assault, Robbery, Burglary and Theft models

Test 3 of Assault, Robbery, Burglary and Theft models

The Explanation: How does this project work?

The project is a Multilinear regression model which seeks to determine:

"To what degree can a neighborhood's (i.e. zipzcode's) crime rates be predicted with the following economic factors?" (which is what this summary represents).

  • Zipcode-based Economic factor 1: % of Population Below Poverty Level
  • Zipcode-based Economic factor 2: Median Household Income
  • Zipcode-based Economic factor 3: Median Rent price of dwellings

The anwser is YES, quite strongly-- with 70-80% confidence, depending on crime type, in Austin, TX

I aggregated data from, and data-engineered, the following datasets:

Data table

After doing some data hygiene & data engineering (such as to assimilate similar categories-- e.g. different types of theft) the result is cleansed data, aggregated into following table.

The result is cleansed data, aggregated into following table.

Data table used in the model. Scroll Horizontally & Vertically to see all data

#zipcodeSqMtrsSqKmPopulationBelowPovertyLevelMedianHouseholdIncomeUnemploymentMedianRentpopulationassaultburglaryhomicideraperobberytheftassault_perCapitaburglary_perCapitahomicide_perCapitarape_perCapitarobbery_perCapitatheft_perCapitatotalCrimestotalCrimesPerCapita
178617179810277179.8102771843957151041195531448010121900.0007160026594384490.0024548662609318300.0005114304710274640.0006137165652329570.009717178949521812740.0140131949061525
27870142181394.218139206815291590564284831345817760.01488833746898260.01471109535625660.0001772421127259840.006026231832683450.01028004253810710.31478199220134720360.360864941510103
3787021294454012.94454333473411766210321203742227516030.005705591479650060.01778242677824279.5093191327501e-050.001046025104602510.003565994674781290.07621719284899221960.104412324077596
4787031443615914.436159109260641183198739760375650.0004528757610828760.0038242842046998400.0001509585870276250.0003522367030644590.02843053389020286600.0332108891460776
5787042250394522.5039452150248794041709963182223422430.002301661511903910.007624253758181694.7951281497998e-050.0005274640964779780.0008151717854659670.053777362200004827150.0650938646335323
67870556866515.68665166119171410883053231133032189240.001015328180269880.0043560854185772300.001048080702214070.0005895453949954150.030263330276431311380.0372723699724879
7787173341502133.415021393305510182351410560101900.0004252785574551330.0023815599217487504.25278557455133e-0500.008080292591647532570.0109296589265969
87872195882389.588238323213116870111786714609233940.005993916621935950.013061370549293300.0008051529790660230.002057613168724280.03524780819466816390.0571658615136876
97872239195363.919536194491789305973125703102650.002009040683073830.009542943244600700.0005022601707684580.001674200569228190.04436631508454713470.0580947597522183
10787231797929317.979293294186910817309991493304425115070.004806606664731120.01064550469369980.0001290364205296940.001354882415561790.001645214361753610.048614471434562420830.0671957159908384
11787246385803363.858033383571189622167748127018173270.002214328550998750.005858744291184200.0008303732066245330.0007842413618120590.0150851132536795370.0247728006642986
12787262820422828.20422896609641050124826160421760.0004806921967633390.0012818458580355700.0003204614645088930.0001602307322544460.01410030443839132040.0163435346899535
13787272219082822.1908281165687610502846122783954150.0007729875970626470.002740592389585750.0001054073995994520.0003162221987983560.000175678999332420.01458135694459085320.0186922455289695
14787282100479521.004795144740569012093902000609.5515545154974e-050000.00028654663546492280.000382062180619896
15787292378614423.786144857358710082822822780483570.0007793680034008790.0027632138302394800.0001417032733456140.0002834065466912290.01264701714609614690.0166147087997733
16787303784028137.840281411957331106831238020380.0003609239653512990.00096246390760346500.00024061597690086600.00457170356111646510.00613570741097209
17787312241215622.412156978265410162471612830744210.0004855154555753360.0033581485677294100.0002832173490856130.0001618384851917790.01703350056643475270.0213222204240168
18787323446514234.46514231277265168814307000003000000.00020968756552736430.000209687565527364
19787355321546353.215463674571411221688211500222150.0006515815661651460.0029617343916597600.000118469375666390.000118469375666390.01273545788413692800.0165857125932946
20787392962856429.628564112652542000169111130101185.91331086275206e-050.00076873041215776705.91331086275206e-0500.006977706818047431330.00786470344746023
21787411966025019.66025403018398354824622559815413023860.004663599054843920.01239480993242962.07271069104174e-050.001119263773162540.002694523898354270.04945487708825633940.0703478008539568
22787421485258014.85258373407611639901411020290.004439511653718090.012208657047724800.0022197558268590500.0321864594894562460.051054384017758
23787445543991255.43991226410569946434521354074355115630.003106876553438280.009366657461106519.20556015833563e-050.0008054865138543680.001173708920187790.035970726318696521950.0505155113688668
24787453460032834.60032816492437990549171204351343719500.002185115720086680.00792104448531421.8209297667389e-050.0006191161206912250.0006737440136933920.035508130451408525770.0469253600888614
25787465832601458.326014512532741221269396360257410.0002227254166821340.001336352500092807.42418055607112e-050.0001856045139017780.02750658896024357900.0293255131964809
26787483283470532.8347059658896109540290482032131412700.001191362620997770.005038471084636394.96401092082403e-050.0003226607098535620.0003474807644576820.031521469347232615500.0384710846363862
27787492607890226.078902680956411503439126862696890.0007560117472594570.002500654240935135.8154749789189e-050.0001744642493675670.0002616963740513510.02003431130237568180.0237852926637783
28787503471802934.718029775958610122711916722473080.0005899922563516350.002654965153582367.37490320439544e-050.0001474980640879090.000258121612153840.0113573509347694090.0150816770529887
297875162119426.21194226386249865143402210008117850.001534170153417020.0069735006973500700.0005578800557880060.0007670850767085080.0547419804741989260.0645746164574616
307875286571448.6571443233271975217170101162023499860.005882352941176470.0094350611531741400.001339545719277810.002853814793244030.057425742574257413210.0769365171811299
31787532841768628.417686263959398265238418444045811131380.003512522907758090.008399511301160667.63591936469151e-050.001107208307880270.002118967623701890.059903787416004939350.0751183567501527
32787543441909334.419093115327499691357021940123390.001547531319086220.0069270449521002207.36919675755343e-050.0001473839351510690.02498157700810614570.0336772291820192
337875643292974.329297959685488880608760132670.0009925558312655090.0094292803970223300.0001240694789081890.0003722084367245660.03312655086848633550.0440446650124069
34787571273262612.732626165515668952271821189213158580.0009243771458755170.008319394312879668.80359186548112e-050.0005722334712562730.0006602693899110840.03776740910291410980.0483317193414913
35787582405243524.052435244179210898451051554120367921440.003436426116838490.0091342423234674700.0007981376787495840.001751468795033810.04753353286775328260.0626538077818424
36787593606537836.06537876567279623985031156191513450.0007779171894604770.003914680050188212.50941028858218e-050.0002258469259723960.0003764115432873280.033751568381430415570.0390715181932246

From here-- with our .csv file of data, we are ready to put the multilinear regression model to work.

The process of setting up the multilinear regression model on the above data table:

I collected datasets for zipcodes in Austin showing the economic indicators mentioned above for each zipcode (the median income, median rent, and percentage of residents in poverty for 36 Austin, Texas zipcodes). I then cross-referenced these zipcodes with a dataset of about 40,000 crimes that occured in Austin in a particular year. I excluded all zipcodes except those for which I had data for both economic indicators and crimes. I then counted up crimes per category (6 categories total, after similfying category names: Assault, Burglary, Robbery, Theft, Homicide, and Rape), per zipcode. With these two datasets showing economic & crime data for particular zipdoes, we are able to produce an equation showing the degree of confidence with which we can predict rates of crime based on the chosen economic indicators.

The results:

According to coefficient of determination for the various models, we can say that about 70%-80% of the variability in these crimes—Assault, Burglary, and Robbery (for the zip codes examined) can be explained by the variability in the selected economic indicators. That's a pretty strong connection which means that these three types of crimes can be relatively well explained based on these three economic indicators (income, rent, and % of population below poverty)! Based on this study, the other types of crimes examined (homicide and rape) cannot be reliably predicted based on economic indicators– this means they're less determined by economic health of a neighborhood, and more by other unknown/non-analyzed factors.

The ultimate purpose of the project was to create a statistical model from the "training" dataset – i.e. a model that explains the data – and then to test the "test" dataset upon the training model, to check how well the model can be used to predict crime rates based on the economic factors studied.

In the statistical charts-- the colorful lines at the top of this page-- we can visualize how well the model predicts crime based on the economic factor inputs from the training dataset. The green lines show the boundaries of a 95% confidence interval (we can be "95% certain" that this interval contains the true mean of the population, which we use the data in the test group to simulate a sample of.), with the redline being the predicted average. The black line represents the actual test data used. As you can see, the green lines capture the majority of the datapoints from the test data, showing that this model (using economic data to explain crimes) does quite a good job of some of these predicting specific crimes (assault, robbery, and especially burglary).

I programmed the project for my two other teammates, who provided guidance, feedback, and ideation, and the final script can be viewed here.

Why do I enjoy statistical analysis?

I enjoy statistical analysis because it is a creative, fun, puzzle-solving way of thinking which allows us to explain the world through data in meaningful, measurable ways.

Throughout this particular statistics course ("Advanced Statistical Methods"), I noticed my thought processes changing. I noticed this especially while working on my statistics project, and for me, I think I really changed between the day I walked into the class, and the day class ended. I realized then that I had began to think more analytically and from a more data-driven, input-output, and equation-based perspective. Crafting equations from data is solving a puzzle– a puzzle where you can produce real and fascinating answers from apparently unrelated datasets! It's quite fun once you get into it!

In addition to changing the way you think, and having fun analyzing a real life puzzle... practicing statistics via scripting will improve your programming abilities! Learning R really helped me become a better programmer because before I could start playing with data, I spent plenty of time learning R's data types and data structures, which was useful when it came time to produce an analysis project and is applicable to programming.

I really like statistics in general for the contribution that the multivariate analysis makes when analyzing any kind of ideas, really. It is quite handy for empirically analyzing all sorts of ideas: whether business decisions, scientific & engineering projects, economic theories, or policy arguments. Being able to formulate an equation or model based on data is a skill in its own right. One of my favorite data visualizations is the "choropleth map", i.e. heatmap.

Here's a choropleth map of Assault frequency for zipcodes in Austin, TX (data is annual assaults reported in 2014).