• About
  • References
    • Mathematical Identities
    • Programming
    • Data Mining, etc.

The Art of Software

The Art of Software

Category Archives: Data Analysis

Derivation of Bias-Variance Decomposition

13 Thursday Sep 2012

Posted by craig in Data Analysis, Exercises, Math

≈ 1 Comment

Tags

statistics

On page 24 in The Elements of Statistical Learning (ESL) by Hastie et al, the Bias-Variance decomposition is shown, but not derived. It turns out the derivation is quite easy, but also a bit tedious. I am presenting the derivation here using notation similar to ESL. I hope that this saves someone some time.

I’d also like to credit these notes, which provided me the trick necessary to derive this, but which unfortunately did not provide the gory details.

To recap the notation used in ESL, we have x_0 as the point at which we want to evaluate our estimate of the function f, while f(x_0), and \hat{y_0} denote the true value of the function and our estimate respectively. However, from here on out, we’ll drop the subscripts.

Recall the definitions of Variance and Bias Squared:

\text{Variance} = E[(\hat{y} - E[\hat{y}])^2]
= E[\hat{y}]^2 - 2E[\hat{y}]\hat{y} + E[\hat{y}]^2]
\text{Bias}^2 = (E[\hat{y}] - f(x))^2
= E[\hat{y}]^2 - 2E[\hat{y}]f(x) + f(x)^2

Now we have mean-squared error:

\text{MSE} = E[(f(x)-\hat{y})^2]
= E[(f(x)- E[\hat{y}] + E[\hat{y}] - \hat{y})^2]
= E[(f(x)- E[\hat{y}] + E[\hat{y}] - \hat{y})(f(x)- E[\hat{y}] + E[\hat{y}] - \hat{y})]
= E[\underline{f(x)^2} - \underline{f(x)E[\hat{y}]} + f(x)E[\hat{y}] - f(x)\hat{y}
- \underline{E[\hat{y}]f(x)} + \underline{E[\hat{y}]^2} - E[\hat{y}]^2 + E[\hat{y}]\hat{y}
+ E[\hat{y}]f(x) - E[\hat{y}]^2 + \underline{E[\hat{y}]^2} - \underline{E[\hat{y}]\hat{y}}
-\hat{y}f(x) + \hat{y}E[\hat{y}] - \underline{\hat{y}E[\hat{y}]} + \underline{\hat{y}^2}]
= E[\hat{y}^2-2E[\hat{y}]\hat{y} + E[\hat{y}]^2]
+ E[E[\hat{y}]^2 -2E[\hat{y}]f(x) + f(x)^2]
+ E[f(x)E[\hat{y}] - f(x)\hat{y} - E[\hat{y}]^2 + E[\hat{y}]\hat{y}
+ E[\hat{y}]f(x) - E[\hat{y}]^2 - \hat{y}f(x) + \hat{y}E[\hat{y}]]
= \text{Variance} + \text{Bias}^2
f(x)E[\hat{y}] -f(x)E[\hat{y}] + E[\hat{y}]^2 -E[\hat{y}]^2
+ f(x)E[\hat{y}] - f(x)E[\hat{y}] +E[\hat{y}]^2 - E[\hat{y}]^2
= \text{Variance} + \text{Bias}^2

The big trick required to get the result is to simultaneously add and subtract E[\hat{y}] to the MSE. After that we only have the tedium of expanding it and then based upon the above definitions of bias and variance recombining it using the linearity of expectations, i.e. E[aX] = aE[X], and E[X+Y] = E[X]+E[Y]. We also use the fact that E[E[X]]=E[X].

Note that the underlined terms used in the third step are combined as the first two lines of the fourth step and are the terms that make up the variance and bias squared.

Advertisements

Affect of Electoral College on State Electoral Power

15 Wednesday Aug 2012

Posted by craig in Data Analysis

≈ Leave a comment

Tags

political analysis, presidential election

With the 2012 Presidential Election approaching and with Electoral College politics on full display, I wondered, “how does the Electoral College affect the overall electoral power of each state versus an allocation of votes based solely on population?”

To begin to answer this question we must first understand how electoral votes are allocated. Article II of the U.S. Constitution states that “each state shall appoint, in such manner as the Legislature thereof may direct, a number of electors, equal to the whole number of Senators and Representatives to which the State may be entitled in the Congress.” In addition, Amendment XXIII treats Washington D.C. as a state for the purpose of electing the President, and in effect the Amendment provides it with three votes. Total there are 538 electoral votes in play, (435 house members + 100 senators + 3 for D.C.).

Due to the method of allocating electoral votes, the per capita voting power of the less populous states is enhanced at the expense of the more populous ones. To illustrate this point we will consider how votes would be allocated based solely on a state’s population in the cases of California and Wyoming.

California is the most populous state in the country, and based upon the 2010 census, it contains 12.07% of the country’s population. Thus, if electoral votes were allocated based solely upon population it would control 12.07%, or 64.92, of them. Instead it receives 55, or 10.22% of the total votes. California’s voting power is 0.85 times what it would be if votes were allocated based on population alone.

At the other extreme is the least populous state, Wyoming, which contains 0.18% of the country’s population, but which controls 3 votes, or 0.56% of the 538 total votes. If its votes were allocated based solely on population it would control just 0.98 votes. Thus Wyoming’s voting power is 3.05 times what it would be if its votes were allocated based solely on population.

Overall larger states like California experience a diminution of power, while smaller states like Wyoming experience a growth. In fact 18 states lose some power, while the rest, including Washington D.C. gain.

The map and table below show electoral voting power per capita for all 50 states and the District of Columbia. The histogram at the bottom shows the distribution of states over several per capita voting power ranges. Finally, the last plot shows each state ranked by its electoral power.

The data show that the least populated states benefit from a tremendous increase in electoral power, while the largest states suffer only marginal losses.

My raw data can be found here, CSVs etc.

Map Showing Per Capita Voting Power Per State

Alabama 1.08 Alaska 2.42
Arizona 0.99 Arkansas 1.18
California 0.85 Colorado 1.03
Connecticut 1.12 Delaware 1.92
Florida 0.89 Georgia 0.95
Hawaii 1.69 Idaho 1.46
Illinois 0.89 Indiana 0.97
Iowa 1.13 Kansas 1.21
Kentucky 1.06 Louisiana 1.01
Maine 1.73 Maryland 0.99
Massachusetts 0.96 Michigan 0.93
Minnesota 1.08 Mississippi 1.16
Missouri 0.96 Montana 1.74
Nebraska 1.57 Nevada 1.28
New Hampshire 1.74 New Jersey 0.91
New Mexico 1.39 New York 0.86
North Carolina 0.90 North Dakota 2.56
Ohio 0.90 Oklahoma 1.07
Oregon 1.05 Pennsylvania 0.90
Rhode Island 2.18 South Carolina 1.12
South Dakota 2.11 Tennessee 0.99
Texas 0.87 Utah 1.25
Vermont 2.75 Virginia 0.93
Washington 1.02 Washington, D.C. 2.86
West Virginia 1.55 Wisconsin 1.01
Wyoming 3.05

Electoral Power Ranked by State

SQL Query to Generate Data for Histogram/Frequency Plot

17 Thursday May 2012

Posted by craig in Code Snippets, Data Analysis, Notes

≈ Leave a comment

Tags

mysql, sql

Say we have two MySQL relations, User and Checkin, where each User can have zero or more Checkins. Furthermore, the User table has two attributes: id and name, while the Checkin table has three attributes: id, user_id, and date.

We would like to generate a plot of the number of users versus the number of checkins. That is, how many users have checked in one time, two times, etc?

We arrive at our answer by first calculating the number of checkins per user, which can be generated using the following query:

 SELECT user_id, count(user_id) AS cicount FROM Checkin GROUP BY user_id 

This gives us a table with two attributes: user_id, and cicount, where cicount means “checkin count”. Let’s call this table “PerUserCount”, which we can query to get the counts we really want.

 SELECT cicount, count(user_id) AS nusers FROM PerUserCount GROUP BY cicount 

Of course we can’t write the query above as is. We have to use a derived table. The complete query is:

SELECT cicount, count(user_id) as nusers FROM (SELECT user_id, count(user_id) AS cicount FROM Checkin GROUP BY user_id ) as PerUserCount GROUP BY cicount

This Stack Overflow post was helpful in deriving this solution.

ScapeToad Cartogram Tutorial (formerly Cartogram Crash Course)

08 Wednesday Feb 2012

Posted by craig in Code Snippets, Data Analysis, Notes

≈ Leave a comment

Tags

cartogram, gis, scape toad

This post provides a tutorial on how to create a cartogram using ScapeToad v1.1. In addition it describes how to work with a few common GIS file formats. Upon completion you will have created a cartogram that shows the per state population of the United States as well as learned a bit about the DBase and shape file formats. Along the way some simple Python programming will be required. All of the data files used for this tutorial as well as the Python script can be found on Git Hub here.

However, before we start it might be useful to get an idea of how cartograms help to visualize geographic information. Mark Newman’s pages are particularly good for understanding the importance of this data visualization method. Have a look at the 2008 U.S. Presidential Election Results, and also at World Mapper.

To begin the tutorial we will need a shape file that describes the state by state geometry of the United States. This can be downloaded at the Census Bureau’s website. Click the above link, then select “States (and equivalent)”, click “submit”, and then from the 2010 box, select the “all in one national file” option. Clicking on the download button will give you a zip file with the relevant information in it. Explore the other options in order to see what additional shape files are available.

Now that you have the zip file downloaded, unpack it. Assuming the zip file was named “tl_2010_us_state10.zip” you should have a single directory with five files in it. Each of the five files has the same base name as the directory itself, but each has its own file extension. For our purposes here we care about the shape file and the DBase file, which have extensions “shp” and “dbf” respectively.

The shape file itself contains geometric information, and can be thought of as a list of geometric entities, where each item corresponds to a particular state’s geometry. Wikipedia has a write-up worth reading. The detailed technical specification for the file format is here. Arc Explorer and Shape Viewer are two free (as in beer) programs for viewing shape files.

The DBase file is a table of properties where, by convention, each row in the table contains the attributes of the item in the shape file with the same index. For example, the 10th shape in the shape file is presumed to have attributes given by the 10th row in the DBase file.

In order to create a cartogram with Scape Toad we will have to supply an appropriate DBase file. In this case our DBase file will contain two columns. The first will be the state’s two letter postal abbreviation and the second will be its population. Scape Toad will ignore the first column, but will allow us to create a cartogram using the data in the second column.

Note that DBase files can be opened with Excel for viewing, and that there is also a Python library for manipulating them.

At this point you should have the following software installed.

  • Python (to create DBase files) (optional if you got the dbf files from Git Hub)
  • dbfpy (to create DBase files) (optional if you got the dbf files from Git Hub)
  • Scape Toad (to view shape files and create cartograms)
  • Shape Viewer (to view shape files – slightly better UI than Scape Toad)
  • Excel  (to view DBase files) (optional)

Next we’ll create a DBase file that contains the U.S. population data using the following Python script.

#!/bin/env python

from dbfpy import dbf

POP ={
    "CA" : 37691912, "TX" : 25145561, "NY" : 19465197, "FL" : 19057542,
    "IL" : 12869257, "PA" : 12742886, "OH" : 11544951, "MI" : 9876187,
    "GA" : 9815210, "NC" : 9656401, "NJ" : 8821155, "VA" : 8096604,
    "WA" : 6830038, "MA" : 6587536, "IN" : 6516922, "AZ" : 6482505,
    "TN" : 6403353, "MO" : 6010688, "MD" : 5828289, "WI" : 5711767,
    "MN" : 5344861, "CO" : 5116769, "AL" : 4802740, "SC" : 4679230,
    "LA" : 4574836, "KY" : 4369356, "OR" : 3871859, "OK" : 3791508,
    "PR" : 3706690, "CT" : 3580709, "IA" : 3062309, "MS" : 2978512,
    "AK" : 2937979, "KS" : 2871238, "UT" : 2817222, "NV" : 2723322,
    "NM" : 2082224, "WV" : 1855364, "NE" : 1842641, "ID" : 1584985,
    "HI" : 1374810, "ME" : 1328188, "NH" : 1318194, "RI" : 1051302,
    "MT" : 998199, "DE" : 907135, "SD" : 824082, "AR" : 722718,
    "ND" : 683932, "VT" : 626431, "DC" : 617996, "WY" : 568158,
    }

# The backup dbf. We'll need it because we need to
# preserve the state by state order of the rows
# in the new file.
olddb = dbf.Dbf("tl_2010_us_state10-orig.dbf")

# Our new DB file.
newdb = dbf.Dbf("tl_2010_us_state10.dbf", new=True)
newdb.addField(
    ("STATE", "C", 15),
    ("POPULATION", "N", 25, 0),
    )

for rec in olddb:
    # STUSPS10 is the key for the two letter state abbreviation
    # in the old file.
    abbrev = rec['STUSPS10']

    # Create a new record in our new db file
    # and assign the columns
    rec=newdb.newRecord()
    rec['STATE']=abbrev

    if POP.has_key(abbrev):
        rec['POPULATION']= POP[abbrev]
        pop = POP[abbrev]
    else:
        # Print a message if we cannot find the population
        # for a given record.
        print "BAD POP KEY:", abbrev
        rec['POPULATION']= 0

    rec.store()

olddb.close()
newdb.close()

The script itself should be run in the directory where the shape and DBase files are located, however before running the script, rename the file “tl_2010_us_state10.dbf” to “tl_2010_us_state10-orig.dbf”. We do this because the Python script uses the old DBase file to determine the order in which to write records into the new file, but in addition it overwrites the original location, since the DBase file to be used with any particular shape file must have the same base name as the shape file itself. Edit the script to account for any differences in file names.

Alternatively, you can skip running the script and download the appropriate DBase file from my Git Hub page.

At this point, if you have Excel you might also want to open both the original and new DBase files and see for yourself what is in them.

Now we can fire-up Scape Toad. When it comes up, click the “add layer” button in the tool bar. Navigate to the shape file and select it. If the shape file came in correctly you should see something like this on your screen.

Scape Toad Screenshot

Note that the DBF file you created must have the exact same base name as the shape file and that they must both be in the same directory. Otherwise we won’t be able to create a cartogram.

Next click the “Create cartogram” icon in the toolbar. Click “next”, “next”, and then ensure that POPULATION is selected in the drop down menu. Click “next” again. And again. And then “compute”. Now wait…

After the computation is finished you should see a cartogram that looks something like this on your screen.

Scape Toad Cartogram Screenshot

Unfortunately Scape Toad has no zoom feature, so to get a close up look at the cartogram you’ll want to export it as a shape file and bring it up in Shape Viewer. Unfortunately there you will lose the legend and will be left with just the distorted shapes. C’est la vie.

If you have gotten this far then congratulations! You have succeeded in creating a simple cartogram that shows how the population of the United States is spread across its geography.

NFL Margin of Victory Graph Through Week 6

21 Friday Oct 2011

Posted by craig in Data Analysis

≈ Leave a comment

Tags

graphs, nfl

NFL Margin of Victory Graph Through Week 6

Arrows point from winners to losers. Numbers indicate margin of victory.

  • RSS - Posts
  • RSS - Comments

Categories

  • Code Snippets
  • Data Analysis
  • Data Structures
  • Exercises
  • Math
  • Navel Gazing
  • Notes
  • Software Engineering
  • Tools

Twitter Updates

  • The Charlottesville Fake News Was the Best Persuasion Play of the Past Year by Scott Adams blog.dilbert.com/2018/02/14/cha… via @ScottAdamsSays 1 year ago
Follow @kungfucraig
Advertisements

Blog at WordPress.com.

Cancel
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy