The Concept We are going to use the data from to find which members of congress co-sponsored bills with each other. The goal is to draw a graph with the nodes being the members of congress and the edges representing the individual bills, or an aggregate of the co-sponsored bills. We can draw this in SVG and make it interactive. It is then interesting to look at the voting records of the different members and see who are aligned and not! We might even allow the viewer to select a bill by clicking on an edge, or clicking on an edge for all the shared bills and having a popup that allows the viewer to select one of interest. Then we would highlight the members that voted for and against, using color to indicate the category.
Getting the Data See
Reading the Data This is most exploratory. So we'll discard this later on if it is irrelevant. We start by looking at the bills.index.xml file. doc = xmlParse("~/Data/GovTrack/bills.index.xml") We look at the attributes of each bill. table(unlist(xmlApply(xmlRoot(doc), function(x) names(xmlAttrs(x))))) last-action number official-title status title 531 531 531 531 531 type 531 So this tells us that all the elements have the same attributes. Let's get the last-action and status of each bill =, function(x) xmlAttrs(x)[c("last-action", "status")]))) bill$"last-action" = as.POSIXct(strptime(bill$"last-action", "%Y-%m-%d")) Let's move to the directory of bills. bills = list.files("~/Data/GovTrack/bills/", full.names = TRUE) length(bills) For each of these, let's get the sponsor and co-sponsor ids. We'll resolve these later! getBillSponsors = function(x) { if(is.character(x)) dd = xmlParse(x) else dd = x unlist(getNodeSet(dd, "/*/sponsor/@id|/*/cosponsors/cosponsor/@id")) } Now we can apply this to each bill. bb = lapply(bills, getBillSponsors) names(bb) = bills and we end up with a collection of character vectors. (This takes about 15 seconds on my MacBook Pro.) Let's look at the distribution of the number of co-sponsors. n = sapply(bb, length) hist(n - 1) Which bill has more than 400 sponsors (i.e. sponsors & co-sponsors) xmlValue(getNodeSet(xmlParse(names(bb)[which.max(n)]), "//title[@type = 'official' and @as='introduced']")[[1]]) sapply(names(bb) [ n > 300], function(x) xmlValue(getNodeSet(xmlParse(x), "//title[@type = 'official' and @as='introduced']")[[1]])) Let's resolve these sponsor/co-sponsor ids. people = xmlParse("~/Data/GovTrack/people.xml") info = t(xmlSApply(xmlRoot(people), function(x) { xmlAttrs(x)[c("id", "lastname", "firstname", "gender", "state")] })) Now let's check all the sponsors/co-sponors are in the people dataset: all(unlist(bb) %in% info[,"id"]) Let's create a graph. We'll look only at the senate to make things smaller. senate = bb[ grep("bills//s.*xml", names(bb), value = TRUE) ] senators = unique(unlist(senate)) (or we could look at the type on each bill, i.e. xmlGetAttr(xmlRoot(bill), "type")) We can build a graphNEL (nodes and edges) object, or alternatively an adjacency matrix. The NEL is done something like...
NEL graphs Let's check the number of bills each of the senators sponsored, i.e. main sponsor sapply(edges, function(x) length(unique(names(x)))) Now we have to reorganize the edges in the following way. We need a list with an edges component. We also want a weights component. The edges should only have one entry for each node. The weights should xedges = lapply(edges, function(x) { tb = table(x) i = match(names(tb), senators) list(edges = structure(i, names = senators[i]), weights = tb) }) library(graph) g = new("graphNEL", nodes = senators, edgeL = xedges)
The adjacency matrix is constructed as Let's check this makes sense sum(coSponsoredWith) prod(dim(coSponsoredWith)) library(graph) g = new("graphAM", adjMat = coSponsoredWith, edgemode = "directed") Now we can layout the graph and this will take a very long time. library(Rgraphviz) ll = layoutGraph(g)