Background These data come from Analytical Graphics and a post on Slashdot. See http://adn.agi.com/SatelliteDatabase/SatelliteDatabase.kmz. The data are in KML format and are readily displayed via Google Earth. But we want to look at the information in this file in other ways. So first we have to read the information.
Extracting the data and EDA doc = xmlParse("doc.kml") folders = getNodeSet(doc, "//x:Folder", "x") active = getNodeSet(doc, "//x:Folder[./x:name/text() = 'Active Satellites']", 'x')[[1]] ff = getNodeSet(active, "./x:Folder", 'x') Let's find out at what altitude the satellites are orbiting the earth. We can do this by fetching the coordinates element. coords = xpathSApply(doc, "//x:coordinates", xmlValue, namespaces = 'x') alt = as.numeric(sapply(strsplit(coords, ","), `[`, 3)) # These should be meters above sea-level. # Some are negative - 1, actually summary(alt) What about the age of these things Let's get the information about each satellite, i.e. the HTML content within the description element. desc = xpathSApply(doc, "//x:Folder/x:Placemark/x:description", xmlValue, namespaces = 'x') Now that we have the nodes, we write a function to parse the HTML content within each node and to get the name-value pairs. getDescription = function(txt) { top = htmlTreeParse(txt, useInternal = TRUE) b = xmlRoot(top)[[1]] els = xmlSApply(b, xmlValue) els = els[names(els) != "br"] i = which(names(els) == "text") structure(els[i], names = els[i-1]) } Now we can run this on the contents of each description node. info = lapply(desc, getDescription) So what fields do these have in common? table(unlist(sapply(info, names))) Active Apogee Inclination 11701 11701 11701 International Designator Launch Date Launch Site 11701 11701 11701 Mass Mission Orbit Description 2214 2913 8 Owner Perigee Period 11701 11701 11701 Satellite Number 11701 So it is Mass, Mission and "Orbit Description" that are not in all of them. Who owns all of these? sort(table(sapply(info, `[`, 'Owner')), decreasing = TRUE) Is CIS "Commonwealth of Independent States", i.e. the countries from the former Soviet Union ? How many are active? table(sapply(info, `[`, 'Active')) False True 10871 830 So only 7%. What about the launch dates for these satellites? We don't care about the time of day (at the moment) launch = strptime(sapply(info, `[`, 'Launch Date'), "%m/%d/%Y %T") Now we can look at a histogram of these, hist(launch, "months", freq = TRUE) What about the launch dates of the active satellites? active = as.logical(gsub("^ ", "", toupper(sapply(info, `[`, 'Active')))) hist(launch[active], "months") There are some quite old ones still working. We might superimpose these on the histogram of the entire launch date distribution. This doesn't work as is. hist(launch, "months") hist(launch[active], "months") Is there a particular time of the day that the satellites are launched? Does it vary across launch sites?
Creating a data frame In this section, we'll construct a data frame from the HTML-based content in the description and from other elements for each satellite. We can either use the code above to collect the fields for each satellite and then unravel them into columns of our data frame, or we can pre-allocate the data frame and have a function that assigns the record to the corresponding row as it process each one. We'll start with the former since we have much of the code already in place. We start with the info object and find all the fields which are in each record: fields = table(unlist(sapply(info, names))) varNames = names(fields)[fields == length(info)] Next, we loop over each of these and extract the corresponding element of each record: x = lapply(varNames, function(id) sapply(info, `[`, id)) names(x) = gsub(" ", ".", varNames) d = data.frame(Active = as.logical(toupper(gsub("^ ", "", x[["Active"]]))), Apogee = as.numeric(gsub(" km$", "", x[["Apogee"]])), Inclination = as.numeric(gsub(" deg$", "", x[["Inclination"]])), International.Designator = x[["International.Designator"]], Launch.Date = strptime(x[["Launch.Date"]], "%m/%d/%Y %T"), Launch.Site = factor(x[["Launch.Site"]]), Owner = factor(x[["Owner"]]), Perigee = as.numeric(gsub(" km$", "", x[["Perigee"]])), Period = as.numeric(gsub(" minutes$", "", x[["Period"]])), Satellite.Number = x[["Satellite.Number"]], stringsAsFactors = FALSE ) Now we check that we have what we expect: dim(d) names(d) sapply(d, class) We also add the altitude coords = xpathSApply(doc, "//x:coordinates", xmlValue, namespaces = 'x') d$alt = as.numeric(sapply(strsplit(coords, ","), `[`, 3)) The KML file has names for each satellite, in addition to the identifier (e.g. id = "24652"). We might be able to these as row names, but unfortunately they are not unique. sat.name = xpathSApply(doc, "//x:Folder/x:Placemark/x:name", xmlValue, namespaces = 'x') The styleURL entry also provides some information about the satellite. So let's take a look at the possibel values: style = xpathSApply(doc, "//x:Folder/x:Placemark/x:styleUrl", xmlValue, namespaces = 'x') table(style) #ActiveGEO #ActiveHEO #ActiveLEO #DebrisGEO #DebrisHEO 353 62 415 46 502 #DebrisLEO #InactiveGEO #InactiveHEO #InactiveLEO #RocketBodyGEO 5710 663 694 2297 27 #RocketBodyHEO #RocketBodyLEO 219 713 So we can add this as variable in our data frame, removing the # at the beginning of each word: d$type = factor(gsub("#", "", style))
Additional Exercises Modify the Google Earth KML/KMZ file to include a time control so that users can interactively specify the start time for which to include satellites. Since all the satellites are p in the Find out when the satellites became inactive. Get auxiliary data from other sources to find this out. Then add this information to the KML file in the time animation exercise above so that users can specify a window of interest and not just the time origin.