Next Kölner R User Meeting: Friday, 6 March 2014

March 2, 2015, 11:36 pm

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

Image may be NSFW.
Clik here to view. Koeln R

The next Cologne R user group meeting is scheduled for this Friday, 6 March 2014 and we have an exciting agenda with two talks, followed by networking drinks:

Using R in Excel via R.NET

Günter Faes and Matthias Spix

MS Office and Excel are the 'de-facto' standards in many industries. Using R with Excel offers an opportunity to combine the statistical power of R with a familiar user interface. R.net offers a user friendly interfaces to Excel; R functions work just like Excel functions and are basically hidden away.

Text Mining with R

Cornelius Puschmann

In addition to the analysis of numerical data, R is increasingly attractive for processing text as well. Cornelius will give a very brief overview of common text mining techniques and their corresponding R implementations, with a focus on useful applications in the social sciences. Techniques will include corpus creation and management (package tm), latent semantic analysis (package lsa), and topic models (package topicmodels), as well as sentiment analysis (experimental package syuzhet). Simple but useful routines such as automatically inferring the language of a text (package text cat , or the gender of a first name (package genderize) will also be briefly pointed out.

Drinks and Networking

The event will be followed by drinks and schnitzel at the Lux.

For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Bernd Weiß and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.

View Larger Map

To leave a comment for the author, please follow the link and comment on his blog: mages' blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Color extraction with R

March 4, 2015, 1:49 pm

≫ Next: Getting Data From An Online Source

≪ Previous: Next Kölner R User Meeting: Friday, 6 March 2014

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Given all the attention the internet has given to the colors of this dress, I thought it would be interesting to look at the capabilities for extracting colors in R.

R has a number of packages for importing images in various file formats, including PNG, JPG, TIFF, and BMP. (The readbitmap package works with all of these.) In each case, the result is a 3-dimensional R array containing a 2-D image layer for each of the color channels (for example red, green and blue for color images). You can then manipulate the array as ordinary data to extract color information. For example, Derek Jones used the readPNG function to extract data from published heatmaps when the source data has been lost.

Photographs typically contain thousands or even millions of unique colors, but a very human question is: what are the major colors in the image? In other words, what is the image's palette? This is a difficult question to answer, but Russell Dinnage used R's k-means clustering capabilities to extract the 3 (or 4 or 6 — you decide) most prominent colors from an image, without including almost-identical shades of the same color and filtering out low-saturation background colors (like gray shadows). Without any supervision, his script can easily extract 6 colors from this tail of this beautiful peacock spider. In fact, his script generates five representative palettes:

Image may be NSFW.
Clik here to view.

I used a similar process to extract the 3 major colors from "that dress":

Image may be NSFW.
Clik here to view. Image may be NSFW.
Clik here to view.

So I guess it was black and blue after all! (Plus a heavy dose of white in the background)

Christophe Cariou used a similar palette-extraction process in R and applied it to every cover of Wired magazine since 1993. For each cover he extracted the 4 major colors, and then represented them all on this beautiful scatter diagram arranged on the color wheel:

Image may be NSFW.
Clik here to view.

You can see the individual colors for the last 58 Wired covers here.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

↧

Getting Data From An Online Source

March 6, 2015, 8:51 am

≫ Next: Motor Vehicle Collision Density in NYC

≪ Previous: Color extraction with R

(This article was first published on bRogramming, and kindly contributed to R-bloggers)

Getting Data From One Online Source

Robert Norberg

Hello world. It’s been a long time since I posted anything here on my blog. I’ve been busy getting my Masters degree in statistical computing and I haven’t had much free time to blog. But I’ve writing `R` code as much as ever. Now, with graduation approaching, I’m job hunting and I thought it would be good to put together a few things to show potential employers. In the spirit of open source, I want to share these “portfolio” scripts with anyone that’s interested.

As I’m sure many statisticians do, I keep a folder of “stock code”, or template scripts that do different things. This folder is always growing and the scripts are always improving, but there are a few in there that I think are worth sharing. Some of these are templates for common analyses, while others are just reminders of how to use a couple of commands to accomplish a practical task.

This post is of the latter type. I’m going to discuss fetching data from a URL.

Why might one need to fetch data from a URL?

You want to share your code with someone who isn’t familiar with R and you want to avoid the inevitable explanation of how to change the file path at the beginning of the file. (“Make sure you only use forward slashes!”)
The data at the URL is constantly changing and you want your analysis to use the latest each time you run it.
You want the code to just work when it’s run from another machine with another directory tree.
You want to post a completely repeatable analysis on your blog and you don’t want it to begin with “go to www.blahblahblah.com, download this data, and load it into R”.

Whatever your reason may be, it’s a neat trick, but it’s not one I use so often that I can just rattle off the code for it from memory. So here’s my template. I hope it can help someone else.

Caveat!!!

This is only for data that is in tabular form already. This is not for web scraping (i.e. extracting a table of data from a Wikipedia page.) There areentire packages devoted to that. This is for the simplest of all cases where there is a .csv file or a .txt file (or similar) at a URL and you want to read it into R directly from that URL without the intermediate step of saving it somewhere on your computer.

Using `data.table`’s `fread()`

I love the data.table package. I use it every day, for almost every project I do. It’s an extension of the data.frame object class in R that makes many improvements. One of those improvements is in the function fread(). It’s data.table’s answer to base R’s read.csv(). It does many things better, but here I’m only going to address its ability to read data right from the web. As a primer, its typical use on a data file residing on your computer would look something like this:

library(data.table)
mydat <- fread('C://Some/File/Path.csv')

Reading data from a source on the web is no different. The example the package authors give in the help file (?fread) is this:

library(data.table)
mydat <- fread('http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat')
head(mydat)

   V1  V2   V3    V4 V5
1:  1 307  930 36.58  0
2:  2 307  940 36.73  0
3:  3 307  950 36.93  0
4:  4 307 1000 37.15  0
5:  5 307 1010 37.23  0
6:  6 307 1020 37.24  0

Now if you actually navigate to that link in your browser, you won’t see anything, but a download dialog should pop up. If you navigate to the parent directory of that address, http://www.stats.ox.ac.uk/pub/datasets/csb you will see some text further down the page you will see several links to data files. Each of these links launches a download dialog when clicked. To grab the URL of the data file to pass to fread(), right click the link and select “Copy link address”. Other data files online might appear in the browser instead of launching download dialog, like this one a professor of mine had us use for an assignment. fread() handles these URLs just the same.

fread() makes smart decisions about how to read the data in (it detects column names and classes and so on), but the command has several arguments for specifying such things as well that you can use at your own discrimination. I find fread('filename') almost always just works, but sometimes there are reasons to be more explicit when reading data in.

Using RStudio

If you’re not familiar with RStudio, you are a true R novice. If you know what it is, but don’t use it, skip ahead.

In RStudio, you can click “Tools” -> “Import Dataset” -> “From Web URL” and a dialog will pop up asking you for a URL. Paste a URL into the dialog box (let’s just use the same one as before: http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat) and click “OK”. A nice little window pops up and allows you to specify how the data should be read and what name the object should be given in R. When you click “Import”, the data is read in and some code appears in the console. What this interface does is download the data to a temporary file in a temporary folder and then read it in. The downloaded data file persists on your hard drive as long as your R session lasts, but disappears as soon as your R session ends.

This is handy, but if you wanted to repeat the process, you would have to click through the menu again and supply the data URL again. This isn’t exactly “repeatable” in the Stack Overflow sense of the word.

Using `RCurl`’s `getURL()`

The RCurl package provides bindings to the cURL library. This is a C library for web connections. The cURL library does way more than we need for this task and frankly, I don’t understand a lot of it. I saved RCurl for last because iI usually try fread() first, and then if I get some sort of error, I resort to RCurl. Take for example the data set at this link: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/data/bycatch.csv (also posted by a professor for an assignment of mine). If you try to fread() it, no dice. I have no idea what that error message means, but here’s how to get that data set in anyway.

library(RCurl)
myfile <- getURL('https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/data/bycatch.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)

What are the arguments ssl.verifyhost=F and ssl.verifypeer=F doing? To be quite honest, I don’t really know. But if I’m having trouble reading from a URL I try specifying these arguments and changing one or both to FALSE almost always circumvents whatever error I’m getting.

This grabs the content residing at the specified URL, but doesn’t return a data.frame object. It has simply put the URL’s content into a string.

class(myfile)

[1] "character"

So how to get this into a data.frame object? We’ll use textConnection() to open a “connection” with the string, much like you would open a connection with a file on your hard drive in order to read it. Then we’ll have read.csv() (or you could use read.table() or fread() or similar) to read the string object like a text file and create a data.frame object.

mydat <- read.csv(textConnection(myfile), header=T)
head(mydat)

   Season  Area Gear.Type  Time Tows Bycatch
1 1989-90 North    Bottom   Day   48       0
2 1989-90 North    Bottom Night    6       0
3 1989-90 North Mid-Water Night    1       0
4 1989-90 South    Bottom   Day  139       0
5 1989-90 South Mid-Water   Day    6       0
6 1989-90 South    Bottom Night    6       0

And there you have it. The data from the URL is now in a data.frame and ready to go.

Aside: read.csv() is just a vesion of read.table() with argument defaults such as sep = "," that make sense for reading .csv files.

A Use Case

Let’s pretend I want to automate something having to do with weather in Chicago. Maybe it’s a knitr document that I have scheduled to re-knit every night on my server. Every time the script re-runs, it should somehow take into account recent weather in Chicago. Weather Undergroundoffers historic (and an hour ago counts as “historic”) hourly weather data for many different locations. Many of these locations are airports, which for obvious reasons, have several meteorological sensors on site. On the Weather Underground page you can select a location and a date and see hourly weather for that calendar day. At the bottom of the page, you can click “Comma Delimited File” to see the data in comma delimited format - perfect for reading into R.

The URLs for such data are always of a predictable form:http://www.wunderground.com/history/airport/4LetterAirportCode/year/month/day/DailyHistory.html?some_junk_specifying_format

I see that the four letter airport code for Chicago is “KMDW” and after clicking through a few of these URLs, I see the stuff after “DailyHistory.html” doesn’t change. So if I know the date, I can construct the URL where the hourly Chicago airport wether for that date can be found in .csv format.

First, I define the beginning and end of the URL, which never change.

baseURL <- 'http://www.wunderground.com/history/airport/KMDW'
suffixURL <- 'DailyHistory.html?HideSpecis=1&format=1'

There is opportunity here to generalize this for many locations if one simply maps the four letter codes to other locations of interest usingswitch() or similar.

Then I ask the system for todays date and from it produce a string having format year/month/day.

Date <- Sys.Date()
datestring <- format(Date, '%Y/%m/%d')

Then I piece all of these strings together to get a URL which will lead to a .csv file of today’s weather in Chicago.

url2fetch <- paste(baseURL, datestring, suffixURL, sep='/')

Finally I grab the content of the webpage at that URL using the RCurl method described above. I choose getURL() instead of fread() for good reason; I’ll need to do some find-and-replace to clean up some html artifacts in the data and that is more efficient to do on one big string rather than on a bunch of individual values in a data.frame.

url_content <- getURL(url2fetch)

Now I have the content of the page in a string and I want to read that string into a data.frame object, but every line of the data ends with an html newline (“<br />”) and a text newline (“n”). read.csv() will recognize the “n” as a signal to start a new row of the data.frame, but the “<br />” isn’t recognized and will be appended to the value in the last column of every row. So let’s take care of this before read.csv() ever gets involved. I’ll do a simple find-and-replace where I find “<br />” and replace it with an empty string (""), aka nothing. This is the regex way of find-and-delete.

url_content <- gsub('<br />', '', url_content)

Finally I can “read” the data into a data.frame object with the help of read.csv() and textConnection().

weather_data <- read.csv(textConnection(url_content))
head(weather_data)

   TimeCST TemperatureF Dew.PointF Humidity Sea.Level.PressureIn
1 12:22 AM         21.9       17.1       82                30.02
2 12:53 AM         21.9       16.0       78                30.07
3  1:53 AM         21.9       15.1       75                30.09
4  2:24 AM         21.0       14.0       74                30.04
5  2:39 AM         21.0       14.0       74                30.04
6  2:53 AM         21.0       15.1       78                30.09
  VisibilityMPH Wind.Direction Wind.SpeedMPH Gust.SpeedMPH PrecipitationIn
1           1.0            NNE          13.8             -            0.01
2           1.0            NNE          15.0             -            0.01
3           4.0            NNE          11.5             -            0.00
4           2.5            NNE          16.1             -            0.00
5           1.5            NNE          12.7             -            0.00
6           1.8            NNE          12.7             -            0.00
  Events Conditions WindDirDegrees             DateUTC
1   Snow       Snow             30 2015-02-26 06:22:00
2   Snow Light Snow             30 2015-02-26 06:53:00
3   Snow Light Snow             30 2015-02-26 07:53:00
4   Snow Light Snow             30 2015-02-26 08:24:00
5   Snow Light Snow             30 2015-02-26 08:39:00
6   Snow Light Snow             30 2015-02-26 08:53:00

Glorious.

To leave a comment for the author, please follow the link and comment on his blog: bRogramming.

↧

Motor Vehicle Collision Density in NYC

March 6, 2015, 4:18 pm

≫ Next: New R Package – ipapi (IP/Domain Geolocation)

≪ Previous: Getting Data From An Online Source

(This article was first published on Stable Markets » R, and kindly contributed to R-bloggers)

Image may be NSFW.
Clik here to view.

In a previous post, I visualized crime density in Boston using R’s ggmap package. In this post, I use ggmap to visualize vehicle accidents in New York City. R code and data are included in this post.

The data comes from NYC Open Data. My data cut ranges from 2012 to 2015. The data tracks the type of vehicle, the name of the street on which the accident occurs, as well as the longitude and latitude coordinates of the accident. Both coordinates are saved as a single character variable called “LOCATION”.

Below, I load the ggmap and gridExtra packages, load the data, drop all accidents without location coordinates, and parse the LOCATION variable to get the longitude and latitude coordinates. I also parse the date variable to create a year variable and use that variable to create two data sets: one with all vehicle accidents in 2013 and another with all vehicle accidents in 2014

 
library(ggmap)
library(gridExtra)

d=read.csv('.../NYPD_Motor_Vehicle_Collisions.csv')

d_clean=d[which(regexpr(',',d$LOCATION)!=-1),]

comm=regexpr(',',d_clean$LOCATION)
d_clean$loc=as.character(d_clean$LOCATION)
d_clean$lat=as.numeric(substr(d_clean$loc,2,comm-1))
d_clean$long=as.numeric(substr(d_clean$loc,comm+1,nchar(d_clean$loc)-1))
d_clean$year=substr(d_clean$DATE,7,10)

d_2013=d_clean[which(d_clean$year=='2013'),c('long','lat')]
d_2014=d_clean[which(d_clean$year=='2014'),c('long','lat')]

Next, I use get_map() to query Google Maps and get a map of NYC. I add a two-dimensional density layer to this map using stat_density2d(). I do this for both 2013 and 2014 data and use gridExtra’s grid.arrange() to place the maps side by side.

ny_plot=ggmap(get_map('New York, New York',zoom=12, maptype='terrain'))

plot1=ny_plot+
stat_density2d(data= d_2013, aes(x = d_2013$long, y = d_2013$lat, alpha=.75,fill=..level..),bins = 10, geom = 'polygon')+
guides(fill = guide_colorbar(barwidth = 1, barheight = 12)) +
scale_alpha(guide = FALSE)+ 
xlab(' ')+ylab(' ')+
ggtitle('Vehicle Accidents - NYC - 2013')

plot2=ny_plot+
stat_density2d(data= d_2014, aes(x = d_2014$long, y = d_2014$lat, alpha=.75,fill=..level..),bins = 10, geom = 'polygon')+
guides(fill = guide_colorbar(barwidth = 1, barheight = 12)) +
scale_alpha(guide = FALSE)+ 
xlab(' ')+ylab(' ')+
ggtitle('Vehicle Accidents - NYC - 2014')

grid.arrange(plot1, plot2,nrow=1,ncol=2)

Image may be NSFW.
Clik here to view.

Next, I plot 2013 accident densities by borough. I write a function boro() that drops all observations with a missing street name and subsets the 2013 data based on borough. I stack them together while adding borough names, then assign each borough a color in vector col_vals. I use stat_density2d’s group parameter to plot each borough’s density layers separately with different colors.

boro=function(x){
 d_clean2=d_clean[which(d_clean$ON.STREET.NAME!='' &amp;amp; d_clean$BOROUGH==x),]
 d_2013_2=d_clean2[which(d_clean2$year=='2013'),c('long','lat','ON.STREET.NAME')]
return(d_2013_2)
}

man=boro('MANHATTAN')
bronx=boro('BRONX')
brook=boro('BROOKLYN')
si=boro('STATEN ISLAND')
q=boro('QUEENS')

full=rbind(cbind(man,group='Manhattan'),
            cbind(bronx,group='Bronx'),
            cbind(brook,group='Brooklyn'),
            cbind(si,group='Staten Island'),
            cbind(q,group='Queens'))
Borough=full$group

col_vals=c('Manhattan'='dodgerblue', 'Bronx'='darkred',
            'Brooklyn'='violet','Staten Island'='darkgreen',
            'Queens'='darkgoldenrod4')

ny_plot=ggmap(get_map('New York, New York',zoom=11, maptype='terrain'))

plot4=ny_plot+
  stat_density2d(data=full, geom='polygon',bins = 10, aes(x=full$long,y=full$lat,fill = Borough, alpha=..level..))+
  scale_fill_manual(values=col_vals)+
  #guides(fill = guide_colorbar(barwidth = 1, barheight = 12)) +
  scale_alpha(guide = FALSE)+ 
  xlab(' ')+ylab(' ')+
  ggtitle('NYC Vehicle Accident Density by Borough, 2013')
plot4

Image may be NSFW.
Clik here to view.

To leave a comment for the author, please follow the link and comment on his blog: Stable Markets » R.

↧

New R Package – ipapi (IP/Domain Geolocation)

March 9, 2015, 2:32 pm

≫ Next: Simple Lower US 48 Albers Maps & Local (no-API) City/State Geocoding in R

≪ Previous: Motor Vehicle Collision Density in NYC

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

I noticed that the @rOpenSci folks had an interface to ip-api.com on their ToDo list so I whipped up a small R package to fill said gap.

Their IP Geolocation API will take an IPv4, IPv6 or FQDN and kick back a ASN, lat/lon, address and more. The ipapi package exposes one function – geolocate which takes in a character vector of any mixture of IPv4/6 and domains and returns a data.table of results. Since ip-api.com has a restriction of 250 requests-per-minute, the package also tries to help ensure you don’t get your own IP address banned (there’s a form on their site you can fill in to get it unbanned if you do happen to hit the limit). Overall, there’s nothing fancy in the package, but it gets the job done.

I notified the rOpenSci folks about it, so hopefully it’ll be one less thing on that particular to-do list.

You can see it in action in combination with the super-spiffy leaflet htmlwidget:

library(leaflet)
library(ipapi)
library(maps)
 
# get top 500 domains
sites <- read.csv("http://moz.com/top500/domains/csv", stringsAsFactors=FALSE)
 
# make reproducible
set.seed(1492)
 
# pick out a random 50
sites <- sample(sites$URL, 50) 
sites <- gsub("/", "", sites)
locations <- geolocate(sites)
 
# take a quick look
dplyr::glimpse(locations)
 
## Observations: 50
## Variables:
## $ as          (fctr) AS2635 Automattic, Inc, AS15169 Google Inc., AS3561...
## $ city        (fctr) San Francisco, Mountain View, Chesterfield, Mountai...
## $ country     (fctr) United States, United States, United States, United...
## $ countryCode (fctr) US, US, US, US, US, US, JP, US, US, IT, US, US, US,...
## $ isp         (fctr) Automattic, Google, Savvis, Google, Level 3 Communi...
## $ lat         (dbl) 37.7484, 37.4192, 38.6631, 37.4192, 38.0000, 33.7516...
## $ lon         (dbl) -122.4156, -122.0574, -90.5771, -122.0574, -97.0000,...
## $ org         (fctr) Automattic, Google, Savvis, Google, AddThis, Peer 1...
## $ query       (fctr) 192.0.80.242, 74.125.227.239, 206.132.6.134, 74.125...
## $ region      (fctr) CA, CA, MO, CA, , GA, 13, MA, TX, , MA, TX, CA, , ,...
## $ regionName  (fctr) California, California, Missouri, California, , Geo...
## $ status      (fctr) success, success, success, success, success, succes...
## $ timezone    (fctr) America/Los_Angeles, America/Los_Angeles, America/C...
## $ zip         (fctr) 94110, 94043, 63017, 94043, , 30303, , 02142, 78218...
 
# all i want is the world!
world <- map("world", fill = TRUE, plot = FALSE) 
 
# kick out a a widget
leaflet(data=world) %>% 
  addTiles() %>% 
  addCircleMarkers(locations$lon, locations$lat, 
                   color = '#ff0000', popup=sites)

50 Random Top Sites

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

↧

Simple Lower US 48 Albers Maps & Local (no-API) City/State Geocoding in R

March 15, 2015, 6:22 am

≫ Next: Machine Learning in R for beginners

≪ Previous: New R Package – ipapi (IP/Domain Geolocation)

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

I’ve been seeing an uptick in static US “lower 48″ maps with “meh” projections this year, possibly caused by a flood of new folks resolving to learn R but using pretty old documentation or tutorials. I’ve also been seeing an uptick in folks needing to geocode US city/state to lat/lon. I thought I’d tackle both in a quick post to show how to (simply) use a decent projection for lower 48 US maps and then how to use a very basic package I wrote – localgeo to avoid having to use an external API/service for basic city/state geocoding.

Albers All The Way

I could just plot an Albers projected map, but it’s more fun to add some data. We’ll start with some setup libraries and then read in some recent earthquake data, then filter it for our map display:

library(ggplot2)
library(dplyr)
library(readr) # devtools::install_github("hadley/readr")
 
# Earthquakes -------------------------------------------------------------
 
# get quake data ----------------------------------------------------------
quakes <- read_csv("http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_month.csv")
 
# filter all but lower 48 US ----------------------------------------------
quakes %>%
  filter(latitude>=24.396308, latitude<=49.384358,
         longitude>=-124.848974, longitude<=-66.885444) -> quakes
 
# bin by .5 ---------------------------------------------------------------
quakes$Magnitude <- as.numeric(as.character(cut(quakes$mag, breaks=c(2.5, 3, 3.5, 4, 4.5, 5),
    labels=c(2.5, 3, 3.5, 4, 4.5), include.lowest=TRUE)))

Many of my mapping posts use quite a few R geo libraries, but this one just needs ggplot2. We extract the US map data, turn it into something ggplot can work with, then plot our quakes on the map:

us <- map_data("state")
us <- fortify(us, region="region")
 
# theme_map ---------------------------------------------------------------
devtools::source_gist("33baa3a79c5cfef0f6df")
 
# plot --------------------------------------------------------------------
gg <- ggplot()
gg <- gg + geom_map(data=us, map=us,
                    aes(x=long, y=lat, map_id=region, group=group),
                    fill="#ffffff", color="#7f7f7f", size=0.25)
gg <- gg + geom_point(data=quakes,
                      aes(x=longitude, y=latitude, size=Magnitude),
                      color="#cb181d", alpha=1/3)
gg <- gg + coord_map("albers", lat0=39, lat1=45)
gg <- gg + theme_map()
gg <- gg + theme(legend.position="right")
gg

2.5+ mag quakes in Lower US 48 in past 30 days
Image may be NSFW.
Clik here to view. Plot_Zoom

Local Geocoding

There are many APIs with corresponding R packages/functions to perform geocoding (one really spiffy recent one is geocodeHERE). While Nokia’s service is less restrictive than Google’s, most of these sites are going to have some kind of restriction on the number of calls per second/minute/day. You could always install the Data Science Toolkit locally (note: it was down as of the original posting of this blog) and perform the geocoding locally, but it does take some effort (and space/memory) to setup and get going.

If you have relatively clean data and only need city/state resolution, you can use a package I made – localgeo as an alternative. I took a US Gov census shapefile and extracted city, state, lat, lon into a data frame and put a lightweight function shim over it (it’s doing nothing more than dplyr::left_join). It won’t handle nuances like “St. Paul, MN” == “Saint Paul, MN” and, for now, it requires you to do the city/state splitting, but I’ll be tweaking it over the year to be a bit more forgiving.

We can give this a go and map the greenest cities in the US in 2014 as crowned by, er, Nerd Wallet. I went for “small data file with city/state in it”, so if you know of a better source I’ll gladly use it instead. Nerd Wallet used DataWrapper, so getting the actual data was easy and here’s a small example of how to get the file, perform the local geocoding and use an Albers projection for plotting the points. The code below assumes you’re still in the R session that used some of the library calls earlier in the post.

library(httr)
library(localgeo) # devtools::install_github("hrbrmstr/localgeo")
library(tidyr)
 
# greenest cities ---------------------------------------------------------
# via: http://www.nerdwallet.com/blog/cities/greenest-cities-america/
 
url <- "https://gist.githubusercontent.com/hrbrmstr/1078fb798e3ab17556d2/raw/53a9af8c4e0e3137a0a8d4d6332f7a6073d93fb5/greenest.csv"
greenest <- read.table(text=content(GET(url), as="text"), sep=",", header=TRUE, stringsAsFactors=FALSE)
 
greenest %>%
  separate(City, c("city", "state"), sep=", ") %>%
  filter(!state %in% c("AK", "HI")) -> greenest
 
greenest_geo <- geocode(greenest$city, greenest$state)
 
gg <- ggplot()
gg <- gg + geom_map(data=us, map=us,
                    aes(x=long, y=lat, map_id=region, group=group),
                    fill="#ffffff", color="#7f7f7f", size=0.25)
gg <- gg + geom_point(data=greenest_geo,
                      aes(x=lon, y=lat),
                      shape=21, color="#006d2c", fill="#a1d99b", size=4)
gg <- gg + coord_map("albers", lat0=39, lat1=45)
gg <- gg + labs(title="Greenest Cities")
gg <- gg + theme_map()
gg <- gg + theme(legend.position="right")
gg

Nerd Wallets’s Greenest US (Lower 48) Cities 2014
Image may be NSFW.
Clik here to view. Plot_Zoom 2

Let me reinforce that the localgeo package will most assuredly fail to geocode some city/state combinations. I’m looking for a more comprehensive shapefile to ensure I have the complete list of cities and I’ll be adding some code to help make the lookups more forgiving. It may at least help when you bump into an API limit and need to crank out something in a hurry.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

↧

Machine Learning in R for beginners

March 25, 2015, 2:00 am

≫ Next: Interactive Maps for John Snow’s Cholera Data

≪ Previous: Simple Lower US 48 Albers Maps & Local (no-API) City/State Geocoding in R

(This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

Introducing: Machine Learning in R

Machine learning is a branch in computer science that studies the design of algorithms that can learn. Typical machine learning tasks are concept learning, function learning or “predictive modeling”, clustering and finding predictive patterns. These tasks are learned through available data that were observed through experiences or instructions, for example. Machine learning hopes that including the experience into its tasks will eventually improve the learning. The ultimate goal is to improve the learning in such a way that it becomes automatic, so that humans like ourselves don't need to interfere any more.

Machine learning has close ties with Knowledge Discovery, Data Mining, Artificial Intelligence and Statistics. Typical applications of machine learning can be classified into scientific knowledge discovery and more commercial applications, ranging from the “Robot Scientist” to anti-spam filtering and recommender systems.

This small tutorial is meant to introduce you to the basics of machine learning in R: it will show you how to use R to work with the well-known machine learning algorithm called “KNN” or k-nearest neighbors.

Using R For k-Nearest Neighbors (KNN)

The KNN or k-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. More specifically, the distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance, cosine similarity or the Manhattan distance. In other words, the similarity to the data that was already in the system is calculated for any new data point that you input into the system. Then, you use this similarity value to perform predictive modeling. Predictive modeling is either classification, assigning a label or a class to the new instance, or regression, assigning a value to the new instance. Whether you classify or assign a value to the new instance depends of course on your how you compose your model with KNN.

The k-nearest neighbor algorithm adds to this basic algorithm that after the distance of the new point to all stored data points has been calculated, the distance values are sorted and the k-nearest neighbors are determined. The labels of these neighbors are gathered and a majority vote or weighted vote is used for classification or regression purposes. In other words, the higher the score for a certain data point that was already stored, the more likely that the new instance will receive the same classification as that of the neighbor. In the case of regression, the value that will be assigned to the new data point is the mean of its k nearest neighbors.

Step One. Get Your Data

Machine learning typically starts from observed data. You can take your own data set or browse through other sources to find one.

Built-in Datasets of R

This tutorial makes use of the Iris data set, which is well-known in the area of machine learning. This dataset is built into R, so you can take a look at this dataset by typing the following into your console:

iris

UC Irvine Machine Learning Repository

If you want to download the data set instead of using the one that is built into R, you can go to the UC Irvine Machine Learning Repository and look up the Iris data set.

Tip not only check out the data folder of the Iris data set, but also take a look at the data description page!

Then, load in the data set with the following command:

iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)

The command reads the .csv or “Comma Separated Value” file from the website. The header argument has been put to FALSE, which means that the Iris data set from this source does not give you the attribute names of the data.

Instead of the attribute names, you might see strange column names such as “V1” or “V2”. Those are set at random. To simplify the working with the data set, it is a good idea to make one yourself: you can do this through the function names(), which gets or sets the names of an object. Concatenate the names of the attributes as you would like them to appear. For the Iris data set, you can use the following R command:

names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")

Step Two. Know Your Data

Now that you have loaded the Iris data set into RStudio, you should try to get a thorough understanding of what your data is about. Just looking or reading about your data is certainly not enough to get started!

Initial Overview Of The Data Set

First, you can already try to get an idea of your data by making some graphs, such as histograms or boxplots. In this case, however, scatter plots can give you a great idea of what you're dealing with: it can be interesting to see how much one variable is affected by another. In other words, you want to see if there is any correlation between two variables.

You can make scatterplots with the ggvis package, for example. You first need to load the ggvis package:

library(ggvis)

iris %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()

Image may be NSFW.
Clik here to view.

You see that there is a high correlation between the sepal length and the sepal width of the Setosa iris flowers, while the correlation is somewhat less high for the Virginica and Versicolor flowers.

The scatter plot that maps the petal length and the petal width tells a similar story:

iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>% layer_points()

Image may be NSFW.
Clik here to view.

You see that this graph indicates a positive correlation between the petal length and the petal width for all different species that are included into the Iris data set.

iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)

Tip are you curious about ggvis, graphs or histograms in particular? Check out our tutorial on histograms and/or our course on ggvis.

After a general visualized overview of the data, you can also view the data set by entering

iris

However, as you will see from the result of this command, this really isn't the best way to inspect your data set thoroughly: the data set takes up a lot of space in the console, which will impede you from forming a clear idea about your data. It is therefore a better idea to inspect the data set by executing

head(iris)

str(iris)

Tip try both of these commands out to see the difference!.

Note that the last command will help you to clearly distinguish the data type num and the three levels of the Species attribute, which is a factor. This is very convenient, since many R machine learning classifiers require that the target feature is coded as a factor.

Remember: factor variables represent categorical variables in R. They can thus take on a limited number of different values.

A quick look at the Species attribute through tells you that the division of the species of flowers is 50-50-50:

table(iris$Species)

If you want to check the percentual division of the Species attribute, you can ask for a table of proportions:

round(prop.table(table(iris$Species)) * 100, digits = 1)

Note that the round argument rounds the values of the first argument, prop.table(table(iris$Species))*100 to the specified number of digitis, which is one digit after the decimal point. You can easily adjust this by changing the value of the digits argument.

Profound Understanding Of Your Data

Let's not remain on this high-level overview of the data! R gives you the opportunity to go more in-depth with the summary() function. This will give you the minimum value, first quantile, median, mean, third quantile and maximum value of the data set Iris for numeric data types. For the class variable, the count of factors will be returned:

summary(iris)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.05   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##             Species  
##  Iris-setosa    :50  
##  Iris-versicolor:50  
##  Iris-virginica :50  
##                      
##                      
##

You can also refine your summary overview by adding specific attributes to the command that was presented above:

summary(iris[c 1=""Sepal.Width")" language="("Petal.Width","][/c])

As you can see, the c() function is added to the original command: the columns petal width and sepal width are concatenated and a summary is then asked of just these two columns of the Iris data set.

Step Three. Where to Go Now.

After you have acquired a good understanding of your data, you have to decide on the use cases that would be relevant for your data set. In other words, you think about what your data set might teach you or what you think you can learn from your data. From there on, you can think about what kind of algorithms you would be able to apply to your data set in order to get the results that you think you can obtain.

Tip keep in mind that the more familiar you are with your data, the easier it will be to assess the use cases for your specific data set. The same also holds for finding the appropriate machine algorithm.

For this tutorial, the Iris data set will be used for classification, which is an example of predictive modeling. The last attribute of the data set, Species, will be the target variable or the variable that we want to predict in this example.

Note that the round that you can also take one of the numerical classes as the target variable if you want to use KNN to do regression.

Step Four. Prepare Your Workspace

Many of the algorithms used in machine learning are not incorporated by default into R. You will most probably need to download the packages that you want to use when you want to get started with machine learning.

Tip got an idea of which learning algorithm you may use, but not of which package you want or need? You can find a pretty complete overview of all the packages that are used in R right here.

To illustrate the KNN algorithm, this tutorial works with the package class. You can type in

library(class)

If you don't have this package yet, you can quickly and easily do so by typing

install.packages("<package name>")

if you're not sure if you have this package, you can run the following command to find out!

any(grepl("<name of your package>", installed.packages()))

Step Five. Prepare Your Data

Normalization

As a part of your data preparation, you might need to normalize your data so that its consistent. For this introductory tutorial, just remember that normalization makes it easier for the KNN algorithm to learn. There are two types of normalization:

example normalization is the adjustment of each example individually,
feature normalization indicates that you adjust each feature in the same way across all examples.

So when do you need to normalize your dataset? In short: when you suspect that the data is not consistent. You can easily see this when you go through the results of the summary() function. Look at the minimum and maximum values of all the (numerical) attributes. If you see that one attribute has a wide range of values, you will need to normalize your dataset, because this means that the distance will be dominated by this feature. For example, if your dataset has just two attributes, X and Y, and X has values that range from 1 to 1000, while Y has values that only go from 1 to 100, then Y's influence on the distance function will usually be overpowered by X's influence. When you normalize, you actually adjust the range of all features, so that distances between variables with larger ranges will not be over-emphasised.

Tip go back to the result of summary(iris) and try to figure out if normalization is necessary.

The Iris data set doesn't need to be normalized: the Sepal.Length attribute has values that go from 4.3 to 7.9 and Sepal.Width contains values from 2 to 4.4, while Petal.Length's values range from 1 to 6.9 and Petal.Width goes from 0.1 to 2.5. All values of all attributes are contained within the range of 0.1 and 7.9, which you can consider acceptable.

Nevertheless, it's still a good idea to study normalization and its effect, especially if you're new to machine learning. You can perform feature normalization, for example, by first making your own normalize function:

normalize <- function(x) {
num <- x - min(x)
denom <- max(x) - min(x)
return (num/denom)
}

You can then use this argument in another command, where you put the results of the normalization in a data frame through as.data.frame() after the function lapply() returns a list of the same length as the data set that you give in. Each element of that list is the result of the application of the normalize argument to the data set that served as input:

YourNormalizedDataSet <- as.data.frame(lapply(YourDataSet, normalize))

For the Iris dataset, you would have applied the normalize argument on the four numerical attributes of the Iris data set (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and put the results in a data frame:

iris_norm <- as.data.frame(lapply(iris[1:4], normalize))

To more thoroughly illustrate the effect of normalization on the data set, compare the following result to the summary of the Iris data set that was given in step two:

summary(iris_norm)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width    
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.222   1st Qu.:0.333   1st Qu.:0.102   1st Qu.:0.0833  
##  Median :0.417   Median :0.417   Median :0.568   Median :0.5000  
##  Mean   :0.429   Mean   :0.439   Mean   :0.468   Mean   :0.4578  
##  3rd Qu.:0.583   3rd Qu.:0.542   3rd Qu.:0.695   3rd Qu.:0.7083  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.0000

Training And Test Sets

In order to assess your model's performance later, you will need to divide the data set into two parts: a training set and a test set. The first is used to train the system, while the second is used to evaluate the learned or trained system. In practice, the division of your data set into a test and a training sets is disjoint: the most common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set.

One last look on the data set teaches you that if you performed the division of both sets on the data set as is, you would get a training class with all species of “Setosa” and “Versicolor”, but none of “Virginica”. The model would therefore classify all unknown instances as either “Setosa” or “Versicolor”, as it would not be aware of the presence of a third species of flowers in the data. In short, you would get incorrect predictions for the test set.

You thus need to make sure that all three classes of species are present in the training model. What's more, the amount of instances of all three species needs to be present at more or less the same ratio as in your original data set.

To make your training and test sets, you first set a seed. This is a number of R's random number generator. The major advantage of setting a seed is that you can get the same sequence of random numbers whenever you supply the same seed in the random number generator.

set.seed(1234)

Then, you want to make sure that your Iris data set is shuffled and that you have the same ratio between species in your training and test sets. You use the sample() function to take a sample with a size that is set as the number of rows of the Iris data set, or 150. You sample with replacement: you choose from a vector of 2 elements and assign either 1 or 2 to the 150 rows of the Iris data set. The assignment of the elements is subject to probability weights of 0.67 and 0.33.

ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))

Note that the replace argument is set to TRUE: this means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state. This means that, for the next rows in your data set, you can either assign a 1 or a 2, each time again. The probability of choosing a 1 or a 2 should not be proportional to the weights amongst the remaining items, so you specify probability weights.

Remember that you want your training set to be 2/3 of your original data set: that is why you assign “1” with a probability of 0.67 and the “2"s with a probability of 0.33 to the 150 sample rows.

You can then use the sample that is stored in the variable ind to define your training and test sets:

iris.training <- iris[ind==1, 1:4]
iris.test <- iris[ind==2, 1:4]

Note that, in addition to the 2/3 and 1/3 proportions specified above, you don't take into account all attributes to form the training and test sets. Specifically, you only take Sepal.Length, Sepal.Width, Petal.Length and Petal.Width. This is because you actually want to predict the fifth attribute, Species: it is your target variable. However, you do want to include it into the KNN algorithm, otherwise there will never be any prediction for it. You therefore need to store the class labels in factor vectors and divide them over the training and test sets.

iris.trainLabels <- iris[ind==1, 5]
iris.testLabels <- iris[ind==2, 5]

Step Six. The Actual KNN Model

Building Your Classifier

After all these preparation steps, you have made sure that all your known (training) data is stored. No actual model or learning was performed up until this moment. Now, you want to find the k nearest neighbors of your training set.

An easy way to do these two steps is by using the knn() function, which uses the Euclidian distance measure in order to find the k-nearest neighbours to your new, unknown instance. Here, the k parameter is one that you set yourself. As mentioned before, new instances are classified by looking at the majority vote or weighted vote. In case of classification, the data point with the highest score wins the battle and the unknown instance receives the label of that winning data point. If there is an equal amount of winners, the classification happens randomly.

Note the k parameter is often an odd number to avoid ties in the voting scores.

To build your classifier, you need to take the knn() function and simply add some arguments to it, just like in this example:

iris_pred <- knn(train = iris.training, test = iris.test, cl = iris.trainLabels, k=3)

You store into iris_pred the knn() function that takes as arguments the training set, the test set, the train labels and the amount of neighbours you want to find with this algorithm. The result of this function is a factor vector with the predicted classes for each row of the test data.

Note that you don't want to insert the test labels: these will be used to see if your model is good at predicting the actual classes of your instances!

You can retrieve the result of the knn() function by typing in the following command:

iris_pred

##  [1] Iris-setosa     Iris-setosa     Iris-setosa     Iris-setosa    
##  [5] Iris-setosa     Iris-setosa     Iris-setosa     Iris-setosa    
##  [9] Iris-setosa     Iris-setosa     Iris-setosa     Iris-setosa    
## [13] Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor
## [17] Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor
## [21] Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor
## [25] Iris-virginica  Iris-virginica  Iris-virginica  Iris-virginica 
## [29] Iris-versicolor Iris-virginica  Iris-virginica  Iris-virginica 
## [33] Iris-virginica  Iris-virginica  Iris-virginica  Iris-virginica 
## [37] Iris-virginica  Iris-virginica  Iris-virginica  Iris-virginica 
## Levels: Iris-setosa Iris-versicolor Iris-virginica

The result of this command is the factor vector with the predicted classes for each row of the test data.

Step Seven. Evaluation of Your Model

An essential next step in machine learning is the evaluation of your model's performance. In other words, you want to analyze the degree of correctness of the model's predictions. For a more abstract view, you can just compare the results of iris_pred to the test labels that you had defined earlier:

##    Predicted Species Observed Species
## 1        Iris-setosa      Iris-setosa
## 2        Iris-setosa      Iris-setosa
## 3        Iris-setosa      Iris-setosa
## 4        Iris-setosa      Iris-setosa
## 5        Iris-setosa      Iris-setosa
## 6        Iris-setosa      Iris-setosa
## 7        Iris-setosa      Iris-setosa
## 8        Iris-setosa      Iris-setosa
## 9        Iris-setosa      Iris-setosa
## 10       Iris-setosa      Iris-setosa
## 11       Iris-setosa      Iris-setosa
## 12       Iris-setosa      Iris-setosa
## 13   Iris-versicolor  Iris-versicolor
## 14   Iris-versicolor  Iris-versicolor
## 15   Iris-versicolor  Iris-versicolor
## 16   Iris-versicolor  Iris-versicolor
## 17   Iris-versicolor  Iris-versicolor
## 18   Iris-versicolor  Iris-versicolor
## 19   Iris-versicolor  Iris-versicolor
## 20   Iris-versicolor  Iris-versicolor
## 21   Iris-versicolor  Iris-versicolor
## 22   Iris-versicolor  Iris-versicolor
## 23   Iris-versicolor  Iris-versicolor
## 24   Iris-versicolor  Iris-versicolor
## 25    Iris-virginica   Iris-virginica
## 26    Iris-virginica   Iris-virginica
## 27    Iris-virginica   Iris-virginica
## 28    Iris-virginica   Iris-virginica
## 29   Iris-versicolor   Iris-virginica
## 30    Iris-virginica   Iris-virginica
## 31    Iris-virginica   Iris-virginica
## 32    Iris-virginica   Iris-virginica
## 33    Iris-virginica   Iris-virginica
## 34    Iris-virginica   Iris-virginica
## 35    Iris-virginica   Iris-virginica
## 36    Iris-virginica   Iris-virginica
## 37    Iris-virginica   Iris-virginica
## 38    Iris-virginica   Iris-virginica
## 39    Iris-virginica   Iris-virginica
## 40    Iris-virginica   Iris-virginica

You see that the model makes reasonably accurate predictions, with the exception of one wrong classification in row 29, where "Versicolor” was predicted while the test label is “Virginica”.

This is already some indication of your model's performance, but you might want to go even deeper into your analysis. For this purpose, you can import the package gmodels:

install.packages("package name")

If you have already installed this package, you can simply enter

library(gmodels)

Then you can make a cross tabulation or a contingency table. This type of table is often used to understand the relationship between two variables. In this case, you want to understand how the classes of your test data, stored in iris.testLabels relate to your model that is stored in iris_pred:

CrossTable(x = iris.testLabels, y = iris_pred, prop.chisq=FALSE)

Image may be NSFW.
Clik here to view.

Note that the last argument prop.chisq indicates whether or not the chi-square contribution of each cell is included. The chi-square statistic is the sum of the contributions from each of the individual cells and is used to decide whether the difference between the observed and the expected values is significant.

From this table, you can derive the number of correct and incorrect predictions: one instance from the testing set was labeled Versicolor by the model, while it was actually a flower of species Virginica. You can see this in the first row of the “Virginica” species in the iris.testLabels column. In all other cases, correct predictions were made. You can conclude that the model's performance is good enough and that you don't need to improve the model!

Move On To Big Data

This tutorial was mainly concerned with performing basic machine learning algorithm KNN with the help of R. The Iris data set that was used was small and overviewable; But you can do so much more! If you have experimented enough with the basics presented in this tutorial and other machine learning algorithms, you might want to find it interesting to go further into R and data analysis. DataCamp can help you to take this step .

The post Machine Learning in R for beginners appeared first on The DataCamp Blog .

To leave a comment for the author, please follow the link and comment on his blog: The DataCamp Blog » R.

↧

Interactive Maps for John Snow’s Cholera Data

March 27, 2015, 11:03 pm

≫ Next: R & Google Maps and R & Robotics (ROS)

≪ Previous: Machine Learning in R for beginners

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

This week, in Istanbul, for the second training on data science, we’ve been discussing classification and regression models, but also visualisation. Including maps. And we did have a brief introduction to the leaflet package,

devtools::install_github("rstudio/leaflet")
require(leaflet)

To see what can be done with that package, we will use one more time the John Snow’s cholera dataset, discussed in previous posts (one to get a visualisation on a google map background, and the second one on an openstreetmap background),

library(sp)
library(rgdal)
library(maptools)
setwd("/cholera/")
deaths <- readShapePoints("Cholera_Deaths")
df_deaths <- data.frame(deaths@coords)
coordinates(df_deaths)=~coords.x1+coords.x2
proj4string(df_deaths)=CRS("+init=epsg:27700") 
df_deaths = spTransform(df_deaths,CRS("+proj=longlat +datum=WGS84"))
df=data.frame(df_deaths@coords)
lng=df$coords.x1
lat=df$coords.x2

Once installed the leaflet package, we can use the package at the RStudio console (which is what we will do here), or within R Markdown documents, and within Shiny applications. But because of restriction we got on this blog (rules of hypotheses.org) So there will be only copies of my screen. But if you run the code, in RStudio you will get interactvive maps in the viewer window.

First step. To load a map, centered initially in London, use

m = leaflet()%>% addTiles() 
m %>% fitBounds(-.141,  51.511, -.133, 51.516)

Image may be NSFW.
Clik here to view.

In the viewer window of RStudio, it is just like on OpenStreetMap, e.g. we can zoom-in, or zoom-out (with the standard + and – in the top left corner)

Image may be NSFW.
Clik here to view.

And we can add additional material, such as the location of the deaths from cholera (since we now have the same coordinate representation system here)

rd=.5
op=.8
clr="blue"
m = leaflet() %>% addTiles()
m %>% addCircles(lng,lat, radius = rd,opacity=op,col=clr)

Image may be NSFW.
Clik here to view.

We can also add some heatmap.

X=cbind(lng,lat)
kde2d <- bkde2D(X, bandwidth=c(bw.ucv(X[,1]),bw.ucv(X[,2])))

But there is no heatmap function (so far) so we have to do it manually,

x=kde2d$x1
y=kde2d$x2
z=kde2d$fhat
CL=contourLines(x , y , z)

We have now a list that contains lists of polygons corresponding to isodensity curves. To visualise of of then, use

m = leaflet() %>% addTiles() 
m %>% addPolygons(CL[[5]]$x,CL[[5]]$y,fillColor = "red", stroke = FALSE)

Image may be NSFW.
Clik here to view.

Of course, we can get at the same time the points and the polygon

m = leaflet() %>% addTiles() 
m %>% addCircles(lng,lat, radius = rd,opacity=op,col=clr) %>%
  addPolygons(CL[[5]]$x,CL[[5]]$y,fillColor = "red", stroke = FALSE)

Image may be NSFW.
Clik here to view.

We can try to plot many polygons on the map, as different layers, to visualise some kind of heatmaps, but it’s not working that well

m = leaflet() %>% addTiles() 
m %>% addCircles(lng,lat, radius = rd,opacity=op,col=clr) %>%
  addPolygons(CL[[1]]$x,CL[[1]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[3]]$x,CL[[3]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[5]]$x,CL[[5]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[7]]$x,CL[[7]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[9]]$x,CL[[9]]$y,fillColor = "red", stroke = FALSE)

Image may be NSFW.
Clik here to view.

An alternative is to hightlight (only) the contour of the polygon

m = leaflet() %>% addTiles() 
m %>% addCircles(lng,lat, radius = rd,opacity=op,col=clr) %>%
  addPolylines(CL[[1]]$x,CL[[1]]$y,color = "red") %>%
  addPolylines(CL[[5]]$x,CL[[5]]$y,color = "red") %>%
  addPolylines(CL[[8]]$x,CL[[8]]$y,color = "red")

Image may be NSFW.
Clik here to view.

Again, the goal of those functions is to get some maps we can zoom-in, or -out (what I call interactive)

Image may be NSFW.
Clik here to view.

And we can get at the same time the contour but also some polygon filled with some light red color

m = leaflet() %>% addTiles() 
m %>% addCircles(lng,lat, radius = rd,opacity=op,col=clr) %>%
  addPolygons(CL[[1]]$x,CL[[1]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[3]]$x,CL[[3]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[5]]$x,CL[[5]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[7]]$x,CL[[7]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolygons(CL[[9]]$x,CL[[9]]$y,fillColor = "red", stroke = FALSE) %>%
  addPolylines(CL[[1]]$x,CL[[1]]$y,color = "red") %>%
  addPolylines(CL[[5]]$x,CL[[5]]$y,color = "red") %>%
  addPolylines(CL[[8]]$x,CL[[8]]$y,color = "red")

Image may be NSFW.
Clik here to view.

Copies of my RStudio screen is nice, but visualising it is just awesome. I will try to find a way to load that map on my blog, but it might be difficult (so far, it is possible to visualise it on http://rpubs.com/freakonometrics/)

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

↧

R & Google Maps and R & Robotics (ROS)

March 29, 2015, 11:26 pm

≫ Next: Using genomation to analyze methylation profiles from Roadmap epigenomics and ENCODE

≪ Previous: Interactive Maps for John Snow’s Cholera Data

(This article was first published on BNOSAC - Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers)

The next RBelgium meetup will be about R & Google Maps and R & Robotics (ROS).
BNOSAC will be hosting the event this time.

This is the schedule:
• 17h30-18h: open questions
• 18h-19h: R and Google Maps
Image may be NSFW.
Clik here to view.
• 19h-20h: R and Robotics (ROS)

Image may be NSFW.
Clik here to view.

For practical information: http://www.meetup.com/RBelgium/events/220918123

To leave a comment for the author, please follow the link and comment on his blog: BNOSAC - Belgium Network of Open Source Analytical Consultants.

↧

Using genomation to analyze methylation profiles from Roadmap epigenomics and ENCODE

March 31, 2015, 6:21 am

≫ Next: Exploring San Francisco with choroplethrZip

≪ Previous: R & Google Maps and R & Robotics (ROS)

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

The genomation package is a toolkit for annotation and visualization of various genomic data. The package is currently in developmental version of BioC. It allows to analyze high-throughput data, including bisulfite sequencing data. Here, we will visualize the distribution of CpG methylation around promoters and their locations within gene structures on human chromosome 3.

Heatmap and plot of meta-profiles of CpG methylation around promoters

In this example we use data from Reduced Representation Bisulfite Sequencing (RRBS) and
Whole-genome Bisulfite Sequencing (WGBS) techniques and H1 and IMR90 cell types
derived from the ENCODE and the Roadmap Epigenomics Project databases.

We download the datasets and convert them to GRanges objects. Using rtracklayer and genomation functions. We also use a refseq bed file for annotation and extraction of promoter regions using readTranscriptFeatures function.

Since we have read the files now we can build base-pair resolution matrices of scores(methylation values) for each experiment. The returned list of matrices can be used to draw heatmaps or meta profiles of methylation ratio around promoters.

Image may be NSFW.
Clik here to view.

Distribution of covered CpGs across gene regions

genomation facilitates visualization of given locations of features aggregated by exons, introns, promoters and TSSs. To find the distribution of covered CpGs within these gene structures, we will use transcript features we previously obtained. Here is the breakdown of the code

Count overlap statistics between our CpGs from WGBS and RRBS H1 cell type and gene structures
Calculate percentage of CpGs overlapping with annotation
plot them in a form of pie charts

Image may be NSFW.
Clik here to view.

To leave a comment for the author, please follow the link and comment on his blog: Recipes, scripts and genomics.

↧

Exploring San Francisco with choroplethrZip

April 7, 2015, 8:30 am

≫ Next: Mixing Numbers and Symbols in Time Series Charts

≪ Previous: Using genomation to analyze methylation profiles from Roadmap epigenomics and ENCODE

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Ari Lamstein

Introduction

Today I will walk through an analysis of San Francisco Zip Code Demographics using my new R package choroplethrZip. This package creates choropleth maps of US Zip Codes and connects to the US Census Bureau. A choropleth is a map that shows boundaries of regions (such as zip codes) and colors those regions according to some metric (such as population).

Zip codes are a common geographic unit for businesses to work with, but rendering them is difficult. Official zip codes are maintained by the US Postal Service, but they exist solely to facilitate mail delivery. The USPS does not release a map of them, they change frequently and, in some cases, are not even polygons. The most authoritative map I could find of US Zip codes was the Census Bureau’s Map of Zip Code Tabulated Areas (ZCTAs). Despite shipping with only a simplified version of this map (60MB instead of 500MB), choroplethrZip is still too large for CRAN. It is instead hosted on github, and you can install it from an R console like this:

# install.github(“devtools”)
library(devtools)
install_github('arilamstein/choroplethrZip@v1.1.1')

The package vignettes (1, 2) explain basic usage. In this article I’d like to demonstrate a more in depth example: showing racial and financial characteristics of each zip code in San Francisco. The data I use comes from the 2013 American Community Survey (ACS) which is run by the US Census Bureau. If you are new to the ACS you might want to view my vignette on Mapping US Census Data.

Example: Race and Ethnicity

One table that deals with race and ethnicity is B03002 - Hipanic or Latino Origin by Race. Many people will be surprised by the large number of categories. This is because the US Census Bureau has a complex framework for categorizing race and ethnicity. Since my purpose here is to demonstrate technology, I will simplify the data by only dealing with only a handful of the values: Total Hispanic or Latino, White (not Hispanic), Black (not Hispanic) and Asian (not Hispanic).

The R code for getting this data into a data.frame can be viewed here, and the code for generating the graphs in this post can be viewed here. Here is a boxplot of the ethnic breakdown of the 27 ZCTAs in San Francisco.

Image may be NSFW.
Clik here to view.

This boxplot shows that there is a wide variation in the racial and ethnic breakdown of San Francisco ZCTAs. For example, the percentage of White people in each ZCTA ranges from 7% to 80%. The percentage of black and hispanics seem to have a tighter range, but also contains outliers. Also, while Asian Americans only make up the 5% of the total US population, the median ZCTA in SF is 30% Asian.

Viewing this data with choropleth maps allows us to associate locations with these values.

Image may be NSFW.
Clik here to view.

Example: Per Capita Income

When discussing demographics people often ask about Per Capita Income. Here is a boxplot of the Per capita Income in San Francisco ZCTAs.

Image may be NSFW.
Clik here to view.

The range of this dataset - from $15,960 to $144,400 - is striking. Equally striking is the outlier at the top. We can use a continuous scale choropleth to highlight the outlier. We can also use a four color choropleth to show the locations of the quartiles.

Image may be NSFW.
Clik here to view.

The outlier for income is zip 94105, which is where a large number of tech companies are located. The zips in the southern part of the city tend to have a low income.

Further Exploration: Interactive Analysis with Shiny

After viewing this analysis readers might wish to do a similar analysis for the city where they live. To facilitate this I have created an interactive web application. The app begins by showing a choropleth map of a random statistic of the Zips in a random Metropolitan Statistical Area (MSA). You can choose another statistic, zoom in, or select another MSA.

In the event that the application does not load (for example, if I reach my monthly quota at my hosting company) then you can run the app from source, which is available here.

Conclusion

I hope that you have enjoyed this exploration of Zip code level demographics with choroplethrZip. I also hope that it encourages more people to use R for demographic statistics.

Ari Lamstein is a software engineer and data analyst in San Francisco. He blogs at Just an R Blog.

Image may be NSFW.
Clik here to view.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

↧

Mixing Numbers and Symbols in Time Series Charts

April 8, 2015, 6:33 am

≫ Next: Recreating the vaccination heatmaps in R

≪ Previous: Exploring San Francisco with choroplethrZip

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

One of the things I’ve been trying to explore with my #f1datajunkie projects are ways of representing information that work both in a glanceable way as well as repaying deeper reading. I’ve also been looking at various ways of using text labels rather than markers to provide additional information around particular data points.

For example, in a race battlemap, with lap number on the horizontal x-axis and gap time on the vertical y-axis, I use a text label to indicate which driver is ahead (or behind) a particular target driver.

Image may be NSFW.
Clik here to view.

In the revised version of this chart type shown in F1 Malaysia, 2015 – Rosberg’s View of the Race, and additional numerical label along the x-axis indicatesd the race position of the target driver at the end of each lap.

What these charts are intended to do is help the eye see particular structural shapes within the data – for example whether a particular driver is being attacked from behind in the example of a battlemap, or whether they are catching the car ahead (perhaps with intervening cars in the way – although more needs to be done on the chart with respect to this for examples where there are several intervening cars; currently, only a single intervening car immediately ahead on track is shown.)

Two closer readings of the chart are then possible. Firstly, by looking at the y-value we can see the actual time a car is ahead (and here the dashed guide line at +/1 1s helps indicate in a glanceable way the DRS activation line; I’m also pondering how to show an indication of pit loss time to indicate what effect a pit stop might have on the current situation). Secondly, we can read off the labels of the drivers involved i a battle to get a more detailed picture of the race situation.

The latest type of chart I’ve been looking at are session utilisation maps, which in their simplest form look something like the following:

Image may be NSFW.
Clik here to view.

The charts show how each driver made use of a practice session or qualifying – drivers are listed on the vertical y-axis and the time into the session each lap was recorded at is identified along the horizontal x-axis.

This chart makes it easy to see how many stints, and of what length, were completed by each driver and at what point in the session. Other information might be inferred – for example, significant gaps in which no cars are recording times may indicate poor weather conditions or red flags. However, no information is provided about the times recorded for each lap.

We can, however, use colour to identify “purple” laps (fastest lap time recorded so far in the session) and “green” laps (a driver’s fastest laptime so far in the session that isn’t a purple time), as well as laps on which a driver pitted:

Image may be NSFW.
Clik here to view.

But still, no meaningful lap times.

One thing to note about laptimes is that they come in various flavours, such as outlaps, when a driver starts the lap from the pitlane; inlaps, or laps on which a driver comes into the pits at the end of the lap; and flying laps when a driver is properly going for it. There are also those laps on which a driver may be trying out various new lines, slowing down to give themselves space for a flying lap, and so on.

Assuming that inlaps and outlaps are not the best indicators of pace, we can use a blend of symbols and text labels on the chart to identify inlaps and outlaps, as well as showing laptimes for “racing” laps, also using colour to highlight purple and green laps:

Image may be NSFW.
Clik here to view.

The chart is produced using ggplot, and a layered approach in which chart elements are added to the chart in separate layers.

#The base chart with the dataset used to create the original chart
#In this case, the dataset included here is redundant
g = ggplot(f12015test)

#Layer showing in-laps (laps on which a driver pitted) and out-laps
#Use a subset of the dataset to place markers for outlaps and inlaps
g = g + geom_point(data=f12015test[f12015test['outlap'] | f12015test['pit'],],aes(x=cuml, y=name, color=factor(colourx)), pch=1)

#Further annotation to explicitly identify pit laps (in-laps)
g = g + geom_point(data=f12015test[f12015test['pit']==TRUE,],aes(x=cuml, y=name),pch='.')

#Layer showing full laps with rounded laptimes and green/purple lap highlights
#In this case, use the laptime value as a text label, rather than a symbol marker
g = g + geom_text(data=f12015test[!f12015test['outlap'] &amp; !f12015test['pit'],],aes(x=cuml, y=name, label=round(stime,1), color=factor(colourx)), size=2, angle=45)

#Force the colour scale to be one we want
g = g + scale_colour_manual(values=c('darkgrey','darkgreen','purple'))

This version of the chart has the advantage of being glanceable when it comes to identifying session utilisation (number, duration and timing of stints) as well as when purple and green laptimes were recorded, as well as repaying closer reading when it comes to inspecting the actual laptimes recorded during each stint.

To reduce clutter on the chart, laptimes are round to 1 decimal place (tenths of a second) rather than using the full lap time which is recorded down to thousandths of a second.

Session utilisation charts are described more fully in a forthcoming chapter of the Wrangling F1 Data With R Leanpub book. Buying a copy of the book gains you access to future updates of the book. A draft version of the chapter can be found here.

Image may be NSFW.
Clik here to view.

To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog... » Rstats.

↧

Recreating the vaccination heatmaps in R

April 9, 2015, 2:14 pm

≫ Next: R User Group Recap: Heatmaps and Using the caret Package

≪ Previous: Mixing Numbers and Symbols in Time Series Charts

(This article was first published on Benomics » R, and kindly contributed to R-bloggers)

In February the WSJ graphics team put together a series of interactive visualisations on the impact of vaccination that blew up on twitter and facebook, and were roundly lauded as great-looking and effective dataviz. Some of these had enough data available to look particularly good, such as for the measles vaccine:

Image may be NSFW.
Clik here to view.

Credit to the WSJ and creators: Tynan DeBold and Dov Friedman

How hard would it be to recreate an R version?

Base R version

Quite recently Mick Watson, a computational biologist based here in Edinburgh, put together a base R version of this figure using heatmap.2 from the gplots package.

Recreating a beautiful heatmap https://t.co/JGomRKjgFR pic.twitter.com/Ps7moKGkhO

— Mick Watson (@BioMickWatson) April 9, 2015

If you’re interested in the code for this, I suggest you check out his blog post where he walks the reader through creating the figure, beginning from heatmap defaults.

However, it didn’t take long for someone to pipe up asking for a ggplot2 version (3 minutes in fact…) and that’s my preference too, so I decided to have a go at putting one together.

ggplot2 version

Thankfully the hard work of tracking down the data had already been done for me, to get at it follow these steps:

Register and login to “Project Tycho“
Go to level 1 data, then Search and retrieve data
Now change a couple of options: geographic level := state; disease outcome := incidence
Add all states (highlight all at once with Ctrl+A (or Cmd+A on Macs)
Hit submit and scroll down to Click here to download results to excel
Open in excel and export to CSV

Simple right!

Now all that’s left to do is a bit of tidying. The data comes in wide format, so can be melted to our ggplot2-friendly long format with:

measles &lt;- melt(measles, id.var=c(&quot;YEAR&quot;, &quot;WEEK&quot;))

After that we can clean up the column names and use dplyr to aggregate weekly incidence rates into an annual measure:

colnames(measles) &lt;- c(&quot;year&quot;, &quot;week&quot;, &quot;state&quot;, &quot;cases&quot;)
mdf &lt;- measles %&gt;% group_by(state, year) %&gt;% 
       summarise(c=if(all(is.na(cases))) NA else 
                 sum(cases, na.rm=T))
mdf$state &lt;- factor(mdf$state, levels=rev(levels(mdf$state)))

It’s a bit crude but what I’m doing is summing the weekly incidence rates and leaving NAs if there’s no data for a whole year. This seems to match what’s been done in the WSJ article, though a more intepretable method could be something like average weekly incidence, as used by Robert Allison in his SAS version.

After trying to match colours via the OS X utility “digital colour meter” without much success, I instead grabbed the colours and breaks from the original plot’s javascript to make them as close as possible.

In full, the actual ggplot2 command took a fair bit of tweaking:

ggplot(mdf, aes(y=state, x=year, fill=c)) + 
  geom_tile(colour=&quot;white&quot;, linewidth=2, 
            width=.9, height=.9) + theme_minimal() +
    scale_fill_gradientn(colours=cols, limits=c(0, 4000),
                        breaks=seq(0, 4e3, by=1e3), 
                        na.value=rgb(246, 246, 246, max=255),
                        labels=c(&quot;0k&quot;, &quot;1k&quot;, &quot;2k&quot;, &quot;3k&quot;, &quot;4k&quot;),
                        guide=guide_colourbar(ticks=T, nbin=50,
                               barheight=.5, label=T, 
                               barwidth=10)) +
  scale_x_continuous(expand=c(0,0), 
                     breaks=seq(1930, 2010, by=10)) +
  geom_segment(x=1963, xend=1963, y=0, yend=51.5, size=.9) +
  labs(x=&quot;&quot;, y=&quot;&quot;, fill=&quot;&quot;) +
  ggtitle(&quot;Measles&quot;) +
  theme(legend.position=c(.5, -.13),
        legend.direction=&quot;horizontal&quot;,
        legend.text=element_text(colour=&quot;grey20&quot;),
        plot.margin=grid::unit(c(.5,.5,1.5,.5), &quot;cm&quot;),
        axis.text.y=element_text(size=6, family=&quot;Helvetica&quot;, 
                                 hjust=1),
        axis.text.x=element_text(size=8),
        axis.ticks.y=element_blank(),
        panel.grid=element_blank(),
        title=element_text(hjust=-.07, face=&quot;bold&quot;, vjust=1, 
                           family=&quot;Helvetica&quot;),
        text=element_text(family=&quot;URWHelvetica&quot;)) +
  annotate(&quot;text&quot;, label=&quot;Vaccine introduced&quot;, x=1963, y=53, 
           vjust=1, hjust=0, size=I(3), family=&quot;Helvetica&quot;)

Result

Image may be NSFW.
Clik here to view.I’m pretty happy with the outcome but there are a few differences: the ordering is out (someone pointed out the original is ordered by two letter code rather than full state name) and the fonts are off (as far as I can tell they use “Whitney ScreenSmart” among others).

Obviously the original is an interactive chart which works great with this data. It turns out it was built with the highcharts library, which actually has R bindings via the rCharts package, so in theory the original chart could be entirely recreated in R! However, for now at least, that’ll be left as an exercise for the reader…

Full code to reproduce this graphic is on github.

Image may be NSFW.
Clik here to view.

To leave a comment for the author, please follow the link and comment on his blog: Benomics » R.

↧

R User Group Recap: Heatmaps and Using the caret Package

April 10, 2015, 8:01 am

≫ Next: Visualisation with R and Google Maps

≪ Previous: Recreating the vaccination heatmaps in R

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

At our most recent R user group meeting we were delighted to have presentations from Mark Lawson and Steve Hoang, both bioinformaticians at Hemoshear. All of the code used in both demos is in our Meetup’s GitHub repo.

Making heatmaps in R

Steve started with an overview of making heatmaps in R. Using the iris dataset, Steve demonstrated making heatmaps of the continuous iris data using the heatmap.2 function from the gplots package, the aheatmap function from NMF, and the hard way using ggplot2. The “best in class” method used aheatmap to draw an annotated heatmap plotting z-scores of columns and annotated rows instead of raw values, using the Pearson correlation instead of Euclidean distance as the distance metric.

library(dplyr)
library(NMF)
library(RColorBrewer)
iris2 = iris # prep iris data for plotting
rownames(iris2) = make.names(iris2$Species, unique = T)
iris2 = iris2 %>% select(-Species) %>% as.matrix()
aheatmap(iris2, color = "-RdBu:50", scale = "col", breaks = 0,
         annRow = iris["Species"], annColors = "Set2", 
         distfun = "pearson", treeheight=c(200, 50), 
         fontsize=13, cexCol=.7, 
         filename="heatmap.png", width=8, height=16)

Image may be NSFW.
Clik here to view.

Classification and regression using caret

Mark wrapped up with a gentle introduction to the caret package for classification and regression training. This demonstration used the caret package to split data into training and testing sets, and run repeated cross-validation to train random forest and penalized logistic regression models for classifying Fisher’s iris data.

First, get a look at the data with the featurePlot function in the caret package:

library(caret)
set.seed(42)
data(iris)
featurePlot(x = iris[, 1:4],
            y = iris$Species,
            plot = "pairs",
            auto.key = list(columns = 3))

Image may be NSFW.
Clik here to view.

Next, after splitting the data into training and testing sets and using the caret package to automate training and testing both random forest and partial least squares models using repeated 10-fold cross-validation (see the code), it turns out random forest outperforms PLS in this case, and performs fairly well overall:

	setosa	versicolor	virginica
Sensitivity	1.00	1.00	0.00
Specificity	1.00	0.50	1.00
Pos Pred Value	1.00	0.50	NaN
Neg Pred Value	1.00	1.00	0.67
Prevalence	0.33	0.33	0.33
Detection Rate	0.33	0.33	0.00
Detection Prevalence	0.33	0.67	0.00
Balanced Accuracy	1.00	0.75	0.50

A big thanks to Mark and Steve at Hemoshear for putting this together!

To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done.

↧

Visualisation with R and Google Maps

April 14, 2015, 5:14 am

≫ Next: Project Tycho, ggplot2 and the shameless stealing of blog ideas

≪ Previous: R User Group Recap: Heatmaps and Using the caret Package

(This article was first published on BNOSAC - Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers)

For those of you who are interested in using R alongside Google Maps by using the packages geonames (www.geonames.org), RgoogleMaps, ggmap, loa and plotKML. Enjoy the slides of our presentation on this topic during the last RBelgium meetup.

If you are interested in applying Spatial modelling on your data, get in touch: bnosac.be/index.php/contact/mail-us

To leave a comment for the author, please follow the link and comment on his blog: BNOSAC - Belgium Network of Open Source Analytical Consultants.

↧

Project Tycho, ggplot2 and the shameless stealing of blog ideas

April 14, 2015, 7:19 am

≫ Next: Hash Table Performance in R: Part III In Part I of this series, I explained how R hashed…

≪ Previous: Visualisation with R and Google Maps

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

Last week, Mick Watson posted a terrific article on using R to recreate the visualizations in this WSJ article on the impact of vaccination. Someone beat me to the obvious joke.

@BioMickWatson @pathogenomenick Nice quilt plot.

— Ed Yong (@edyong209) April 9, 2015

Someone also beat me to the standard response whenever base R graphics are used.

@BioMickWatson very nice! now a ggplot2 version please

— Nick Loman (@pathogenomenick) April 9, 2015

And despite devoting much of Friday morning to it, I was beaten to publication of a version using ggplot2.

Why then would I even bother to write this post. Well, because I did things a little differently; diversity of opinion and illustration of alternative approaches are good. And because on the internet, it’s quite acceptable to appropriate great ideas from other people when you lack any inspiration yourself. And because I devoted much of Friday morning to it.

Here then is my “exploration of what Mick did already, only using ggplot2 like Ben did already.”

You know what: since we’re close to the 60th anniversary of Salk’s polio vaccine field trial results, let’s do poliomyelitis instead of measles. And let’s normalise cases per 100 000 population, since Mick and Ben did not. There, I claim novelty.

1. Getting the disease data
Follow Mick’s instructions, substituting POLIOMYELITIS for MEASLES. The result is a CSV file. Except: the first 2 lines are not CSV and the header row contains an extra blank field (empty quotes), not found in the data rows. The simplest way to deal with this was to use read.csv(), which auto-creates the extra column (62), calls it “X” and fills it with NA. You can then ignore it. Here’s a function to read the CSV file and aggregate cases by year.

library(plyr)
library(reshape2)
library(ggplot2)

readDiseaseData <- function(csv) {
  dis <- read.csv(csv, skip = 2, header = TRUE, stringsAsFactors = FALSE)
  dis[, 3:62] <- sapply(dis[, 3:62], as.numeric)  
  dis.m   <- melt(dis[, c(1, 3:61)], variable.name = "state", id.vars = "YEAR")
  dis.agg <- aggregate(value ~ state + YEAR, dis.m, sum)
  return(dis.agg)
}

We supply the downloaded CSV file as an argument:

polio <- readDiseaseData("POLIOMYELITIS_Cases_1921-1971_20150410002846.csv")

and here are the first few rows:

head(polio)
          state YEAR value
1      NEBRASKA 1921     1
2    NEW.JERSEY 1921     1
3      NEW.YORK 1921    40
4      NEW.YORK 1922     2
5 MASSACHUSETTS 1923     1
6      NEW.YORK 1923     8

2. Getting the population data
I discovered US state population estimates for the years 1900 – 1990 at the US Census Bureau. The URLs are HTTPS, but omitting the “s” works fine. The data are plain text…which is good but…although the data are somewhat structured (delimited), the files themselves vary a lot. Some contain only estimates, others contain in addition census counts. For earlier decades the numbers are thousands with a comma (so “1,200” = 1 200 000). Later files use millions with no comma. The decade years are split over several lines with different numbers of lines before and inbetween.

To make a long story short, any function to read these files requires many parameters to take all this into account and it looks like this:

getPopData <- function(years = "0009", skip1 = 23, skip2 = 81, rows = 49, names = 1900:1909, keep = 1:11) {
  u  <- paste("http://www.census.gov/popest/data/state/asrh/1980s/tables/st", years, "ts.txt", sep = "")
  p1 <- read.table(u, skip = skip1, nrows = rows, header = F, stringsAsFactors = FALSE)
  p2 <- read.table(u, skip = skip2, nrows = rows, header = F, stringsAsFactors = FALSE)
  p12 <- join(p1, p2, by = "V1")
  p12 <- p12[, keep]
  colnames(p12) <- c("state", names)
  # 1900-1970 are in thousands with commas
  if(as.numeric(substring(years, 1, 1)) < 7) {
    p12[, 2:11] <- sapply(p12[, 2:11], function(x) gsub(",", "", x))
    p12[, 2:11] <- sapply(p12[, 2:11], as.numeric)
    p12[, 2:11] <- sapply(p12[, 2:11], function(x) 1000*x)
  }
  return(p12)
}

So now we can create a list of data frames, one per decade, then use plyr::join_all to join on state and get a big date frame of 51 states x 91 years with population estimates.

popn <- list(p1900 = getPopData(),
             p1910 = getPopData(years = "1019", names = 1910:1919),
             p1920 = getPopData(years = "2029", names = 1920:1929),
             p1930 = getPopData(years = "3039", names = 1930:1939),
             p1940 = getPopData(years = "4049", skip1 = 21, skip2 = 79, , names = 1940:1949),
             p1950 = getPopData(years = "5060", skip1 = 27, skip2 = 92, rows = 51, names = 1950:1959, keep = c(1, 3:7, 9:13)),
             p1960 = getPopData(years = "6070", skip1 = 24, skip2 = 86, rows = 51, names = 1960:1969, keep = c(1, 3:7, 9:13)),
             p1970 = getPopData(years = "7080", skip1 = 14, skip2 = 67, rows = 51, names = 1970:1979, keep = c(2:8, 11:14)),
             p1980 = getPopData(years = "8090", skip1 = 11, skip2 = 70, rows = 51, names = 1980:1990, keep = 1:12))

popn.df <- join_all(popn, by = "state", type = "full")

3. Joining the datasets
Next step: join the disease and population data. Although we specified states in the original data download, it includes things that are not states like “UPSTATE.NEW.YORK”, “DISTRICT.OF.COLUMBIA” or “PUERTO.RICO”. So let’s restrict ourselves to the 50 states helpfully supplied as variables in R. First we create a data frame containing state names and abbreviations, then match the abbreviations to the polio data.

statenames <- toupper(state.name)
statenames <- gsub(" ", ".", statenames)
states <- data.frame(sname = statenames, sabb = state.abb)

m <- match(polio$state, states$sname)
polio$abb <- states[m, "sabb"]

Now we can melt the population data, join to the polio data on state abbreviation and calculate cases per 100 000 people.

popn.m <- melt(popn.df)
colnames(popn.m) <- c("abb", "YEAR", "pop")
popn.m$YEAR <- as.numeric(as.character(popn.m$YEAR))
polio.pop <- join(polio, popn.m, by = c("YEAR", "abb"))
polio.pop$cases <- (100000 / polio.pop$pop) * polio.pop$value

head(polio.pop)
          state YEAR value abb      pop      cases
1      NEBRASKA 1921     1  NE  1309000 0.07639419
2    NEW.JERSEY 1921     1  NJ  3297000 0.03033060
3      NEW.YORK 1921    40  NY 10416000 0.38402458
4      NEW.YORK 1922     2  NY 10589000 0.01888752
5 MASSACHUSETTS 1923     1  MA  4057000 0.02464876
6      NEW.YORK 1923     8  NY 10752000 0.07440476

Success! Let’s get plotting.

4. Plotting
We should really indicate where data are missing but for the purposes of this post, I’ll just drop incomplete rows using na.omit().

Technically my first attempt is an abuse of geom_dotplot, but I think it generates quite a nice effect (assuming you’re not colour-blind). Note that years have to be factorised here.

ggplot(na.omit(polio.pop)) + geom_dotplot(aes(x = factor(YEAR), fill = cases), color = "white", binwidth = 1, 
dotsize = 0.4, binpositions = "all", method = "histodot") + facet_grid(abb~.) + theme_bw() + 
scale_fill_continuous(low = "floralwhite", high = "red") + geom_vline(xintercept = 32) + 
scale_y_discrete(breaks = NULL) + theme(panel.border = element_blank(), strip.text.y = element_text(angle = 0)) + 
scale_x_discrete(breaks = seq(min(polio.pop$YEAR), max(polio.pop$YEAR), 5)) + 
labs(x = "Year", y = "cases / 100000", title = "Poliomyelitis 1921 - 1971")

Image may be NSFW.
Clik here to view.

Polio cases, dotplot

For a shape more like the WSJ plots, I use geom_rect. This plot is generated quite a lot faster.

ggplot(na.omit(polio.pop)) + geom_rect(aes(xmin = YEAR, xmax = YEAR+1, ymin = 0, ymax = 12, fill = cases)) + 
facet_grid(abb~.) + theme_bw() + scale_y_discrete(breaks = NULL) + scale_fill_continuous(low = "floralwhite", high = "red") + 
theme(panel.border = element_blank(), panel.margin = unit(1, "mm"), strip.text.y = element_text(angle = 0)) + 
geom_vline(xintercept = 1955) + scale_x_continuous(breaks = seq(min(polio.pop$YEAR), max(polio.pop$YEAR), 5)) + 
labs(x = "Year", y = "cases / 100000", title = "Poliomyelitis 1921 - 1971")

Image may be NSFW.
Clik here to view.

Polio cases, geom_rect

Finally, let’s try the colour palette generated by Mick. From his post, it’s clear that the WSJ fiddled with bin sizes and break points to generate more yellow/orange/red for pre-vaccine years. I haven’t bothered with that here, so things look a little different.

cols <- c(colorRampPalette(c("white", "cornflowerblue"))(10), colorRampPalette(c("yellow", "red"))(30))

ggplot(na.omit(polio.pop)) + geom_rect(aes(xmin = YEAR, xmax = YEAR+1, ymin = 0, ymax = 12, fill = cases), color = "white") + 
facet_grid(abb~.) + theme_bw() +scale_y_discrete(breaks = NULL) + scale_fill_gradientn(colours = cols) + 
theme(panel.border = element_blank(), panel.margin = unit(1, "mm"), strip.text.y = element_text(angle = 0)) + 
geom_vline(xintercept = 1955) + scale_x_continuous(breaks = seq(min(polio.pop$YEAR), max(polio.pop$YEAR), 5)) + 
labs(x = "Year", y = "cases / 100000", title = "Poliomyelitis 1921 - 1971")

Image may be NSFW.
Clik here to view.

Polio cases, geom_rect, new palette

Summary
Not too much work for some quite attractive output, thanks to great R packages; Hadley, love your work.
As ever, the main challenge is getting the raw data into shape. At some point I’ll wrap all this up as Rmarkdown and send it off to RPubs.

Filed under: R, statistics Tagged: disease, ggplot2, project tycho, vaccination Image may be NSFW.
Clik here to view.

Image may be NSFW.
Clik here to view.

To leave a comment for the author, please follow the link and comment on his blog: What You're Doing Is Rather Desperate » R.

↧

Hash Table Performance in R: Part III In Part I of this series, I explained how R hashed…

April 17, 2015, 7:06 pm

≫ Next: Downloading and Visualizing Seismic Events from USGS

≪ Previous: Project Tycho, ggplot2 and the shameless stealing of blog ideas

(This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers)

Hash Table Performance in R: Part III

In Part I of this series, I explained how R hashed environments are superior to vectors or lists when you’re in need of an associative array. In Part II I explained the three main operations you want to perform on hash tables and how you should implement them for optimal performance.

This time around I want to introduce a package I’ve created that “peaks behind the curtain” of R environments, which I’ll use later to evaluate more performance characteristics.

Under Envestigation

The envestigate package exposes the hash table and its statistics that back the R environment. I consider this an experimental package so I don’t intend to push it to CRAN any time soon. However it properly uses the R C API and was developed the proper way, so it should work on any late model R version across all platforms.

library(devtools) # Make sure you've already installed this
install_github('jeffreyhorner/envestigate')
library(envestigate)

Hash Table Implementation

Hash tables are commonly implemented as an array of buckets with each bucket containing a linked list. They use a hash function to compute an index into the array of buckets so that the correct value can be found. A perfect hash function ‘hashes’ each key to a unique bucket to store the value, but they are rarely perfect. It is stated that when two keys hash to the same bucket a collision occures, and collisions are resolved by traversing the bucket’s linked list until the correct value is found.

In R, environment hash tables are a vector of lists where each element of the list contains a binding from symbol to value. The envestigate::hash_table function transforms that structure into a list of character vectors where each element contains only the symbol name as a string and ignores the value, thus exposing the internal structure so we can explore:

 # e - an environment who's hash table has 5 buckets
e <- new.env(size=5L)

e[['Tukey']] <- 1 # Values aren't important
e[['Fisher']] <- 1
e[['Bayes']] <- 1
e[['Pearson']] <- 1

ht <- envestigate::hash_table(e)

# List the bucket contents
ht$buckets

## [[1]]
## [1] "Bayes" "Tukey"
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## [1] "Pearson"
## 
## [[5]]
## [1] "Fisher"

So we see that the internal hash table of 5 buckets has two empty buckets, bucket 2 and 3, and 3 non-empty buckets with bucket 1 containing two elements.

hash_table function

Let’s explore the 'hash_table’ object a little further. The envestigate package provides a simple print function for hash table objects:

ht

## $size
## [1] 5
## 
## $num_entries
## [1] 4
## 
## $num_collisions
## [1] 1
## 
## $num_empty_buckets
## [1] 2
## 
## $num_non_empty_buckets
## [1] 3
## 
## $load_factor
## [1] 0.6
## 
## $max_bucket_length
## [1] 2

size - Number of buckets in the hash table, e.g. the size of the hash table.
num_entries - Total number of entries (or elements or variables) in the hash table.
num_collisions - Total number of collisions, counted as the sum of 1 - length of each list.
num_empty_buckets - Total number of buckets that contain zero elements.
num_non_empty_buckets - Total number of buckets that contain at least one element.
load_factor - Ratio of non empty buckets to size.
max_bucket_length - Length of bucket with the most elements.

And there’s more. The following items are also part of the object but are typically too large for a simple print summary. They are:

buckets - List of the buckets and their associated lists, which we saw above.
empty_buckets - Vector of indexes for the empty buckets.
non_empty_buckets - Vector of indexes for non empty buckets.
bucket_counts - Named intenger vector where names are indexes of and elements are the counts of each bucket.
bucket_size_dist - Data frame histogram of the number of buckets for each unique bucket size.

A note on the load factor. It is typically defined in the literature as the total number of elements (or entries) in the hash table divided by the size of the hash table. However, computing that value requires a complete scan of the table and this can have performance consequenses for instance when the table needs to be resized. So just using a ratio of non empty buckets to size is a cheaper to compute approximation. This is how R determines when to resize a hash table.

Computing Hash Values

envestigate also exposes the hash function that R uses to compute the hash value for strings, values that are used to map the strings to their buckets. Defining hash functions for computing hash values, a well researched topic, is beyond the scope of this note so I’ll just point to R’s implementation here, authored by Peter J. Weinberger.

The function envestigate::hash_value maps character vectors to integer values:

envestigate::hash_value(c('Tukey','Fisher','Bayes','Pearson'))

##     Tukey    Fisher     Bayes   Pearson 
##   6013385  80780994   4755395 112761358

So those hash values are pretty meaningless at this point as you can tell non of them are in the 1 to 5 range, the size of our hash table. But they are a consistent, reproducible, one to one mapping from unique string to unique value, and we can use that result to compute the bucket index using the %% or modulo operator:

(hash_value(c('Tukey','Fisher','Bayes','Pearson')) %% ht$size) + 1

##   Tukey  Fisher   Bayes Pearson 
##       1       5       1       4

or use the convenient envestigate::hash_index function:

hash_index(c('Tukey','Fisher','Bayes','Pearson'),ht)

##   Tukey  Fisher   Bayes Pearson 
##       1       5       1       4

Now we can compute on R’s internal environment hash tables just like R. For instance looking up a variable in an environment is nearly equivalent to this operation for looking up the bucket for the string 'Tukey’:

index <- hash_index('Tukey',ht)
bucket <- ht$buckets[[ index ]]
bucket

## [1] "Bayes" "Tukey"

for (i in 1:length(bucket)){
  if ('Tukey' == bucket[i]){
    cat('Found Tukey in bucket',index,'and list element',i,'!n')
  }
}

## Found Tukey in bucket 1 and list element 2 !

So envestigate away, and share what you come up with!

To leave a comment for the author, please follow the link and comment on his blog: Jeffrey Horner.

↧

Downloading and Visualizing Seismic Events from USGS

April 28, 2015, 6:56 am

≫ Next: choroplethrZip v1.3.0: easier demographics, national maps

≪ Previous: Hash Table Performance in R: Part III In Part I of this series, I explained how R hashed…

(This article was first published on R Video tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

The unlucky events that took place in Nepal have flooded the web with visualization of the earthquakes from USGS. They normally visualize earthquakes with a colour scale that depends on the age of the event and a marker size that depends on magnitude. I remembered that some time ago I tested ways for downloading and visualizing data from USG in the same way in R. So I decided to take those tests back, clean them up and publish them. I hope this will not offend anyone, I do not want to disrespect the tragedy, just share my work.

The USGS provides access to csv files for seismic events recording in several time frames: past hour, past day, past week, and in the past 30 days. For each of these, several choices of significance are provided, user can download all the events in the time frame or limit their request to events with magnitude higher than: 1.0, 2.5, 4.5 and significant events. The data are provided in csv files with standard names so that they are always accessible and updated every 15 minutes with new data.
USGS provides the csv files in links with standard names. For example in this case we are downloading all the data in the last month, so the csv file’s name is: all_month.csv. If we wanted to download only the earthquakes in the last day and with a magnitude above 4.5, we would have used the file name: 4.5_day.csv. The links to all the csv provided by USGS are available here: http://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php

For this experiment we need the following packages: sp, plotrix, and raster
In R we can easily import the data by simply calling the read.table function and reading the csv file from the server:

URL <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"
Earthquake_30Days <- read.table(URL, sep = ",", header = T)

This will download all the seismic events in the past 30 days.
Now we can transform the data, which are stored in a data.frame, into a spatial object using the following two lines:

coordinates(Earthquake_30Days)=~longitude+latitude
projection(Earthquake_30Days)=CRS("+init=epsg:4326")

The first line transforms the object Earthquake_30Days into a SpatialPointsDataFrame. The second gives it its proper projection, which is a geographical projection like Google Maps.
At this point I want to download the borders of all the countries in the world so that I can plot the seismic events with some geographical references:

download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")

These three lines can download the border shapefile from the web, unzip it into the working directory and load it.
In this example I will visualize the earthquakes using the same technique used by the USGS, with a colour that varies with the age of the event and a size that depends on magnitude. So the first thing to do is take care of the time. If we check the format of the time variable in the USGS file we see that it is a bit uncommon:

Earthquake_30Days$time[1]
[1] 2015-04-28T12:20:43.410Z

For this reason I created a function to transform this format into something we can use:

conv.time <- function(vector){
split1 <- strsplit(paste(vector),"T")
split2 <- strsplit(split1[[1]][2],"Z")
fin <- paste0(split1[[1]][1],split2[[1]][1])
paste(as.POSIXlt(fin,formate="%Y-%m-%d%H:%M:%OS3"))
}

If I apply this function to the previous example I obtain the following:

conv.time(Earthquake_30Days$time[1])
[1] "2015-04-28 12:20:43"

Now I can create a new variable in the object with this time-stamp:

DT <- sapply(Earthquake_30Days$time,FUN=conv.time)
Earthquake_30Days$DateTime <- as.POSIXlt(DT)

Now we can start the tricky part. For plotting the events with a custom colour scale and a custom size scale, we first need to create them. Moreover, we also need to create the thresholds needed for the legend.
For the colour scale we can do all that using the following lines:

days.from.today <- round(c(Sys.time()-Earthquake_30Days$DateTime)/60,0)
colour.scale <- color.scale(days.from.today,color.spec="rgb",extremes=c("red","blue"),alpha=0.5)
colors.DF <- data.frame(days.from.today,color.scale(days.from.today,color.spec="rgb",extremes=c("red","blue")))
colors.DF <- colors.DF[with(colors.DF, order(colors.DF[,1])), ]
colors.DF$ID <- 1:nrow(colors.DF)
breaks <- seq(1,nrow(colors.DF),length.out=10)

In the first line I calculate the age of the event as the difference between the system time and the event time-stamp. In the second line I create the colour scale with the function in plotrix, from red to blue and with a certain transparency.
Then I need to create the thresholds for the legend. I first create a data.frame with age and colours, then I order it by age and insert an ID column. At this point I can create the thresholds by simply using the seq function.
I do the same thing with the size thresholds:

size.DF <- data.frame(Earthquake_30Days$mag,Earthquake_30Days$mag/5)
size.DF <- size.DF[with(size.DF, order(size.DF[,1])), ]
size.DF$ID <- 1:nrow(size.DF)
breaks.size <- seq(0,max(Earthquake_30Days$mag/5),length.out=5)

Then I can plot the whole thing:

tiff(filename="Earthquake_Map.tif",width=7000,height=4000, res=300)

#Plot
plot(polygons)
plot(Earthquake_30Days, col= colour.scale, cex=Earthquake_30Days$mag/5, pch=16, add=T)

#Title and Legend
title("Earthquakes in the last 30 days",cex.main=3)
legend.pos <- list(x=-28.52392,y=-20.59119)
rect(xleft=legend.pos$x-5, ybottom=legend.pos$y-30, xright=legend.pos$x+30, ytop=legend.pos$y+10, col="white", border=NA)
legendg(legend.pos,legend=c(round(colors.DF[colors.DF$ID %in% round(breaks,0),1],2)),fill=paste(colors.DF[colors.DF$ID %in% round(breaks,0),2]),bty="n",bg=c("white"),y.intersp=0.75,title="Age",cex=0.8) 
text(x=legend.pos$x+5,y=legend.pos$y+5,"Legend:")
legend(x=legend.pos$x+15,y=legend.pos$y,legend=breaks.size[2:5]*5,pch=points(rep(legend.pos$x+15,4),c(legend.pos$y-6,legend.pos$y-9,legend.pos$y-12,legend.pos$y-15),pch=16,cex=breaks.size[2:5]),cex=0.8,bty="n",bg=c("white"),y.intersp=1.1,title="Magnitude") 

dev.off()

I divided the magnitude by 5, so that the bubbles are not too big. The position of the legends is something that depends of the image, if you decrease the area plotted on the map their location will change and you can use geographical coordinates to change it.
The result is the following image:

Image may be NSFW.
Clik here to view.

Download the image here: Earthquake_Map.jpg

The full script is available here: Script

To leave a comment for the author, please follow the link and comment on his blog: R Video tutorial for Spatial Statistics.

↧

choroplethrZip v1.3.0: easier demographics, national maps

April 28, 2015, 3:56 pm

≫ Next: The First NY R Conference

≪ Previous: Downloading and Visualizing Seismic Events from USGS

(This article was first published on Just an R Blog » R, and kindly contributed to R-bloggers)

Introduction

choroplethr v3.0 is now available on github. You can get it by typing

# install.packages("devtools")
library(devtools)
install_github('arilamstein/choroplethrZip@v1.3.0')

Version 1.3.0 has two new features:

Data frame df_zip_demographics contains eight demographic statistics about each ZIP Code Tabulated Area (ZCTA) in the US. Data comes from the 2013 5-year American Community Survey (ACS).
Function ?get_zip_demographics will return a data.frame with those same statistics from an arbitrary ACS.

Data

Here is how to access the data:

library(choroplethrZip)
data(df_zip_demographics)

?df_zip_demographics
colnames(df_zip_demographics)
 [1] "region" "total_population" "percent_white" 
 [4] "percent_black" "percent_asian" "percent_hispanic" 
 [7] "per_capita_income" "median_rent" "median_age"

summary(df_zip_demographics[, "total_population"])
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 0    721     2802   9517 13000   114700

Mapping the Data

Here is a program which will create national maps of the data:

# for each column in the data.frame
for (i in 2:ncol(df_zip_demographics))
{
 # set the value and title
 df_zip_demographics$value = df_zip_demographics[,i]
 title = paste0("2013 ZCTA Demographics:n",
                colnames(df_zip_demographics[i]))

 # print the map
 choro = zip_choropleth(df_zip_demographics, title=title)
 print(choro)
}

Note that national zip maps can take a few minutes to render. Here is the output.

Image may be NSFW.
Clik here to view.

New Vignette

Additionally, I have created a new vignette about these features.

Image may be NSFW.
Clik here to view.

To leave a comment for the author, please follow the link and comment on his blog: Just an R Blog » R.

↧

The First NY R Conference

April 30, 2015, 8:30 am

≫ Next: Shiny: Officer Involved Shootings

≪ Previous: choroplethrZip v1.3.0: easier demographics, national maps

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Last Friday and Saturday the NY R Conference briefly lit up Manhattan's Union Square neighborhood as the center of the R world. You may have caught some of the glow on twitter. Jared Lander, volunteers from the New York Open Statistical Programming Meetup along with the staff at Workbench (the conference venue) set the bar pretty darn high for a first time conference.

Image may be NSFW.
Clik here to view.

The list of speakers was impressive (a couple of the presentations approached the sublime), the venue was bright and upscale, the food was good, and some of the best talks ran way over the time limit but somehow the clock slowed down to sync to the schedule.

But the best part of the conference was the vibe! It was a sweet brew of competency, cooperation and fun. The crowd, clearly out to enjoy themselves, provided whatever lift the speakers needed to be at the top of their game. For example, when near the very end of the second day Stefan Karpinsky's PC just "up and died" as he was about to start his Julia to R demo the crowd hung in there with him and Stefan managed an engaging, ad lib, no visuals 20 minute talk. It was also uncanny how the talks seemed to be arranged in just the right order. Mike Dewar, a data scientist with the New York Times, gave the opening presentation which featured some really imaginative and impressive data visualizations that wowed the audience. But Bryan Lewis stole back the thunder, and the applause, later in the morning when as part of his presentation on htmlwidgets he reproduced Dewar's finale viz with mushroom data.

Bryan has posted his slides on his site here along with a promise to post all of the code soon.

The slides from all of the presentations have yet to be posted on the NY R Conference website. So, all I can do here today is to provide an opportunity sample drawn from postings I have managed to find scattered about the web. Here are Winston Chang's talk on Dashboarding with Shiny, Jared Lander's talk on Making R Go Faster and Bigger, Wes McKinney's talk on Data Frames, my talk on Reproducibility with the checkpoint package, and Max Richman's talk on R for Survey Analysis.

For the rest of the presentations, we will have to wait for the slides to become available on the conference site. There is a lot to look forward to: Vivian Peng's presentation on Storytelling and Data Visualization will be worth multiple viewings and you will not want to miss Hilary Parker's hilarious "grand slam" talk on Reproducible Analysis in Production featuring explainR and complainR. But for sure, look for Andrew Gelman's talk: But When You Call Me A Bayesian I Know I'm Not the Only One. Gelman delivered what was possibly the best technical talk ever, but we will have to wait for the conference video to reassess that.

Image may be NSFW.
Clik here to view.

Was Gelman's talk really the best ever, or was it just the magic of his delivery and the mood of the audience that made it seem so? Either way, I'm glad I was there.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

↧

Using R in Excel via R.NET

Text Mining with R

Drinks and Networking

Getting Data From One Online Source

Robert Norberg

Caveat!!!

Using data.table’s fread()

Using RStudio

Using RCurl’s getURL()

A Use Case

Albers All The Way

Local Geocoding

Introducing: Machine Learning in R

Using R For k-Nearest Neighbors (KNN)

Step One. Get Your Data

Built-in Datasets of R

UC Irvine Machine Learning Repository

Step Two. Know Your Data

Initial Overview Of The Data Set

Profound Understanding Of Your Data

Step Three. Where to Go Now.

Step Four. Prepare Your Workspace

Step Five. Prepare Your Data

Normalization

Training And Test Sets

Step Six. The Actual KNN Model

Building Your Classifier

Step Seven. Evaluation of Your Model

Move On To Big Data

Heatmap and plot of meta-profiles of CpG methylation around promoters

Distribution of covered CpGs across gene regions

Introduction

Example: Race and Ethnicity

Example: Per Capita Income

Further Exploration: Interactive Analysis with Shiny

Conclusion

Base R version

ggplot2 version

Result

Full code to reproduce this graphic is on github.

Making heatmaps in R

Classification and regression using caret

Hash Table Performance in R: Part III

Under Envestigation

Hash Table Implementation

hash_table function

Computing Hash Values

Data

Mapping the Data

New Vignette

Using `data.table`’s `fread()`

Using `RCurl`’s `getURL()`