Quantcast
Channel: Search Results for “maps”– R-bloggers
Viewing all 589 articles
Browse latest View live

Interactive visualizations with R – a minireview

$
0
0

(This article was first published on Open Data Science, and kindly contributed to R-bloggers)

Interactive visualization allows deeper exploration of data than static plots. Javascript libraries such as d3 have made possible wonderful new ways to show data. Luckily the R community has been active in developing R interfaces to some popular javascript libraries to enable R users to create interactive visualizations without knowing any javascript.

In this post I have reviewed some of the most common interactive visualization packages in R with simple example plots along with some comments and experiences. Here are the packages included:

  • ggplot2 – one of the best static visualization packages in R
  • ggvis – interactive plots from the makers of ggplot2
  • rCharts – R interface to multiple javascript charting libraries
  • plotly – convert ggplot2 figures to interactive plots easily
  • googleVis – use Google Chart Tools from R

You can either jump straight to the example visualization or read my comments first. The R markdown source code for this blog post with embedded visualizations can be found in Github. I have probably missed some important features and documentation, and also clear mistakes are possible. Please point those out in the comments, and I’ll fix them. It is also important to note that I am a heavy ggplot2 user, and hence my comments may also be biased!

Also other libraries for creating interactive visualizations from R do exist, such as clickme, RIGHT, ggobi, iplots, gg2v, rVega, cranvas and r2d3. Some of these are not under active development anymore. I might include some of those into the comparison here in the future. The d3Network package is also worth checking if you need cool interactive network visualizations.

Technical features

All four packages use javascript for the interactive visualizations, and are cabable of producing most of the standard plot types. The syntaxes vary somewhat, as ggvis uses the pipe operator %>% (familiar for dplyr users), replacing the + in ggplot2. rCharts uses several javascript libraries and also the syntax used varies between different types of charts.

All other packages than googleVis are clearly in an early development phase, which is visible in a limited features and documentation. As an experienced ggplot2 user it was often hard to adapt to the much narrower range of features included in ggvis. For example faceting is a very important feature that hopefully gets implemented soon.

Documentation-wise ggvis and googleVis seem to be the most advanced. rCharts especially suffers from the combination of multiple plot types (named rather uninformatively as rPlot, nPlot and so on) with practically no documentation. So producing anything else than what’s provided in the existing examples was very hard.

googleVis sets itself apart by requiring the data in a different format than the other packages. In Hadley Wickham’s terms, it assumes the data is in the messy format, in contrast to the other packages, tha assume tidy data. This makes it somewhat hard to use, at least when one is used to using tidy data frames. See the examples below for more details.

Plotly is an interesting alternative to the other packages in that it simply takes as input a ggplot2 object and transforms it into an interactive chart that can then be embedded into websites. Using the service requires authentication, which is a clear limitation. By default all plots are made publicly visible to anyone, but there apparently is a way to produce private plots as well, with a limit in their number in the free account.

ggvis is currently the only one of these packages that can not produce map visualizations, but I assume this feature will be added in the future. plotly can use maps created with ggplot2, but not yet with the handy ggmap extension.

Sharing the visualizations

Interactive visualizations are typically meant to be shared for a larger audience. Common ways to share interactive visualizations from R are as standalone html files, embedded in R markdown documents, and embedded in Shiny applications. All the studied packages can produce standalone htmls, though possibly with some loss of interactivity.

R markdown documents are a very nice way of sharing reproducible analyses, using the knitr and rmarkdown packages. Outputs from all the studied visualization packages can be embedded in .Rmd documents, though I had some problems (see the Issues section below). All packages are also compatible with Shiny documents and applications, and have good tutorials for this.

Issues

I encountered several problems when preparing this blog post. Specifically, I had issues in embedding the plots into R markdown documents. This is made more complicated with the various available ways of turning .Rmd files into html: manual knit() and knit2html() functions, the Knit HTML button in RStudio, and a Jekyll-powered blog with its own requirements. Here I have listed the most important issues, with solutions when found. Some things are still unsolved, hope someone can help me with those!

  • ggvis showed up nicely with Knit HTML, as it creates a standalone file with the necessary javascript libraries included. However, this was not the case with my blog setup. My solution was to inlude the set of scripts (taken from the source of this page) into the header of all my blog posts (see here). Not sure if this is an optimal solution.
  • rCharts: Embedding rCharts to R markdown did not quite work either as shown e.g. here. With Knit HTML button the line that worked was rchars.object$print(include_assets=TRUE), whereas with the blog the line was rchars.object$show('iframesrc', cdn=TRUE).
  • plotly: Embedding plotly charts into R markdown documents did not work as shown here, but adding session="knit" to the ggplotly() call solved the issue (thanks to Scott Chamberlain and Marianne Corvellec for help!). Note that in this post I embedded existing plotly charts manually.
  • There are still two charts that do now show up in this post. I have quite limited understanding of how knitr, jekyll and the javascript tools work together, and could not get these to work. Perhaps the scripts somehow conflict with each other?

I also noticed some minor issues:

  • googleVis was missing axis labels by default
  • rCharts is missing legend titles, and behaves strangely on scatter plot: legend shows partially incorrect information, and the plot area is too tight

Summary

In general, being able to produce valid interactive html charts from R markdown without knowing any javascript is great! All of the packages great sensible outputs, but there are also a lot of differences. I love ggplot2, and hence I also like ggvis, as it pays attention to graphical details following the grammar of graphics principles. However, the package is still missing a lot of important features, such as faceting. In many cases rCharts can do what ggvis can not (yet), and so it is a good alternative. However, the missing documentation makes it hard to create customized plots. Plotly has a really nice idea and implementation, but requirement for authentication and limited number of private plots reduce the usability a lot. Google’s Motion charts are cool and useful, but otherwise the input data format logic that differs from the packages makes using the package too hard in practice.

Example visualizations

Here I have made example plots with the interactive tools: histograms, scatter plots and line plots. Source code is available in Github. First we need to install and load the necessary R packages:

## Install necessary packages
install.packages("devtools")
library("devtools")
install.packages("ggvis")
install.packages("googleVis")
install_github("ramnathv/rCharts")
install_github("ropensci/plotly")
install.packages("dplyr")
install.packages("tidyr")
install.packages("knitr")
# Load packages
library("ggvis")
library("googleVis")
library("rCharts")
library("plotly")
library("dplyr")
library("tidyr")
library("knitr")
# Define image sizes
img.width <- 450
img.height <- 300
options(RCHART_HEIGHT = img.height, RCHART_WIDTH = img.width)
opts_chunk$set(fig.width=6, fig.height=4)

Plotly needs some setting up (using the credentials from here).

# Plotly requires authentication
py <- plotly("RgraphingAPI", "ektgzomjbx")

Prepare the mtcars data set a bit.

# Use mtcars data
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)
# Compute mean mpg per cyl and am
mtcars.mean <- mtcars %>% group_by(cyl, am) %>% 
  summarise(mpg_mean=mean(mpg)) %>% 
  select(cyl, am, mpg_mean) %>% ungroup()

Histograms

ggplot

hist.ggplot <- ggplot(mtcars, aes(x=mpg)) + geom_histogram(binwidth=1)
hist.ggplot

testing

ggvis

hist.ggvis <- mtcars %>% ggvis(x = ~mpg) %>% layer_histograms(width=1) %>% 
  set_options(width = img.width, height = img.height)
hist.ggvis

], "axes": [ { "type": "x", "scale": "x", "orient": "bottom", "layer": "back", "grid": true, "title": "mpg" }, { "type": "y", "scale": "y", "orient": "left", "layer": "back", "grid": true, "title": "count" } ], "padding": null, "ggvis_opts": { "keep_aspect": false, "resizable": true, "padding": {

}, "duration": 250, "renderer": "svg", "hover_duration": 0, "width": 450, "height": 300 }, "handlers": null } ; ggvis.getPlot("plot_id128334887").parseSpec(plot_id128334887_spec);

rCharts

# rCharts histogram needs manual binning and counting!
hist.rcharts <- rPlot(x="bin(mpg,1)", y="count(id)", data=mtcars, type="bar")
# Use this with 'Knit HTML' button
# hist.rcharts$print(include_assets=TRUE)
# Use this with jekyll blog
hist.rcharts$show('iframesrc', cdn=TRUE)

Does not show up…

plotly

# This works, but is not evaluated now. Instead the iframe is embedded manually.
py$ggplotly(hist.ggplot, session="knitr")

googleVis

# Number of bins chosen automatically, which is sometimes bad
gvis.options <- list(hAxis="{title:'mpg'}",
                     width=img.width, height=img.height)
hist.gvis <- gvisHistogram(data=mtcars["mpg"], option=gvis.options)
print(hist.gvis)

<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Strict//EN”
“http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”>
<html xmlns=”http://www.w3.org/1999/xhtml”>





// jsData function gvisDataHistogramID56d271b392d5 () { var data = new google.visualization.DataTable(); var datajson = [ [ 21 ], [ 21 ], [ 22.8 ], [ 21.4 ], [ 18.7 ], [ 18.1 ], [ 14.3 ], [ 24.4 ], [ 22.8 ], [ 19.2 ], [ 17.8 ], [ 16.4 ], [ 17.3 ], [ 15.2 ], [ 10.4 ], [ 10.4 ], [ 14.7 ], [ 32.4 ], [ 30.4 ], [ 33.9 ], [ 21.5 ], [ 15.5 ], [ 15.2 ], [ 13.3 ], [ 19.2 ], [ 27.3 ], [ 26 ], [ 30.4 ], [ 15.8 ], [ 19.7 ], [ 15 ], [ 21.4 ] ]; data.addColumn('number','mpg'); data.addRows(datajson); return(data); }

// jsDrawChart function drawChartHistogramID56d271b392d5() { var data = gvisDataHistogramID56d271b392d5(); var options = {}; options["allowHtml"] = true; options["hAxis"] = {title:'mpg'}; options["width"] = 450; options["height"] = 300;

var chart = new google.visualization.Histogram( document.getElementById('HistogramID56d271b392d5') ); chart.draw(data,options);

}

// jsDisplayChart (function() { var pkgs = window.__gvisPackages = window.__gvisPackages || []; var callbacks = window.__gvisCallbacks = window.__gvisCallbacks || []; var chartid = "corechart";

// Manually see if chartid is in pkgs (not all browsers support Array.indexOf) var i, newPackage = true; for (i = 0; newPackage && i < pkgs.length; i++) { if (pkgs[i] === chartid) newPackage = false; } if (newPackage) pkgs.push(chartid);

// Add the drawChart function to the global list of callbacks callbacks.push(drawChartHistogramID56d271b392d5); })(); function displayChartHistogramID56d271b392d5() { var pkgs = window.__gvisPackages = window.__gvisPackages || []; var callbacks = window.__gvisCallbacks = window.__gvisCallbacks || []; window.clearTimeout(window.__gvisLoad); // The timeout is set to 100 because otherwise the container div we are // targeting might not be part of the document yet window.__gvisLoad = setTimeout(function() { var pkgCount = pkgs.length; google.load("visualization", "1", { packages:pkgs, callback: function() { if (pkgCount != pkgs.length) { // Race condition where another setTimeout call snuck in after us; if // that call added a package, we must not shift its callback return; } while (callbacks.length > 0) callbacks.shift()(); } }); }, 100); }

// jsFooter


Data: mtcars["mpg"] • Chart ID: HistogramID56d271b392d5googleVis-0.5.6



R version 3.1.1 (2014-07-10)
Google Terms of UseDocumentation and Data Policy

</html>

Scatter plots

ggplot

scatter.ggplot <- ggplot(mtcars, aes(x=wt, y=mpg, colour=cyl)) + geom_point()
scatter.ggplot

testing

ggvis

scatter.ggvis <- mtcars %>% ggvis(x = ~wt, y = ~mpg, fill = ~cyl) %>% 
  layer_points() %>% set_options(width = img.width, height = img.height)
scatter.ggvis

} }, "values": "\"domain\"\n\"4\"\n\"6\"\n\"8\"" }, { "name": "scale/x", "format": { "type": "csv", "parse": { "domain": "number" } }, "values": "\"domain\"\n1.31745\n5.61955" }, { "name": "scale/y", "format": { "type": "csv", "parse": { "domain": "number" } }, "values": "\"domain\"\n9.225\n35.075" } ], "scales": [ { "name": "fill", "type": "ordinal", "domain": { "data": "scale/fill", "field": "data.domain" }, "points": true, "sort": false, "range": "category10" }, { "name": "x", "domain": { "data": "scale/x", "field": "data.domain" }, "zero": false, "nice": false, "clamp": false, "range": "width" }, { "name": "y", "domain": { "data": "scale/y", "field": "data.domain" }, "zero": false, "nice": false, "clamp": false, "range": "height" } ], "marks": [ { "type": "symbol", "properties": { "update": { "size": { "value": 50 }, "x": { "scale": "x", "field": "data.wt" }, "y": { "scale": "y", "field": "data.mpg" }, "fill": { "scale": "fill", "field": "data.cyl" } }, "ggvis": { "data": { "value": "mtcars0" } } }, "from": { "data": "mtcars0" } } ], "width": 450, "height": 300, "legends": [ { "orient": "right", "fill": "fill", "title": "cyl" } ], "axes": [ { "type": "x", "scale": "x", "orient": "bottom", "layer": "back", "grid": true, "title": "wt" }, { "type": "y", "scale": "y", "orient": "left", "layer": "back", "grid": true, "title": "mpg" } ], "padding": null, "ggvis_opts": { "keep_aspect": false, "resizable": true, "padding": {

}, "duration": 250, "renderer": "svg", "hover_duration": 0, "width": 450, "height": 300 }, "handlers": null } ; ggvis.getPlot("plot_id223898430").parseSpec(plot_id223898430_spec);

rCharts

scatter.rcharts <- rPlot(mpg ~ wt, data = mtcars, color = 'cyl', type = 'point')
# WTF, legend shows 4-7, while the levels are 4,6,8???
# very tight limits, parts of points missing on the edge
# Use this with 'Knit HTML' button
# scatter.rcharts$print(include_assets=TRUE)
# Use this with jekyll blog
scatter.rcharts$show('iframesrc', cdn=TRUE)

plotly

# This works, but is not evaluated now. Instead the iframe is embedded manually.
py$ggplotly(scatter.ggplot, session="knitr")

googleVis

# Spread data to show the wanted scatter plot (unique id required for unique rows)
mtcars$id <- as.character(1:nrow(mtcars))
mtcars.temp <- tidyr::spread(mtcars[c("wt", "mpg", "cyl", "id")], key=cyl, value=mpg)
gvis.options <- list(hAxis="{title:'wt'}", vAxis="{title:'mpg'}",
                     width=img.width, height=img.height)
scatter.gvis <- gvisScatterChart(select(mtcars.temp, -id), options=gvis.options)
print(scatter.gvis)

<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Strict//EN”
“http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”>
<html xmlns=”http://www.w3.org/1999/xhtml”>





// jsData function gvisDataScatterChartID56d215377749 () { var data = new google.visualization.DataTable(); var datajson = [ [ 1.513, 30.4, null, null ], [ 1.615, 30.4, null, null ], [ 1.835, 33.9, null, null ], [ 1.935, 27.3, null, null ], [ 2.14, 26, null, null ], [ 2.2, 32.4, null, null ], [ 2.32, 22.8, null, null ], [ 2.465, 21.5, null, null ], [ 2.62, null, 21, null ], [ 2.77, null, 19.7, null ], [ 2.78, 21.4, null, null ], [ 2.875, null, 21, null ], [ 3.15, 22.8, null, null ], [ 3.17, null, null, 15.8 ], [ 3.19, 24.4, null, null ], [ 3.215, null, 21.4, null ], [ 3.435, null, null, 15.2 ], [ 3.44, null, 19.2, null ], [ 3.44, null, 17.8, null ], [ 3.44, null, null, 18.7 ], [ 3.46, null, 18.1, null ], [ 3.52, null, null, 15.5 ], [ 3.57, null, null, 15 ], [ 3.57, null, null, 14.3 ], [ 3.73, null, null, 17.3 ], [ 3.78, null, null, 15.2 ], [ 3.84, null, null, 13.3 ], [ 3.845, null, null, 19.2 ], [ 4.07, null, null, 16.4 ], [ 5.25, null, null, 10.4 ], [ 5.345, null, null, 14.7 ], [ 5.424, null, null, 10.4 ] ]; data.addColumn('number','wt'); data.addColumn('number','4'); data.addColumn('number','6'); data.addColumn('number','8'); data.addRows(datajson); return(data); }

// jsDrawChart function drawChartScatterChartID56d215377749() { var data = gvisDataScatterChartID56d215377749(); var options = {}; options["allowHtml"] = true; options["hAxis"] = {title:'wt'}; options["vAxis"] = {title:'mpg'}; options["width"] = 450; options["height"] = 300;

var chart = new google.visualization.ScatterChart( document.getElementById('ScatterChartID56d215377749') ); chart.draw(data,options);

}

// jsDisplayChart (function() { var pkgs = window.__gvisPackages = window.__gvisPackages || []; var callbacks = window.__gvisCallbacks = window.__gvisCallbacks || []; var chartid = "corechart";

// Manually see if chartid is in pkgs (not all browsers support Array.indexOf) var i, newPackage = true; for (i = 0; newPackage && i < pkgs.length; i++) { if (pkgs[i] === chartid) newPackage = false; } if (newPackage) pkgs.push(chartid);

// Add the drawChart function to the global list of callbacks callbacks.push(drawChartScatterChartID56d215377749); })(); function displayChartScatterChartID56d215377749() { var pkgs = window.__gvisPackages = window.__gvisPackages || []; var callbacks = window.__gvisCallbacks = window.__gvisCallbacks || []; window.clearTimeout(window.__gvisLoad); // The timeout is set to 100 because otherwise the container div we are // targeting might not be part of the document yet window.__gvisLoad = setTimeout(function() { var pkgCount = pkgs.length; google.load("visualization", "1", { packages:pkgs, callback: function() { if (pkgCount != pkgs.length) { // Race condition where another setTimeout call snuck in after us; if // that call added a package, we must not shift its callback return; } while (callbacks.length > 0) callbacks.shift()(); } }); }, 100); }

// jsFooter


Data: select(mtcars.temp, -id) • Chart ID: ScatterChartID56d215377749googleVis-0.5.6



R version 3.1.1 (2014-07-10)
Google Terms of UseDocumentation and Data Policy

</html>

Line plots

ggplot

line.ggplot <- ggplot(mtcars.mean, aes(x=cyl, y=mpg_mean, colour=am)) + 
  geom_line(aes(group=am))
line.ggplot

testing

ggvis

line.ggvis <- mtcars.mean %>% ggvis(x = ~cyl, y = ~mpg_mean, stroke = ~am) %>% 
  layer_lines() %>% set_options(width = img.width, height = img.height)
line.ggvis

} }, "values": "\"domain\"\n\"0\"\n\"1\"" }, { "name": "scale/x", "format": { "type": "csv", "parse": {

} }, "values": "\"domain\"\n\"4\"\n\"6\"\n\"8\"" }, { "name": "scale/y", "format": { "type": "csv", "parse": { "domain": "number" } }, "values": "\"domain\"\n14.39875\n28.72625" } ], "scales": [ { "name": "stroke", "type": "ordinal", "domain": { "data": "scale/stroke", "field": "data.domain" }, "points": true, "sort": false, "range": "category10" }, { "name": "x", "type": "ordinal", "domain": { "data": "scale/x", "field": "data.domain" }, "points": true, "sort": false, "range": "width", "padding": 0.5 }, { "name": "y", "domain": { "data": "scale/y", "field": "data.domain" }, "zero": false, "nice": false, "clamp": false, "range": "height" } ], "marks": [ { "type": "group", "from": { "data": "mtcars.mean0/group_by1/arrange2" }, "marks": [ { "type": "line", "properties": { "update": { "x": { "scale": "x", "field": "data.cyl" }, "y": { "scale": "y", "field": "data.mpg_mean" }, "stroke": { "scale": "stroke", "field": "data.am" } }, "ggvis": { "data": { "value": "mtcars.mean0/group_by1/arrange2" } } } } ] } ], "width": 450, "height": 300, "legends": [ { "orient": "right", "stroke": "stroke", "title": "am" } ], "axes": [ { "type": "x", "scale": "x", "orient": "bottom", "layer": "back", "grid": true, "title": "cyl" }, { "type": "y", "scale": "y", "orient": "left", "layer": "back", "grid": true, "title": "mpg_mean" } ], "padding": null, "ggvis_opts": { "keep_aspect": false, "resizable": true, "padding": {

}, "duration": 250, "renderer": "svg", "hover_duration": 0, "width": 450, "height": 300 }, "handlers": null } ; ggvis.getPlot("plot_id291228275").parseSpec(plot_id291228275_spec);

Does not show up…

rCharts

line.rcharts <- hPlot(x="cyl", y="mpg_mean", group="am", data=mtcars.mean, type="line")
# Use this with 'Knit HTML' button
# line.rcharts$print(include_assets=TRUE)
# Use this with jekyll blog
line.rcharts$show('iframesrc', cdn=TRUE)

plotly

# This works, but is not evaluated now. Instead the iframe is embedded manually.
py$ggplotly(line.ggplot, session="knitr")

googleVis

# Spread data to show the wanted line plot
mtcars.mean.temp <- tidyr::spread(mtcars.mean, key=am, value=mpg_mean)
gvis.options <- list(hAxis="{title:'cyl'}", vAxis="{title:'mpg_mean'}",
                     width=img.width, height=img.height)
line.gvis <- gvisLineChart(xvar="cyl", yvar=c("0", "1"), data=mtcars.mean.temp, 
                           options=gvis.options)
print(line.gvis)

<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Strict//EN”
“http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”>
<html xmlns=”http://www.w3.org/1999/xhtml”>





// jsData function gvisDataLineChartID56d21b449ffd () { var data = new google.visualization.DataTable(); var datajson = [ [ "4", 22.9, 28.075 ], [ "6", 19.125, 20.56666667 ], [ "8", 15.05, 15.4 ] ]; data.addColumn('string','cyl'); data.addColumn('number','0'); data.addColumn('number','1'); data.addRows(datajson); return(data); }

// jsDrawChart function drawChartLineChartID56d21b449ffd() { var data = gvisDataLineChartID56d21b449ffd(); var options = {}; options["allowHtml"] = true; options["hAxis"] = {title:'cyl'}; options["vAxis"] = {title:'mpg_mean'}; options["width"] = 450; options["height"] = 300;

var chart = new google.visualization.LineChart( document.getElementById('LineChartID56d21b449ffd') ); chart.draw(data,options);

}

// jsDisplayChart (function() { var pkgs = window.__gvisPackages = window.__gvisPackages || []; var callbacks = window.__gvisCallbacks = window.__gvisCallbacks || []; var chartid = "corechart";

// Manually see if chartid is in pkgs (not all browsers support Array.indexOf) var i, newPackage = true; for (i = 0; newPackage && i < pkgs.length; i++) { if (pkgs[i] === chartid) newPackage = false; } if (newPackage) pkgs.push(chartid);

// Add the drawChart function to the global list of callbacks callbacks.push(drawChartLineChartID56d21b449ffd); })(); function displayChartLineChartID56d21b449ffd() { var pkgs = window.__gvisPackages = window.__gvisPackages || []; var callbacks = window.__gvisCallbacks = window.__gvisCallbacks || []; window.clearTimeout(window.__gvisLoad); // The timeout is set to 100 because otherwise the container div we are // targeting might not be part of the document yet window.__gvisLoad = setTimeout(function() { var pkgCount = pkgs.length; google.load("visualization", "1", { packages:pkgs, callback: function() { if (pkgCount != pkgs.length) { // Race condition where another setTimeout call snuck in after us; if // that call added a package, we must not shift its callback return; } while (callbacks.length > 0) callbacks.shift()(); } }); }, 100); }

// jsFooter


Data: data • Chart ID: LineChartID56d21b449ffdgoogleVis-0.5.6



R version 3.1.1 (2014-07-10)
Google Terms of UseDocumentation and Data Policy

</html>

Session info

sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tidyr_0.1          dplyr_0.3.0.2      plotly_0.5.10     
##  [4] ggplot2_1.0.0      RJSONIO_1.3-0      RCurl_1.95-4.3    
##  [7] bitops_1.0-6       rCharts_0.4.5      googleVis_0.5.6   
## [10] ggvis_0.4.0.9000   SnowballC_0.5.1    wordcloud_2.5     
## [13] RColorBrewer_1.0-5 tm_0.6             NLP_0.1-5         
## [16] knitr_1.8         
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   colorspace_1.2-4 DBI_0.3.1        digest_0.6.4    
##  [5] evaluate_0.5.5   formatR_1.0      grid_3.1.1       gtable_0.1.2    
##  [9] htmltools_0.2.6  httpuv_1.3.2     jsonlite_0.9.13  labeling_0.3    
## [13] lattice_0.20-29  lazyeval_0.1.9   magrittr_1.0.1   MASS_7.3-35     
## [17] mime_0.2         munsell_0.4.2    parallel_3.1.1   plyr_1.8.1      
## [21] proto_0.3-10     R6_2.0           Rcpp_0.11.3      reshape2_1.4    
## [25] scales_0.2.4     shiny_0.10.2.1   slam_0.1-32      stringr_0.6.2   
## [29] tools_3.1.1      whisker_0.3-2    xtable_1.7-4     yaml_2.1.13



Creative Commons -käyttölupa

To leave a comment for the author, please follow the link and comment on his blog: Open Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Do You Know What You Show at Your Map?

$
0
0

(This article was first published on Misanthrope's Thoughts, and kindly contributed to R-bloggers)
As access to the GIS and mapping is becoming easier every year the more people and companies create maps. Unfortunately often they just do not know what they are actually showing at their maps. This issue is being mentioned over and over again.  

Here is the example that I discovered recently: Cyberthreat Real-Time Map by Kaspersky antivirus company. Here how it looks like:


Amongst the other info they show the Infection rank for each country... based on total threats detected.... You may have already guessed what is the fail, but I let me explain it anyway.

See, the №1 infected country is Russia, which is the home country for Kaspersky and where this antivirus is quite popular. So we can conclude that the rankings that supposed to demonstrate the severity of virus activities merely demonstrates the number of Kaspersky software installations across the globe.

Lets test this hypothesis. I don't have the data about the number of installation of Kaspersky software per country, but it is safe to assume that this number is proportional to the population of the given country. Also it is easier to get infection rankings for countries from the map than the number of the threats detected. If I had total threats data per country I would compare it to the population. Having infection rankings it is more rational to compare it to the population rankings instead. So I picked 27 random countries and compared their infection and population rankings. The result is demonstrated at the plot below:

Infection rank vs. Population rank
The linear model is fairly close to Inrection rank = Population rank. It is clear that the phenomena that is presented as an Infection rank just reflects a total software installations per country and not the severity of the 'cyberthreat'. In order to get the actual Infection rank the number of detected threats have to be normalised by the number of software installations.

To leave a comment for the author, please follow the link and comment on his blog: Misanthrope's Thoughts.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Some Applications of Item Response Theory in R

$
0
0

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)
The typical introduction to item response theory (IRT) positions the technique as a form of curve fitting. We believe that a latent continuous variable is responsible for the observed dichotomous or polytomous responses to a set of items (e.g., multiple choice questions on an exam or rating scales from a survey). Literally, once I know your latent score, I can predict your observed responses to all the items. Our task is to estimate that function with one, two or three parameters after determining that the latent trait is unidimensional. In the process of measuring individuals, we gather information about the items. Those one, two or three parameters are assessments of each item's difficulty, discriminability and sensitivity to noise or guessing.

All this has been translated into R by William Revelle, and as a measurement task, our work is done. We have an estimate of each individual's latent position on an underlying continuum defined as whatever determines the item responses. Along the way, we discover which items require more of the latent trait in order to achieve a favorable response (e.g., the difficulty of answering correctly or the extremity of the item and/or the response). We can measure ability with achievement items, political ideology with an opinion survey, and brand perceptions with a list of satisfaction ratings.

To be clear, these scales are meant to differentiate among individuals. For example, the R statistical programming language has an underlying structure that orders the learning process so that the more complex concepts are mastered after the simpler material. In this case, learning is shaped by the difficulty of the subject matter with the more demanding content reusing or building onto what has already been learned. When the constraints are sufficient, individuals and their mastery can be arrayed on a common scale. At one end of the continuum are complex concepts that only the more advanced students master. The easier stuff falls toward the bottom of the scale with topics that almost everyone knows. When you take an R programming achievement test, your score tells me how well you performed relative to others who answered similar questions (see normed-referenced testing).

The same reasoning applied to IRT analysis of political ideology (e.g., the R package basicspace). Opinions tend to follow a predictable path from liberal to conservative so that only a limited number of all possible configurations are actually observed. As shown below, legislative voting follows such a pattern with Senators (dark line) and Representatives (light line) separate along the liberal to conservative dimensions based on their votes in the 113th Congress. Although not shown, all the specific votes can also be placed on this same scale so that Pryor, Landrieu, Baucus and Hagan (in blue) are located toward the right because their votes on various bills and resolutions agreed more often with Republicans (in red). As with achievement testing, an order is imposed on the likely responses of objects so that the response space in p dimensions (where p equals the number of behaviors, items or votes) is reduced to a one-dimensional seriation of both votes and voters on the same scale.

My last example comes from marketing research where brand perceptions tend to organized as a pattern of strengths and weaknesses defined by the product category. In a previous post, I showed how preference for Subway fast food restaurants is associated with a specific ordering of product and service attribute ratings. Many believe that Subway offers fresh and healthy food. Fewer like the taste or feel it is filling. Fewer still are happy with the ordering or preparation, and even more dislike the menu and the seating arrangements. These perceptions have an order so that if you are satisfied with the menu then you are likely to be satisfied with the taste and the freshness/healthiness of the food. Just as issues can be ordered from liberal to conservative, brand perceptions reflect the strengths and weaknesses promised by the brand's positioning. Subway promises fresh and healthy food but not prepackaged and waiting under the heat lamp for easy bagging. The mean levels of our satisfaction ratings will be consistent with those brand priorities.

We can look at the same data from another perspective. Heatmaps summarize the triangular pattern observed in data matrices that can be modeled by IRT. In a second post analyzing the Subway data, I described the following heatmap showing the results from the 8-item checklist of features associated with the brand. Each row is a different respondent with the blue indicating that the item was checked and red telling us that the item was not checked. As one moves down the heatmap, the overall perceptions become more positive as additional attributes are endorsed. Positive brand perceptions are incremental, but the increments are not more of the same. Tasty and filling gets added to healthy and fresh. That is, greater satisfaction with Subway is reflected in the willingness to endorse additional components of the brand promise. The heatmap is triangular so that those who are happy with the menu are likely to be at least as satisfied with all the attributes to the right.

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mapping Seattle Crime

$
0
0

(This article was first published on SHARP SIGHT LABS » r-bloggers, and kindly contributed to R-bloggers)

seattle_crime_map_2010-2014_ggplot2_590x670

Last week I published a data visualization of San Francisco crime.

This week, I’m mapping Seattle crime data.

The map above is moderately complicated to create, so I’ll start this tutorial with a simpler case: the dot distribution map.

Seattle crime map, simplified version

First, we’ll start by loading the data.

Note that I already “cleaned” this dataset (mostly removing extraneous variables, data prior to 2010, etc,).

library(ggmap)
library(dplyr)
library(ggplot2)

#########################
# GET SEATTLE CRIME DATA
#########################

download.file("http://www.sharpsightlabs.com/wp-content/uploads/2015/01/seattle_crime_2010_to_2014_REDUCED.txt.zip", destfile="seattle_crime_2010_to_2014_REDUCED.txt.zip")

#------------------------------
# Unzip the SF crime data file
#------------------------------
unzip("seattle_crime_2010_to_2014_REDUCED.txt.zip")

#------------------------------------
# Read crime data into an R dataframe
#------------------------------------
df.seattle_crime <- read.csv("seattle_crime_2010_to_2014_REDUCED.txt")

 

Get map of Seattle using ggmap package

Next, we’ll get a map of Seattle using qmap().

qmap() is a function from the ggmap package. Basically, it pings Google Maps and creates a map that you can use for a geospatial context layer. (It can also retrieve related maps made by Stamen, CloudMade, or OpenStreetMap.)


################
# SEATTLE GGMAP
################

map.seattle_city <- qmap("seattle", zoom = 11, source="stamen", maptype="toner",darken = c(.3,"#BBBBBB"))
map.seattle_city

 
Here, we’re using qmap() as follows:

We’re calling it with “seattle” as the first argument. That does exactly what you think it does. It tells qmap() that we want a map of Seattle. The qmap() function understands city names, so you can ask for “chicago,” “san francisco,” etc. Play with it a little!

We’re also setting a “zoom” parameter. Again, play with that number and see what happens. Currently, we’re setting zoom to 11. To be clear, you can use zoom to zoom in or zoom out on the specified location. In this case, we’re zooming in on the center of Seattle, and if we zoom in too much, we’ll omit parts of the city. For our purposes, a zoom of 11 is ideal.

The maptype= parameter has been set to “toner”. The “toner” maptype is basically a black and white map. (Note that there are other maptypes, such as “satellite,” and “hybrid.” Try those out and see what happens.)

On top of that, you’ll note that I’m using a parameter called “darken.” Effectively, I’m using darken to add color on top of the map (the hexidecimal color “#BBBBBB”). I’ve done this to subtly change the map color from pure black and white to shades of grey.

Next, we’ll plot.

Make basic dot distribution map


##########################
# CREATE BASIC MAP
#  - dot distribution map
##########################
map.seattle_city +
  geom_point(data=df.seattle_crime, aes(x=Longitude, y=Latitude))

seattle_crime_basic-dot-distribution-map_2010-2014_ggplot2_500x409
 
This map is a little ugly, but it’s instructive to examine what we’re doing in the code.

Notice that the syntax is almost the same as the syntax for the basic scatterplot. In some sense, this is a scatterplot.

As proof, let’s create a scatterplot using the same dataset. Simply replace the map.seattle_city code with ggplot().

#####################
# CREATE SCATTERPLOT
#####################
ggplot() +
  geom_point(data=df.seattle_crime, aes(x=Longitude, y=Latitude))

seattle_crime_basic-scatterplot_2010-2014_ggplot2_275x409
 
This is the exact same data and the same variable mapping. We’ve just removed the map.seattle_city context layer. Now, it’s just a basic scatterplot.

That’s part of the reason I wanted to write up this tutorial. I’ve emphasized earlier that you should master the basic charts like the scatterplot. One reason I emphasize the basics is because the basic charts serve as foundations for more complicated charts.

In this case, the scatterplot is the foundation for the dot distribution map.

Ok. Now, let’s go back to our map. You might have noticed that the data is really “dense.” All of the points are on top of each other. We call this “overplotting.” We’re going to modify our point geoms to deal with this overplotting.

Adjust point transparency to deal with overplotting


#############################
# ADD TRANSPARENCY and COLOR
#############################

map.seattle_city +
  geom_point(data=df.seattle_crime, aes(x=Longitude, y=Latitude), color="dark green", alpha=.03, size=1.1)

seattle_crime_basic-dot-distribution-map_GREEN_2010-2014_ggplot2_500x409
 
Notice that we made some modifications within geom_point().

We added color to make it a little more interesting.

But more importantly, we modified two parameters: alpha= and size=.

The size= parameter obviously modifies the size of the point.

alpha modifies the transparency. In this case, we’re making the points highly transparent so we can better see areas of Seattle with high levels of crime. We’re manipulating alpha levels to deal with overplotting.

To be clear, there are other solutions for dealing with overplotting. This isn’t necessarily the best solution, but early in learning data science, this will be one of the simplest to implement.

Wrapping up

The above tutorial shows you how to make a basic dot distribution map using R’s ggplot2 and ggmap.

Note a few things:

  1. We’re building on foundational techniques. In this case, we’ve made a dot distribution map, which is just a modified scatterplot.
  2. We built this plot iteratively. We started with the base map, then added points, and then modified those points.

It bears repeating that you should master the basics like the scatterplot, line, histogram, and bar chart. Also practice designing data visualizations iteratively. When you can do these things, you’ll be able to progress to more sophisticated visualization techniques.

Finally, if you want to replicate the map at the beginning of the post, here’s the code:


#################################
# TILED version 
#  tile border mapped to density
#################################
map.seattle_city +
  stat_density2d(data=df.seattle_crime, aes(x=Longitude
                                            , y=Latitude
                                            ,color=..density..
                                            ,size=ifelse(..density..<=1,0,..density..)
                                            ,alpha=..density..)
                 ,geom="tile",contour=F) +
  scale_color_continuous(low="orange", high="red", guide = "none") +
  scale_size_continuous(range = c(0, 3), guide = "none") +
  scale_alpha(range = c(0,.5), guide="none") +
  ggtitle("Seattle Crime") +
  theme(plot.title = element_text(family="Trebuchet MS", size=36, face="bold", hjust=0, color="#777777")) 

seattle_crime_map_2010-2014_ggplot2_590x670
 
If you look carefully, you’ll notice that the code has quite a few similarities to the basic dot distribution map. (Again: master the basics, and you’ll start to understand what’s going on here.)

The post Mapping Seattle Crime appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on his blog: SHARP SIGHT LABS » r-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a choropleth map of Italy using mapIT

$
0
0

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

In the R environment, different packages to draw maps are available. I lost the count by now; surely, sp and ggmap deserve consideration. Despite the great availability of R functions dedicated to this topic, in the past, when I needed to draw a very basic map of Italy with regions marked with different colours (namely a choropleth map), I had a bit of difficulties.

My expectation was that building a choropleth map of Italy using R was a extremely trivial procedure, but my experience was different. In fact, if the aim is to represent a map of United States, the most part of the available functions are very easy to use. However, to draw a map of Italy, the procedures become a bit complicated if compared to the banality of the chart (a good tutorial - in Italian – can be found here).

I wasn’t the only one R user to have this problem. Some time ago, in the community Statistica@Ning, Lorenzo di Blasio proposed a good solution using ggplot2. Summarizing the code proposed by Lorenzo, I assembled a first function capable to create a map in a easy and fast way. Finally, Nicola Sturaro of MilanoR group has strongly improved and completed the code and created a new package: mapIT.

Currently, the package mapIT is located into a repository on GitHub. In order to install the package, you can use the package devtools:

library(devtools)
install_github("nicolasturaro/mapIT")

In my first use of mapIT, I had to map the number of wineries taken into account in a research regarding Italian wine evaluations. I need to visualize, for each region, the number of wineries whose wines were reviewed. In the following code, there are the data; for each Italian region (first column) the number of wineries (second column) is reported.

wine &lt;- data.frame(
    Region = c("Abruzzo","Basilicata","Calabria","Campania",
               "Emilia-Romagna","Friuli-Venezia Giulia","Lazio",
               "Liguria","Lombardia","Marche","Molise","Piemonte",
               "Puglia","Sardegna","Sicilia","Toscana",
               "Trentino-Alto Adige","Umbria","Valle d'Aosta","Veneto"),
    Wineries = c(22,8,9,35,24,74,19,8,41,29,5,191,22,14,40,173,57,29,6,92)
 )

The names of regions can be written both in lowercase and in uppercase. Spaces and other non-alphabetical characters will be ignored. So, you can write indifferently: ‘Trentino-Alto Adige’, ‘Trentino Alto Adige’ or ‘TrentinoAltoAdige’. For regions with bilingual denomination, only the Italian wording is accepted.
To build the map, the package mapIT make available the namesake function mapIT(). The first argument to pass to the function is the numeric variable (Wineries) and the second one is the variable specifying the Italian region (Region). A third argument can be used to specify the data frame from which extract the variables.
Further, there are some additional arguments useful to modify the graphic style. In the following example I used guide.label, which specifies the title label for the legend.

library(mapIT)
mapIT(Wineries, Region, data=wine, guide.label="Number ofnwineries")

mapIT - choropleth map of Italy in blue

Easy, right? It was enough to load the package and launch a brief row of code!
The chart can be customized in several ways. The main argument allowing to alter the graphic details is graphPar, consisting in a long list of arguments (for details, see the help function).
One of the first things we want to do, surely will be alter the colours. To alter the colours, you must specify, in the graphPar list, the colours for the minimum value (low) and for the maximum value (high):

gp &lt;- list(low="#fff0f0", high="red3")

For convenience I saved the list into the object gp. Note that colours can be specified using both the hexadecimal code and the R keywords for colours.

mapIT(Wineries, Region, data=wine,
      guide.label="Number ofnwineries",  graphPar=gp)

mapIT - choropleth map of Italy in red

You can play with colours to find your preferred arrangement. To identify the hexadecimal code for colours, a fast solution is to use a web applications as RGB color picker.
The low and high values of graphPar can be used to convert the chart in black and white. In this case, to make the chart a bit more pleasant, it’s possible use the themes of ggplot2. In the examples below, the first map (left panel) was built using the theme theme_bw, while the second map (right panel) was built using the theme theme_grey.

library(ggplot2)

# Theme: black and white
gp &lt;- list(low="white", high="gray20", theme=theme_bw())
mapIT(Wineries, Region, data=wine,
      guide.label="Number ofnwineries", graphPar=gp)

# Theme: grey
gp &lt;- list(low="white", high="gray20", theme=theme_grey())
mapIT(Wineries, Region, data=wine,
      guide.label="Number ofnwineries", graphPar=gp)

mapIT - choropleth map of Italy in black and white

Still there are different features to implement and, in the future, some things can be changed. If you has some ideas to improve mapIT, or you found a malfunctioning, you can open an issue on GitHub.

To leave a comment for the author, please follow the link and comment on his blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rediscovering Formula One Race Battlemaps

$
0
0

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

A couple of days ago, I posted a recipe on the F1DataJunkie blog that described how to calculate track position from laptime data.

Using that information, as well as additional derived columns such as the identity of, and time to, the cars immediately ahead of and behind a particular selected driver, both in terms of track position and race position, I revisited a chart type I first started exploring several years ago – race battle charts.

The main idea behind the battlemaps is that they can help us search for stories amidst the runners.

[sourceoce language=’R’]dirattr=function(attr,dir=’ahead’) paste(attr,dir,sep=”)

#We shall find it convenenient later on to split out the initial data selection
battlemap_df_driverCode=function(driverCode){
lapTimes[lapTimes[‘code’]==driverCode,]
}

battlemap_core_chart=function(df,g,dir=’ahead’){
car_X=dirattr(‘car_’,dir)
code_X=dirattr(‘code_’,dir)
factor_X=paste(‘factor(position_’,dir,'<position)',sep='')
code_race_X=dirattr('code_race',dir)
if (dir=='ahead') diff_X='diff' else diff_X='chasediff'

if (dir=="ahead") drs=1000 else drs=-1000
g=g+geom_hline(aes_string(yintercept=drs),linetype=5,col='grey')

#Plot the offlap cars that aren't directly being raced
g=g+geom_text(data=df[df[dirattr('code_',dir)]!=df[dirattr('code_race',dir)],],
aes_string(x='lap',
y=car_X,
label=code_X,
col=factor_X),
angle=45,size=2)
#Plot the cars being raced directly
g=g+geom_text(data=df,
aes_string(x='lap',
y=diff_X,
label=code_race_X),
angle=45,size=2)
g=g+scale_color_discrete(labels=c("Behind","Ahead"))
g+guides(col=guide_legend(title="Intervening car"))

}

battle_WEB=battlemap_df_driverCode("WEB")
g=battlemap_core_chart(battle_WEB,ggplot(),'ahead')
battlemap_core_chart(battle_WEB,g,dir='behind')
[/sourcecode]

In this first sketch, from the 2012 Australian Grand Prix, I show the battlemap for Mark Webber:

battlemaps-unnamed-chunk-12-1

We see how at the start of the race Webber kept pace with Alonso, albeit around about a second behind, at the same time as he drew away from Massa. In the last third of the race, he was closely battling with Hamilton whilst drawing away from Alonso. Coloured labels are used to highlight cars on a different lap (either ahead (aqua) or behind (orange)) that are in a track position between the selected driver and the car one place ahead or behind in terms of race position (the black labels). The y-axis is the time delta in milliseconds between the selected car and cars ahead (y > 0) or behind (y < 0). A dashed line at the +/- one second mark identifies cars within DRS range.

As well as charting the battles in the vicinity of a particular driver, we can also chart the battle in the context of a particular race position. We can reuse the chart elements and simply need to redefine the filtered dataset we are charting.

For example, if we filter the dataset to just get the data for the car in third position at the end of each lap, we can then generate a battle map of this data.
battlemap_df_position=function(position){
lapTimes[lapTimes['position']==position,]
}

battleForThird=battlemap_df_position(3)

g=battlemap_core_chart(battleForThird,ggplot(),dir='behind')+xlab(NULL)+theme_bw()
g=battlemap_core_chart(battleForThird,g,'ahead')+guides(col=FALSE)
g

battlemaps-postionbattles-1

For more details, see the original version of the battlemap chapter. For updates to the chapter, I recommend that you invest in a copy Wrangling F1 Data With R book if you haven’t already done so:-)


To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog... » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The leaflet package for online mapping in R

$
0
0

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

It has been possible for some years to launch a web map from within R. A number of packages for doing this are available, including:

  • RgoogleMaps, an interface to the Google Maps api
  • leafletR, an early package for creating Leaflet maps with R
  • rCharts, which can be used as a basis for webmaps

In this tutorial we use the new RStudio-supported leaflet R package. We use this package, an R interface to the JavaScript mapping library of the same name because:

  • leaflet is supported by RStudio, who have a track strong track record of creating amazing R packages
  • leaflet appears to provide the simplest, fastest way to host interactive maps online in R, requiring only 2 lines of code for one web map! (as you’ll see below)
  • leaflet is shiny. Shiny in the literal sense of the word (a new and fresh approach to web mapping in R) but also in the sense that it works well with the R package shiny.

The best tutorial resource I have found on leaflet is this vignette by Joe Cheng and Yihui Xie. Building on their excellent description, this article explains some of the fundamentals of the package.

Installing leaflet

Because leaflet is new, it’s not yet on CRAN. Even when it does appear, installing from github may be a good idea, to ensure you have access to the latest features and bug fixes. Here’s how:

# Install leaflet package
if(!require(leaflet)) install_github("rstudio/leaflet")

A first web map with leaflet

To create an interactive web map with leaflet is incredibly easy. Having installed the package try this single line of code:

# Plot a default web map (brackets display the result)
(m <- leaflet() %>% addTiles())
img <- readPNG("~/repos/Creating-maps-in-R/figure//shiny_world.png")
grid.raster(img)

Basic leaflet map in R

Adding basic features with %>%

Adding basic features to your webmap is easy. The %>% ‘pipe’ operator used extensively in dplyr and developed for the maggrittr package means we can finally escape from dozens of nested bracketted commands. (If you use RStudio, I suggest trying the new shortcut Ctl+Shift+M to produce this wonderful operator.) leaflet settings and functionality can thus be added sequentially, without requiring any additional brackets!

To add a location to the map m, for example, we can simply pipe m into the function setView():

m %>% setView(lng = -1.5, lat = 53.4, zoom = 10) # set centre and extent of map

This way we can gradually add elements to our map, one-by-one:

(m2 <- m %>%
  setView(-1.5, 53.4, 10) %>% # map location
  addMarkers(-1.4, 53.5) %>% # add a marker
  addPopups(-1.6, 53.3, popup = "Hello Sheffield!") %>% # popup
  # add som circles:
  addCircles(color = "black", runif(90, -2, -1), runif(90, 53, 54), runif(90, 10, 500)))

Adding data

In the previous example, we added some random data that we created locally, inside the function call. How do we use real, large datasets in leaflet? The package provides 3 options:

  • Data from base R: lat/long matrix or data.frame
  • Data from sp such as SpatialPoints and SpatialPolygons
  • Data from maps

Let’s try adding a bicycle route, one that I followed with some friends to move from Sheffield to my current home of Leeds. First download some data:

url = "https://github.com/Robinlovelace/sdvwR/raw/master/data/gps-trace.gpx"
download.file(url, destfile = "shef2leeds.gpx", method = "wget")

Now we can load this as a SpatialLinesDataFrame and display in leaflet:

library(rgdal)
shef2leeds <- readOGR("shef2leeds.gpx", layer = "tracks")
m2 %>%
  setView(-1.5, 53.4, 9) %>% # map location
  addPolylines(data = shef2leeds, color = "red", weight = 4)

Note in the above example that we had to use the argument data = to refer to our spatial object: it cannot simply be inserted without specifying what it is. The data argument can also be placed inside the initial leaflet() function call.

That was quite a painless process that would many more lines of code if you were to JavaScript. But not as painless as the bicycle trip itself, which involved few lines of code still: 0! This can be seen in the following video.

Shiny integration

leaflet is developed by the same team who develop shiny so the two are well integrated. Below is a very simple example, modified slightly from the package’s vignette:

library(shiny)
shinyApp(
  ui = fluidPage(leafletOutput('myMap')),
  server = function(input, output) {
    
    # download and load data
    url = "https://github.com/Robinlovelace/sdvwR/raw/master/data/gps-trace.gpx"
    download.file(url, destfile = "shef2leeds.gpx", method = "wget", )
    library(rgdal)
    shef2leeds <- readOGR("shef2leeds.gpx", layer = "tracks")
    
    map = leaflet() %>% addTiles() %>% setView(-1.5, 53.4, 9) %>% 
      addPolylines(data = shef2leeds, color = "red", weight = 4)
    output$myMap = renderLeaflet(map)
  }
)

Applications

Clearly leaflet is a powerful and flexible R package. If I were to offer one critique, it would be that I could find no easy way to allow the user to query the data objects loaded. plotly, for example, provides a description of any visual object the user clicks on. The datashine commuter visualisation, for example allows users to click on any point, resulting in a burst of lines emenating from it. This would also be possible in leaflet/shiny, but the best implementation is not immediately clear, to me at least!

The wider context of this article is the pressing need for better transport planning decision making, to enable a transition away from fossil fuels. To this end, the ‘propensity to cycle’ project, funded by the UK’s Department for Transport, seeks to create an interactive tool to identify where new bicycle paths are most needed. There are clearly many other uses for R’s leaflet package: what will you use it for? Let me know at @robinlovelace.

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

KEGG enrichment analysis with latest online data using clusterProfiler

$
0
0

(This article was first published on YGC » R, and kindly contributed to R-bloggers)

KEGG.db is not updated since 2012. The data is now pretty old, but many of the Bioconductor packages still using it for KEGG annotation and enrichment analysis.

As pointed out in 'Are there too many biological databases', there is a problem that many out of date biological databases often don't get offline. This issue also exists in web-server or software that using out-of-date data. For example, the WEGO web-server stopped updating GO annotation data since 2009, and WEGO still online with many people using it. The biological story may changed totally if using a recently updated data. Seriously, We should keep an eye on this issue.

Now enrichKEGG function is reloaded with a new parameter use.KEGG.db. This parameter is by default setting to FALSE, and enrichKEGG function will download the latest KEGG data for enrichment analysis. If the parameter use.KEGG.db is explicitly setting to TRUE, it will use the KEGG.db which is still supported but not recommended.

With this new feature, supported species is unlimited if only there are KEGG annotations available in KEGG database. You can access the full list of species supported by KEGG via: http://www.genome.jp/kegg/catalog/org_list.html

Now the organism parameter in enrichKEGG should be abbreviation of academic name, for example 'hsa' for human and 'mmu' for mouse. It accepts any species listed in http://www.genome.jp/kegg/catalog/org_list.html.

In the current release version of clusterProfiler (in Bioconductor 3.0), enrichKEGG supports about 20 species, and the organism parameter accept common name of species, for instance "human" and "mouse". For these previously supported species, common name is also supported. So that you script is still working with new version of clusterProfiler. For other species, common name is not supported, since I don't want to maintain such a long mapping list with many species have no common name available and it may also introduce unexpected bugs.

Example 1: Using online KEGG annotation

?View Code RSPLUS
1
2
3
4
5
6
7
8
library(DOSE)
data(geneList)
de <- names(geneList)[geneList > 1]
 
library(clusterProfiler)
kk <- enrichKEGG(de, organism="hsa", pvalueCutoff=0.05, pAdjustMethod="BH", 
                 qvalueCutoff=0.1, readable=TRUE)
head(summary(kk))
> head(summary(kk))
               ID                            Description GeneRatio  BgRatio
hsa04110 hsa04110                             Cell cycle    31/247 124/6861
hsa03030 hsa03030                        DNA replication     9/247  36/6861
hsa04060 hsa04060 Cytokine-cytokine receptor interaction    25/247 265/6861
hsa04114 hsa04114                         Oocyte meiosis    14/247 113/6861
hsa04115 hsa04115                  p53 signaling pathway    10/247  68/6861
hsa04062 hsa04062            Chemokine signaling pathway    18/247 189/6861
               pvalue     p.adjust       qvalue
hsa04110 2.280256e-18 9.349050e-17 5.280593e-17
hsa03030 3.527197e-06 7.230753e-05 4.084123e-05
hsa04060 8.404037e-06 1.148552e-04 6.487326e-05
hsa04114 4.827484e-05 4.948171e-04 2.794859e-04
hsa04115 1.406946e-04 9.801620e-04 5.536217e-04
hsa04062 1.434383e-04 9.801620e-04 5.536217e-04
                                                                                                                                                                              geneID
hsa04110 CDC45/CDC20/CCNB2/CCNA2/CDK1/MAD2L1/TTK/CHEK1/CCNB1/MCM5/PTTG1/MCM2/CDC25A/CDC6/PLK1/BUB1B/ESPL1/CCNE1/ORC6/ORC1/CCNE2/MCM6/MCM4/DBF4/SKP2/CDC25B/BUB1/MYC/PCNA/E2F1/CDKN2A
hsa03030                                                                                                                            MCM5/MCM2/MCM6/MCM4/FEN1/RFC4/PCNA/RNASEH2A/DNA2
hsa04060                           CXCL10/CXCL13/CXCL11/CXCL9/CCL18/IL1R2/CCL8/CXCL3/CCL20/IL12RB2/CXCL8/TNFRSF11A/CCL5/CXCR6/IL2RA/CCR1/CCL2/IL2RG/CCL4/CCR8/CCR7/GDF5/IL24/LTB/IL7
hsa04114                                                                                          CDC20/CCNB2/CDK1/MAD2L1/CALML5/AURKA/CCNB1/PTTG1/PLK1/ESPL1/CCNE1/CCNE2/BUB1/FBXO5
hsa04115                                                                                                               CCNB2/RRM2/CDK1/CHEK1/CCNB1/GTSE1/CCNE1/CCNE2/SERPINB5/CDKN2A
hsa04062                                                                       CXCL10/CXCL13/CXCL11/CXCL9/CCL18/CCL8/CXCL3/CCL20/CXCL8/CCL5/CXCR6/CCR1/STAT1/CCL2/CCL4/HCK/CCR8/CCR7
         Count
hsa04110    31
hsa03030     9
hsa04060    25
hsa04114    14
hsa04115    10
hsa04062    18
>

In the KEGG.db, there are only 5894 human genes annotated. With current online data, the number of annotated gene increase to 6861 as shown above and of course, p-values changed.

User should pay attention to another issue that readable parameter is only available for those species that has an annotation db. For example, for human we use org.Hs.eg.db for mapping gene ID to Symbol.

Example 2: enrichment analysis of species which are not previously supported

Here, I use a gene list of Streptococcus pneumoniae D39 to demonstrate using enrichKEGG with species that are not supported previously.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
> gene
 [1] "SPD_0065" "SPD_0071" "SPD_0293" "SPD_0295" "SPD_0296" "SPD_0297"
 [7] "SPD_0327" "SPD_0426" "SPD_0427" "SPD_0428" "SPD_0559" "SPD_0560"
[13] "SPD_0561" "SPD_0562" "SPD_0580" "SPD_0789" "SPD_1046" "SPD_1047"
[19] "SPD_1048" "SPD_1050" "SPD_1051" "SPD_1052" "SPD_1053" "SPD_1057"
[25] "SPD_1326" "SPD_1432" "SPD_1534" "SPD_1582" "SPD_1612" "SPD_1613"
[31] "SPD_1633" "SPD_1634" "SPD_1648" "SPD_1678" "SPD_1919"
> spdKEGG = enrichKEGG(gene, organism="spd")
> summary(spdKEGG)
               ID                                 Description GeneRatio BgRatio
spd00052 spd00052                        Galactose metabolism     35/35  35/752
spd02060 spd02060             Phosphotransferase system (PTS)     12/35  47/752
spd01100 spd01100                          Metabolic pathways     28/35 341/752
spd00520 spd00520 Amino sugar and nucleotide sugar metabolism      9/35  43/752
               pvalue     p.adjust       qvalue
spd00052 4.961477e-61 2.480739e-60 5.222608e-61
spd02060 2.470177e-07 6.175443e-07 1.300093e-07
spd01100 1.958319e-05 3.263866e-05 6.871296e-06
spd00520 6.534975e-05 8.168718e-05 1.719730e-05
                                                                                                                                                                                                                                                                                                                             geneID
spd00052 SPD_0065/SPD_0071/SPD_0293/SPD_0295/SPD_0296/SPD_0297/SPD_0327/SPD_0426/SPD_0427/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_0562/SPD_0580/SPD_0789/SPD_1046/SPD_1047/SPD_1048/SPD_1050/SPD_1051/SPD_1052/SPD_1053/SPD_1057/SPD_1326/SPD_1432/SPD_1534/SPD_1582/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1678/SPD_1919
spd02060                                                                                                                                                                                                                SPD_0293/SPD_0295/SPD_0296/SPD_0297/SPD_0426/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_1047/SPD_1048/SPD_1057
spd01100                                                                SPD_0071/SPD_0426/SPD_0427/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_0562/SPD_0580/SPD_0789/SPD_1046/SPD_1047/SPD_1048/SPD_1050/SPD_1051/SPD_1052/SPD_1053/SPD_1057/SPD_1326/SPD_1432/SPD_1534/SPD_1582/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1919
spd00520                                                                                                                                                                                                                                           SPD_0580/SPD_1326/SPD_1432/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1919
         Count
spd00052    35
spd02060    12
spd01100    28
spd00520     9

Summary

To summarize, clusterProfiler supports downloading the latest KEGG annotation for enrichment analysis and it supports all species that have KEGG annotation available in KEGG database.

To install the devel version of clusterProfiler, start R and enter the following command:

?View Code RSPLUS
1
2
install.packages(c("DOSE", "clusterProfiler"),
                repos="http://www.bioconductor.org/packages/devel/bioc")

Related Posts

To leave a comment for the author, please follow the link and comment on his blog: YGC » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Paris’s history, captured in its streets

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The following image by Mathieu Rajerison has been doing the rounds of French media recently. It shows the streets of Paris, color-coded by their compass direction. It's been featured in an article in Telerama magazine, and even on French TV Channel LCI (skip ahead to 8:20 in the linked video. which also features an interview with Mathieu).

Paris streets

Mathieu used the R language and OpenStreetMap data to construct the image, which colorizes each street according to the compass direction it points. Orthogonal streets are colored the same, so regular grids appear as swathes of uniform color. A planned city like Chicago, would appear as a largely monochrome grid, but Paris exhibits much more variation. (You can see many other cities in this DataPointed.net article.) As this article in the French edition of Slate explains, the very history of Paris itself is encapsulated in the colored segments. You can easily spot Napoleon's planned boulevards as they cut through the older medieval neighborhoods, and agglomerated villages like Montmartre appear as rainbow-hued nuggets. 

Mathieu explains the process of creating the chart in a blog post written in English. He uses the maptools package to import the OpenStreetMap shapefile and to extract the orientations of the streets. A simple R function is used to select colors for the streets, and then the entire map is sampled to a grid with the spatstat package, before finally being exported as a TIFF by the raster package. The entire chart is created with just 31 lines of R code, which you can find at the link below.

Data and GIS tips: Streets of Paris Colored by Orientation

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Turning R into a GIS – Mapping the weather in Germany

$
0
0

(This article was first published on Big Data Doctor » R, and kindly contributed to R-bloggers)

temp-mapNothing has gotten more attention in the visualization world like the map-based insights, or in other words, just plotting on a map different KPIs to allow for a playful discovery experience. I must admit, maps are cool, an awesome tool to “show-off” and to visually gain some insights.

But let’s be also clear about the limitations of map based charts:

  • You can compare locations based on a KPI, but you cannot quantify the difference between them
  • Color is difficult to understand and often leads to misinterpretation (e.g: what’s the meaning of red? more sales? or worst results?).
  • Color gradients are also pretty challenging for the human eye.
  • Zooming in/out results in insights detailing down/aggregation, but it’s difficult to establish a quantification between different granularity levels.

Anyways, R can be really useful to create high-quality maps… There are awesome packages like rMaps, where you have a set of controls available to make your maps interactive, rgooglemaps, maptools, etc.

In this post I’m going to plot weather KPIs for over 8K different postal codes (Postleitzahl or PLZ) in Germany. I’m going to shade the different areas according to their values -as you would expect :)

We are going to follow these steps to visualize the temperature, the humidity and the snow fall for the entire German country:

  1. Preparation of the required assets (PLZ coordinates, polygon lines, weather API key, etc)
  2. Querying the weather API for each PLZ to retrieve the weather values
  3. Map creation and PLZ data frame merging with the obtained weather information
  4. Map display for the weather metrics and high-resolution picture saving

1- Assets preparation

We need to prepare a few assets… Everything freely accessible and just a mouse click away… Amazing, isn’t it?

  • The list of all PLZ with city name and the lat/long coordinates of a centroid (you can download this data from geonames)
  • The shapefiles for the PLZ to know how to draw them on a map (kindly made available out of the OpenStreetMaps at suche-postleitzahl.org)
  • A key for the weather API (you need to register at openweathermap.org, takes literally a second and they are not going to bother you with newsletters)

2-Downloading the weather data

Basically, it’s just a JSON call we can perform for each PLZ passing the lat/long coordinates to the openweather api’s endpoint. Each weather entry is then stored as a 1 row data frame we keep appending to the one holding all entries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library(jsonlite)
#load the plz info you download from the geonames resource
plz.ort.de<-read.csv(file = "../plzgeo.csv")
weather.de<-NULL
for (i in 1:nrow(plz.ort.de))
{
  url<-paste0('http://api.openweathermap.org/data/2.5/weather?lat=',plz.ort.de[i,]$lat, '&lon=',plz.ort.de[i,]$lon,'&units=metric&APPID=PUT_YOUR_KEY_HERE')
  weather.entry<-jsonlite::fromJSON(url,simplifyMatrix = F,simplifyDataFrame = F,flatten = T)
  temperature<-weather.entry$main$temp
  humidity<-weather.entry$main$humidity
  wind.speed<-weather.entry$wind$speed
  wind.deg<-weather.entry$wind$deg
  snow<-weather.entry$snow$`3h`
  if (is.null(wind.speed)){ wind.speed<-NA}
  if (is.null(wind.deg)){ wind.deg<-NA}
  if (is.null(snow)){ snow<-NA}
  if (is.null(humidity)){ humidity<-NA}
  if (is.null(temperature)){ temperature<-NA}
  weather.de<-rbind(data.frame(plz=plz.ort.de[i,]$plz,temperature,humidity,wind.speed,wind.deg,snow),weather.de)  
#you might want to take your process to sleep for some milliseconds to give the API a breath
}

3-Map creation and PLZ-weather data frames merging

Using the rgal for the required spatial transformations. In this case, we use the EPSG 4839 for the German geography (see spTransform)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
library(ggplot2)
library(rgdal)           # required for readOGR and spTransform
library(RColorBrewer)
 
setwd("[your_path]/plz-gebiete.shp/")
# read shapefile
wmap <- readOGR(dsn=".", layer="plz-gebiete")
 
map <- readOGR(dsn=".", layer="plz-gebiete")
map <- spTransform(map,CRS=CRS("+init=epsg:4839"))
map$region<-substr(map$plz, 1,1)
map.data <- data.frame(id=rownames(map@data), map@data)
map.data$cplz<- as.character(map.data$plz)
 
weather.de$cplz<- as.character(weather.de$plz)
#normalization to force all PLZs having 5 digits
weather.de$cplz<- ifelse(nchar(weather.de$cplz)<5, paste0("0",weather.de$cplz), weather.de$cplz)
map.data<-merge(map.data,weather.de,by=c("cplz"),all=T)
map.df   <- fortify(map)
map.df   <- merge(map.df,map.data,by="id", all=F)

4-Map display for the weather metrics and high-resolution picture saving

We just rely on the standard ggplot functionality to plot the weather metric we’d like to. To make it more readable, I facetted by region.

1
2
3
4
5
6
7
8
9
10
11
12
13
temperature<-ggplot(map.df, aes(x=long, y=lat, group=group))+
  geom_polygon(aes(fill=temperature))+
  facet_wrap(~region,scales = 'free') +
  geom_path(colour="lightgray", size=0.5)+
  scale_fill_gradient2(low ="blue", mid = "white", high = "green", 
                       midpoint = 0, space = "Lab", na.value = "lightgray", guide = "legend")+  theme(axis.text=element_blank())+
  theme(axis.text=element_text(size=12)) +
  theme(axis.title=element_text(size=14,face="bold")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme(panel.background = element_rect(fill = 'white')) +
  theme(panel.grid.major = element_line( color="snow2")) 
 
ggsave("../plz-temperature-300.png",  width=22.5, height=18.25, dpi=300)

bavaria-temperature

5-(Bonus) Underlying map tiles

You probably feel like having a map as reference to see city names, roads, rivers and all that stuff in each PLZ. For that we can use ggmap, a really cool package for spatial visualization with Google Maps and OpenStreetMap.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
library(plyr)
library(ggmap)
# reading the shapes
area <- readOGR(dsn=".", layer="plz-gebiete")
# using the normalized version with "0" for the later join
weather.de$plz<-weather.de$cplz
# from factor to character
area.df$plz<-as.character(area.df$plz)
area.df <- data.frame(id=rownames(area@data), area@data)
# merging weather and geographical information
area.extend<-merge(area.df, weather.de, by=c("plz"),all=F)
# building 
area.points <- fortify(area)
area.points<-merge(area.points, area.extend, by=c("id"),all=F)
d <- join(area.points, area.extend, by="id")
# region extraction
d$region<-substr(d$plz, 1,1)
bavaria<-subset(d, region=="8")
# google map tiles request... location is where you want your map centered at
google.map <- get_map(location="Friedberg", zoom =8, maptype = "terrain", color = "bw", scale=4)
ggmap(google.map) +
  geom_polygon(data=bavaria, aes(x=long, y=lat, group=group, fill=temperature), colour=NA, alpha=0.5) +
  scale_fill_gradient2(low ="blue", mid = "yellow", high = "green", 
                       midpoint = 0, space = "Lab", na.value = "lightgray", guide = "legend")+  theme(axis.text=element_blank())+  
  labs(fill="") +
  theme_nothing(legend=TRUE)
 
ggsave("../plz-temperature-Bavaria.png",  width=22.5, height=18.25, dpi=300)

The results speak for themselves!
Temperature in Germany
temperature-Germany
Temperature in the area around Munich only
temperature-munich
Snow across Germany
snow-Germany

To leave a comment for the author, please follow the link and comment on his blog: Big Data Doctor » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

MazamaSpatialUtils Package

$
0
0

(This article was first published on Working With Data » R, and kindly contributed to R-bloggers)
This entry is part 15 of 15 in the series Using R

Mazama Science has just released its first package on CRAN — MazamaSpatialUtils. Here is the description:

A suite of conversion scripts to create internally standardized spatial polygons dataframes. Utility scripts use these datasets to return values such as country, state, timezone, watershed, etc. associated with a set of longitude/latitude pairs. (They also make cool maps.)

In this post we discuss the reasons for creating this package and provide examples of its use.

At Mazama Science we often work with data that is geo-located:

  • biological and chemical samples from streams
  • seismic sensor data
  • pollution monitoring data
  • output from gridded atmospheric models
  • forest management and geomorphology data
  • national and state demographic and economic data

Using the sp package, all of these types of data can be plotted on maps or combined with other geospatial data to ask more detailed, more interesting questions. For an introduction to spatial data in R see Working With Geospatial Data or Working With Geopatial Data (and ggplot2).

The long term goal of the MazamaSpatialUtils package is to make it easier for us to work with GIS shapefile data we discover on the web as we create a library of interesting spatial datasets for use in R. This first release of the package addresses three specific issues:

  1. creating a scalable system for working with spatial data
  2. normalizing identifiers in spatial data
  3. finding spatial information based on a set of locations

Creating a scalable system

Shapefiles with high resolution features are by nature quite large. Working with the tz_world timezones dataset we see that a single timezone polygon, ‘America/Godthab’, takes up 4.1 Mb of RAM because of the highly detailed outline of the Greenland coast.

America_Godthab

Spatial datasets are large and their conversion from shapefile to SpatialPolygonsDataFrame can be time consuming. In addition, there is little uniformity to the dataframe data found in these datasets. The MazamaSpatialUtils package addresses these issues in two ways:

  1. It provides a package state variable called SpatialDataDir which is used internally as the location for all spatial datasets.
  2. It defines a systematic process for converting shapefile data into .RData files with SpatialPolygonsDataFrames objects.

Spatial Data Directory

Users will want to maintain a directory where their .RData versions of spatial data reside. The package provides a setSpatialDataDir() function which sets a package state variable storing the location. Internally, getSpatialDataDir() is used whenever data need to be accessed. (Hat tip to Hadley Wickham’s description of Environments and package state.)

Systematic Process

The package comes with several convert~() functions that download, convert and normalize shapefile datasets available on the web. Version 0.1 of the package has five such scripts that walk through the same basic steps with minor differences depending on the needs of the source data. With these as examples, users should be able to create their own convert~() functions to process other spatial data. Once converted and normalized, each dataset will benefit from other package utility functions that depend on the consistent availability and naming of certain columns in the @data slot of each SpatialPolygonsDataFrame.

Normalizing identifiers

The great thing about working with the shapefile format is that it is the defacto standard format for spatial data. We LOVE standards! Many shapefiles, but not all, also use the ISO 3166-1 alpha-2 character encoding for identifying countries, hereafter called the countryCode. However, there seems to be no agreement at all about what to call this encoding. We have seen  ’ISO’, ‘ISO2′, ‘country’, ‘CC’ and many more. Even the ISOcodes package calls this column of identifiers ‘Alpha_2′ in one dataframe and ‘Country’ in another.

Of course there are many other spatial datasets that do not include a column with the countryCode. Sometimes it is because they use FIPS or ISO 3166-1 alpha-3 or some (non-standardized) version of the plain English name. Other times it is because the data are part of a national dataset and the country is assumed.

Wouldn’t it be nice if every spatial dataset you worked with was guaranteed to have a column named countryCode with the ISO 3166-1 alpha-2 encoding? We certainly think so!

The heart of spatial data ‘normalization’ in this package is the conversion of various spatial datasets into .RData files with guaranteed and uniformly named identifiers including at least:

  • countryCode – ISO 3166-1 alpha-2
  • stateCode – ISO 3166-2 alpha-2 (if appropriate)

Any datasets with timezone information will include:

  • timezone – Olson timezone

The uniformity of identifiers in the spatial datasets makes it easy to generate maps with data from any dataset that uses standard ISO codes for countries or states.

Location searches

Version 0.1 of this package is targeted to the following use case:

How can we determine the timezones associated with a set of locations?

Here is how we arrived at this question:

We are working with pollution monitoring data collected by sensors around the United States. Data are collected hourly and aggregated into a single annual dataset with a GMT time axis. So far so good. Not surprisingly, pollution levels show a strong diurnal signal so it is useful do identify measurements as being either during the daytime or nighttime. Luckily, the maptools package has a suite of ‘sun-methods’ for calculating the local sunrise and sunset if you provide a longitude, latitude and POSIXct object with the proper timezone.

Determining the timezone associated with a location is an inherently spatial question and can be addressed with a point-in-polygon query as enabled by the sp package. Once we enabled this functionality with a timezone dataset we realized that we could extract more metadata for our monitoring stations from other spatial datasets: country, state, watershed, legislative district, etc. etc. But we’re getting ahead of ourselves.

Here is an example demonstrating a search for Olson timezone identifiers:

library(MazamaSpatialUtils)

# Vector of lons and lats
lons <- seq(-120,-60,5)
lats <- seq(20,80,5)

# Get Olson timezone names
timezones <- getTimezone(lons, lats)
print(timezones)

[1] NA                   NA                   "America/Hermosillo"
 [4] "America/Denver"     "America/Chicago"    "America/Chicago"   
 [7] "America/Toronto"    "America/Toronto"    NA                  
[10] "America/Iqaluit"    "America/Iqaluit"    NA                  
[13] "America/Godthab"

Additional information is available by specifying allData=TRUE:

# Get all information in the dataset
timezoneDF <- getTimezone(lons, lats, allData=TRUE)
print(timezoneDF)

timezone UTC_offset UTC_DST_offset countryCode  longitude latitude
1                <NA>         NA             NA        <NA>         NA       NA
2                <NA>         NA             NA        <NA>         NA       NA
3  America/Hermosillo         -7             -7          MX -110.96667 29.06778
4      America/Denver         -7             -6          US -104.98417 39.74556
5     America/Chicago         -6             -5          US  -87.65000 41.86417
6     America/Chicago         -6             -5          US  -87.65000 41.86417
7     America/Toronto         -5             -4          CA  -79.38333 43.66083
8     America/Toronto         -5             -4          CA  -79.38333 43.66083
9                <NA>         NA             NA        <NA>         NA       NA
10    America/Iqaluit         -5             -4          CA  -68.46667 63.74556
11    America/Iqaluit         -5             -4          CA  -68.46667 63.74556
12               <NA>         NA             NA        <NA>         NA       NA
13    America/Godthab         -3             -2          GL  -51.73333 64.18639

 They also make cool maps

Using spatial data to create location-specific metadata can be very rewarding. But it doesn’t satisfy our ever-present craving for eye candy. As long as we have all of this spatial data at our fingertips, let’s do something fun.

library(MazamaSpatialUtils)
library(sp)                  # for spatial plotting

# Remove Antarctica
tz <- subset(SimpleTimezones, countryCode != 'AQ')

# Assign timezone polygons an index based on UTC_offset
colorIndices <- .bincode(tz@data$UTC_offset, breaks=seq(-12.5,12.5,1))

# Color timezones by UTC_offset
plot(tz, col=rainbow(25)[colorIndices])

 RainbowTimezones

 Optimization

Large spatial searches can be slow so our package does include two simplified datasets: SimpleCountries and SimpleTimezones. The existence of the SimpleCountries dataset combined with the promise that every dataset will have a countryCode can be used to pre-filter large datasets, improving the performance of spatial searches as in the following example:

library(MazamaSpatialUtils)
library(stringr)

# Specify the directory for spatial data
setSpatialDataDir('~/SpatialData')

# Install NaturalEarthAdm1 if not already installed
installSpatialData('NaturalEarthAdm1', adm=1)

# Load the data
loadSpatialData('NaturalEarthAdm1')

# Vector of random lons and lats
lons <- runif(1000,-180,180)
lats <- runif(1000,-90,90)

# Get country dataframe from SimpleCountries
countryDF <- getCountry(lons, lats, allData=TRUE)

# Determine which countries are involved (NA values indicate "over water")
CC <- unique(countryDF$countryCode[!is.na(countryDF$countryCode)])

# Get state dataframe
stateDF <- getState(lons, lats, countryCodes=CC , allData=TRUE)

# Create Country-State names
names <- paste(countryDF$countryName,'-',stateDF$stateName)

# Display names that aren't "NA - NA" = "over water"
names[names != 'NA - NA'][1:20]

[1] "Australia - Western Australia"  "Antarctica - Antarctica"       
 [3] "Argentina - Santa Fe"           "NA - Sakha (Yakutia)"          
 [5] "Iraq - Al-Anbar"                "Algeria - Bordj Bou Arréridj"  
 [7] "Antarctica - Antarctica"        "China - Hubei"                 
 [9] "Yemen - Amran"                  "Russia - Chita"                
[11] "Antarctica - Antarctica"        "Antarctica - Antarctica"       
[13] "Russia - Chita"                 "Russia - Omsk"                 
[15] "China - Xinjiang"               "China - Liaoning"              
[17] "Russia - Sakha (Yakutia)"       "Russia - Krasnoyarsk"          
[19] "Antarctica - Antarctica"        "Australia - Northern Territory"

Future Plans

Our plans for this package will depend upon project needs but we will certainly be adding convert~() functions for the administrative boundaries data from gadm.org and for some of the USGS water resources datasets.

We encourage interested parties to contribute convert~() functions for their own favorite spatial datasets. If they produce SpatialPolygonDataFrames that adhere to the package standards, we’ll include them in the next release.

Happy Mapping!

To leave a comment for the author, please follow the link and comment on his blog: Working With Data » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Eight New Ideas From Data Visualization Experts

$
0
0

(This article was first published on Plotly, and kindly contributed to R-bloggers)
This post summarizes and visualizes eight key ideas we’ve heard from data visualization experts. Check out our first Case Study to learn more about using Plotly Enterprise on-premise, on your servers. To get started on free online graphing like in this post, check out our tutorials.


Make Interactive Graphs




Pictures of graphs in PowerPoints, dashboards, and emails can be dull. Viewers get value when they can see data with their mouse, zoom, filter, drill down, and study graphs. Plotly uses D3.js so all your graphs are interactive. The graph below models the historical temperature record and associated temperature changes contributed by each factor.








Make Graphs With IPython Widges In Plotly & Domino




Our friends at Domino wrote a blog post showing how you or a developer on your team can use Domino, Plotly’s APIs, and IPython Notebooks to add sliders, filters, and widgets to graphs and take exploration a new direction. See our tutorial to learn more.







Reproducible Research with Plotly & Overleaf




Institutional memory is crucial for research. But data is easily lost if it’s on a thumbdrive, in an email, or on a desktop. The team at Overleaf is enabling reproducible, online publication. You can import Plotly graphs into Overleaf papers and write together.







Plotly graphs are mobile-optimized, reproducible, downloadable, and web-based. Just add the URL in your presentation or Overleaf project to share it all. For examples, for the climate graph above:







Use Statistical Graphs




Graphing pros love using statisical graphs like histograms, heatmaps, 2D histograms, and box plots to investigate and explain data. Below, we’re showing a log axis plot, a boxplot, and a histogram. The numbers are Facebook users per country. Curious? We have tutorials.











Use 3D Graphs




Below see the prestige, education, and income of a few professions, sorted by gender. 3D graphing enables a whole new dimension of interactivity. The points are projected on the outside of the graph. Click and hold to flip the plot or toggle to zoom. Click here to visit the graph. Or take a 3D graphing tutorial.







Embed Interactive Graphs With Dashboards




In a fast-moving world, it’s crucial to get the most recent data. That’s why we make it easy to embed updating graphs in dashboards, like the temperature graph below of San Francisco and Montréal (live version here). See our tutorials on updating graphs, interactive dashboards or graphing from databases.




Customiz Interactive Graphs With JavaScript




For further customizations, use our JavaScript API. You (or a developer on your team) can build custom controls that change anything about an embedded Plotly graph





Embed Interactive Graphs With Shiny




If you are an R user, you can render and embed interactive ggplot2 graphs in Shiny with Plotly. See our tutorial.





If you liked what you read, please consider sharing. We’re at feedback at plot dot ly, and @plotlygraphs.

To leave a comment for the author, please follow the link and comment on his blog: Plotly.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Talk and Course in NYC Next Week

$
0
0

(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

I'll be giving talk on Tuesday February 17 (7:00PM-9:00PM) that will be an overview of predictive modeling. It will not be highly technical and here is the current outline:

  • "Predictive modeling" definition
  • Some example applications
  • A short overview and example
  • How is this different from what statisticians already do?
  • What can drive choice of methodology?
  • Where should we focus our efforts?

The location is Thoughtworks NYC (99 Madison Avenue, 15th Floor).

The next day (Wednesday Feb 18th) I will be teaching Applied Predictive Modeling for the NYC Data Science Academy from 9:00am – 4:30pm at 205 E 42nd Street, New York, NY 10017. This will focus on R.

To leave a comment for the author, please follow the link and comment on his blog: Blog - Applied Predictive Modeling.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The United States In Two Words

$
0
0

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Sweet home Alabama, Where the skies are so blue; Sweet home Alabama, Lord, I’m coming home to you (Sweet home Alabama, Lynyrd Skynyrd)

This is the second post I write to show the abilities of twitteR package and also the second post I write for KDnuggets. In this case my goal is to have an insight of what people tweet about american states. To do this, I look for tweets containing the exact phrase “[STATE NAME] is” for every states. Once I have the set of tweets for each state I do some simple text mining: cleaning, standardizing, removing empty words and crossing with these sentiment lexicons. Then I choose the two most common words to describe each state. You can read the original post here. This is the visualization I produced to show the result of the algorithm:

States In Two Words v2

Since the right side of the map is a little bit messy, in the original post you can see a table with the couple of words describing each state. This is just an experiment to show how to use and combine some interesting tools of R. If you don’t like what Twitter says about your state, don’t take it too seriously.

This is the code I wrote for this experiment:

# Do this if you have not registered your R app in Twitter
library(twitteR)
library(RCurl)
setwd("YOUR-WORKING-DIRECTORY-HERE")
if (!file.exists('cacert.perm'))
{
  download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile='cacert.perm')
}
requestURL="https://api.twitter.com/oauth/request_token"
accessURL="https://api.twitter.com/oauth/access_token"
authURL="https://api.twitter.com/oauth/authorize"
consumerKey = "YOUR-CONSUMER_KEY-HERE"
consumerSecret = "YOUR-CONSUMER-SECRET-HERE"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=requestURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo=system.file("CurlSSL", "cacert.pem", package="RCurl"))
save(Cred, file="twitter authentification.Rdata")
# Start here if you have already your twitter authentification.Rdata file
library(twitteR)
library(RCurl)
library(XML)
load("twitter authentification.Rdata")
registerTwitterOAuth(Cred)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
#Read state names from wikipedia
webpage=getURL("http://simple.wikipedia.org/wiki/List_of_U.S._states")
table=readHTMLTable(webpage, which=1)
table=table[!(table$"State name" %in% c("Alaska", "Hawaii")), ]
#Extract tweets for each state
results=data.frame()
for (i in 1:nrow(table))
{
  tweets=searchTwitter(searchString=paste("'"", table$"State name"[i], " is"'",sep=""), n=200, lang="en")
  tweets.df=twListToDF(tweets)
  results=rbind(cbind(table$"State name"[i], tweets.df), results)
}
results=results[,c(1,2)]
colnames(results)=c("State", "Text")
library(tm)
#Lexicons
pos = scan('positive-words.txt',  what='character', comment.char=';')
neg = scan('negative-words.txt',  what='character', comment.char=';')
posneg=c(pos,neg)
results$Text=tolower(results$Text)
results$Text=gsub("[[:punct:]]", " ", results$Text)
# Extract most important words for each state
words=data.frame(Abbreviation=character(0), State=character(0), word1=character(0), word2=character(0), word3=character(0), word4=character(0))
for (i in 1:nrow(table))
{
  doc=subset(results, State==as.character(table$"State name"[i]))
  doc.vec=VectorSource(doc[,2])
  doc.corpus=Corpus(doc.vec)
  stopwords=c(stopwords("english"), tolower(unlist(strsplit(as.character(table$"State name"), " "))), "like")
  doc.corpus=tm_map(doc.corpus, removeWords, stopwords)
  TDM=TermDocumentMatrix(doc.corpus)
  TDM=TDM[Reduce(intersect, list(rownames(TDM),posneg)),]
  v=sort(rowSums(as.matrix(TDM)), decreasing=TRUE)
  words=rbind(words, data.frame(Abbreviation=as.character(table$"Abbreviation"[i]), State=as.character(table$"State name"[i]),
                                   word1=attr(head(v, 4),"names")[1],
                                   word2=attr(head(v, 4),"names")[2],
                                   word3=attr(head(v, 4),"names")[3],
                                   word4=attr(head(v, 4),"names")[4]))
}
# Visualization
require("sqldf")
statecoords=as.data.frame(cbind(x=state.center$x, y=state.center$y, abb=state.abb))
#To make names of right side readable
texts=sqldf("SELECT a.abb,
            CASE WHEN a.abb IN ('DE', 'NJ', 'RI', 'NH') THEN a.x+1.7
            WHEN a.abb IN ('CT', 'MA') THEN a.x-0.5  ELSE a.x END as x,
            CASE WHEN a.abb IN ('CT', 'VA', 'NY') THEN a.y-0.4 ELSE a.y END as y,
            b.word1, b.word2 FROM statecoords a INNER JOIN words b ON a.abb=b.Abbreviation")
texts$col=rgb(sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),max=255)
library(maps)
jpeg(filename = "States In Two Words v2.jpeg", width = 1200, height = 600, quality = 100)
map("state", interior = FALSE, col="gray40", fill=FALSE)
map("state", boundary = FALSE, col="gray", add = TRUE)
text(x=as.numeric(as.character(texts$x)), y=as.numeric(as.character(texts$y)), apply(texts[,4:5] , 1 , paste , collapse = "n" ), cex=1, family="Humor Sans", col=texts$col)
dev.off()

To leave a comment for the author, please follow the link and comment on his blog: Ripples.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Making Maps in R with Ryan Peek and Michele Tobias

$
0
0

(This article was first published on Noam Ross - R, and kindly contributed to R-bloggers)

Today, Ryan Peek and Michele Tobias gave an introduction to making maps in R. Here’s the webcast:

(Pardon the little scuffle at the beginning and as we switched computers halfway through. Still getting the hang of hangouts.)

Resources:

  • Download all of Ryan’s code and HTML files here.
  • See Michele’s slides on Slideshare here.
  • Code for Michele’s example maps in her GitHub. repository.

To leave a comment for the author, please follow the link and comment on his blog: Noam Ross - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

using the httr package to retrieve data from apis in R

$
0
0

(This article was first published on numbr crunch - Blog, and kindly contributed to R-bloggers)
For a project I’m working on, I needed to access residential electricity rates and associated coordinate information (lat/long) for locations in the US. After a little searching, I found that data.gov offers the rate information in two forms: a static list of approximate rates by region and an API, which returns annual average utility rates ($/kWH) for residential, commercial, and industrial users. Most of my project work will take place in R so I thought why not see how well APIs interact with it. I came across the “httr” package, which for my purposes, worked extremely well.

For this tutorial, we are only going to look at the GET() command in httr. You can view the full list of functions in the httr package here. The GET() command will access the API, provide it some request parameters, and receive an output. For the Utility Rate API, the request parameters are api_key, address, lat, and lon. You can request the API Key from the previous link. Format (json or xml) is technically a request parameter as well. For the purposes of this tutorial we will be using json for all requests.
In the sample code above, you can see that there are two ways to use the GET() command. You can either append all the request parameters directly to the API url as shown in line 7 or you can use the query option in GET() and include a list of keys. I prefer the second option because it allows you to change individual key values more easily. In either case, the content() command will produce the same output. In my case, I wanted specific values from content(sample2). You can do this by adding the subscripts from the content(sample2) command output, see lines 14 and 15.

The above code gives me the necessary utility information and residential electricity rate by location. But I also want coordinate information, which is not available from this API. Luckily, the Google Maps Javascipt API does this pretty easily. For this API, the only required request parameter is address. You can also provide a key (API Key) if you would like to track the API requests being made.
The above code works just like the code for the Utility Rate API, now I have the necessary coordinate information that I was looking for. To makes things even easier, I combined both APIs into a single function which outputs all the information I need into a neat list, see below. The entire script is also available on my Github. Thanks to Hadley WIckham for creating the httr package and providing some useful scripts to start with.

To leave a comment for the author, please follow the link and comment on his blog: numbr crunch - Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How Big Is The Vatican City?

$
0
0

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Dici che il fiume trova la via al mare e come il fiume giungerai a me (Miss Sarajevo, U2)

One way to calculate approximately the area of some place is to circumscribe it into a polygon of which you know its area. After that, generate coordinates inside the polygon and count how many of them fall into the place. The percentage of coordinates inside the place by the area of the polygon is an approximation of the desired area.

I applied this technique to calculate the area of the Vatican City. I generated a squared grid of coordinates around the Capella Sistina (located inside the Vatican City). To calculate the area I easily obtain the convex hull polygon of the coordinates using chull function of grDevices package. Then, I calculate the area of the polygon using areaPolygon function of geosphere package.

To obtain how many coordinates of the grid fall inside the Vatican City, I use revgeocode function of ggmap package (I love this function). For me, one coordinate is inside the Vatican City if its related address contains the words “Vatican City”.

What happens generating a grid of 20×20 coordinates? I obtain that the area of the Vatican City is about 0.32Km2 but according to Wikipedia, the area is 0.44Km2: this method underestimates the area around a 27%. But why? Look at this:

Vatican2

This plot shows which addresses of the grid fall inside the Vatican City (ones) and which of them do not fall inside (zeros). As you can see, there is a big zone in the South, and a smaller one in the North of the city where reverse geocode do not return “Vatican City” addresses.

Maybe Pope Francis should phone Larry Page and Sergey Brin to claim this 27% of his wonderful country.

I was willing to do this experiment since I wrote this post. This is the code:

require(geosphere)
require(ggmap)
require(plotGoogleMaps)
require(grDevices)
setwd("YOUR-WORKING-DIRECTORY-HERE")
#Coordinates of Capella Sistina
capella=geocode("capella sistina, Vatican City, Roma")
#20x20 grid of coordinates around the Capella
g=expand.grid(lon = seq(capella$lon-0.010, capella$lon+0.010, length.out=20),
lat = seq(capella$lat-0.005, capella$lat+0.005, length.out=20))
#Hull Polygon containing coordinates
p=g[c language="(chull(g),chull(g)[1"][/c]),]
#Address of each coordinate of grid
a=apply(g, 1, revgeocode)
#Estimated area of the vatican city
length(grep("Vatican City", a))/length(a)*areaPolygon(p)/1000/1000
s=cbind(g, a)
s$InOut=apply(s, 1, function(x) grepl('Vatican City', x[3]))+0
coordinates(s)=~lon+lat
proj4string(s)=CRS('+proj=longlat +datum=WGS84')
ic=iconlabels(s$InOut, height=12)
plotGoogleMaps(s, iconMarker=ic, mapTypeId="ROADMAP", legend=FALSE)

To leave a comment for the author, please follow the link and comment on his blog: Ripples.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

John Snow, and OpenStreetMap

$
0
0

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

While I was working for a training on data visualization, I wanted to get a nice visual for John Snow’s cholera dataset. This dataset can actually be found in a great package of famous historical datasets.

library(HistData)
data(Snow.deaths)
data(Snow.streets)

One can easily visualize the deaths, on a simplified map, with the streets (here simple grey segments, see Vincent Arel-Bundock’s post)

plot(Snow.deaths[,c("x","y")], col="red", pch=19, cex=.7,xlab="", ylab="", xlim=c(3,20), ylim=c(3,20))
slist <- split(Snow.streets[,c("x","y")],as.factor(Snow.streets[,"street"]))
invisible(lapply(slist, lines, col="grey"))

Of course, one might add isodensity curves (estimated using kernels)

require(KernSmooth)
kde2d <- bkde2D(Snow.deaths[,2:3], bandwidth=c(0.5,0.5))
contour(x=kde2d$x1, y=kde2d$x2,z=kde2d$fhat, add=TRUE)

Now, what if we want to visualize that dataset on a nice background, from Google Maps, or OpenStreetMaps? The problem here is that locations are in a weird coordinate representation system. So let us use a different dataset. For instance, on Robin Wilson’s blog, one can get datasets in a more traditional representation (here the epsg 27700). We can extract the dataset from

library(foreign)
deaths=read.dbf(".../Cholera_Deaths.dbf")

Then, we need our background,

library(OpenStreetMap)
map = openmap(c(lat= 51.516,   lon= -.141),
              c(lat= 51.511,   lon= -.133))
map=openproj(map, projection = "+init=epsg:27700") 
plot(map)
points(deaths@coords,col="red", pch=19, cex=.7 )

If we zoom in (the code above will be just fine), we get

And then, we can compute the density

X=deaths@coords
kde2d <- bkde2D(X, bandwidth=c(bw.ucv(X[,1]),bw.ucv(X[,2])))

based on the same function as before (here I use marginal cross-validation techniques to get optimal bandwidths). To get a nice gradient, we can use

clrs=colorRampPalette(c(rgb(0,0,1,0), rgb(0,0,1,1)), alpha = TRUE)(20)

and finally, we add it on the map

image(x=kde2d$x1, y=kde2d$x2,z=kde2d$fhat, add=TRUE,col=clrs)
contour(x=kde2d$x1, y=kde2d$x2,z=kde2d$fhat, add=TRUE)

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

John Snow, and Google Maps

$
0
0

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

In my previous post, I discussed how to use OpenStreetMaps (and standard plotting functions of R) to visualize John Snow’s dataset. But it is also possible to use Google Maps (and ggplot2 types of graphs).

library(ggmap)
get_london <- get_map(c(-.137,51.513), zoom=17)
london <- ggmap(get_london)

Again, the tricky part comes from the fact that the coordinate representation system, here, is not the same as the one used on Robin Wilson’s blog.

library(foreign)
deaths=read.dbf(".../Cholera_Deaths.dbf")

So we have to change it

df_deaths=data.frame(deaths@coords)
library(sp)
library(rgdal)
coordinates(df_deaths)=~coords.x1+coords.x2
proj4string(df_deaths)=CRS("+init=epsg:27700") 
df_deaths = spTransform(df_deaths,CRS("+proj=longlat +datum=WGS84"))

Here, we have the same coordinate system as the one used in Google Maps. Now, we can add a layer, with the points,

london + geom_point(aes(x=coords.x1, y=coords.x2),data=data.frame(df_deaths@coords),col="red")

Again, it is possible to add the density, as an additional layer,

london + geom_point(aes(x=coords.x1, y=coords.x2), 
data=data.frame(df_deaths@coords),col="red")+
geom_density2d(data = data.frame(df_deaths@coords), 
aes(x = coords.x1, y=coords.x2), size = 0.3) + 
stat_density2d(data = data.frame(df_deaths@coords), 
aes(x = coords.x1, y=coords.x2,fill = ..level.., alpha = ..level..),size = 0.01, bins = 16, geom = "polygon") + scale_fill_gradient(low = "green", high = "red",guide = FALSE) + 
scale_alpha(range = c(0, 0.3), guide = FALSE)

 

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Silhouettes

$
0
0

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Romeo, Juliet, balcony in silhouette, makin o’s with her cigarette, it’s juliet (Flapper Girl, The Lumineers)

Two weeks ago I published this post for which designed two different visualizations. At the end, I decided to place words on the map of the United States. The discarded visualization was this other one, where I place the words over the silhouette of each state:

States In Two Words v1

I do not want to set aside this chart because I really like it and also because I think it is a nice example of the possibilities one have working with R.

Here you have the code. It substitutes the fragment of the code headed by “Visualization” of the original post:

library(ggplot2)
library(maps)
library(gridExtra)
library(extrafont)
opt=theme(legend.position="none",
             panel.background = element_blank(),
             panel.grid = element_blank(),
             axis.ticks=element_blank(),
             axis.title=element_blank(),
             axis.text =element_blank(),
             plot.title = element_text(size = 28))
vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
grid.newpage()
jpeg(filename = "States In Two Words.jpeg", width = 1200, height = 600, quality = 100)
pushViewport(viewport(layout = grid.layout(6, 8)))
for (i in 1:nrow(table))
{
  wd=subset(words, State==as.character(table$"State name"[i]))
  p=ggplot() + geom_polygon( data=subset(map_data("state"), region==tolower(table$"State name"[i])), aes(x=long, y=lat, group = group), colour="white", fill="gold", alpha=0.6, linetype=0 )+opt
  print(p, vp = vplayout(floor((i-1)/8)+1, i%%8+(i%%8==0)*8))
  txt=paste(as.character(table$"State name"[i]),"n is", wd$word1,"n and", wd$word2, sep=" ")
  grid.text(txt, gp=gpar(font=1, fontsize=16, col="midnightblue", fontfamily="Humor Sans"), vp = viewport(layout.pos.row = floor((i-1)/8)+1, layout.pos.col = i%%8+(i%%8==0)*8))
}
dev.off()

To leave a comment for the author, please follow the link and comment on his blog: Ripples.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Viewing all 589 articles
Browse latest View live