Heatmap of Toronto Traffic Signals using RGoogleMaps

April 12, 2014, 9:56 am

≫ Next: GeoCoding, R, and The Rolling Stones – Part 1

≪ Previous: Using memoise to cache R values

(This article was first published on everyday analytics, and kindly contributed to R-bloggers)

A little while back there was an article in blogTO about how a reddit user had used data from Toronto's Open Data initiative to produce a rather cool-looking map of all the locations of all the traffic signals here in the city.

It's neat because as the author on blogTO notes, it is recognizable as Toronto without any other geographic data being plotted - the structure of the city comes out in the data alone.

Still, I thought it'd be interesting to see as a geographic heat map, and also a good excuse to fool around with mapping using Rgooglemaps.

The finished product below:

Despite my best efforts with transparency (using my helper function), it's difficult for anything but the city core to really come out in the intensity map.

The image without the Google maps tile, and the coordinates rotated, shows the density a little better in the green-yellow areas:

And it's also straightforward to produce a duplication of the original black and white figure:

The R code is below. Interpolation is using the trusty kde2d function from the MASS library and a rotation is applied for the latter two figures, so that the grid of Toronto's streets faces 'up' as in the original map.

# Toronto Traffic Signals Heat Map
# Myles Harrison
# http://www.everydayanalytics.ca
# Data from Toronto Open Data Portal:
# http://www.toronto.ca/open

library(MASS)
library(RgoogleMaps)
library(RColorBrewer)
source('colorRampPaletteAlpha.R')

# Read in the data
data <- read.csv(file="traffic_signals.csv", skip=1, header=T, stringsAsFactors=F)
# Keep the lon and lat data
rawdata <- data.frame(as.numeric(data$Longitude), as.numeric(data$Latitude))
names(rawdata) <- c("lon", "lat")
data <- as.matrix(rawdata)

# Rotate the lat-lon coordinates using a rotation matrix
# Trial and error lead to pi/15.0 = 12 degrees
theta = pi/15.0
m = matrix(c(cos(theta), sin(theta), -sin(theta), cos(theta)), nrow=2)
data <- as.matrix(data) %*% m

# Reproduce William's original map
par(bg='black')
plot(data, cex=0.1, col="white", pch=16)

# Create heatmap with kde2d and overplot
k <- kde2d(data[,1], data[,2], n=500)
# Intensity from green to red
cols <- rev(colorRampPalette(brewer.pal(8, 'RdYlGn'))(100))
par(bg='white')
image(k, col=cols, xaxt='n', yaxt='n')
points(data, cex=0.1, pch=16)

# Mapping via RgoogleMaps
# Find map center and get map
center <- rev(sapply(rawdata, mean))
map <- GetMap(center=center, zoom=11)
# Translate original data
coords <- LatLon2XY.centered(map, rawdata$lat, rawdata$lon, 11)
coords <- data.frame(coords)

# Rerun heatmap
k2 <- kde2d(coords$newX, coords$newY, n=500)

# Create exponential transparency vector and add
alpha <- seq.int(0.5, 0.95, length.out=100)
alpha <- exp(alpha^6-1)
cols2 <- addalpha(cols, alpha)

# Plot
PlotOnStaticMap(map)
image(k2, col=cols2, add=T)
points(coords$newX, coords$newY, pch=16, cex=0.3)

This a neat little start and you can see how this type of thing could easily be extended to create a generalized mapping tool, stood up as a web service for example (they're out there). Case in point: Google Fusion Tables. I'm unsure as to what algorithm they use but I find it less satisfying, looks like some kind of simple point blending:

As always, all the code is on github.

To leave a comment for the author, please follow the link and comment on his blog: everyday analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

GeoCoding, R, and The Rolling Stones – Part 1

April 12, 2014, 2:11 pm

≫ Next: Analyzing Microbial Growth with R

≪ Previous: Heatmap of Toronto Traffic Signals using RGoogleMaps

(This article was first published on Rolling Your Rs, and kindly contributed to R-bloggers)

Originally posted on Rolling Your Rs:

In this article I discuss a general approach for Geocoding a location from within R, processing XML reports, and using R packages to create interactive maps. There are various ways to accomplish this, though using Google’s GeoCoding service is a good place to start. We’ll also talk a bit about the XML package that is a very useful tool for parsing reports returned from Google. XML is a powerful markup language that has wide support in many Internet databases so it is helpful. Lastly, we’ll use our knowledge to create a map of the tour dates on the Rolling Stones 1975 Tour of the Americas. Also, when I use the word “GeoCoding” this basically implies the process of taking a geographic location and turning it into a latitude / longitude pair.

What does Google Offer ?

Check out the main Geocoding page, which presents implementation details of the API as…

View original 1,433 more words

Filed under: GeoCoding XML processing, R programming apply lapply tapply

To leave a comment for the author, please follow the link and comment on his blog: Rolling Your Rs.

↧

Analyzing Microbial Growth with R

April 9, 2014, 5:19 pm

≫ Next: Finally, an easy way to fix the horizontal lines in ggplot2 maps

≪ Previous: GeoCoding, R, and The Rolling Stones – Part 1

(This article was first published on Brian Connelly » R | Brian Connelly, and kindly contributed to R-bloggers)

In experimental evolution research, few things are more important than growth. Both the rate of growth and the resulting yield can provide direct insights into a strain or species’ fitness. Whether one strain with a trait of interest can outgrow (and outcompete) another that possesses a variation of that trait often depends primarily on the fitnesses of the two strains.

Zachary Blount and his mountain of Petri dishes (Photo: Brian Baer)

Because of its importance both for painting the big picture and for properly designing experiments, I spend a very large portion studying the growth of different candidate bacterial strains in different environments. This usually means counting colonies that grow on mountains of Petri dishes or by using a spectrophotometer to measure the absorbance of light as it passes through populations growing in clear microtiter plates. In science, replication is key, so once my eyes have glazed from counting colonies, or once the plate reader dings from across the lab to tell me that it’s done, my job becomes assembling all the data from the replicate populations of each strain and in each environment to try to figure out what’s going on. Do the growth rates show the earth-shattering result that I’m hoping for, or will I have to tweak my experimental design and go through it all again? The latter is almost always the case.

Because analyzing growth is both so fundamental to what I do, and because it is something I repeat ad nauseam, having a solid and easy-to-tweak pipeline for analyzing growth data is a must. Repeating the same analyses over and over again is not only unpleasant, but it also eats up a lot of time.

Here, I’m going to describe my workflow for analyzing growth data. It has evolved quite a bit over the past few years, and I’m sure it will continue to do so. As a concrete example, I’ll use some data from a kinetic read in a spectrophotometer, which means I have information about growth for each well in a 96-well microtiter plate measured periodically over time. I have chosen a more data dense form of analysis to highlight how easy it can be to analyze these data. However, I use the same pipeline for colony counts, single reads in the spec, and even for data from simulations. The following is an overview of the process:

Import the raw data, aggregating from multiple sources if necessary
Reformat the data to make it “tidy”: each record corresponds to one observation
Annotate the data, adding information about experimental variables
Group the replicate data
Calculate statistics (e.g., mean and variation) for each group
Plot the data and statistics

This workflow uses R, which is entirely a matter of preference. Each of these steps can be done in other environments using similar tools. I’ve previously written about grouping data and calculating group summaries in Summarizing Data in Python with Pandas, which covers the most important stages of the pipeline.

If you’d like to work along, I’ve included links for sample data in the section where they’re used. The entire workflow is also listed at the end, so you can easily copy and paste it into your own scripts.

For all of this, you’ll need an up-to-date version of R or RStudio. You’ll also need the dplyr, reshape2, and ggplot2 packages, which can be installed by running the following:

install.packages(c('reshape2', 'dplyr', 'ggplot2'))

The Initial State of the Data

To get started, I’ll put on my TV chef apron and pull some pre-cooked data out of the oven. This unfortunate situation occurs because all of the different software that I’ve used for reading absorbance measurements export data in slightly different formats. Even different versions of the same software can export differently.

So we’ll start with this CSV file from one of my own experiments. Following along with my data will hopefully be informative, but it is no substitute for using your own. So if you’ve got it, I really hope you will use your own data instead. Working this way allows you to see how each of these steps transforms your data, and how the process all comes together. For importing, try playing with the formatting options to read.table. There’s also a rich set of command line tools that make creating and manipulating tabular data quick and easy if that’s more your style. No matter how you get there, I’d recommend saving the data as a CSV (see write.csv) as soon as you’ve dealt with the import so that you never again have to face that unpleasant step.

# Read the data from a CSV file named raw.csv in the data directory
rawdata <- read.csv("data/raw.csv")

Each row of the example file contains the Time (in seconds), Temperature (in degrees Celsius), and absorbance readings at 600 nm for the 96 wells in a microtiter plate over a 24-hour period. These 96 values each have their own column in the row. To see the layout of this data frame, run summary(rawdata). Because of its large size, I’m not including the output of that command here.

Even microtiter plates can be mountainous, as Peter Conlin‘s bench shows.

No matter if your software exports as text, XML, or something else, this basic layout is very common. Unfortunately, it’s also very difficult to work with, because there’s no easy way to add more information about what each well represents. In order to be aware of which well corresponds to which treatment, you’ll most likely have to keep referring either to your memory, your lab notes, or something else to remind yourself how the plate was laid out. Not only is this very inconvenient for analysis—your scripts will consist of statements like treatment_1_avg <- (B4 + E6 + H1) / 3, which are incomprehensible in just about all contexts besides perhaps Battleship—but it also almost guarantees a miserable experience when looking back on your data after even just a few days. In the next step, we’ll be re-arranging the data and adding more information about the experiment itself. Not only will this make the analysis much easier, but it’ll also help sharing the data with others or your future self.

Tidying the Data

As we saw before, our original data set contains one row for each point in time, where each row has the absorbance value for each of our 96 wells. We’re now going to follow the principles of Tidy Data and rearrange the data so that each row contains the value for one read of one well. As you will soon see, this means that each point in time will correspond to 96 rows of data.

To do this rearranging, we’re going to be using the melt function from reshape2. With melt, you specify which columns are identity variables, and which columns are measured variables. Identity variables contain information about the measurement, such as what was measured (e.g., which strain or environment), how (e.g., light absorbance at 600 nm), when, and so on. These are kind of like the 5 Ws of Journalism for your experiment. Measured variables contain the actual values that were observed.

# Load the reshape2 library
library(reshape2)

reshaped <- melt(rawdata, id=c("Time", "Temperature"), variable.name="Well",
                 value.name="OD600")

In the example data, our identity variables are Time and Temperature, while our measured variable is absorbance at 600 nm, which we’ll call OD600. Each of these will be represented as a column in the output. The output, which we’re storing in a data frame named reshaped, will also contain a Well column that contains the well from which data were collected. The Well value for each record will correspond to the name of the column that the data came from in the original data set.

Now that our data are less “wide”, we can take a peek at its structure and its first few records:

summary(reshaped)

       Time        Temperature        Well            OD600       
  Min.   :    0   Min.   :28.2   A1     :  4421   Min.   :0.0722  
  1st Qu.:20080   1st Qu.:37.0   A2     :  4421   1st Qu.:0.0810  
  Median :42180   Median :37.0   A3     :  4421   Median :0.0970  
  Mean   :42226   Mean   :37.0   A4     :  4421   Mean   :0.3970  
  3rd Qu.:64280   3rd Qu.:37.0   A5     :  4421   3rd Qu.:0.6343  
  Max.   :86380   Max.   :37.1   A6     :  4421   Max.   :1.6013  
                                 (Other):397890

head(reshaped)

   Time Temperature Well  OD600
 1    0        28.2   A1 0.0777
 2   20        28.9   A1 0.0778
 3   40        29.3   A1 0.0779
 4   60        29.8   A1 0.0780
 5   80        30.2   A1 0.0779
 6  100        30.6   A1 0.0780

There’s a good chance that this format will make you a little bit uncomfortable. How are you supposed to do things like see what the average readings across wells B4, E6, and H1 are? Remember, we decided that doing it that way—although perhaps seemingly logical at the time—was not the best way to go because of the pain and suffering that it will cause your future self and anyone else who has to look at your data. What’s so special about B4, E6, and H1 anyway? You may know the answer to this now, but will you in 6 months? 6 days?

Annotating the Data

Based solely on the example data set, you would have no way of knowing that it includes information about three bacterial strains (A, B, and C) grown in three different environments (1, 2, and 3). Now we’re going to take advantage of our newly-rearranged data by annotating it with this information about the experiment.

One of the most important pieces of this pipeline is a plate map, which I create when designing any experiments that use microtiter plates (see my templates here and here). These plate maps describe the experimental variables tested (e.g., strain and environment) and what their values are in each of the wells. I keep the plate map at my bench and use it to make sure I don’t completely forget what I’m doing while inoculating the wells.

A plate map for our example data. In this case, strain A is colored blue, strain B is red, and strain C is colored black.

For the analysis, we’ll be using a CSV version of the plate map pictured. This file specifies where the different values of the experimental variables occur on the plate. Its columns describe the wells and each of the experimental variables, and each row contains a well and the values of the experimental variables for that well.

In this sample plate map file, each row contains a well along with the letter of the strain that was in that well and the number of the environment in which it grew. If you look closely this plate map, you’ll notice that I had four replicate populations for each treatment. In some of the wells, the strain is NA. These are control wells that just contained medium. Don’t worry about these, we’ll filter them out later on.

# Load the plate map and look at the first few rows of it
platemap <- read.csv("data/platemap.csv")
head(platemap, n=10)

    Well Strain Environment
 1    B2      A           1
 2    B3      B           1
 3    B4      C           1
 4    B5   <NA>           1
 5    B6      A           2
 6    B7      B           2
 7    B8      C           2
 8    B9   <NA>           2
 9   B10      A           3
 10  B11      B           3

We can combine the information in this plate map with the reshaped data by pairing the data by their Well value. In other words, for each row of the reshaped data, we’ll find the row in the plate map that has the same Well. The result will be a data frame in which each row contains the absorbance of a well at a given point in time as well as information about what was actually in that well.

To combine the data with the plate map, we’ll use the inner_join function from dplyr, indicating that Well is the common column. Inner join is a term from databases that means to find the intersection of two data sets.

library(dplyr)

# Combine the reshaped data with the plate map, pairing them by Well value
annotated <- inner_join(reshaped, platemap, by="Well")

# Take a peek at the first few records in annotated
head(annotated)

   Time Temperature Well  OD600 Strain Environment
 1    0        28.2   B2 0.6100      A           1
 2   20        28.9   B2 0.5603      A           1
 3   40        29.3   B2 0.1858      A           1
 4   60        29.8   B2 0.1733      A           1
 5   80        30.2   B2 0.1713      A           1
 6  100        30.6   B2 0.1714      A           1

This produces a new table named annotated that contains the combination of our absorbance data with the information from the plate map. The inner join will also drop data for all of the wells in our data set that do not have a corresponding entry in the plate map. So if you don’t use a row or two in the microtiter plate, just don’t include those rows in the plate map (there’s nothing to describe anyway). Since the inner join takes care of matching the well data with its information, an added benefit of the plate map approach is that it makes data from experiments with randomized well locations much more easy to analyze (unfortunately, it doesn’t help with the pipetting portion of those experiments).

Let’s pause right here and save this new annotated data set. Because it contains all information related to the experiment—both the measured variables and the complete set of identity variables—it’s now in an ideal format for analyzing and for sharing.

# Write the annotated data set to a CSV file
write.csv(annotated, "data-annotated.csv")

Grouping the Data

Now that the data set is annotated, we can arrange it into groups based on the different experimental variables. With the example data set, it makes sense to collect the four replicate populations of each treatment at each time point. Using this grouping, we can begin to compare the growth of the different strains in the different environments over time and make observations such as “Strain A grows faster than Strain B in Environment 1, and slower than Strain B in Environment 2“. In other words, we’re ready to start learning what the data have to tell us about our experiment.

For this and the following step, we’re once again going to be using the dplyr package, which contains some really powerful (and fast) functions that allow you to easily filter, group, rearrange, and summarize your data. We’ll group the data by Strain, then by Environment, and then by Time, and store the grouping in grouped. As shown in the Venn diagram, this means that we’ll first separate the data based on the strain. Then we’ll separate the data within each of those piles by the environment. Finally, within these smaller collections, we’ll group the data by time.

grouped <- group_by(annotated, Strain, Environment, Time)

Grouping the data by Strain, Environment, and Time

What this means is that grouped contains all of the growth measurements for Strain A in Environment 1 at each point in Time, then all of the measurements for Strain A in Environment 2 at each point in Time, and so on. We’ll use this grouping in the next step to calculate some statistics about the measurements. For example, we’ll be able to calculate the average absorbance among the four replicates of Strain A in Environment 1 over time and for each of the other treatments.

Calculating Statistics for Each Group

Now that we have our data partitioned into logical groups based on the different experimental variables, we can calculate summary statistics about each of those groups. For this, we’ll use dplyr’s summarise function, which allows you to execute one or more functions on any of the columns in the grouped data set. For example, to count the number of measurements, the average absorbance (from the OD600 column), and the standard deviation of absorbance values:

stats <- summarise(grouped, N=length(OD600), Average=mean(OD600), StDev=sd(OD600))

The resulting stats data set contains a row for each of the different groups. Each row contains the Strain, Environment, and Time that define that group as well as our sample size, average, and standard deviation, which are named N, Average, and StDev, respectively. With summarise, you can use apply any function to the group’s data that returns a single value, so we could easily replace the standard deviation with 95% confidence intervals:

# Create a function that calculates 95% confidence intervals for the given
# data vector using a t-distribution
conf_int95 <- function(data) {
    n <- length(data)
    error <- qt(0.975, df=n-1) * sd(data)/sqrt(n)
    return(error)
}

# Create summary for each group containing sample size, average OD600, and
# 95% confidence limits
stats <- summarise(grouped, N=length(OD600), Average=mean(OD600),
                   CI95=conf_int95(OD600))

Combining Grouping, Summarizing, and More

One of the neat things that dplyr provides is the ability to chain multiple operations together using the %.% operator. This allows us to combine the grouping and summarizing from the last two steps (and filtering, sorting, etc.) into one line:

stats <- annotated %.%
          group_by(Environment, Strain, Time) %.%
          summarise(N=length(OD600), 
                    Average=mean(OD600),
                    CI95=conf_int95(OD600)) %.%
          filter(!is.na(Strain))

Note that I’ve put the input data set, annotated, at the beginning of the chain of commands and that group_by and summarise no longer receive an input data source. Instead, the data flows from annotated, through summarise, and finally through filter just like a pipe. The added filter removes data from the control wells, which had no strain.

Plotting the Results

Now that we have all of our data nicely annotated and summarized, a great way to start exploring it is through plots. For the sample data, we’d like to know how each strain grows in each of the environments tested. Using the ggplot2 package, we can quickly plot the average absorbance over time:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) + 
       geom_line() + 
       labs(x="Time (Hours)", y="Absorbance at 600 nm")

The obvious problem with this plot is that although we can differentiate among the three strains, we can’t see the effect that environment has. This can be fixed easily, but before we do that, let’s quickly dissect what we did.

We’re using the ggplot function to create a plot. As arguments, we say that the data to plot will be coming from the stats data frame. aes allows us to define the aesthetics of our plot, which are basically what ggplot uses to determine various visual aspects of the plot. In this case, the x values will be coming from our Time column, which we divide by 3600 to convert seconds into hours. The corresponding y values will come from the Average column. Finally, we will color things such as lines, points, etc. based on the Strain column.

The ggplot function sets up a plot, but doesn’t actually draw anything until we tell it what to draw. This is part of the philosophy behind ggplot: graphics are built by adding layers of different graphic elements. These elements (and other options) are added using the + operator. In our example, we add a line plot using geom_line. We could instead make a scatter plot with geom_point, but because our data are so dense, the result isn’t quite as nice. We also label the axes using labs.

Back to the problem of not being able to differentiate among the environments. While we could use a different line type for each environment (using the linetype aesthetic), a more elegant solution would be to create a trellis chart. In a trellis chart (also called small multiples by Edward Tufte), the data are split up and displayed as individual subplots. Because these subplots use the same scales, it is easy to make comparisons. We can use ggplot’s facet_grid to create subplots based on the environments:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) + 
       geom_line() + 
       facet_grid(Environment ~ .) +
       labs(x="Time (Hours)", y="Absorbance at 600 nm")

Trellis plot showing the growth of the strains over time for each environment

Let’s take it one step further and add shaded regions corresponding to the confidence intervals that we calculated. Since ggplot builds plots layer-by-layer, we’ll place the shaded regions below the lines by adding geom_ribbon before using geom_line. The ribbons will choose a fill color based on the Strain and not color the edges. Since growth is exponential, we’ll also plot our data using a log scale with scale_y_log10:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) +
       geom_ribbon(aes(ymin=Average-CI95, ymax=Average+CI95, fill=Strain),
                   color=NA, alpha=0.3) + 
       geom_line() +
       scale_y_log10() +
       facet_grid(Environment ~ .) +
       labs(x="Time (Hours)", y="Absorbance at 600 nm")

Our final plot showing the growth of each strain as mean plus 95% confidence intervals for each environment

In Conclusion

And that’s it! We can now clearly see the differences between strains as well as how the environment affects growth, which was the overall goal of the experiment. Whether or not these results match my hypothesis will be left as a mystery. Thanks to a few really powerful packages, all it took was a few lines of code to analyze and plot over 200,000 data points.

I’m planning to post a follow-up in the near future that builds upon what we’ve done here by using grofit to fit growth curves.

Complete Script

library(reshape2)
library(dplyr)
library(ggplot2)

# Read in the raw data and the platemap. You may need to first change your
# working directory with the setwd command.
rawdata <- read.csv("data/raw.csv")
platemap <- read.csv("data/platemap.csv")

# Reshape the data. Instead of rows containing the Time, Temperature,
# and readings for each Well, rows will contain the Time, Temperature, a
# Well ID, and the reading at that Well.
reshaped <- melt(rawdata, id=c("Time", "Temperature"), variable.name="Well", 
                 value.name="OD600")

# Add information about the experiment from the plate map. For each Well
# defined in both the reshaped data and the platemap, each resulting row
# will contain the absorbance measurement as well as the additional columns
# and values from the platemap.
annotated <- inner_join(reshaped, platemap, by="Well")

# Save the annotated data as a CSV for storing, sharing, etc.
write.csv(annotated, "data-annotated.csv")

conf_int95 <- function(data) {
    n <- length(data)
    error <- qt(0.975, df=n-1) * sd(data)/sqrt(n)
    return(error)
}

# Group the data by the different experimental variables and calculate the
# sample size, average OD600, and 95% confidence limits around the mean
# among the replicates. Also remove all records where the Strain is NA.
stats <- annotated %.%
              group_by(Environment, Strain, Time) %.%
              summarise(N=length(OD600),
                        Average=mean(OD600),
                        CI95=conf_int95(OD600)) %.%
              filter(!is.na(Strain))

# Plot the average OD600 over time for each strain in each environment
ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) +
       geom_ribbon(aes(ymin=Average-CI95, ymax=Average+CI95, fill=Strain),
                   color=NA, alpha=0.3) + 
       geom_line() +
       scale_y_log10() +
       facet_grid(Environment ~ .) +
       labs(x="Time (Hours)", y="Absorbance at 600 nm")

Extending to Other Types of Data

I hope it’s also easy to see how this pipeline could be used in other situations. For example, to analyze colony counts or a single read from a plate reader, you could repeat the steps exactly as shown, but without Time as a variable. Otherwise, if there are more experimental variables, the only change needed would be to add a column to the plate map for each of them.

Acknowledgments

I’d like to thank Carrie Glenney and Jared Moore for their comments on this post and for test driving the code. Many thanks are also in order for Hadley Wickham, who developed each of the outstanding packages used here (and many others).

Related Information

Tidy Data – A great paper by Hadley Wickham discussing the structure and benefits of “tidy data”
Summarizing Data in Python with Pandas – A post that I wrote about the split-apply-combine process in Python.
ggplot2 Documentation – Very helpful for learning the types of graphics ggplot can create
This post is available as a PDF (doi: 10.6084/m9.figshare.996057)

To leave a comment for the author, please follow the link and comment on his blog: Brian Connelly » R | Brian Connelly.

↧

Finally, an easy way to fix the horizontal lines in ggplot2 maps

April 13, 2014, 12:53 pm

≫ Next: Using R — Working with Geospatial Data (and ggplot2)

≪ Previous: Analyzing Microbial Growth with R

(This article was first published on cameron.bracken.bz » R, and kindly contributed to R-bloggers)

ggplot2 tries to make it super easy to add country or state borders to your map, and for the most part it works great as long as you include the entire map region in your plot (all the states or the entire world map for example). A long standing issue with the ggplot2 borders() function […]

To leave a comment for the author, please follow the link and comment on his blog: cameron.bracken.bz » R.

↧

Using R — Working with Geospatial Data (and ggplot2)

April 16, 2014, 1:34 pm

≫ Next: Make your ggplots shareable, collaborative, and with D3

≪ Previous: Finally, an easy way to fix the horizontal lines in ggplot2 maps

(This article was first published on Working With Data » R, and kindly contributed to R-bloggers)

This is a follow-up blog-post to an earlier introductory post by Steven Brey: Using R: Working with Geospatial Data. In this post, we’ll learn how to plot geospatial data in ggplot2. Why might we want to do this? Well, it’s really about your personal taste. Some people are willing to forfeit the fine-grained control of base graphics in exchange for the elegance of a ggplot. The choice is entirely yours.

To get started, we’ll need the ggplot2 package and some data! The dataset we’ll look at are shapefiles defining watersheds in Washington state.

Loading libraries and data

# load libraries
library(ggplot2)
library(sp)
library(rgdal)
library(rgeos)

# create a local directory for the data
localDir <- "R_GIS_data"
if (!file.exists(localDir)) {
  dir.create(localDir)
}

# download and unzip the data
url <- "ftp://www.ecy.wa.gov/gis_a/hydro/wria.zip"
file <- paste(localDir, basename(url), sep='/')
if (!file.exists(file)) {
  download.file(url, file)
  unzip(file,exdir=localDir)
}

# create a layer name for the shapefiles (text before file extension)
layerName <- "WRIA_poly"

# read data into a SpatialPolygonsDataFrame object
dataProjected <- readOGR(dsn=localDir, layer=layerName)

Transforming the data

Thus far, we haven’t done anything radically different than before, but in order to prepare the data for plotting in a ggplot, we’ll have to do a couple manipulations to the structure of the data. ggplot2 will only work with a data.frame object, so our object of class of SpatialPolygonsDataFrame will not be appropriate for plotting. Let’s write some code and discuss why this kind of transformation is necessary.

# add to data a new column termed "id" composed of the rownames of data
dataProjected@data$id <- rownames(dataProjected@data)

# create a data.frame from our spatial object
watershedPoints <- fortify(dataProjected, region = "id")

# merge the "fortified" data with the data from our spatial object
watershedDF <- merge(watershedPoints, dataProjected@data, by = "id")

# NOTE : If we so choose, we could have loaded the plyr library to use the
#      : join() function. For those familiar with SQL, this may be a more
#      : intuitive way to understand the merging of two data.frames. An
#      : equivalent SQL statement might look something like this:
#      : SELECT *
#      : FROM dataProjected@data
#      : INNER JOIN watershedPoints
#      : ON dataProjected@data$id = watershedPoints$id

# library(plyr)
# watershedDF <- join(watershedPoints, dataProjected@data, by = "id")

What does all this code mean and why do we need it? Let’s go through this line by line.

dataProjected@data$id <- rownames(dataProjected@data)

Here we are appending to the data an extra column called “id”. This column will contain the rownames so that we define an explicit relationship between the data and the polygons associated with that data.

watershedPoints <- fortify(dataProjected, region = "id")

Fortify? What does that even mean? A quick search on the internet will yield some helpful documentation. (See fortify.sp documentation). Basically, fortify take two arguments: model, which will consist of the SpatialPolygonsDataFrame object we wish to convert and region, the name of the variable by which to split regions. If all goes according to plan, some magic happens and we get a data.frame, just like we wanted… well, not quite. If you inspect this data.frame, you’ll notice it appears to be missing some critical information. Fret not! Using the relationship we created earlier, we can merge these two datasets with the following command.

watershedDF <- merge(watershedPoints, dataProjected@data, by = "id")

And viola! Now that we’ve created a data.frame that ggplot2 likes, we can begin plotting. Before we get to plotting, let’s take a quick look at this new data.frame we’ve created.

head(watershedDF)

##   id    long     lat order  hole piece group WRIA_ID WRIA_NR WRIA_AREA_
## 1  0 2377934 1352106     1 FALSE     1   0.1       1      62     789790
## 2  0 2378018 1352109     2 FALSE     1   0.1       1      62     789790
## 3  0 2382417 1352265     3 FALSE     1   0.1       1      62     789790
## 4  0 2387199 1352434     4 FALSE     1   0.1       1      62     789790
## 5  0 2387693 1352452     5 FALSE     1   0.1       1      62     789790
## 6  0 2392524 1352623     6 FALSE     1   0.1       1      62     789790
##        WRIA_NM Shape_Leng Shape_Area
## 1 Pend Oreille     983140   3.44e+10
## 2 Pend Oreille     983140   3.44e+10
## 3 Pend Oreille     983140   3.44e+10
## 4 Pend Oreille     983140   3.44e+10
## 5 Pend Oreille     983140   3.44e+10
## 6 Pend Oreille     983140   3.44e+10

Your first ggplot

If you’re coming from base graphics, some of the syntax may appear intimidating, but’s it’s all part of the “grammar of graphics” after which ggplot2 is modeled. You’ll notice a graph is built layer by layer, beginning with the data and the mapping of data to “aesthetic attributes”. We’ll add “geoms” or geometric objects and perhaps we’ll compute some statistics. We may also want to adjust the scale or coordinate system. All this can be added in a very modular fashion; this is one of the key advantages to using ggplot2. So, enough talk, let’s make a plot!

ggWatershed <- ggplot(data = watershedDF, aes(x=long, y=lat, group = group,
                                              fill = WRIA_NM)) +
  geom_polygon()  +
  geom_path(color = "white") +
  scale_fill_hue(l = 40) +
  coord_equal() +
  theme(legend.position = "none", title = element_blank(),
        axis.text = element_blank())

print(ggWatershed)

Alright, so we have created our first ggplot. Looks pretty spiffy, right? We started by passing a data.frame to the function ggplot. From there, we added some aesthetic mappings. x and y are fairly self-explainatory, group = group simply identifies the groups of coordinates that pertain to individual polygons and fill = WRIA_NM will attempt to assign an appropriate color scale to data based on the “WRIA_NM” column. Next, we added several “geoms” including polygons and paths. The polygons are the brightly colored shapes you see, and the path is the white outline around each shape. scale_fill_hue() changes the properties of the colors displayed and theme() can be used to change a number of properties; in this case I chose not to display a legend or axis labels since they add very little to the plot. Lastly, coord_equal() fixes the aspect ratio between the horizontal and vertical scales.

Manipulating spatial objects

Now, let’s get some practice working with spatial objects. For this exercise, we will subset the data and observe watersheds in the Puget Sound region. For ease of access, we’ll cut out some the data we don’t particularly care about and rename some of the columns to be more descriptive.

# identify some interesting attributes
attributes <- c("WRIA_NR", "WRIA_AREA_", "WRIA_NM")

# subset the full dataset extracting only the desired attributes
dataProjectedSubset <- dataProjected[,attributes]

# assign these attributes of interest to more descriptive names
names(dataProjectedSubset) <- c("number", "area", "name")

# create a data.frame name (potentially different from layerName)
dataName <- "WRIA"

# reproject the data onto a "longlat" projection and assign it to the new name
assign(dataName,spTransform(dataProjectedSubset, CRS("+proj=longlat")))

# save the data
save(list=c(dataName),file=paste(localDir,"WAWRIAs.RData",sep="/"))

# inspect the watershed names
WRIA$name

##  [1] Pend Oreille            Upper Lake Roosevelt
##  [3] Nooksack                Kettle                 
##  [5] Okanogan                Upper Skagit           
##  [7] Methow                  San Juan               
##  [9] Colville                Sanpoil                
## [11] Lower Skagit - Samish   Middle Lake Roosevelt  
## [13] Lyre - Hoko             Chelan                 
## [15] Soleduc                 Stillaguamish          
## [17] Island                  Nespelem               
## [19] Quilcene - Snow         Elwha - Dungeness      
## [21] Foster                  Little Spokane         
## [23] Middle Spokane          Wenatchee              
## [25] Entiat                  Lower Spokane          
## [27] Lower Lake Roosevelt    Grand Coulee           
## [29] Kitsap                  Upper Crab-Wilson      
## [31] Skokomish - Dosewallips Moses Coulee           
## [33] Queets - Quinault       Hangman                
## [35] Palouse                 Upper Yakima           
## [37] Lower Chehalis          Kennedy - Goldsborough 
## [39] Lower Crab              Alkali - Squilchuck    
## [41] Chambers - Clover       Deschutes              
## [43] Naches                  Nisqually              
## [45] Upper Chehalis          Willapa                
## [47] Esquatzel Coulee        Middle Snake           
## [49] Cowlitz                 Lower Snake            
## [51] Lower Yakima            Grays - Elochoman      
## [53] Walla Walla             Klickitat              
## [55] Lewis                   Rock - Glade           
## [57] Wind - White Salmon     Salmon - Washougal     
## [59] Snohomish               Cedar - Sammamish      
## [61] Duwamish - Green        Puyallup - White       
## 62 Levels: Alkali - Squilchuck Cedar - Sammamish ... Wind - White Salmon

# save a subset including only regions surrounding the Puget Sound (as it
# turns out, this will be the first 19 entries)
PSWRIANumbers <- c(1:19)
WRIAPugetSound <- WRIA[WRIA$number %in% PSWRIANumbers,]

# plot Puget Sound watersheds to make sure this is approximately what we want
plot(WRIAPugetSound)

# save the data
dataName <- "WRIAPugetSound"
save(list=c(dataName),file=paste(localDir,"WRIAPugetSound.RData",sep="/"))

Since the plot was simply a “sanity check” of sorts, I decided to use base graphics for a quick peek. Perhaps you should try making it into a ggplot.

Before wrapping up this exercise, lets transform the subsetted dataset so it’s ready to be used by ggplot2.

# add to data a new column termed "id" composed of the rownames of data
dataProjectedSubset@data$id <- rownames(dataProjectedSubset@data)

# create a data.frame from our spatial object
watershedPointsSubset <- fortify(dataProjectedSubset, region = "id")

# merge the "fortified" data with the data from our spatial object
watershedSubsetDF <- merge(watershedPointsSubset, dataProjectedSubset@data,
                           by = "id")

More ggplots

Onto the more fun analysis. While this dataset is fairly limited in information, it does contain some numerical values that might be worth investigating. Let’s get some information regarding the area of Washington’s watersheds. You’ll find that ggplot2 makes it very easy to visualize this kind of information.

ggWatershedArea <- ggplot(data = watershedSubsetDF, aes(x=long, y=lat,
                                                        group = group, 
                                                        fill = area)) +
  geom_polygon()  +
  geom_path(color = "white") +
  scale_fill_gradient(breaks=c(500000,1000000,1500000),
                      labels=c("Low","Medium","High")) +
  coord_equal() +
  theme(axis.title = element_blank(), axis.text = element_blank()) +
  labs(title = "Area of Washington's Watersheds", fill = "Area")

print(ggWatershedArea)

That seemed pretty painless. All we had to do was set the fill aesthetic to the region area and add scale_fill_gradient() to define a continuous color scale; ggplot seemed to handle everything else for us. Notice in the previous ggplot that I set fill = WRIA_NM, or the watershed name. This information was identified as a categorical variable and regions were therefore filled with variety of colors. In the script above, I passed fill = area. If I did not add scale_fill_gradient(), I would have recieved the following error message: Error: Continuous value supplied to discrete scale. This ggplot expected a categorical variable, so it became necessary to add the scale_fill_gradient() layer.

Say, the way this ggplot colors the regions is actually kind of counter- intuitive; light regions for larger areas and dark regions for smaller areas? Doesn’t make much sense if you ask me. Let’s do something about that. Since this is Washington, we’ll don some UW spirit. Hope you like purple and gold! (Apologies to any WSU fans out there.)

ggWatershedAreaPurple <- ggWatershedArea + geom_path(color = "goldenrod1", 
                                                     size=1) +
  scale_fill_gradient(low = "plum1", high = "purple4",
                      breaks=c(500000,1000000,1500000),
                      labels=c("Low","Medium","High"))

print(ggWatershedAreaPurple)

Okay, so the purple and gold might be a bit obnoxious, but you get the idea. Let’s just color the largest of the watersheds. Before we get to that, let’s inspect the data to determine exactly which one that is. You can already sort of guess based on the size and color of the regions, but let’s be sure.

# find the largest area
maxArea <- max(dataProjectedSubset$area)

# create a "mask" identifying the biggest area
biggestAreaMask <- which(dataProjectedSubset$area == maxArea)
biggestAreaName <- dataProjectedSubset$name[biggestAreaMask]
biggestAreaName

## [1] Lower Yakima
## 62 Levels: Alkali - Squilchuck Cedar - Sammamish ... Wind - White Salmon

# NOTE : Each "mask" we create is a vector of logical values that we will use
#      : to subset the data.frame. Masks are particularly helpful when
#      : querying large datasets as the comparison of logicals is faster than
#      : comparing more complex data types.

# create a subset
biggestArea <- dataProjectedSubset[biggestAreaMask,]

So supposedly we have identified Washington’s largest watershed. Let make another ggplot to see what this watershed looks like. Remember, ggplot2 only likes data.frames, so we’ll have to mess around with the data.

# add to data a new column termed "id" composed of the rownames of data
biggestArea@data$id <- rownames(biggestArea@data)

# create a data.frame from our spatial object
biggestAreaPoints <- fortify(biggestArea, region = "id")

# merge the "fortified" data with the data from our spatial object
biggestAreaDF <- merge(biggestAreaPoints, biggestArea@data, by = "id")

If we’re going to do this kind of transformation every time we make a ggplot, perhaps we can make a method to reduce coding time. Alas, that will wait for another day. Now we plot!

ggBiggestArea <- ggplot(data = biggestAreaDF, aes(x=long, y=lat)) +
  geom_polygon(fill = "deepskyblue2")  +
  coord_equal() +
  theme(legend.position = "none", axis.title = element_blank(),
        axis.text = element_blank()) +
  labs(title=paste(biggestAreaName, "\n Area=", maxArea, " (units)", sep=""))

print(ggBiggestArea)

That seemed like a lot of trouble to go through just to find the largest area. Can we do that without creating an entirely new spatial object? Of course we can!

# create a "mask" identifying the biggest area
biggestAreaMask <- which(watershedSubsetDF$area == max(watershedSubsetDF$area))

#creat a subset
biggestWatershed <- watershedSubsetDF[biggestAreaMask,]

ggBiggestAreaPlus <- ggplot(data = watershedDF, aes(x=long, y=lat,
                                                    group = group)) +
  geom_polygon()  +
  geom_polygon(data = biggestWatershed, fill ="deepskyblue2") +
  geom_path(color = "white") +
  coord_equal() +
  theme(legend.position = "none", title = element_blank(),
        axis.text = element_blank())

print(ggBiggestAreaPlus)

Using `ggmap`

The last thing I’ll describe in this post is the function and use of ggmap. This library is used for visualizing spatial data with the likes of Google Maps using ggplot2. A quick example is provided below.

# load library
library(ggmap)

# reproject the data onto a "longlat" projection
subsetTransform <- spTransform(dataProjectedSubset, CRS("+proj=longlat"))

# determine the bounding box of the spatial object
b <- bbox(subsetTransform)

# get and plot a map
washingtonState <- ggmap(get_map(location = b, maptype = "satellite", zoom = 6))

subsetTransformFortified <- fortify(subsetTransform, region = "id")
subsetTransformFortified <- merge(subsetTransformFortified,
                                  subsetTransform@data, by.x = "id")

washingtonState + geom_polygon(data = subsetTransformFortified,
                               aes(x = long, y = lat, group = group,
                                   fill = name), alpha = 0.5) +
  scale_x_continuous(limits = c(b[1,1],b[1,2])) +
  scale_y_continuous(limits = c(b[2,1],b[2,2])) +
  theme(legend.position = "none", title = element_blank())

Final remarks

While these plots may look “nicer”, ggplot2 has a couple of disadvantages. Perhaps most glaring is the increase in computing time. base graphics are built to be fast. Generally speaking, visualizing geospatial data is not the fastest process in the world, but using base vs. ggplot2 can be the difference between 0.5 seconds and 10 seconds. So if you’re producing graphics on-they-fly, stick with base, but if you’re looking to create publication quality graphics, ggplot2 is certainly worth learning.

Learn more about plotting spatial data using ggplot2 at these sources:

ggplot2 documentation

Plotting polygons

Spatial data and ggplot2

ggmap with ggplot2

Files:

ggSpatial.R (R script with markdown)
ggSpatial-code.R (Plain R script)

To leave a comment for the author, please follow the link and comment on his blog: Working With Data » R.

↧

Make your ggplots shareable, collaborative, and with D3

April 17, 2014, 12:00 am

≫ Next: Notes from the Tokyo R User Group meeting, 17 April 2014

≪ Previous: Using R — Working with Geospatial Data (and ggplot2)

(This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers)

Editor's note: This is a guest post from Matt Sundquist form the Plot.ly team.

You can access the source code for this post at https://gist.github.com/sckott/10991885

Ggplotly and Plotly's R API let you make ggplot2 plots, add py$ggplotly(), and make your plots interactive, online, and drawn with D3. Let's make some.

1. Getting Started and Examples

Here is Fisher's iris data.

library("ggplot2")
ggiris <- qplot(Petal.Width, Sepal.Length, data = iris, color = Species)
print(ggiris)

Let's make it in Plotly. Install:

install.packages("devtools")
library("devtools")
install_github("plotly", "ropensci")

Load.

library("plotly")

## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: RJSONIO

signup("new_username", "your_email@domain.com")

That should have responded with your new key. Use that to create a plotly interface object, or use ours:

py <- plotly("RgraphingAPI", "ektgzomjbx")

It just works.

py$ggplotly(ggiris)

The call opens a browser tab. Or in an .Rmd document, the plot is embedded if you specify the plotly=TRUE chunk option (see source). If you're running this from the source, it makes all the graphs at once in your browser. Reaction my first time: here be dragons.

If you click the data and graph link in the embed, it takes you to Plotly's GUI, where you can edit the graph, see the data, and share your plot with collaborators.

1.2 Maps

Next: Maps!

data(canada.cities, package="maps")
viz <- ggplot(canada.cities, aes(long, lat)) +
  borders(regions="canada", name="borders") +
  coord_equal() +
  geom_point(aes(text=name, size=pop), colour="red", alpha=1/2, name="cities")

Call Plotly.

py$ggplotly(viz)

1.3 Scatter

Want to make a scatter and add a smoothed conditional mean? Here's how to do it in Plotly. For the rest of the plots, we'll just print the Plotly version to save space. You can hover on text to get data, or click and drag across a section to zoom in.

model <- lm(mpg ~ wt + factor(cyl), data=mtcars)
grid <- with(mtcars, expand.grid(
  wt = seq(min(wt), max(wt), length = 20),
  cyl = levels(factor(cyl))
))

grid$mpg <- stats::predict(model, newdata=grid)

viz2 <- qplot(wt, mpg, data=mtcars, colour=factor(cyl)) + geom_line(data=grid)
py$ggplotly(viz2)

1.4 Lines

Or, take ggplotly for a spin with the orange dataset:

orange <- qplot(age, circumference, data = Orange, colour = Tree, geom = "line")
py$ggplotly(orange)

1.5 Alpha blend

Or, make plots beautiful.

prettyPlot <- ggplot(data=diamonds, aes(x=carat, y=price, colour=clarity))
prettyPlot <- prettyPlot + geom_point(alpha = 1/10)
py$ggplotly(prettyPlot)

1.6 Functions

Want to draw functions with a curve?

eq <- function(x) {x*x}
tmp <- data.frame(x=1:50, y=eq(1:50))

# Make plot object
p <- qplot(x, y, data=tmp, xlab="X-axis", ylab="Y-axis")
c <- stat_function(fun=eq)

py$ggplotly(p + c)

2. A GitHub for data and graphs

Like we might work together on code on GitHub or a project in a Google Doc, we can edit graphs and data together on Plotly. Here's how it works:

Your URL is shareable.
Public use is free.
You can set the privacy of your graph.
You can edit and add to plots from our GUI or with R or APIs for Python, MATLA, Julia, Perl, Arduino, Raspberry Pi, and REST.
You get a profile of graphs, like Rhett Allain from Wired Science.
You can embed interactive graphs in iframes.

2.1 Inspiration and team

Plotly's API is part of [rOpenSci](ropensci.org), being developed by the brilliant Toby Hocking, and on GitHub. Your thoughts, issues, and pull requests are welcome. Right now, you can make scatter and line plots; let us know what you'd like to see next.

The project was inspired by Hadley Wickham and the elegance and precision of ggplot2. Thanks to Scott Chamberlain, Joe Cheng, and Elizabeth Morrison-Wells for their help.

3. ggthemes and Plotly

Using ggthemes opens up another set of custom graph filters for styling your graphs. To get started, you'll want to install ggthemes.

library("devtools")
install_github("ggthemes", "jrnold")

and load your data.

library("ggplot2")
library("ggthemes")
dsamp <- diamonds[sample(nrow(diamonds), 1000), ]

Inverse gray.

gray <- (qplot(carat, price, data = dsamp, colour = cut) + 
           theme_igray())
py$ggplotly(gray)

The Tableau scale.

tableau <- (qplot(carat, price, data = dsamp, colour = cut) + 
              theme_igray() + 
              scale_colour_tableau())
py$ggplotly(tableau)

Stephen Few's scale.

few <- (qplot(carat, price, data = dsamp, colour = cut) + 
          theme_few() + 
          scale_colour_few())
py$ggplotly(few)

## Error: Failed connect to plot.ly:443; Operation timed out

To leave a comment for the author, please follow the link and comment on his blog: rOpenSci Blog - R.

↧

Notes from the Tokyo R User Group meeting, 17 April 2014

April 21, 2014, 10:51 pm

≫ Next: Overlaying species occurrence data with climate data

≪ Previous: Make your ggplots shareable, collaborative, and with D3

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

Last Thursday I had the pleasure to attend the Tokyo R user group meeting. And what a fun meeting it was! Over 40 R users had come together in central Tokyo. Yohei Sato, who organises the meetings, allowed me to talk a little about the recent developments of the googleVis package.

Thankfully all talks were given in English:

Takashi J. Ozaki presented on Visualisation of Supervised Learning with arules and arulesViz.
Shinichi Takayanagi showed how rMaps can be used to visualise geographical data of trading bitcoins.
Shota Yasui used a motion chart to visualise the demand and supply cycles of salmons in Norway.
Daisuke Ichikawa demonstrated that R can also be used to keep us motivated by saying 'yeah', or better use data to produce music that sounds like a 25 year old Gameboy.

Following the meeting the user group had booked a pub around the corner for a few drinks and some food. Brilliant!

Delicious chicken steaks and rice porridge

The next morning, as I woke up in on the 23rd floor of hotel in Shinjuku I felt that my bed was moving. I am sure it was the earthquake, but what a weird feeling it was with a little hangover.

To leave a comment for the author, please follow the link and comment on his blog: mages' blog.

↧

Overlaying species occurrence data with climate data

April 22, 2014, 12:00 am

≫ Next: R activity around the world

≪ Previous: Notes from the Tokyo R User Group meeting, 17 April 2014

(This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers)

One of the goals of the rOpenSci is to facilitate interoperability between different data sources around web with our tools. We can achieve this by providing functionality within our packages that converts data coming down via web api's in one format (often a provider specific schema) into a standard format. The new version of rWBclimate that we just posted to CRAN does just that. In an earlier post I wrote about how users could combine data from both rgbif and rWBclimate. Back then I just thought it was pretty cool that you could overlay the points on a nice climate map. Now we've come a long way, with the development of an easier to use and more comprehensive package for accessing species occurrence data, spocc, and added conversion functions to create spatial objects out of both climate data maps, and species occurrence data. The result is that you can grab data from both sources, and then extract climate information about your species occurrence data.

In the example below I'm going to download climate data at the basin level for the US and Mexico, and then species occurrences for eight different tree species. I'll then extract the temperature from each point data with an spatial overlay and look at the distribution of temperatures for each species. Furthermore the conversion to spatial objects functions will allow you to use our data with any shape files you might have.

The first step is to grab the KML files for each river basin making up the US and Mexico, which we identify with an integer.

library(rWBclimate)
library(spocc)
library(taxize)

library(spocc)
### Create path to store kml's
dir.create("~/kmltmp")

options(kmlpath = "~/kmltmp")
options(stringsAsFactors = FALSE)

usmex <- c(273:284, 328:365)
### Download KML's and read them in.
usmex.basin <- create_map_df(usmex)

## Download temperature data
temp.dat <- get_historical_temp(usmex, "decade")
temp.dat <- subset(temp.dat, temp.dat$year == 2000)


# Bind temperature data to map data frame

usmex.map.df <- climate_map(usmex.basin, temp.dat, return_map = F)

Now we have created a map of the US and Mexico, downloaded the average temperature in each basin between 1990 and 2000, and bound them together. Next let's grab occurrence data using spocc for our eight tree species

## Grab some species occurrence data for the 8 tree species.

splist <- c("Acer saccharum", "Abies balsamea", "Arbutus xalapensis", "Betula alleghaniensis", 
    "Chilopsis linearis", "Conocarpus erectus", "Populus tremuloides", "Larix laricina")

## get data from bison and gbif
splist <- sort(splist)
out <- occ(query = splist, from = c("bison", "gbif"), limit = 100)

## scrub names
out <- fixnames(out, how = "query")

## Create a data frame of all data.

out_df <- occ2df(out)

Now we've downloaded the data using their latin names, we might want to know the common names. Luckily the taxize package is great for that, and we can grab them with just a couple of lines of code.

### grab common names
cname <- ldply(sci2comm(get_tsn(splist), db = "itis", simplify = TRUE), function(x) {
    return(x[1])
})[, 2]

### Now let's create a vector of common names for easy plotting But first
### order on names so we can just add the names
out_df <- out_df[order(out_df$name), ]
### strip NA values and 0 values of coordinates
out_df <- out_df[!is.na(out_df$lat), ]
out_df <- out_df[out_df$lat > 0, ]
out_df$common <- rep(cname, table(out_df$name))

Now we have all the components we need, species data and spatial polygons with temperature data bound to them. Before we do the spatial over lay, let's have do a quick visualization.

## Now just create the base temperature map
usmex.map <- ggplot() + geom_polygon(data = usmex.map.df, aes(x = long, y = lat, 
    group = group, fill = data, alpha = 0.9)) + scale_fill_continuous("Average annual \n temp: 1990-2000", 
    low = "yellow", high = "red") + guides(alpha = F) + theme_bw(10)

## And overlay of gbif data
usmex.map <- usmex.map + geom_point(data = out_df, aes(y = latitude, x = longitude, 
    group = common, colour = common)) + xlim(-125, -59) + ylim(5, 55)

print(usmex.map)

plot of chunk mapping

Now the question is, what's the temperature at each point for each tree species? We can conevrt our species data to spatial points with occ_to_sp, and our data from rWBclimate can be converted to spatial polygons with kml_to_sp. Next we can loop through each grouping of species, and call the over function to get the temperature at each point.

## Create a spatial polygon dataframe binding kml polygons to temperature
## data
temp_sdf <- kml_to_sp(usmex.basin, df = temp.dat)
### Now we can change the points to a spatial polygon:
sp_points <- occ_to_sp(out)

tdat <- vector()
### Get averages
for (i in 1:length(splist)) {
    tmp_sp <- sp_points[which(sp_points$name == splist[i]), ]
    tmp_t <- over(tmp_sp, temp_sdf)$data
    tdat <- c(tdat, tmp_t)
}

The last step is to create a new data frame with our data. Unfortunately the size of our old data frame out_df won't be the same size due to some invalid lat/long's that came down with our data so the entire data frame will be reassembled. After we assemble the data frame we can summarize our it with plyr, getting the mean temperature and latitude for each species.

### Assemble new dataframe
spDF <- data.frame(matrix(nrow = dim(sp_points)[1], ncol = 0))
spDF$species <- sp_points$name
spDF <- cbind(coordinates(sp_points), spDF)

### This is important, be sure to order all the points alphebetically as we
### did earlier
spDF <- spDF[order(spDF$species), ]

spDF$cname <- rep(cname, table(sp_points$name))
spDF$temp <- tdat
### Strip NA's
spDF <- spDF[!is.na(spDF$temp), ]

## Create summary
summary_data <- ddply(spDF, .(cname), summarise, mlat = mean(latitude), mtemp = mean(temp), 
    sdlat = sd(latitude), sdtemp = sd(temp))

First let's look at a plot of mean temporature vs latititude, and to identify the points we'll plot their common names.

ggplot(summary_data, aes(x = mlat, y = mtemp, label = cname)) + geom_text() + 
    xlab("Mean Latitude") + ylab("Mean Temperature (C)") + theme_bw() + xlim(10, 
    50)

plot of chunk means

This gives us a sense about how the means of each value are related, but we can also look at the distribution of temperatures with boxplots.

ggplot(spDF, aes(as.factor(cname), temp)) + geom_boxplot() + theme_bw(13) + 
    ylab("Temperature") + xlab("Common Name") + theme(axis.text.x = element_text(angle = 45, 
    hjust = 0.5, vjust = 0.5))

plot of chunk boxplots

This gives a sense of how wide the temperature distributions are, as well as looking at some of the outliers. The distributions look pretty skewed, and this probably reflects the large spatial granularity of our temperature data compared to the occurrence data. However this example shows how you can easily combine data from multiple rOpenSci packages. We will continue to work towards enhancing the interoperability of heterogenous data streams via our tools.

To leave a comment for the author, please follow the link and comment on his blog: rOpenSci Blog - R.

↧

R activity around the world

April 22, 2014, 6:00 am

≫ Next: Project Tycho, Correlation between states

≪ Previous: Overlaying species occurrence data with climate data

(This article was first published on rapporter, and kindly contributed to R-bloggers)

This project was inspired by "Where is the R Activity?" and our follow-up post on the number of useR! 2013 attendees. But instead of static maps, now we gathered bunch of R-related data from a variety of different sources to create some interactive cartograms to highlight the focus of R activity from various points of view. Like the number of R Foundation members per country all over the world:

Figure 1. The number of ordinary members in the R Foundation (click for interactive map)

Please click on the above image or URL to see the interactive, D3.js driven map, where hovering the mouse over any country would reveal some detailed statistics on a number of R-related metrics. The menu with a few settings can be activated by hover the mouse over the small blue triangle on the top.

We have also fetched the number of other members (supporters, donors etc.) from the main R-project.org site, and computed the number of all R Foundation members per 1,000 persons, which shows a slightly modified plot -- due to the population-weighted scale:

Figure 2. The number of R Foundation members per 1,000 persons (click for interactive map)

And there are quite a few other metrics we collected from different data sources and merged in R:

the list attendees and participants of the annual useR! conferences were usually fetched from publicly available on the conference homepage (2004, 2006, 2008, 2010, 2011, 2013), in other cases (2009, 2012) the organizing committee kindly contributed the lists. 2007 is still missing.
the number of R User Groups and the number of members was fetched from meetup.com, although we are aware of the fact that only a subset of RUGs are hosted at that provider. This results in some degree of bias, and we would be extremely happy to get some help to fine-tune this database.
the number of CRAN package downloads in 2013 was fetched from the RStudio Cloud CRAN mirror, just like in the origin blog post. We decided to check the number of overall downloads and also for 5 packages. This latter extra work resulted in more options to render the cartogram, e.g. the Rcmdr downloads might be higher than devtools in some countries, where R is used in education, but not much R development takes place there.
online search queries were downloaded from Google Trends.
top R GitHub users were identified and fetched from its wonderful API. Unfortunately the search API limits the results to 1,000 elements, so this data should be rather considered as a sample for the most active R users on GitHub. The plots reflect the proportion of such users in each country.
and the number of visits at R-bloggers.com (on Figure 3 and 4) were kindly contributed by Tal Galili. Thank you, Tal!

Figure 3. The number of visits at R-bloggers.com (click for interactive map)

Figure 4. The number of visits at R-bloggers.com per 1,000 persons (click for interactive map)

The most time-consuming activity in data collection was to standardize the country names, and even more: to manually identify the country of each record if no location data was provided, which resulted in endless hours of desktop research. But our intern did a great job, although he probably knows the name of at least 2,000 different R users from all around the world by now :)

Feedback is highly welcomed, I would love to hear from the more than 3,000 useR! conference participants:

Figure 5. The number of useR! attendees per 1,000 persons in each country (click for interactive map)

Or from he more than 40,000 identified RUG members:

Figure 6. The number of R Meetup members (click for interactive map)

Or any others, who has contributed to the 31 and half million R package downloads in 2013:

Figure 7. The number of R package downloads (click for interactive map)

And congrats to Switzerland, the number one countRy by our artificial and arbitrary Rank, which was computed by averaging the population-weighted R-related variables mentioned above:

Figure 8. The global R index (click for interactive map)

To leave a comment for the author, please follow the link and comment on his blog: rapporter.

↧

Project Tycho, Correlation between states

April 27, 2014, 12:55 am

≫ Next: colormap

≪ Previous: R activity around the world

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

In this fourth post on Measles data I want to have a look at correlation between states. As described before, the data is from Project Tycho, which contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.

Data

I discovered an error in previous code which made 1960 to appear twice. Hence updated script.
setwd('/home/kees/Documents/tycho/')
r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
    na.strings='-',
    skip=2)
r2 <- reshape(r1,
    varying=names(r1)[-c(1,2)],
    v.names='Cases',
    idvar=c('YEAR' , 'WEEK'),
    times=names(r1)[-c(1,2)],
    timevar='STATE',
    direction='long')
r2$STATE=factor(r2$STATE)

####################3
years <- dir(pattern='+.txt')
years

pop1 <-
    lapply(years,function(x) {
            rl <- readLines(x)
            print(x)
            sp <- grep('^U.S.',rl)
            st1 <- grep('^AL',rl)
            st2 <- grep('^WY',rl)
            rl1 <- rl[c language="(sp[1"][/c]-2,st1[1]:st2[1])]
            rl2 <- rl[c language="(sp[2"][/c]-2,st1[2]:st2[2])]

            read1 <- function(rlx) {
                rlx[1] <- paste('abb',rlx[1])
                rlx <- gsub(',','',rlx,fixed=TRUE)
                rt <- read.table(textConnection(rlx),header=TRUE)
                rt[,grep('census',names(rt),invert=TRUE)]
            }
            rr <- merge(read1(rl1),read1(rl2))
            ll <- reshape(rr,
                list(names(rr)[-1]),
                v.names='pop',
                timevar='YEAR',
                idvar='abb',
                times=as.integer(gsub('X','',names(rr)[-1])),
                direction='long')
        })
pop <- do.call(rbind,pop1)
pop <- pop[grep('19601',rownames(pop),invert=TRUE),]
states <- rbind(
    data.frame(
        abb=datasets::state.abb,
        State=datasets::state.name),
    data.frame(abb='DC',
        State='District of Columbia'))
states$STATE=gsub(' ','.',toupper(states$State))

r3 <- merge(r2,states)
r4 <- merge(r3,pop)
r4$incidence <- r4$Cases/r4$pop

r5 <- subset(r4,r4$YEAR>1927,-STATE)
r6 <- r5[complete.cases(r5),]

New variable

In previous posts it became clear there is in general a yearly cycle. However, the minimum in this cycle is in summer. This means for yearly summary it might be best not to use calender years, but rather something which breaks during summer. My choice is week 37.
with(r6[r6$WEEK>30 & r6$WEEK<45,],
    aggregate(incidence,by=list(WEEK=WEEK),mean))
   WEEK           x
1    31 0.016757440
2    32 0.013776182
3    33 0.011313391
4    34 0.008783259
5    35 0.007348603
6    36 0.006843930
7    37 0.006528467
8    38 0.007078171
9    39 0.008652546
10   40 0.016784205
11   41 0.013392375
12   42 0.016158805
13   43 0.018391632
14   44 0.021788221
r6$cycle <- r6$YEAR + (r6$WEEK>37)

Plot

States over time

Since not all states have complete data, it was decided to use state-year combinations with at least 40 observations (weeks). As can be seen there is some correlation between states, especially in 1945. If anything, correlation gets weaker past 1955.
library(ggplot2)ggplot(with(r6,aggregate(incidence,
                list(cycle=cycle,
                    State=State),
                function(x)
                    if(length(x)>40)
                        sum(x) else
                        NA)),
        aes(cycle, x,group=State)) +
    geom_line(size=.1) +
    ylab('Incidence registered Measles Cases Per Year') +
    theme(text=element_text(family='Arial')) +
    scale_y_log10()

Between states

I have seen too many examples of people rebuilding maps based on traveling times or distances. Now I want to do the same. Proper (euclidean) distance of the states would make the variables the year/week combinations, which gives all kind of scaling issues. What I did is to use correlation and transform that into something distance like. ftime is just a helper variable, so I am sure the reshape works correctly.
r6$ftime <- interaction(r6$YEAR,r6$WEEK)
xm <- reshape(r6,
    v.names='incidence',
    idvar='ftime',
    timevar='State',
    direction='wide',
    drop=c('abb','Cases','pop'))

xm2 <- xm[,grep('incidence',names(xm))]
cc <- cor(xm2,use='pairwise.complete.obs')
dimnames(cc) <- lapply(dimnames(cc),function(x) sub('incidence.','',x))
dd <- as.dist(1-cc/2)
The heatmap reveals the structure best.
heatmap(as.matrix(dd),dist=as.dist,symm=TRUE)

MDS is most nice to look at. I will leave comparisons to the US map to those who actually know all these state's relative locations.
library(MASS)
mdsx <- isoMDS(dd)
par(mai=rep(0,4))
plot(mdsx$points,
    type = "n",
    axes=FALSE,
    xlim=c(-1,1),
    ylim=c(-1,1.1))
text(mdsx$points, labels = dimnames(cc)[[1]])

References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

↧

colormap

April 30, 2014, 12:00 am

≫ Next: Plotting Microtiter Plate Maps

≪ Previous: Project Tycho, Correlation between states

(This article was first published on Dan Kelley Blog/R, and kindly contributed to R-bloggers)

Introduction

Over the past month or so I have been trying different ways of handling GMT-style colormaps in Oce. I think my present solution is on the right track, but I am posting here to get more eyes on the problem.

Note that the function called Colormap() here will be called colormap() if I decided to incorporate it into Oce. That will mean that it replaces an old function of the same name. Also, as part of the change, the old function colorize() will disappear. (To call them “old” is a stretch. They are about a month old and were marked as “Alpha” code from the start.)

Procedure

The following code is direct from the help for Colormap(); all I’ve done is to put the example code into Rmarkdown to make for easier comparision with the resultant graphs.

`1`	library(oce)

## Loading required package: methods
## Loading required package: mapproj
## Loading required package: maps
## Loading required package: ncdf4
## Loading required package: tiff

## Example 1. color scheme for points on xy plot
x <- seq(0, 1, length.out = 40)
y <- sin(2 * pi * x)
par(mar = c(3, 3, 1, 1))
mar <- par("mar")  # prevent margin creep by drawPalette()
## First, default breaks
c <- Colormap(y)
drawPalette(c$zlim, col = c$col, breaks = c$breaks)
plot(x, y, bg = c$zcol, pch = 21, cex = 1)
grid()

center

par(mar = mar)
## Second, 100 breaks, yielding a smoother palette
c <- Colormap(y, breaks = 100)
drawPalette(c$zlim, col = c$col, breaks = c$breaks)
plot(x, y, bg = c$zcol, pch = 21, cex = 1)
grid()

center

par(mar = mar)

## Example 2. topographic image with a standard color scheme
par(mfrow = c(1, 1))
data(topoWorld)
cm <- Colormap(name = "gmt_globe")
imagep(topoWorld, breaks = cm$breaks, col = cm$col)

center

## Example 3. topographic image with modified colors
cm <- Colormap(name = "gmt_globe")
deep <- cm$x0 < -4000
cm$col0[deep] <- "black"
cm$col1[deep] <- "black"
cm <- Colormap(x0 = cm$x0, x1 = cm$x1, col0 = cm$col0, col1 = cm$col1)
imagep(topoWorld, breaks = cm$breaks, col = cm$col)

center

## Example 4. image of world topography with water colorized smoothly from
## violet at 8km depth to blue at 4km depth, then blending in 0.5km
## increments to white at the coast, with tan for land.
cm <- Colormap(x0 = c(-8000, -4000, 0, 100), x1 = c(-8000, -4000, 0, 100), col0 = c("violet", 
    "blue", "white", "tan"), col1 = c("violet", "blue", "white", "tan"), n = c(100, 
    8, 1))
lon <- topoWorld[["longitude"]]
lat <- topoWorld[["latitude"]]
z <- topoWorld[["z"]]
imagep(lon, lat, z, breaks = cm$breaks, col = cm$col)
contour(lon, lat, z, levels = 0, add = TRUE)

center

Resources

Source code: 2014-04-30-colormap.R

To leave a comment for the author, please follow the link and comment on his blog: Dan Kelley Blog/R.

↧

Plotting Microtiter Plate Maps

May 1, 2014, 7:54 am

≫ Next: Introducing Statwing

≪ Previous: colormap

(This article was first published on Brian Connelly » R | Brian Connelly, and kindly contributed to R-bloggers)

I recently wrote about my workflow for Analyzing Microbial Growth with R. Perhaps the most important part of that process is the plate map, which describes the different experimental variables and where they occur. In the example case, the plate map described which strain was growing and in which environment for each of the wells used in a 96-well microtiter plate. Until recently, I’ve always created two plate maps. The first one is hand-drawn using pens and markers and sat on the bench with me when I started an experiment. By marking the wells with different colors, line types, and whatever other hieroglyphics I decide on, I can keep track of where everything is and how to inoculate the wells.

A Plate Map

The second is a CSV file that contains a row for each well that I use and columns describing the values of each of my experimental variables. This file contains all of the information that I had on my hand-drawn plate map, but in a format that I can later merge with my result data to produce a fully-annotated data set. The fully-annotated data set is the perfect format for plotting with tools like ggplot2 or for sharing with others.

    Well Strain Environment
 1    B2      A           1
 2    B3      B           1
 3    B4      C           1
 4    B5     <NA>         1
 5    B6      A           2
 6    B7      B           2
 7    B8      C           2
 8    B9     <NA>         2
 9   B10      A           3
 10  B11      B           3

But when talking with Carrie Glenney, whom I’ve been convincing of the awesomeness of the CSV/dplyr/ggplot workflow, I realized that there’s really no need to have two separate plate maps. Since all the information is in the CSV plate map, why bother drawing one out on paper? This post describes how I’ve started using ggplot2 to create a nice plate map image that I can print and take with me to the bench or paste in my lab notebook.

Reading in the Plate Map

First, load load your plate map file into R. You may need to first change your working directory with setwd or give read.csv the full path of the plate map file.

platemap <- read.csv("platemap.csv")

If you don’t yet have a plate map of your own, you can use this sample plate map.

Extracting Row and Column Numbers

In my plate maps, I refer to each well by its row-column pair, like “C6″. To make things easier to draw, we’re going to be splitting those well IDs into their row and column numbers. So for “C6″, we’ll get row 3 and column 6. This process is easy with dplyr’s mutate function. If you haven’t installed dplyr, you can get it by running install.packages('dplyr').

library(dplyr)

platemap <- mutate(platemap,
                   Row=as.numeric(match(toupper(substr(Well, 1, 1)), LETTERS)),
                   Column=as.numeric(substr(Well, 2, 5)))

Once this is done, the platemap data frame will now have two additional columns, Row and Column, which contain the row and column numbers associated with the well in the Well column, respectively.

Drawing the Plate

Microtiter plates are arranged in a grid, so it’s not a big leap to think about a plate as a plot containing the row values along the Y axis and the column values along the X axis. So let’s use ggplot2 to create a scatter plot of all of the wells in the plate map. We’ll also give it a title.

library(ggplot2)

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(size=10) +
    labs(title="Plate Layout for My Experiment")

First Plot

As you can see, this plot doesn’t tell us anything about our experiment other than the wells it uses and their location.

Showing Empty Wells

I often don’t use all 96 wells in my experiments. It is useful, however, to show all of them. This makes it obvious which wells are used and helps orient your eyes when shifting between the plate map and the plate. Because of this, we’ll create some white circles with a light grey border for all 96 wells below the points that we’ve already created. We’ll also change the aspect ratio of the plot so that it better matches the proportions of a 96-well plate.

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim = c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    labs(title="Plate Layout for My Experiment")

plot of chunk plot2 blank wells and aspect ratio

Flipping the Axis

Now that we are showing all 96 wells, one thing becomes clear—the plot arranges the rows from 1 on the bottom to 8 at the top, which is opposite of how microtiter plates are labeled. Fortunately, we can easily flip the Y axis. While we’re at it, we’ll also tell the Y axis to use letters instead of numbers and to draw these labels for each value. Similarly, we’ll label each column value along the X axis.

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment")

Axes flipped

For those who would like to mimic the look of a microtiter plate even more closely, I have some bad news. It’s not possible to place the X axis labels above the plot. Not without some complicated tricks, at least.

Removing Grids and other Plot Elements

Although the plot is starting to look a lot like a microtiter plate, there’s still some unnecessary “chart junk”, such as grids and tick marks along the axes. To create a more straightforward plate map, we can apply a theme that will strip these elements out. My theme for doing this (theme_bdc_microtiter) is available as part of the ggplot2bdc package. Follow that link for installation instructions. Once installed, we can now apply the theme:

library(ggplot2bdc)

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment") +
    theme_bdc_microtiter()

plot of chunk theme

Highlighting Experimental Variables

Now that our plot is nicely formatted, it’s time to get back to the main point of all of this—displaying the values of the different experimental variables.

You’ll first need to think about how to best encode each of these values. For this, ggplot provides a number of aesthetics, such as color, shape, size, and opacity. There are no one-size-fits-all rules for this. If you’re interested in this topic, Jacques Bertin’s classic Semiology of Graphics has some great information, and Jeff Heer and Mike Bostock‘s Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design is very interesting. After a little experimentation you should be able to figure out which encodings best represent your data.

You’ll also need to consider the data types of the experimental variables, because it’s not possible to map a shape or some other discrete property to continuous values.

Here, we’ll show the different environments using shapes, and the different strains using color. When R imported the plate map, it interpreted the Environment variable as continuous (not a crazy assumption, since it has values 1, 2, and 3). We’re first going to be transforming it to a categorical variable (factor in R speak) so that we can map it to a shape. We’ll then pass our encodings to ggplot as the aes argument to geom_point.

platemap$Environment <- as.factor(platemap$Environment)

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(aes(shape=Environment, colour=Strain), size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment") +
    theme_bdc_microtiter()

plot of chunk full plot

Changing Colors, Shapes, Etc.

By default, ggplot will use a default ordering of shapes and colors. If you’d prefer to use a different set, either because they make the data more easy to interpret (see Sharon Lin and Jeffrey Heer‘s fascinating The Right Colors Make Data Easier To Read) or for some other reason, we can adjust them. I’ll change the colors used to blue, red, and black, which I normally associate with these strains. Although these colors aren’t quite as aesthetically pleasing as ggplot2′s defaults, I use them because they are the colors of markers I have at my bench.

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(aes(shape=Environment, colour=Strain), size=10) +
    scale_color_manual(values=c("A"="blue", "B"="red", "C"="black")) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment") +
    theme_bdc_microtiter()

plot of chunk scale colors

Wrap-Up

And that’s all it takes! You can now save the plot using ggsave, print it, add it to some slides, or anything else. In the future, I’ll describe a similar visualizations that can be made that allow exploration of the annotated data set, which contains the plate map information along with the actual data.

Many thanks to Sarah Hammarlund for her comments on a draft of this post!

To leave a comment for the author, please follow the link and comment on his blog: Brian Connelly » R | Brian Connelly.

↧

Introducing Statwing

April 27, 2014, 8:01 am

≫ Next: In case you missed it: April 2014 roundup

≪ Previous: Plotting Microtiter Plate Maps

(This article was first published on CoolStatsBlog » R, and kindly contributed to R-bloggers)

Recently, Greg Laughlin, the founder of a new statistical software called Statwing, let me try his product for free. I happen to like free things very much (the college student is strong within me) so I gave it a try.

I mostly like how easy it is to use: For instance, to relate two attributes like Age and Income, you click Age, click Income, and click Relate.

So what can Statwing do?

Summarize an attribute (like “age”): totals, averages, standard deviation, confidence intervals, percentiles, visual graphs like the one below
Relate two columns together (“Openness” vs “Extraversion”)

Plots the two attributes against eachother to see how they relate. It will include the formula of the regression line and the R-squared value.
Sometimes a chi-square-style table is more appropriate. The software determines how best to represent the data.
Tests the null hypothesis that the attributes are independent, by a T-test, F-test (ANOVA) or chi-square test. Statwing determines which one is appropriate.
Repeat the above for a ranked correlation.

For now, you can’t forecast a time series or represent data on maps. But Greg told me that the team is adding new features as I type this.

If you’d like to try the software yourself, click here. They’ve got three sample datasets to play with:

Titanic passengers information
The results of a psychological survey
A list of congressman, their voting record and donations.

Abbas Keshvani

To leave a comment for the author, please follow the link and comment on his blog: CoolStatsBlog » R.

↧

In case you missed it: April 2014 roundup

May 9, 2014, 8:33 am

≫ Next: On the carbon footprint of the NBA

≪ Previous: Introducing Statwing

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from April of particular interest to R users:

Registration is now open for the useR! 2014 R conference in Los Angeles.

A new Kaggle competition challenges R users to predict which shoppers will become repeat buyers.

Data on R usage around the world, presented as an interative map.

The New York Times publishes the R code behind their new US Senate election forecast feature.

Talent Analytics uses R to understand the factors that lead employees to resign.

Thomas Dinsmore compares performance benchmarks for SAS and Revolution R Enterprise.

A succinct example of Simpson's Paradox: "Good for women, good for men, bad for people".

A replay of the Revolution Analytics webinar, Big-Data Trees for R.

A local newspaper features R and the weatherData package.

I talked about data scientists using R in a DM Radio podcast.

A look at the R H2O package, which provides an interface to the 0xdata distributed algorithms.

Some practical examples explain why vectorized programming in R improves code clarity and performance.

Revolution Analytics' Daniel Hanson provides an introduction to Monte-Carlo simulation of financial time series.

A new CRAN task view dedicated to interfacing R with social media, open data, and other Web technologies.

An R script to create an impressionistic avatar from your Twitter followers.

A summary of the new features in R 3.1.0 "Spring Dance".

R used to analyze character connections in the Star Wars movies, and other applications presented at the Bay Area R Users Group.

The chloroplethr package can now create animated data maps.

A new R-based blog from Norman Matloff, author of The Art of R Programming.

A comprehensive overview of R packages for ensemble modeling

A list of R packages and resources for generalized linear modeling.

An in-depth article in FastCompany Labs surveys open science with R.

Seven data points quantifying the recent growth of R.

An example of vectorization in R, looking at the Collatz Conjecture.

General interest stories (not related to R) in the past month included: visible sound, how dogs react to magic, the generic brand video, arguments pro and con for Big Data and the 2048 game.

As always, thanks for the comments and please send any suggestions to me at david@revolutionanalytics.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

↧

On the carbon footprint of the NBA

May 11, 2014, 1:16 pm

≫ Next: rasterVis tutorials

≪ Previous: In case you missed it: April 2014 roundup

(This article was first published on Stat Of Mind, and kindly contributed to R-bloggers)

It’s no secret that I enjoy basketball, but I’ve often wondered about the carbon footprint that can be caused by 30 teams each playing an 82-game season. Ultimately, that’s 2460 air flights across the whole of the USA, each carrying 30+ individuals.

For these reasons, I decided to investigate the average distance travelled by each NBA team during the 2013-2014 NBA season. In order to do so, I had to obtain the game schedule for the whole 2013-2014 season, but also the distances between arenas in which games are played. While obtaining the regular season schedule was straightforward (a shameless copy and paste), for the distance between arenas, I first had to extract the coordinates of each arena, which could be achieved using the geocode function in the ggmap package.

Example: finding the coordinates of NBA arenas:

# find geocode location of a given NBA arena
library(maps)
library(mapdata)
library(ggmap)
geo.tag1 <- geocode('Bankers Life Fieldhouse')
geo.tag2 <- geocode('Madison Square Garden')
print(geo.tag1)
geo.tag1
        lon     lat
1 -86.15578 39.7639

Once the coordinate of all NBA arenas were obtained, we can use this information to compute the pairwise distance matrix between each NBA arena. However we first had to define a function to compute the distance between two pairs of latitude-longitude.

Computing the distance between two coordinate points:

# Function to calculate distance in kilometers between two points
# reference: http://andrew.hedges.name/experiments/haversine/
earth.dist <- function (lon1, lat1, lon2, lat2, R)
{
  rad <- pi/180
  a1 <- lat1 * rad
  a2 <- lon1 * rad
  b1 <- lat2 * rad
  b2 <- lon2 * rad
  dlon <- b2 - a2
  dlat <- b1 - a1
  a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
  c <- 2 * atan2(sqrt(a), sqrt(1 - a))
  d <- R * c
  real.d <- min(abs((R*2) - d), d)
  return(real.d)
}

Using the function above and the coordinates of NBA arenas, the distance between any two given NBA arenas can be computed with the following lines of code.
Computing the distance matrix between all NBA arenas:

# compute distance between each NBA arena
dist <- c()
R <- 6378.145 # define radius of earth in km
lon1 <- geo.tag1$lon
lat1 <- geo.tag1$lat
lon2 <- geo.tag2$lon
lat2 <- geo.tag2$lat
dist <- earth.dist(lon1, lat1, lon2, lat2, R)

print(dist)
485.6051

By performing this operation on all pairs of NBA teams, we can compute a distance matrix, which can be used in conjunction with the 2013-2014 regular season schedule to compute the total distance travelled by each NBA teams. Finally, all that was left was to visualize the data in an attractive manner. I find the googleVis is a great resource for that, as it provides a convenient interface between R and the Google Chart Tools API. Because wordpress.com does not support javascript, you can view the interactive graph by clicking on the image below.

Total distance (in km) travelled by all NBA teams during the 2013-2014 NBA regular season

Incredibly, we see that the aggregate number of kilometers travelled by NBA teams amounts to 2,108,806 kms! I hope the players have some kind of frequent flyer card…We can take this a step further by computing the amount of CO2 emitted by each NBA team during the 2013-2014 season. The NBA charters standard A319 Airbus planes, which according to the Airbus website emits an average of 9.92 kg of CO2 per km. Again, you can view the interactive graph of CO2 by clicking on the image below.

Total amount of CO2 (in kg) consummed by all NBA teams during the 2013-2014 NBA regular season

Not surprisingly, Oregon and California-based teams travel and pollute the most, since the NBA is mid-east / east coast heavy in its distribution of teams. It is somewhat ironic that the hipster / recycle-crazy / eco-friendly citizens of Portland are also the host of the most polluting NBA team :-)
What is also interesting is to plot the trail of flights (or pollution) achieved by the NBA throught the season.

Great circle maps of all airplane flights completed by NBA teams during the 2013-2014 regular season.

I’ve been thinking about designing an algorithm that finds the NBA season schedule with minimal carbon footprint, which is essentially an optimization problem. The only issue is that there are a huge amount of restrictions to consider, such as christmas day games, first day of season games etc… More on that later.
As usual, all the relevant code for this analysis can be found on my github account.

To leave a comment for the author, please follow the link and comment on his blog: Stat Of Mind.

↧

rasterVis tutorials

May 13, 2014, 3:10 am

≫ Next: R has some sharp corners

≪ Previous: On the carbon footprint of the NBA

(This article was first published on Omnia sunt Communia! » R-english, and kindly contributed to R-bloggers)

Agustin Lobo has recently published some good tutorials about rasterVis:

Regarding the second tutorial, there has been an interesting discussion in the R-sig-Geo mailing list about the projection of the ggmap output.

To leave a comment for the author, please follow the link and comment on his blog: Omnia sunt Communia! » R-english.

↧

R has some sharp corners

May 15, 2014, 5:14 pm

≫ Next: Next Kölner R User Meeting: Friday, 23 May 2014

≪ Previous: rasterVis tutorials

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

R is definitely our first choice go-to analysis system. In our opinion you really shouldn’t use something else until you have an articulated reason (be it a need for larger data scale, different programming language, better data source integration, or something else). The advantages of R are numerous:

Single integrated work environment.
Powerful unified scripting/programming environment.
Many many good tutorials and books available.
Wide range of machine learning and statistical libraries.
Very solid standard statistical libraries.
Excellent graphing/plotting/visualization facilities (especially ggplot2).
Schema oriented data frames allowing batch operations, plus simple row and column manipulation.
Unified treatment of missing values (regardless of type).

For all that we always end up feeling just a little worried and a little guilty when introducing a new user to R. R is very powerful and often has more than one way to perform a common operation or represent a common data type. So you are never very far away from a strange and painful corner case. This why when you get R training you need to make sure you get an R expert (and not an R apologist). One of my favorite very smart experts is Norm Matloff (even his most recent talk title is smart: “What no one else will tell you about R”). Also, buy his book; we are very happy we purchased it.

But back to corner cases. For each method in R you really need to double check if it actually works over the common R base data types (numeric, integer, character, factor, and logical). Not all of them do and and sometimes you get a surprise.

Recent corner case problems we ran into include:

randomForest regression fails on character arguments, but works on factors.
mgcv gam() model doesn’t convert strings to formulas.
R maps can’t use the empty string as a key (that is the string of length 0, not a NULL array or NA value).

These are all little things, but can be a pain to debug when you are in the middle of something else.

For our concrete example let’s concentrate on the pain generated by the empty string.

In R strings represent free-form text and factors represent strings from a pre-defined finite set of possibility (actually called “levels”). The difference can be subtle and you may not always know which one you have (R may have converted for you) and which one will work (R may fail to convert for you).

Take for example the simple case of building a linear model mapping a string or factor valued x to numeric y-values:

d <- data.frame(y=c(1,2,3),x=c('a','b','c'))
lmModel <- lm(y~0+x,data=d)
d$lmPred <- predict(lmModel,newdata=d)

print(d)
  y x lmPred
1 1 a      1
2 2 b      2
3 3 c      3

We have used x to predict y. This works (under the covers) because R converts the string-values of x into factor levels and then uses. For the basic details of how this works try: help('data.matrix') or help('contrasts'). For a bit more on conversion to indicators (which is pretty much automatic in R, but can be a painful manual step in some other machine learning frameworks) see chapter 2 section 2.2.3 of Practical Data Science with R.

Of course it is silly of us to use the entire lm() framework to model an expected value conditioned on a single categorical variable. That is what table() and aggregate() are for. lm() gets the right answer but it has to do some unnecessary steps (such as forming and inverting a design matrix, which happens to be diagonal- but probably wastes space in a non-sparse representation). To directly build a model that maps level to expected value we do the following:

mkMapModel <- function(yVarName,xVarName,data) {
  means <- aggregate(as.formula(paste(yVarName,xVarName,sep='~')),
     data=data,FUN=mean)
  model <- as.list(means[[yVarName]])
  names(model) <- means[[xVarName]]
  model
}
mapModel <- mkMapModel('y','x',d)
d$mapPred <- as.numeric(mapModel[d$x])

print(d)
  y x lmPred mapPred
1 1 a      1       1
2 2 b      2       2
3 3 c      3       3

This works great. It makes sense, and is much more efficient for variables that have a very large number of levels (see Modeling Trick: Impact Coding of Categorical Variables with Many Levels for more details). However it turns out this is only working because the x-variable is encoded as a factor. As the code below shows, when the x-variable takes on string values (called character in R) the lm() works (likely as it triggers a conversion to factor at some point), but our hand-rolled mkMapModel() fails:

d <- data.frame(y=c(1,2,3),x=c('a','b',''),stringsAsFactors=FALSE)
lmModel <- lm(y~0+x,data=d)
d$lmPred <- predict(lmModel,newdata=d)

print(d)
  y x lmPred
1 1 a      1
2 2 b      2
3 3        3

mapModel <- mkMapModel('y','x',d)
d$mapPred <- as.numeric(mapModel[d$x])

Error: (list) object cannot be coerced to type 'double'

This is because the empty string "" (not null, just the string of length 0) isn’t a legal map-key in R. Likely this is due R’s linkages between evaluation environments and maps (and while the empty string may be a traditional string, it isn’t a traditional variable name; see the “environment->list coercion” example in help('as.list') for the connection).

One work around is to make sure the x-variable is a factor (not a character array). We demonstrate this in the working code below. Another fix would be to use paste() to prefix all strings (so none are empty).

d <- data.frame(y=c(1,2,3),x=c('a','b',''),stringsAsFactors=TRUE)
mapModel <- mkMapModel('y','x',d)
d$mapPred <- as.numeric(mapModel[d$x])

print(d)
  y x mapPred
1 1 a       1
2 2 b       2
3 3         3

The requirement to ensure strings are already converted to factors is not unique to our function mkMapModel() (for example the randomForest package has similar issues in some circumstances, but at least is does work with factors unlike some Python scikit-learn packages). Even with such issues I think you are net-ahead using R (notice we didn’t have to write any for-loops due to the vectorized nature of the operator []). With some defensive coding you certainly are ahead in using R’s built in models (like lm()).

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.

↧

Next Kölner R User Meeting: Friday, 23 May 2014

May 19, 2014, 11:07 pm

≫ Next: Allez les Bleus !

≪ Previous: R has some sharp corners

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

The next Cologne R user group meeting is scheduled for this Friday, 23 May 2014.

To celebrate our 10th meeting we welcome:

Andrie de Vries (Revolution Analytics and Co-author of R for Dummies): Taking R to the Enterprise
Markus Gesmann: googleVis overview and recent developments

Followed by drinks and schnitzel at the Lux.

Further details available on our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Bernd Weiß and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their vector programme.

View Larger Map

To leave a comment for the author, please follow the link and comment on his blog: mages' blog.

↧

Allez les Bleus !

May 20, 2014, 7:13 pm

≫ Next: R and Python Meetups, Philippines

≪ Previous: Next Kölner R User Meeting: Friday, 23 May 2014

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

In almost three weeks, the (FIFA) World Cup will start, in Brazil. I have to admit that I am not a big fan of soccer, so I will not talk to much about it. Actually, I wanted to talk about colors, and variations on some colors. For instance, there are a lot of blues. In order to visualize standard blues, let us consider the following figure, inspired by the well known chart of R colors,

BLUES=colors()[grep("blue",colors())]
RGBblues=col2rgb(BLUES)
library(grDevices)
HSVblues=rgb2hsv( RGBblues[1,], RGBblues[2,], RGBblues[3,])
HueOrderBlue=order( HSVblues[1,], HSVblues[2,], HSVblues[3,] )
SetTextContrastColor=function(color) ifelse( mean(col2rgb(color)) > 127, "black", "white")
TextContrastColor=unlist( lapply(BLUES, SetTextContrastColor) )
c=11
l=6
plot(0, type="n", ylab="", xlab="",axes=FALSE, ylim=c(0,11), xlim=c(0,6))
for (j in 1:11){
  for (i in 1:6){
  k=(j-1)*6 + i
rect(i-1,j-1,i,j, border=NA, col=BLUES[ HueOrderBlue[k] ])
text(i-.5,j-.5,paste(BLUES[k]), cex=0.75, col=TextContrastColor[ HueOrderBlue[k] ])}}

All the color names that contain “blue” in it are here.

Having the choice between several possible colors is interesting, but it can also be interesting to get a palette of blue colors, What we can get is the following

library(RColorBrewer)
blues=colorRampPalette(brewer.pal(9,"Blues"))(100)

In order to illustrate the use of palette colors, consider some data, on soccer players (officially registered). The dataset - lic-2012-v1.csv - can be downloaded from http://data.gouv.fr/fr/dataset/… (I will also use a dataset we have on location of all towns, in France, with latitudes and longitudes)

base1=read.csv(
"http://freakonometrics.free.fr/popfr19752010.csv",
header=TRUE)
base1$cp=base1$dep*1000+base1$com
base2=read.csv("lic-2012-v1.csv", header=TRUE)
base2=base2[base2$fed_2012==111,]
names(base2)[1]="cp"
base2$cp=as.numeric(as.character(base2$cp))

The problem with France (I should probably say one of the many problems) is that regions and departements are not well coded, in the standard functions. To explain where départements are, let us use the dept.rda file, and then, we can get a matching between R names, and standard (administrative) ones,

base21=base2[,c("cp","l_2012","pop_2010")]
base21$dpt=trunc(base21$cp/1000)
library(maps)
load("dept.rda")
base21$nomdpt=dept$dept[match(as.numeric(base21$dpt),dept$CP)]
L=aggregate(base21$l_2012,by=list(Category=base21$nomdpt),FUN=sum)
P=aggregate(base21$pop_2010,by=list(Category=base21$nomdpt),FUN=sum)
base=data.frame(D=P$Category,Y=L$x/P$x,C=trunc(L$x/P$x/.0006))
france=map(database="france")
matche=match.map(france,base$D,exact=TRUE)
map(database="france", fill=TRUE,col=blues[base$C[matche]],resolution=0)

Here are the rates of soccer players (with respect to the total population) It is also possible to look at rate not by département, but by town,

base10=base1[,c("cp","long","lat","pop_2010")]
base20=base2[,c("cp","l_2012")]
base=merge(base10,base20)
Y=base$l_2012/base$pop_2010
QY=as.numeric(cut(Y,c(0,quantile(Y,(1:99)/100),10),labels=1:100))
library(maps)
map("france",xlim=c(-1,1),ylim=c(46,48))
points(base$long,base$lat,cex=.4,pch=19,col=blues[QY])

The darker the dot, the more player, We can also zoom in, to get a better understanding, in the northern part of France, for instance, or in the Southern part,

We can obtain a map which is not (too) far away from the one mentioned a few months ago on http://slate.fr/france/78502/.

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

↧

R and Python Meetups, Philippines

May 21, 2014, 7:02 pm

≫ Next: Towards (Yet) Another R Colour Palette Generator. Step One: Quentin Tarantino.

≪ Previous: Allez les Bleus !

(This article was first published on Analysis with Programming, and kindly contributed to R-bloggers)

There will be upcoming meetups for R User Group Philippines and Python Philippines (PythonPH) Community. Below are the details:

R Meetup

topic: R for SAS users, and planning of RUG activities

venue: 9/F Sun Life Centre, 5th Avenue corner Rizal Drive,

Bonifacio Global City, 1634, Taguig

date: Thursday, June 19, 2014

7:00 pm

outline:

Introducing R to SAS users
Common SAS functions used at PPD - c/o Mark Javellosa
Group discussion on equivalent packages in R
Sharing of experiences of actual SAS converts

Question? Ask here.

Python Meetup

topic: Data Science

speakers: Jolo Balbin
               Bright.com (LinkedIn)

               Stephanie Sy
               Ex-Google, Ex-Wildfire

venue: 8F Visayas Room, Microsoft Office,

6750 Ayala Avenue, Makati

date: Friday, May 23, 2014

7:00 pm

Let's learn and talk about Python again. This month's meetup is about "data science". We have a couple of data scientists doing talks and the usual pizza and chat time. Invite your friends! Hope to see you guys!

We're also still selling some Two Scoops of Django books for Php 1,440 each. A must-have if you do or want to do web development with Django.

Question? Ask here.

To leave a comment for the author, please follow the link and comment on his blog: Analysis with Programming.

↧

The Initial State of the Data

Tidying the Data

Annotating the Data

Grouping the Data

Calculating Statistics for Each Group

Combining Grouping, Summarizing, and More

Plotting the Results

In Conclusion

Complete Script

Extending to Other Types of Data

Acknowledgments

Related Information

Loading libraries and data

Transforming the data

Your first ggplot

Manipulating spatial objects

More ggplots

Using ggmap

Final remarks

Read more

Files:

1. Getting Started and Examples

1.2 Maps

1.3 Scatter

1.4 Lines

1.5 Alpha blend

1.6 Functions

2. A GitHub for data and graphs

2.1 Inspiration and team

3. ggthemes and Plotly

Data

New variable

Plot

States over time

Between states

References

Introduction

Procedure

Resources

Reading in the Plate Map

Extracting Row and Column Numbers

Drawing the Plate

Showing Empty Wells

Flipping the Axis

Removing Grids and other Plot Elements

Highlighting Experimental Variables

Changing Colors, Shapes, Etc.

Wrap-Up

R Meetup

Python Meetup

Using `ggmap`