Quantcast
Channel: Search Results for “maps”– R-bloggers
Viewing all 589 articles
Browse latest View live

Slidify my R journey from @matlabulous to rCrimemap

$
0
0

(This article was first published on Blend it like a Bayesian!, and kindly contributed to R-bloggers)

My LondonR Talk

Thanks to Mango Solutions (LondonR organiser), I was given the opportunity last night to talk about my mini project ‘CrimeMap’Instead of going through all the technical details behind the scenes, I chose to talk the audience through my R journey from a noob to a heavy user. CrimeMap was used as a case study to show how ones can benefit from learning R (or, in some ways, trying to justify the time I spent staring at RStudio IDE last year). The feedback was really great and the talk effectively expanded my network in the data science community so I am really grateful for that! You can find my presentation here.

Before the main event, there was an excellent R-Python workshop by Chris Musselle. The other two interesting presentations were "Dynamic Report Generation" by Kate Hanley and "Customer Clustering for Retail Marketing" by Jon Sedar. Their presentations will soon be made available here.

CrimeMap - A Wonderful Learning Experience

When I first started learning R for real, the goal was very simple - "let's plot something pretty with ggplot2". Well, a lot has changed since then. The more I learned, the more I discovered. It is really hard to summarise the 'R' awesomeness in a few slides due to its diversity. One thing I am absolutely certain is that I made the right move about a year ago to shift from MATLAB to R. Yet, I am keeping my twitter account name @matlabulous just to remind myself that ones should always keep an open mind for new and evolving technology (... and should avoid getting a tattoo of your potential ex-gf/bf's name. On that note, no, I don't have a tattoo.For more information about the CrimeMap, please see my previous posts here, here and here.

Using Slidify for Professional Presentation

The talk was also the first time I presented something totally unrelated to water engineering. I thought, for a change, let’s try something different. Then I remembered looking at the Slidify slides from Jeff Leek’s Data Analysis course back in Jan-March last year. I thought that would fit perfectly for LondonR because the whole presentation would be coded completely in R. It would be a good reason to learn Slidify too. So I went through the Slidify examples, put some slides together, tweaked the CSS a little bit and then published it to GitHub – a streamline Slidify workflow well thought and designed by Ramnath Vaidyanathan. To me, the results are amazing! So amazing that I am confident to leave PowerPoint and use Slidify for professional presentations in the future.


rMaps + CrimeMap = rCrimemap

Two weeks before the presentation, I wrote an email to Ramnath as I wanted to thank him for Slidify. I told him how I enjoyed using Slidify for the LondonR slides. Out of the blue, Ramnath told me that he had seen my CrimeMap already and he kindly pointed me to this blog post about using Leaflet heat map in rMaps. I thought, OMG, why now? Then I thought, yeah, why not? So I created a new package called ‘rCrimemap’ based on Ramnath’s example and the codes from the CrimeMap project – just in time for the LondonR meeting. At first, I wanted to called the package something different but eventually I chose rCrimemap so it aligns well with Ramnath’s rCharts and rMaps.

Using ‘rCrimemap’

rCrimemap is still raw and experimental. It depends on some new packages such as dplyr, dev version of rCharts and rMaps etc. I have only developed and tested it recently on Linux. Please give it a try if you have a chance. All feedback and suggestions are welcome. Codes are here.

To install it, you will need the RStudio IDE version 0.98.501 or newer and the following packages ...
require(devtools)
install.packages(c("base64enc", "ggmap", "rjson", "dplyr"))
install_github('ramnathv/rCharts@dev')
install_github('ramnathv/rMaps')

After that, install rCrimemap package via ... 
install_github('woobe/rCrimemap')

rCrimemap is basically a big wrapper function. In fact, there is only one function 'rcmap( )' in the package at the moment. (OK, it is obviously an overkill ... but I really wanted to try developing a package.) The function is very similar to the first one I did for CrimeMap prior to the Shiny development. In terms of graphical functionality, it is not as flexible as the CrimeMap yet (for example, CrimeMap can do all these colours and facet). However, it is much more powerful than CrimeMap in the sense that users can move around, zoom in and out like using a real digital map. The colour of the heat map also changes when you zoom in/out. This gives users a much better visibility of where the local crime hot spots are when they zoom in. OK, enough said, let’s go through some example usage …

The arguments of the function 'rcmap( )' are:
  1. location: point of interest within England, Wales and Northern Ireland
  2. period: a month between Dec 2010 and Jan 2014 (in the format of yyyy-mm)
  3. type: category of crime (e.g. "All", "Anti-social behaviour")
  4. map_size: the resolution of the map in pixel (e.g. Full HD = c(1920, 1080))
  5. provider: the base map provider (e.g. "Nokia.normalDay", "MapQuestOpen.OSM")
  6. zoom: zoom level of the map (e.g. I recommend starting with 10 to show all the crimes)

Example 1: “Ball Brothers EC3R 7PP” (LondonR venue since March 2013) during the London riot (Aug 2011). The map can be viewed within RStudio IDE or be exported to a browser. The animation was created outside R (Oh ... what if rCrimemap + animation package? ... I will leave that for later.)
rcmap("Ball Brothers EC3R 7PP", "2011-08", "All", c(1000,1000),"Nokia.normalDay")


Example 2: Manchester in Jan 2014 - using "MapQuestOpen.OSM" as base map instead.

rcmap("Manchester", "2014-01", "All", c(1000,1000), "MapQuestOpen.OSM")



Credits



There you go, enjoy :)

To leave a comment for the author, please follow the link and comment on his blog: Blend it like a Bayesian!.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rcpp 0.11.1

$
0
0

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)
A new minor release 0.11.1 of Rcpp is now on the CRAN network for GNU R; binaries for Debian have also been uploaded.

The release fixes a number of bugs that have come up since the 0.11.0 release in January, but also brings some extensions. See the NEWS file section below for details, or the ChangeLog file in the package and on the Rcpp Changelog page

Once again, we tested this release by building against all CRAN packages which depend upon Rcpp. In short, three packages are blacklisted from tests, and three came up with something we noted --- but the remaining 177 packages all build and test cleanly. Detailed results of those tests (and the scripts for it) are on GitHub.

There are a number of other fixes, upgrades and other extensions detailed in NEWS file extract below, in the ChangeLog file in the package and on the Rcpp Changelog page.

Changes in Rcpp version 0.11.1 (2014-03-13)

  • Changes in Rcpp API:

    • Preserve backwards compatibility with Rcpp 0.10.* by allowing RObject extraction from vectors (or lists) of Rcpp objects

    • Add missing default constructor to Reference class that was omitted in the header-only rewrite

    • Fixes for NA and NaN handling of the IndexHash class, as well as the vector .sort() method. These fixes ensure that sugar functions depending on IndexHash (i.e. unique(), sort_unique(), match()) will now properly handle NA and NaN values for numeric vectors.

    • DataFrame::nrows now more accurately mimics R's internal behavior (checks the row.names attribute)

    • Numerous changes to permit compilation on the Solaris OS

    • Rcpp vectors gain a subsetting method – it is now possible to subset an Rcpp vector using CharacterVectors (subset a by name), LogicalVectors (logical subsetting), and IntegerVectors (0-based index subsetting). Such subsetting will also work with Rcpp sugar expressions, enabling expressions such as x[ x > 0].

    • Comma initialization (e.g. CharacterVector x = "a", "b", "c";, has been disabled, as it causes problems with the behavior of the = operator with Rcpp::Lists. Users who want to re-enable this functionality can use #define RCPP_COMMA_INITIALIZATION, but be aware of the above caveat. The more verbose CharacterVector x = CharacterVector::create("a", "b", "c") is preferred.

  • Changes in Rcpp Attributes

    • Fix issue preventing packages with Rcpp::interfaces attribute from compiling.

    • Fix behavior with attributes parsing of ::create for default arguments, and also allow constructors of a given size (e.g. NumericVector v = NumericVector(10)) gives a default value of numeric(10) at the R level). Also make NAs preserve type when exported to R (e.g. NA_STRING as a default argument maps to NA_character_ at the R level)

  • Changes in Rcpp modules

    • Corrected the un_pointer implementation for object

Thanks to CRANberries, you can also look at a diff to the previous release. As always, even fuller details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads, the browseable doxygen docs and zip files of doxygen output for the standard formats. A local directory has source and documentation too. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Moving the North Pole to the Equator

$
0
0

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

I am still working with @3wen on visualizations of the North Pole. So far, it was not that difficult to generate maps, but we started to have problems with the ice region in the Arctic. More precisely, it was complicated to compute the area of this region (even if we can easily get a shapefile). Consider the globe,

worldmap <- ggplot() + 
geom_polygon(data = world.df, aes(x = long, y = lat, group = group)) +
scale_y_continuous(breaks = (-2:2) * 30) +
scale_x_continuous(breaks = (-4:4) * 45)

and then, add three points in the northern hemisphere, and plot the associated triangle

P1 <- worldmap + geom_polygon(data = triangle, aes(x = long, y = lat, group = group), 
fill ="blue", alpha = 0.6, col = "light blue", size = .8)+
geom_point(data = triangle, aes(x = long, y = lat, group = group),colour = "red")+

for some given projection, e.g.

coord_map("ortho", orientation=c(61, -74, 0))

This can be done with the following function

proj1=function(x=75){
triangle <- data.frame(long=c(-70,-110,-90*(x<90)+90*(x>90)),
lat=c(60,60,x*(x<90)+(90-(x-90))*(x>90)),group=1, region=1)
worldmap <- ggplot() + 
geom_polygon(data = world.df, aes(x = long, y = lat, group = group)) +
scale_y_continuous(breaks = (-2:2) * 30) +
scale_x_continuous(breaks = (-4:4) * 45)
P1 <- worldmap + geom_polygon(data = triangle, aes(x = long, y = lat, group = group), 
fill ="blue", alpha = 0.6, col = "light blue", size = .8)+
geom_point(data = triangle, aes(x = long, y = lat, group = group),colour = "red")+
coord_map("ortho", orientation=c(61, -74, 0)) 
print(P1)
}

or

I am not sure if I understand why the projection of the triangle is not convex on the graph above, but say it’s not a big deal, here. Actually, our problem is that our interest is on regions (polygons, from a geometrical point of view) that do contain the North Pole. And here, it starts to be messy. I can easily move the upper point on the other side of the globe, but the polygon is not correct,

I do understand that it should be a problem, non-trivial, but it means that it should not be that simple to compute the area of a polygon (a region) that contains the North Pole. Which is exactly what we did observe. My skills in geometry are extremely poor. So do not expect that I will go through the code of the function that compute the area of a polygon ! Actually, my idea is the following : if the problem is that the North Pole is in the region, let’s consider some rotation, to shift the North Pole on the Equation. The code here is, from latitudes and longitude, to get new latitudes and longitudes, after a rotation around the y-axis (the North Pole will go down, along Greenwhich meridian) is

rotation=function(Z,theta){
lon=Z[,1]/180*pi; lat=Z[,2]/180*pi
x=cos(lon)*cos(lat)
y=sin(lon)*cos(lat)
z=sin(lat)
pt1=cbind(x,y,z)
M=matrix(c(cos(theta),0,-sin(theta),0,1,0,sin(theta),0,cos(theta)),3,3)
pt2=t(M%*%t(pt1))
lat=asin(pt2[,3])*180/pi
lon=atan2(pt2[,2],pt2[,1])*180/pi
return(cbind(lon,lat))}

With a rotation from  (no change) to  (the North Pole on the equator), we get

From now on, it is possible to compute the area of any region containing the North Pole ! One should simply apply the function on all datebases generated from shapefiles ! We can then compute the centroid of the ice region,

r.glace=glace
r.glace[,1:2]=rotation(glace[,1:2],pi/2)
M=matrix(NA,length(unique(glace$id)),3)
j=0
for(i in unique(glace$id)){j=j+1
Polyglace <- as(r.glace[glace$id==i,c("long","lat")],"gpc.poly")
M[j,1]=area.poly(Polyglace)
M[j,2:3]=centroid(r.glace[r.glace$id==i,c("long","lat")])
}
Z=c(weighted.mean(M[,2],M[,1]),weighted.mean(M[,3],M[,1]))
rotation(rbind(Z),-pi/2)[1,])

And we get below, we can visualize all the locations of the centroid of the ice region in the past 25 years

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Species occurrence data

$
0
0

(This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers)

The rOpenSci projects aims to provide programmatic access to scientific data repositories on the web. A vast majority of the packages in our current suite retrieve some form of biodiversity or taxonomic data. Since several of these datasets have been georeferenced, it provides numerous opportunities for visualizing species distributions, building species distribution maps, and for using it analyses such as species distribution models. In an effort to streamline access to these data, we have developed a package called Spocc, which provides a unified API to all the biodiversity sources that we provide. The obvious advantage is that a user can interact with a common API and not worry about the nuances in syntax that differ between packages. As more data sources come online, users can access even more data without significant changes to their code. However, it is important to note that spocc will never replicate the full functionality that exists within specific packages. Therefore users with a strong interest in one of the specific data sources listed below would benefit from familiarising themselves with the inner working of the appropriate packages.

Data Sources

spocc currently interfaces with five major biodiversity repositories. Many of these packages have been part of the rOpenSci suite:

  1. Global Biodiversity Information Facility (rgbif)
    GBIF is a government funded open data repository with several partner organizations with the express goal of providing access to data on Earth's biodiversity. The data are made available by a network of member nodes, coordinating information from various participant organizations and government agencies.

  2. Berkeley Ecoengine (ecoengine)
    The ecoengine is an open API built by the Berkeley Initiative for Global Change Biology. The repository provides access to over 3 million specimens from various Berkeley natural history museums. These data span more than a century and provide access to georeferenced specimens, species checklists, photographs, vegetation surveys and resurveys and a variety of measurements from environmental sensors located at reserves across University of California's natural reserve system. (related blog post)

  3. iNaturalist (rinat) iNaturalist provides access to crowd sourced citizen science data on species observations.

  4. VertNet (rvertnet) Similar to rgbif, ecoengine, and rbison (see below), VertNet provides access to more than 80 million vertebrate records spanning a large number of institutions and museums primarly covering four major disciplines (mammology, herpetology, ornithology, and icthyology). Note that we don't currenlty support VertNet data in this package, but we should soon

  5. Biodiversity Information Serving Our Nation (rbison)
    Built by the US Geological Survey's core science analytic team, BISON is a portal that provides access to species occurrence data from several participating institutions.

  6. eBird (rebird)
    ebird is a database developed and maintained by the Cornell Lab of Ornithology and the National Audubon Society. It provides real-time access to checklist data, data on bird abundance and distribution, and communtiy reports from birders.

  7. AntWeb (AntWeb)
    AntWeb is the world's largest online database of images, specimen records, and natural history information on ants. It is community driven and open to contribution from anyone with specimen records, natural history comments, or images. (related blog post)

Note: It's important to keep in mind that several data providers interface with many of the above mentioned repositories. This means that occurence data obtained from BISON may be duplicates of data that are also available through GBIF. We do not have a way to resolve these duplicates or overlaps at this time but it is an issue we are hoping to address in future versions of the package.

Installing the package

install.packages("spocc")
# or install the most recent version
devtools::install_github("ropensci/spocc")
library(spocc)

Searching species occurrence data

The main workhorse function of the package is called occ. The function allows you to search for occurrence records on a single species or list of species and from particular sources of interest or several. The main input is a query with sources specified under the argument from. So to look at a really simply query:

results <- occ(query = 'Accipiter striatus', from = 'gbif')
results
#> Summary of results - occurrences found for: 
#>  gbif  : 25 records across 1 species 
#>  bison :  0 records across 1 species 
#>  inat  :  0 records across 1 species 
#>  ebird :  0 records across 1 species 
#>  ecoengine :  0 records across 1 species 
#>  antweb :  0 records across 1 species

This returns the results as an S3 class with a slot for each data source. Since we only requested data from gbif, the remaining slots are empty. To view the data:

results$gbif
#> $meta
#> $meta$source
#> [1] "gbif"
#> 
#> $meta$time
#> [1] "2014-03-16 17:39:31.716 PDT"
#> 
#> $meta$query
#> [1] "Accipiter striatus"
#> 
#> $meta$type
#> [1] "sci"
#> 
#> $meta$opts
#> list()
#> 
#> 
#> $data
#> $data$Accipiter_striatus
#>                  name       key longitude latitude prov
#> 1  Accipiter striatus 891040018    -97.65   30.158 gbif
#> 2  Accipiter striatus 891040169   -122.44   37.490 gbif
#> 3  Accipiter striatus 891035119    -71.73   18.270 gbif
#> 4  Accipiter striatus 891035349    -72.53   43.132 gbif
#> 5  Accipiter striatus 891038901    -97.20   32.860 gbif
#> 6  Accipiter striatus 891048899    -73.07   43.632 gbif
#> 7  Accipiter striatus 891049443    -99.10   26.491 gbif
#> 8  Accipiter striatus 891050439    -97.88   26.102 gbif
#> 9  Accipiter striatus 891043765    -76.64   41.856 gbif
#> 10 Accipiter striatus 891056214   -117.15   32.704 gbif
#> 11 Accipiter striatus 891054792    -73.24   44.315 gbif
#> 12 Accipiter striatus 768992325    -76.10    4.724 gbif
#> 13 Accipiter striatus 859267562   -108.34   36.732 gbif
#> 14 Accipiter striatus 859267548   -108.34   36.732 gbif
#> 15 Accipiter striatus 859267717   -108.34   36.732 gbif
#> 16 Accipiter striatus 891043784    -73.05   43.605 gbif
#> 17 Accipiter striatus 891118711   -122.18   37.786 gbif
#> 18 Accipiter striatus 891116600    -97.32   32.821 gbif
#> 19 Accipiter striatus 891124493   -117.11   32.632 gbif
#> 20 Accipiter striatus 891125442   -122.88   38.612 gbif
#> 21 Accipiter striatus 891127900   -122.36   37.778 gbif
#> 22 Accipiter striatus 891128609    -97.98   32.761 gbif
#> 23 Accipiter striatus 891121966    -76.55   38.672 gbif
#> 24 Accipiter striatus 868487120    -83.83   42.333 gbif
#> 25 Accipiter striatus 891131416    -72.59   43.853 gbif

If you prefer data from more than one source, simply pass a vector of source names for the from argument. Example:

occ(query = 'Accipiter striatus', from = c('ecoengine', 'gbif'))
#> Summary of results - occurrences found for: 
#>  gbif  : 25 records across 1 species 
#>  bison :  0 records across 1 species 
#>  inat  :  0 records across 1 species 
#>  ebird :  0 records across 1 species 
#>  ecoengine :  25 records across 1 species 
#>  antweb :  0 records across 1 species

We can also search for multiple species across multiple engines.

species_list <- c("Accipiter gentilis", "Accipiter poliogaster", "Accipiter badius")
res_set <- occ(species_list, from = c('gbif', 'ecoengine'))

Similarly, we can search for data on the Sharp-shinned Hawk from other data sources too.

occ(query = 'Accipiter striatus', from = 'ecoengine')
# or look for data on other species
occ(query = 'Danaus plexippus', from = 'inat')
occ(query = 'Bison bison', from = 'bison')
occ(query = "acanthognathus brevicornis", from = "antweb")

occ is also extremely flexible and can take package specific arguments for any source you might be querying. You can pass these as a list under pacakge_name_opts (e.g. antweb_opts, ecoengine_opts). See the help file for ?occ for more information.

Visualizing biodiversity data

We provide several methods to visualize the resulting data. Current options include Leaflet.js, ggmap, a Mapbox implementation in a GitHub gist, or a static map.

Mapping with Leaflet

spp <- c("Danaus plexippus", "Accipiter striatus", "Pinus contorta")
dat <- occ(query = spp, from = "gbif", gbifopts = list(georeferenced = TRUE))
# occ2df, as the name suggests converts data contained inside an occ class to a R data.frame
data <- occ2df(dat)
mapleaflet(data = data, dest = ".")

Render a geojson file automatically as a GitHub gist

To have a map automatically posted as a gist, you'll need to set up your GitHub credentials ahead of time. You can either pass these as variables github.username and github.password, or store them in your options (taking regular precautions as you would with passwords of course). If you don't have these stored, you'll be prompted to enter them before posting.

spp <- c("Danaus plexippus", "Accipiter striatus", "Pinus contorta")
dat <- occ(query = spp, from = "gbif", gbifopts = list(georeferenced = TRUE))
dat <- fixnames(dat)
dat <- occ2df(dat)
mapgist(data = dat, color = c("#976AAE", "#6B944D", "#BD5945"))

Static maps

If interactive maps aren't your cup of tea, or you prefer to have one that you can embed in a paper, try one of our static map options. You can go with the more elegant ggmap option or stick with something from base graphics.

ecoengine_data <- occ(query = "Lynx rufus californicus", from = "ecoengine")
mapggplot(ecoengine_data)

spnames <- c("Accipiter striatus", "Setophaga caerulescens", "Spinus tristis")
base_data <- occ(query = spnames, from = "gbif", gbifopts = list(georeferenced = TRUE))
plot(base_data, cex = 1, pch = 10)

What's next?

  • As soon as we have an updated rvertnet package, we'll add the ability to query VertNet data from spocc.
  • We will add rCharts as an official import once the package is on CRAN (Eta end of March)
  • We're helping on a new package rMaps to make interactive maps using various Javascript mapping libraries, which will give access to a variety of awesome interactive maps. We will integrate rMaps once it's on CRAN.
  • We'll add a function to make interactive maps using RStudio's Shiny in a future version.

As always, issues or pull requests are welcome directly on the repo.

To leave a comment for the author, please follow the link and comment on his blog: rOpenSci Blog - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

GSoC Proposal 2014: package bdvis: Biodiversity Data Visualizations

$
0
0

(This article was first published on Vijay Barve, and kindly contributed to R-bloggers)

I am applying for Google Summer of Code 2014 again with “Biodiversity Data Visualizations using R” proposal. We are proposing to take package bdvis to next level by adding more functions and making it available through CRAN. I am posting this idea to get feedback and suggestions from Biodiversity Informatics community.

[During next few days I will keep updating this to accommodate suggestions. The example visualizations here are crude examples of the ideas, and need lot of work to convert them into reusable functions.]

Background

Package bdvis is already under development and was successful projects in GSoC 2013. As of now the package has basic functionality to perform biodiversity data visualizations, but with growing user base for the package, requests for additional features are coming up. We propose to add the user requested functionality and implement some new functions to take bdvis to next level. Following are the major tasks of proposed project.

  1. Fix currently reported bugs and complete documentation to submit package to CRAN.
  2. Implementation of additional features requested by users.
  3. Develop seamless data support.
  4. Additional functions for visualizations.
  5. Prepare detailed vignette.

User requested features

The features and functionality requested by users so far are the following:

  • A versatile function to subset the data based on taxonomy for a species, genus, family etc. or date like a particular year or range of years and so on.
  • Tempolar ability to show average records per day/week/month rather than just raw numbers currently
  • Taxotree additional parameters to control the diagram like Title, Legend, Colors. Also to add ability to choose summary based on number of records, number of species or higher taxonomy
  • bdsummary number of grid cells covered by data records and % of coverage of the bounding box
  • Visualisation ability for the output of completeness analysis bdcomplete function
  • Improve gettaxo efficiency by adding ability to search by genus rather than current scientific name. This could be added as an option in case user needs to search by full scientific names for some reason.

Data formats support

Develop functions for seamless support for major available Biodiversity occurrence data formats in R environment to work with bdvis package. Preliminary list of packages that make data available are rgbif, rvertnet, rinat, spocc. Get feedback from user community for additional data sources they might be using and incorporate them into the worklist.

Additional visualizations

  • Distribution of collection efforts over time (line graph) [Fig 1 Soberon et al 2000]
  • Distribution of number of records among taxon, cells (histogram) [Fig 3,4 Soberon et al 2000]
  • Distribution of number of species among cells (histogram) [Fig 5 Soberon et al 2000]
  • Completeness vs number of species(scatterplot) [Fig 6 Soberon et al 2000]
  • Record densities for day of year and week of year [Otegui 2012]

RecordsPerDayofYear

  • Records per year dot plots [Otegui 2012]

RecPerYear

  • calenderHeat maps of number of records or species recorded

IndianMoths_calenderheat

Vignette preparation

Prepare test data sets for the vignette. Three data sets one with global geographical coverage and wide species coverage, second with country level geographical and Class or Order level species coverage and final narrow species selection may be at genus level to demonstrate functionality. Write up code and explanation of each of the function in package, add result tables, graphs and maps to complete the vignette.

References

  • Otegui, J., & Ariño, A. H. (2012). BIDDSAT: visualizing the content of biodiversity data publishers in the Global Biodiversity Information Facility network. Bioinformatics (Oxford, England), 28(16), 2207–8. doi:10.1093/bioinformatics/bts359
  • Soberón, J., Llorente, J., & Oñate, L. (2000). The use of specimen-label databases for conservation purposes: an example using Mexican Papilionid and Pierid butterflies. Biodiversity and Conservation, 9(Roman 1997), 1441–1466. Retrieved from http://www.springerlink.com/index/H58022627013233W.pdf

To leave a comment for the author, please follow the link and comment on his blog: Vijay Barve.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Updates on Interactive rCrimemap, rBlocks … and the Packt offer!

$
0
0

(This article was first published on Blend it like a Bayesian!, and kindly contributed to R-bloggers)

Testing rCrimemap as a Self-Contained Web Page

I've been learning more about rMaps and rCharts since the LondonR meeting. There are many amazing things you can do with rCharts but it does take time to learn all the tweaks. For example, I just discovered that the rMaps objects (like other rCharts ojects) can be saved as a self-contained webpage.

So here are the links to one of the maps I rendered with rCrimemap - visualising all the England, Wales and N. Ireland crimes in Jan 2014 (not sure why some of the crimes were recorded in Scotland - I'll need to further investigate this later). Eventually, I hope to build a new Shiny web app for rCrimemap that allows users to change the settings like the original CrimeMap.



Note: I would recommend NOT to try this on smartphones. I will need to figure out how the map can be trimmed and optimised for smartphones later.

Yet Another rBlocks Experiment

Playing with the EBImage package this time, I wrote this script to pixelate a picture and re-colour it with rBlocks (just for fun - not practical at all ...) (Gist - rBlocks_test_04_pixelation.R)


Celebrating Packt's 2000th Book

Finally, Packt is offering "Buy One Get One Free" on all ebooks to celebrate the 2000th title!!!



To leave a comment for the author, please follow the link and comment on his blog: Blend it like a Bayesian!.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Seamless analytical environment by WDI, dplyr, and rMaps

$
0
0

(This article was first published on My Life as a Mock Quant in English, and kindly contributed to R-bloggers)

Recently I found that My R Guru @ramnath_vaidya is developping a new visualization package rMaps.

I was so excited when I saw it for the first time and I think that it's really awesome for plotting any data on a map.

Let me explain how we can

  • Get data(WDI package)
  • Manipulate data(dplyr package)
  • Visualize the result(rMaps package)

with greate R packages.

Except for rMaps package, you can install these packages(WDI, dplyr) from CRAN by usual way.

install.packages(c("WDI", "dplyr"))

To install rMaps package, you just write the following commands on R console.

require(devtools)
install_github("ramnathv/rCharts@dev")
install_github("ramnathv/rMaps")

(Don't forget to install “devtools” package to use install_github function.)

Now, as an example, I show you that

  • Get “CO2 emissions (kt)” data from World Bank by WDI package
  • Summarze it to by dplyr package
  • Visualize it by rMaps package

The result is shown below:

…Enjoy!!!

By the way, recently an Japanese R professional guy often posts his greate articles. I recommend you to see these articles if you are interested in visualizing and dplyr especially.

Source codes:

library(WDI)
library(rMaps)
library(dplyr)
library(countrycode)
# Get CO2 emission data from World bank
# Data source : http://data.worldbank.org/indicator/EN.ATM.CO2E.KT/
df <- WDI(country=c("all"),
indicator="EN.ATM.CO2E.KT",
start=2004, end=2013)
# Data manipulation By dplyr
data <- df %.%
na.omit() %.%
#Add iso3c format country code
mutate(iso3c=countrycode(iso2c, "iso2c", "iso3c")) %.%
group_by(iso3c) %.%
#Get the most recent CO2 emission data
summarize(value=EN.ATM.CO2E.KT[which.max(year)])
# Visualize it by rMaps
i1 <- ichoropleth(value~iso3c, data, map="world")
i1$show("iframesrc", cdn = TRUE) # for blog post
#... or you can direct plot by just evaluating "i1" on R console.

To leave a comment for the author, please follow the link and comment on his blog: My Life as a Mock Quant in English.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Experimenting With R – Point to Point Mapping With Great Circles

$
0
0

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

I’ve started doodling again… This time, around maps, looking for recipes that make life easier plotting lines to connect points on maps. The most attractive maps seem to use great circles to connect one point with another, these providing the shortest path between two points when you consider the Earth as a sphere.

Here’s one quick experiment (based on the Flowing Data blog post How to map connections with great circles), for an R/Shiny app that allows you to upload a CSV file containing a couple of location columns (at least) and an optional “amount” column, and it’ll then draw lines between the points on each row.

greatcircle map demo

The app requires us to solve several problems, including:

  • how to geocode the locations
  • how to plot the lines as great circles
  • how to upload the CSV file
  • how to select the from and two columns from the CSV file
  • how to optionally select a valid numerical column for setting line thickness
    • Let’s start with the geocoder. For convenience, I’m going to use the Google geocoder via the geocode() function from the ggmap library.

#Locations are in two columns, *fr* and *to* in the *dummy* dataframe
#If locations are duplicated in from/to columns, dedupe so we don't geocode same location more than once
locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F)
#Run the geocoder against each location, then transpose and bind the results into a dataframe
cbind(locs, t(sapply(locs$place,geocode, USE.NAMES=F))) 

The locs data is a vector of locations:

                    place
1              London, UK
2            Cambridge,UK
3            Paris,France
4       Sydney, Australia
5           Paris, France
6             New York,US
7 Cape Town, South Africa

The sapply(locs$place,geocode, USE.NAMES=F) function returns data that looks like:

    [,1]       [,2]     [,3]     [,4]      [,5]     [,6]      [,7]     
lon -0.1254872 0.121817 2.352222 151.207   2.352222 -74.00597 18.42406 
lat 51.50852   52.20534 48.85661 -33.86749 48.85661 40.71435  -33.92487

The transpose (t() gives us:

     lon        lat      
[1,] -0.1254872 51.50852 
[2,] 0.121817   52.20534 
[3,] 2.352222   48.85661 
[4,] 151.207    -33.86749
[5,] 2.352222   48.85661 
[6,] -74.00597  40.71435 
[7,] 18.42406   -33.92487

The cbind() binds each location with its lat and lon value:

                    place        lon       lat
1              London, UK -0.1254872  51.50852
2            Cambridge,UK   0.121817  52.20534
3            Paris,France   2.352222  48.85661
4       Sydney, Australia    151.207 -33.86749
5           Paris, France   2.352222  48.85661
6             New York,US  -74.00597  40.71435
7 Cape Town, South Africa   18.42406 -33.92487

Code that provides a minimal example for uploading the data from a CSV file on the desktop to the Shiny app, then creating dynamic drop lists containing column names, can be found here: Simple file geocoder (R/shiny app).

The following snippet may be generally useful for getting a list of column names from a data frame that correspond to numerical columns:

#Get a list of column names for numerical columns in data frame df
nums <- sapply(df, is.numeric)
names(nums[nums])

The code for the full application can be found as a runnable gist in RStudio from here: R/Shiny app – great circle mapping. [In RStudio, install.packages("shiny"); library(shiny); runGist(9690079). The gist contains a dummy data file if you want to download it to try it out...]

Here’s the code explicitly…

The global.R file loads the necessary packages, installing them if they are missing:

#global.R

##This should detect and install missing packages before loading them - hopefully!
list.of.packages <- c("shiny", "ggmap","maps","geosphere")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(list.of.packages,function(x){library(x,character.only=TRUE)}) 

The ui.R file builds the Shiny app’s user interface. The drop down column selector lists are populated dynamically with the names of the columns in the data file once it is uploaded. An optional Amount column can be selected – the corresponding list only displays the names of numerical columns. (The lists of location columns to be geocoded should really be limited to non-numerical columns.) The action button prevents the geocoding routines firing until the user is ready – select the columns appropriately before geocoding (error messages are not handled very nicely;-)

#ui.R
shinyUI(pageWithSidebar(
  headerPanel("Great Circle Map demo"),
  
  sidebarPanel(
    #Provide a dialogue to upload a file
    fileInput('datafile', 'Choose CSV file',
              accept=c('text/csv', 'text/comma-separated-values,text/plain')),
    #Define some dynamic UI elements - these will be lists containing file column names
    uiOutput("fromCol"),
    uiOutput("toCol"),
    #Do we want to make use of an amount column to tweak line properties?
    uiOutput("amountflag"),
    #If we do, we need more options...
    conditionalPanel(
      condition="input.amountflag==true",
      uiOutput("amountCol")
    ),
    conditionalPanel(
      condition="input.amountflag==true",
      uiOutput("lineSelector")
    ),
    #We don't want the geocoder firing until we're ready...
    actionButton("getgeo", "Get geodata")
    
  ),
  mainPanel(
    tableOutput("filetable"),
    tableOutput("geotable"),
    plotOutput("geoplot")
  )
))

The server.R file contains the server logic for the app. One thing to note is the way we isolate some of the variables in the geocoder reactive function. (Reactive functions fire when one of the external variables they contain changes. To prevent the function firing when a variable it contains changes, we need to isolate it. (See the docs for me; for example, Shiny Lesson 7: Reactive outputs or Isolation: avoiding dependency.)

#server.R

shinyServer(function(input, output) {

  #Handle the file upload
  filedata <- reactive({
    infile <- input$datafile
    if (is.null(infile)) {
      # User has not uploaded a file yet
      return(NULL)
    }
    read.csv(infile$datapath)
  })

  #Populate the list boxes in the UI with column names from the uploaded file  
  output$toCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    items=names(df)
    names(items)=items
    selectInput("to", "To:",items)
  })
  
  output$fromCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    items=names(df)
    names(items)=items
    selectInput("from", "From:",items)
  })
  
  #If we want to make use of an amount column, we need to be able to say so...
  output$amountflag <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    
    checkboxInput("amountflag", "Use values?", FALSE)
  })

  output$amountCol <- renderUI({
    df <-filedata()
    if (is.null(df)) return(NULL)
    #Let's only show numeric columns
    nums <- sapply(df, is.numeric)
    items=names(nums[nums])
    names(items)=items
    selectInput("amount", "Amount:",items)
  })
  
  #Allow different line styles to be selected
  output$lineSelector <- renderUI({
    radioButtons("lineselector", "Line type:",
                 c("Uniform" = "uniform",
                   "Thickness proportional" = "thickprop",
                   "Colour proportional" = "colprop"))
  })
  
  #Display the data table - handy for debugging; if the file is large, need to limit the data displayed [TO DO]
  output$filetable <- renderTable({
    filedata()
  })
  
  #The geocoding bit... Isolate variables so we don't keep firing this...
  geodata <- reactive({
    if (input$getgeo == 0) return(NULL)
    df=filedata()
    if (is.null(df)) return(NULL)
    
    isolate({
      dummy=filedata()
      fr=input$from
      to=input$to
      locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F)      
      cbind(locs, t(sapply(locs$place,geocode, USE.NAMES=F))) 
    })
  })

  #Weave the goecoded data into the data frame we made from the CSV file
  geodata2 <- reactive({
    if (input$getgeo == 0) return(NULL)
    df=filedata()
    if (input$amountflag != 0) {
      maxval=max(df[input$amount],na.rm=T)
      minval=min(df[input$amount],na.rm=T)
      df$b8g43bds=10*df[input$amount]/maxval
    }
    gf=geodata()
    df=merge(df,gf,by.x=input$from,by.y='place')
    merge(df,gf,by.x=input$to,by.y='place')
  })
  
  #Preview the geocoded data
  output$geotable <- renderTable({
    if (input$getgeo == 0) return(NULL)
    geodata2()
  })
  
  #Plot the data on a map...
  output$geoplot<- renderPlot({
    if (input$getgeo == 0) return(map("world"))
    #Method pinched from: http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/
    map("world")
    df=geodata2()
    
    pal <- colorRampPalette(c("blue", "red"))
    colors <- pal(100)
    
    for (j in 1:nrow(df)){
      inter <- gcIntermediate(c(df[j,]$lon.x[[1]], df[j,]$lat.x[[1]]), c(df[j,]$lon.y[[1]], df[j,]$lat.y[[1]]), n=100, addStartEnd=TRUE)

      #We could possibly do more styling based on user preferences?
      if (input$amountflag == 0) lines(inter, col="red", lwd=0.8)
      else {
        if (input$lineselector == 'colprop') {
          maxval <- max(df$b8g43bds)
          minval=  min(df$b8g43bds)
          colindex <- round( (df[j,]$b8g43bds[[1]]/10) * length(colors) )
          lines(inter, col=colors[colindex], lwd=0.8)
        } else if (input$lineselector == 'thickprop') {
          lines(inter, col="red", lwd=df[j,]$b8g43bds[[1]])
        } else lines(inter, col="red", lwd=0.8)
      } 
    } 
  })

})

So that’s the start of it… this app could be further developed in several ways, for example allowing the user to filter or colour displayed lines according to factor values in a further column (commodity type, for example), or produce a lattice of maps based on facet values in a column.

I also need to figure how to to save maps, and maybe produce zoomable ones. If geocoded points all lay within a blinding box limited to a particular geographical area, scaling the map view to show just that area might be useful.

Other techniques might include using proportional symbols (circles) at line landing points to show the sum of values incoming to that point, or some of values outgoing, or the difference between the two; (maybe use green for incoming outgoing, then size by the absolute difference?)


To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog... » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Simple Introduction to the Graphing Philosophy of ggplot2

$
0
0

(This article was first published on Learning as You Go » RStats, and kindly contributed to R-bloggers)

“The emphasis in ggplot2 is reducing the amount of thinking time by making it easier to go from the plot in your brain to the plot on the page.” (Wickham, 2012)

“Base graphics are good for drawing pictures; ggplot2 graphics are good for understanding the data.” (Wickham, 2012)

I’m not ggplot2′s creator, Hadley Wickham, but I do find myself in discussions trying to explain how to build graphs in ggplot2. It’s a very elegant system, but also very different from other graphing systems. Once you understand the organizing philosophy, ggplot2 becomes very easy to work with.

The grammar of ggplot2 graphics

There is a basic grammar to all graphics production. In R‘s base graphics or in Excel, you feed ranges of data to a plot as x and y elements, then manipulate colors, scale dimensions and other parts of the graph as graphical elements or options.

ggplot2′s grammar makes a clear distinction between your data and what gets displayed on the screen or page. You feed ggplot2 your data, then apply a series of mappings and transformations to create a visual representation of that data. Even with base graphics or Excel we never really plot the data itself, we only create a representation; ggplot2 makes this distinction explicit. In addition, ggplot2′s structure makes it very easy to tweak a graph to look the way you want by adding mappings.

A ggplot2 graph is built up from a few basic elements:

1. Data The raw data that you want to plot
2. Geometries geom_ The geometric shapes that will represent the data.
3. Aethetics aes() Aesthetics of the geometric and statistical objects, such as color, size, shape and position.
4. Scales scale_ Maps between the data and the aesthetic dimensions, such as data range to plot width or factor values to colors

Putting it together, the code to build a ggplot2 graph looks something like:

data
+ geometry to represent the data,
+ aesthetic mappings of data to plot coordinates like position, color and size
+ scaling of ranges of the data to ranges of the aesthetics

A real example shows off how this all fits together.

library(ggplot2)
# Create some data for our example
some.data <- data.frame(timer = 1:12, 
                        countdown = 12:1, 
                        category = factor(letters[1:3]))
# Generate the plot
some.plot <- ggplot(data = some.data, aes(x = timer, y = countdown)) +
  geom_point(aes(colour = category)) +
  scale_x_continuous(limits = c(0, 15)) +
  scale_colour_brewer(palette = "Dark2") +
  coord_fixed(ratio=1)
# Display the plot
some.plot
Demonstration of the key concepts in the grammar of graphics: data, geometries, aesthetic mappings and scale mappings.

Demonstration of the key concepts in the grammar of graphics: data, geometries, aesthetic mappings and scale mappings.

Here you can see that the data is passed to ggplot(), aesthetic mappings between the data and the plot coordinates, a geometry to represent the data and a couple of scales to map between the data range and the plot ranges.

More advanced parts of the ggplot2 grammar

The above will get you a basic graph, but ggplot2 includes a few more parts of the grammar that you’ll want to be aware of as you try to visualize more complex data:

5. Statistical transformations stat_ Statistical summaries of the data that can be plotted, such as quantiles, fitted curves (loess, linear models, etc.), sums and so o.
6. Coordinate systems coord_ The transformation used for mapping data coordinates into the plane of the data rectangle.
7. Facets facet_ The arrangement of the data into a grid of plots (also known as latticing, trellising or creating small multiples).
8. Visual Themes theme The overall visual defaults of a plot: background, grids, axe, default typeface, sizes, colors, etc.

Hadley Wickham describes various pieces of this grammar in recorded presentations on Vimeo and YouTube and the online documentation to ggplot2. The most complete explanation is in his book ggplot2: Elegant Graphics for Data Analysis (Use R!) (Wickham, 2009).

References

Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Dordrecht, Heibelberg, London, New York: Springer, 2009. Print.
Wickham, Hadley. A Backstage Tour of ggplot2 with Hadley Wickham. 2012. Video. YouTube. Web. 21 Mar 2014. . Contributed by REvolutionAnalytics.


To leave a comment for the author, please follow the link and comment on his blog: Learning as You Go » RStats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Calendar charts with googleVis

$
0
0

(This article was first published on mages' blog, and kindly contributed to R-bloggers)
My little series of posts about the new googleVis charts continues with calendar charts.

Google's calendar charts are still in beta, but they provide already a nice heat map visualisation of calendar year data. The current development version of googleVis supports this new function via gvisCalendar. Here is an example displaying daily stock price data.


For the code below to run you will require the developer version (≥ 0.5.0-4) of googleVis from GitHub and R ≥ 3.0.2.

I suppose the biggest current drawback is that the colours of the cells cannot be defined by the user. However, this should change with future versions of the Google Chart Tools. For more information and installation instructions see the googleVis project site and Google documentation.

Interestingly, the calendar chart looks very similar to the visualisation R. Wicklin and R. Allison from SAS used for the winning poster at the Data Expo 2009. Paul Bleicher created a function in R, based on lattice that creates a very similar output. You may recall David Smith's blog post about this.



Session Info

R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] googleVis_0.5.0-4 chron_2.3-45 lattice_0.20-24

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3 tools_3.0.3

To leave a comment for the author, please follow the link and comment on his blog: mages' blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mapping the March 2014 California Earthquake with ggmap

$
0
0

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

I had no intention to blog this, but @jayjacobs convinced me otherwise. I was curious about the recent (end of March, 2014) California earthquake “storm” and did a quick plot for “fun” and personal use using ggmap/ggplot.

I used data from the Southern California Earthquake Center (that I cleaned up a bit and that you can find here) but would have used the USGS quake data if the site hadn’t been down when I tried to get it from there.

The code/process isn’t exactly rocket-science, but if you’re looking for a simple way to layer some data on a “real” map (vs handling shapefiles on your own) then this is a really compact/self-contained tutorial/example.

You can find the code & data over at github as well.

There’s lots of ‘splainin in the comments (which are prbly easier to read on the github site) but drop a note in the comments or on Twitter if it needs any further explanation. The graphic is SVG, so use a proper browser :-) or run the code in R if you can’t see it here.


(click for larger version)
library(ggplot2)
library(ggmap)
library(plyr)
library(grid)
library(gridExtra)
 
# read in cleaned up data
dat <- read.table("quakes.dat", header=TRUE, stringsAsFactors=FALSE)
 
# map decimal magnitudes into an integer range
dat$m <- cut(dat$MAG, c(0:10))
 
# convert to dates
dat$DATE <- as.Date(dat$DATE)
 
# so we can re-order the data frame
dat <- dat[order(dat$DATE),]
 
# not 100% necessary, but get just the numeric portion of the cut factor
dat$Magnitude <- factor(as.numeric(dat$m))
 
# sum up by date for the barplot
dat.sum <- count(dat, .(DATE, Magnitude))
 
# start the ggmap bit
# It's super-handy that it understands things like "Los Angeles" #spoffy
# I like the 'toner' version. Would also use a stamen map but I can't get 
# to it consistently from behind a proxy server
la <- get_map(location="Los Angeles", zoom=10, color="bw", maptype="toner")
 
# get base map layer
gg <- ggmap(la) 
 
# add points. Note that the plot will produce warnings for all points not in the
# lat/lon range of the base map layer. Also note that i'm encoding magnitude by
# size and color and using alpha for depth. because of the way the data is sorted
# the most recent quakes in the set should be on top
gg <- gg + geom_point(data=dat,
                      mapping=aes(x=LON, y=LAT, 
                                  size=MAG, fill=m, alpha=DEPTH), shape=21, color="black")
 
# this takes the magnitude domain and maps it to a better range of values (IMO)
gg <- gg + scale_size_continuous(range=c(1,15))
 
# this bit makes the right size color ramp. i like the reversed view better for this map
gg <- gg + scale_fill_manual(values=rev(terrain.colors(length(levels(dat$Magnitude)))))
gg <- gg + ggtitle("Recent Earthquakes in CA & NV")
 
# no need for a legend as the bars are pretty much the legend
gg <- gg + theme(legend.position="none")
 
 
# now for the bars. we work with the summarized data frame
gg.1 <- ggplot(dat.sum, aes(x=DATE, y=freq, group=Magnitude))
 
# normally, i dislike stacked bar charts, but this is one time i think they work well
gg.1 <- gg.1 + geom_bar(aes(fill=Magnitude), position="stack", stat="identity")
 
# fancy, schmanzy color mapping again
gg.1 <- gg.1 + scale_fill_manual(values=rev(terrain.colors(length(levels(dat$Magnitude)))))
 
# show the data source!
gg.1 <- gg.1 + labs(x="Data from: http://www.data.scec.org/recent/recenteqs/Maps/Los_Angeles.html", y="Quake Count")
gg.1 <- gg.1 + theme_bw() #stopthegray
 
# use grid.arrange to make the sizes work well
grid.arrange(gg, gg.1, nrow=2, ncol=1, heights=c(3,1))

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Wright Map Tutorial – Part 3

$
0
0

(This article was first published on R Snippets for IRT, and kindly contributed to R-bloggers)

In this part of the tutorial, we’ll show how to load ConQuest output to make a CQmodel object and then WrightMaps. We’ll also show how to turn deltas into thresholds. All the example files here are available in the /inst/extdata folder of the github. If you download the latest version of the package, they should be in a folder called /extdata wherever your R packages are stored. Set this folder as your working directory with setwd() to run the examples.

Making the model

Let’s load a model. The first parameter should be the name of the person estimates file, while the second should be the name of the show file. Both are necessary for creating Wright maps (although the CQmodel function will run fine with only one or the other, provided that they are properly passed).

model1 <- CQmodel(p.est = "ex2.eap", show = "ex2.shw")

This (model1) is a CQmodel object. Enter the name of the object to see the names of all the tables & information stored within this object.

model1


## 
## ConQuest Output Summary:
## ========================
## Partial Credit Analysis 
## 
## The item model: item+item*step 
## 1 dimension 
## 582 participants
## Deviance: 9273 (21 parameters)
## 
## Additional information available:
## Summary of estimation: $SOE
## Response model parameter estimates: $RMP
## Regression coefficients: $reg.coef
## Variances: $variances
## Reliabilities: $rel.coef
## GIN tables: $GIN
## EAP table: $p.est
## Additional details: $run.details

Type the name of any of these tables to see the information stored there.

model1$SOE


## 
## Summary of estimation
## 
## Estimation method: Gauss-Hermite Quadrature with 15 nodes 
## Assumed population distribution: Gaussian 
## Constraint: DEFAULT 
## 
## Termination criteria:
##       1000 iterations
##       0.0001 change in parameters
##       0.0001 change in deviance
##       100 iterations without a deviance improvement
##       10 Newton steps in M-step
## Estimation terminated after 27 iterations because the deviance convergence criteria was reached.
## 
## Random number generation seed: 1 
## 2000 nodes used for drawing 5 plausible values 
## 200 nodes used when computing fit 
## Value for obtaining finite MLEs for zero/perfects: 0.3
model1$equation


## [1] "item+item*step"
model1$reg.coef


##                CONSTANT
## Main dimension    0.972
## S. errors         0.062
model1$rel.coef


##                MLE Person separation RELIABILITY
## Main dimension NA                               
##                WLE Person separation RELIABILITY EAP/PV RELIABILITY
## Main dimension NA                                0.813
model1$variances


## [1] 2.162

The most relevant for our purposes are the RMP, GIN, and p.est tables. The RMP tables contain the Response Model Parameters. These are item parameters. Typing model1$RMP would display them, but they’re a little long, so I’m just going to ask for the names and then show the first few rows of each table.

names(model1$RMP)


## [1] "item"      "item*step"

For this model, the RMPs have item and item*step parameters. We could add these to get the deltas. Let’s see what the tables look like.

head(model1$RMP$item)


##   n_item item    est error U.fit U.Low U.High  U.T W.fit W.Low W.High  W.T
## 1      1    1  0.753 0.055  1.11  0.88   1.12  1.8  1.10  0.89   1.11  1.8
## 2      2    2  1.068 0.053  1.41  0.88   1.12  6.0  1.37  0.89   1.11  6.0
## 3      3    3 -0.524 0.058  0.82  0.88   1.12 -3.2  0.87  0.88   1.12 -2.3
## 4      4    4 -1.174 0.060  0.76  0.88   1.12 -4.3  0.85  0.88   1.12 -2.7
## 5      5    5 -0.389 0.057  0.95  0.88   1.12 -0.9  0.95  0.89   1.11 -0.9
## 6      6    6  0.067 0.055  1.03  0.88   1.12  0.6  1.02  0.89   1.11  0.3
head(model1$RMP$"item*step")


##   n_item item step    est error U.fit U.Low U.High  U.T W.fit W.Low W.High
## 1      1    1    0     NA    NA  2.03  0.88   1.12 13.3  1.18  0.89   1.11
## 2      1    1    1 -1.129 0.090  0.99  0.88   1.12 -0.1  1.00  0.95   1.05
## 3      1    1    2  1.129    NA  0.80  0.88   1.12 -3.5  0.95  0.89   1.11
## 4      2    2    0     NA    NA  2.25  0.88   1.12 15.4  1.40  0.90   1.10
## 5      2    2    1 -0.626 0.093  1.04  0.88   1.12  0.7  1.04  0.94   1.06
## 6      2    2    2  0.626    NA  1.08  0.88   1.12  1.2  1.08  0.89   1.11
##    W.T
## 1  3.0
## 2  0.0
## 3 -0.9
## 4  7.1
## 5  1.3
## 6  1.4

Let’s look at a more complicated example.

model2 <- CQmodel("ex4a.mle", "ex4a.shw")
model2$equation


## [1] "rater+topic+criteria+rater*topic+rater*criteria+topic*criteria+rater*topic*criteria*step"
names(model2$RMP)


## [1] "rater"                     "topic"                    
## [3] "criteria"                  "rater*topic"              
## [5] "rater*criteria"            "topic*criteria"           
## [7] "rater*topic*criteria*step"
head(model2$RMP$"rater*topic*criteria*step")


##   n_rater    rater n_topic topic n_criteria criteria step    est error
## 1       1      Amy       1 Sport          1 spelling    1     NA    NA
## 2       1      Amy       1 Sport          1 spelling    2  0.299 0.398
## 3       1      Amy       1 Sport          1 spelling    3 -0.299    NA
## 4       2 Beverely       1 Sport          1 spelling    0     NA    NA
## 5       2 Beverely       1 Sport          1 spelling    1 -0.184 0.491
## 6       2 Beverely       1 Sport          1 spelling    2  0.051 0.461
##   U.fit U.Low U.High  U.T W.fit W.Low W.High W.T
## 1  0.43  0.70   1.30 -4.7  0.99  0.00   2.00 0.1
## 2  1.34  0.70   1.30  2.1  1.05  0.42   1.58 0.3
## 3  1.28  0.70   1.30  1.7  1.05  0.51   1.49 0.3
## 4  0.41  0.74   1.26 -5.8  1.47  0.00   2.09 0.9
## 5  3.23  0.74   1.26 10.9  0.95  0.30   1.70 0.0
## 6  0.87  0.74   1.26 -1.0  1.30  0.62   1.38 1.5

The GIN tables show the threshold parameters.

model1$GIN


##           [,1]  [,2]
## Item_1  -0.469 1.977
## Item_2   0.234 1.906
## Item_3  -1.789 0.742
## Item_4  -2.688 0.336
## Item_5  -1.656 0.883
## Item_6  -1.063 1.195
## Item_7  -1.969 1.047
## Item_8  -1.617 1.289
## Item_9  -0.957 1.508
## Item_10 -0.992 2.094
model2$GIN


## $Amy
## $Amy$Sport
##              [,1]   [,2]   [,3]
## spelling  -31.996 -1.976 -1.250
## coherence  -1.447 -1.446 -1.209
## structure  -2.247 -0.911 -0.172
## grammar    -0.885 -0.773 -0.107
## content    -0.486  0.104  0.627
## 
## $Amy$Family
##              [,1]   [,2]   [,3]
## spelling  -31.996 -2.516 -0.912
## coherence  -1.401 -1.280 -1.103
## structure  -1.966 -1.260 -0.294
## grammar    -1.069 -0.380 -0.106
## content    -0.728 -0.012  0.950
## 
## $Amy$Work
##             [,1]   [,2]   [,3]
## spelling  -2.055 -2.051 -1.128
## coherence -1.515 -1.320 -0.862
## structure -1.402 -1.158 -0.631
## grammar   -0.816 -0.550  0.122
## content   -0.430  0.212  0.762
## 
## $Amy$School
##              [,1]   [,2]   [,3]
## spelling  -31.996 -2.059 -0.997
## coherence  -1.403 -1.402 -0.999
## structure  -1.629 -1.148 -0.462
## grammar    -0.967 -0.421  0.070
## content    -0.782 -0.027  1.121
## 
## 
## $Beverely
## $Beverely$Sport
##             [,1]   [,2]   [,3]
## spelling  -2.054 -1.339 -0.663
## coherence -1.751 -1.129 -0.674
## structure -1.042 -0.437  0.013
## grammar   -0.502 -0.082  0.529
## content   -0.253  0.613  1.184
## 
## $Beverely$Family
##              [,1]   [,2]   [,3]
## spelling  -31.996 -2.264 -0.718
## coherence  -1.524 -1.357 -0.684
## structure  -1.326 -0.577  0.164
## grammar    -0.796  0.118  0.599
## content    -0.469  0.690  1.230
## 
## $Beverely$Work
##             [,1]   [,2]   [,3]
## spelling  -2.366 -1.465 -0.672
## coherence -1.388 -1.088 -0.925
## structure -1.115 -0.621  0.197
## grammar   -0.345  0.045  0.495
## content   -0.212  0.482  1.282
## 
## $Beverely$School
##             [,1]   [,2]   [,3]
## spelling  -1.826 -1.611 -0.873
## coherence -1.632 -1.222 -0.794
## structure -1.270 -0.865  0.321
## grammar   -0.491 -0.037  0.413
## content   -0.361  0.449  1.137
## 
## 
## $Colin
## $Colin$Sport
##             [,1]   [,2]  [,3]
## spelling  -1.660 -0.685 0.564
## coherence -0.612 -0.168 0.362
## structure -0.485  0.519 1.512
## grammar    0.611  1.275 1.698
## content    1.037  1.853 2.343
## 
## $Colin$Family
##             [,1]   [,2]   [,3]
## spelling  -1.477 -0.677 -0.022
## coherence -0.441 -0.277  0.332
## structure -0.318  0.265  1.299
## grammar    0.361  1.252  1.839
## content    1.009  1.683  2.374
## 
## $Colin$Work
##             [,1]   [,2]  [,3]
## spelling  -1.697 -1.002 0.089
## coherence -0.654 -0.105 0.192
## structure -0.502  0.502 1.205
## grammar    0.662  1.218 1.573
## content    0.766  1.806 2.357
## 
## $Colin$School
##             [,1]   [,2]  [,3]
## spelling  -1.595 -0.788 0.095
## coherence -0.629 -0.389 0.123
## structure -0.470  0.122 1.237
## grammar    0.385  1.010 1.679
## content    0.698  1.520 2.310
## 
## 
## $David
## $David$Sport
##             [,1]   [,2]  [,3]
## spelling  -1.405 -0.482 0.412
## coherence -0.357  0.136 0.581
## structure  0.023  0.724 1.811
## grammar    0.714  1.454 1.959
## content    1.256  2.031 2.912
## 
## $David$Family
##             [,1]   [,2]  [,3]
## spelling  -1.271 -0.404 0.741
## coherence  0.028  0.415 0.977
## structure  0.474  1.069 1.756
## grammar    1.177  1.733 2.085
## content    1.284  2.169 3.596
## 
## $David$Work
##             [,1]   [,2]  [,3]
## spelling  -1.378 -0.587 0.498
## coherence -0.119  0.260 0.795
## structure  0.173  1.003 1.885
## grammar    1.199  1.592 2.008
## content    1.437  2.174 3.117
## 
## $David$School
##             [,1]   [,2]  [,3]
## spelling  -0.815 -0.330 0.424
## coherence  0.062  0.293 0.805
## structure  0.295  1.012 1.955
## grammar    1.035  1.642 2.260
## content    1.312  2.107 3.407

Finally, the p.est table shows person parameters.

head(model1$p.est)  ##EAPs


##   casenum est (d1) error (d1) pop (d1)
## 1       1  -0.0824     0.5050   0.8821
## 2       2   1.7592     0.5597   0.8551
## 3       3   0.1648     0.4912   0.8884
## 4       4   3.5734     0.8269   0.6837
## 5       5  -0.6230     0.5291   0.8705
## 6       6   0.1648     0.4912   0.8884
head(model2$p.est)  ##MLEs


##   casenum sscore (d1) max (d1) est (d1) error (d1)
## 1       1          23       60  -0.4969     0.2535
## 2       2          36       60   0.6931     0.2605
## 3       3          24       60  -0.2637     0.2638
## 4       4          52       60   1.8587     0.3782
## 5       5          47       60   1.9147     0.2884
## 6       6          47       60   0.5312     0.2835

CQmodel, meet wrightMap

Ok, we have person parameters and item parameters: Let’s make a Wright Map

wrightMap(model1)


## Using GIN table for threshold parameters

The above uses the GIN table as thresholds. But you may want to use RMP tables. For example, if you have an item table and an item*step table, you might want to combine them to make deltas. You could do this yourself, but you could also let the make.deltas function do it for you. This function reshapes the item*step parameters, checks the item numbers to see if there are any dichotomous items, and then adds the steps and items. This can be especially useful if you didn’t get a GIN table from ConQuest (see below).

model3 <- CQmodel("ex2a.eap", "ex2a.shw")
model3$GIN


## NULL
model3$equation


## [1] "item+item*step"

This model has no GIN table, but it does have item and item*step tables. The make.deltas function will read the model equation and look for the appropriate tables.

make.deltas(model3)


## Using item and item*step tables to create delta parameters


##                    1      2      3
## Earth shape   -0.961 -0.493     NA
## Earth pictu.. -0.650  0.256  2.704
## Falling off   -1.416  1.969  1.265
## What is Sun   -0.959  1.343     NA
## Moonshine      0.157 -0.482 -0.128
## Moon and ni.. -0.635  0.861     NA
## Night and d..  0.157 -0.075 -0.739
## Breathe on ..  0.657  1.152 -3.558

When sent a model with no GIN table, wrightMap will automatically send it to make.deltas without the user having to ask.

wrightMap(model3, label.items.row = 2)


## Using item and item*step tables to create delta parameters

The make.deltas function can also handle rating scale models.

model4 <- CQmodel("ex2b.eap", "ex2b-2.shw")
model4$GIN


## NULL
model4$equation


## [1] "item+step"

This rating scale model again has no GIN table (always the first thing wrightMap looks for) so we’ll need to make deltas.

make.deltas(model3)


## Using item and item*step tables to create delta parameters


##                    1      2      3
## Earth shape   -0.961 -0.493     NA
## Earth pictu.. -0.650  0.256  2.704
## Falling off   -1.416  1.969  1.265
## What is Sun   -0.959  1.343     NA
## Moonshine      0.157 -0.482 -0.128
## Moon and ni.. -0.635  0.861     NA
## Night and d..  0.157 -0.075 -0.739
## Breathe on ..  0.657  1.152 -3.558

Or let wrightMap make them automatically.

wrightMap(model4, label.items.row = 2)


## Using item and step tables to create delta parameters

Specifying the tables

In the above examples, we let wrightMap decide what parameters to graph. WrightMap starts by looking for a GIN table. If it finds that, it assumes they are thresholds and graphs them accordingly. If there is no GIN table, it then sends the function to make.deltas, which will examine the model equation to see if it knows how to handle it. Make.deltas can handle equations of the form

A (e.g. item)

A + B (e.g. item + step [RSM])

A + A * B (e.g. item + item * step [PCM])

A + A * B + B (e.g item + item * gender + gender)

(It will also notice if there are minus signs rather than plus signs and react accordingly.)

But sometimes we may want something other than the default. Let’s look at model2 again.

model2$equation


## [1] "rater+topic+criteria+rater*topic+rater*criteria+topic*criteria+rater*topic*criteria*step"

Here’s the default Wright Map, using the GIN table

wrightMap(model2, min.logit.pad = -29, use.hist = FALSE)


## Using GIN table for threshold parameters

This doesn’t look great. Instead of showing all these estimates, we can specify a specific RMP table to use using the item.table parameter.

wrightMap(model2, item.table = "rater")


## Using rater tables to create delta parameters

That shows just the rater parameters. Here’s just the topics.

wrightMap(model2, item.table = "topic")


## Using topic tables to create delta parameters

What I really want, though, is to show the rater*topic estimates. For this, we can use the interactions and step.table parameters.

wrightMap(model2, item.table = "rater", interactions = "rater*topic", step.table = "topic")


## Using rater and rater*topic and topic tables to create delta parameters

Switch the item and step names to graph it the other way:

wrightMap(model2, item.table = "topic", interactions = "rater*topic", step.table = "rater")


## Using topic and rater*topic and rater tables to create delta parameters

You can leave out the interactions to have more of a rating scale-type model.

wrightMap(model2, item.table = "rater", step.table = "topic")


## Using rater and topic tables to create delta parameters

Or leave out the step table:

wrightMap(model2, item.table = "rater", interactions = "rater*topic")


## Using rater and rater*topic tables to create delta parameters

Again, make.deltas is reading the model equation to decide whether to add or subtract. If, for some reason, you want to specify a different sign for one of the tables, you can use item.sign, step.sign, and inter.sign for that.

wrightMap(model2, item.table = "rater", interactions = "rater*topic", step.table = "topic", 
    step.sign = -1)


## Using rater and rater*topic and topic tables to create delta parameters

The last few examples might not make sense for this model, but are just to illustrate how the function works. Note that all three of these parameters must be the exact name of specific RMP tables, and you can’t specify an interactions table or a step table without also specifying an item table (although JUST an item table is fine). And if your model equation is more complicated than the ones specified above, you will have to either use a GIN table or specify in the function call which tables to use for what. A model of the form item + item * step + booklet, for example, will not run unless there is a GIN table or you have defined at least the item.table.

Making thresholds

So far, we’ve seen how to use the GIN table to graph thresholds, or the RMP tables to graph deltas. We have one use case left: Making thresholds out of those RMP-generated deltas. Coulter (Dan) Furr has provided a lovely function for exactly this purpose. The example below uses the model3 deltas, but you can send it any matrix with items as rows and steps as columns.

deltas <- make.deltas(model3)


## Using item and item*step tables to create delta parameters
deltas


##                    1      2      3
## Earth shape   -0.961 -0.493     NA
## Earth pictu.. -0.650  0.256  2.704
## Falling off   -1.416  1.969  1.265
## What is Sun   -0.959  1.343     NA
## Moonshine      0.157 -0.482 -0.128
## Moon and ni.. -0.635  0.861     NA
## Night and d..  0.157 -0.075 -0.739
## Breathe on ..  0.657  1.152 -3.558
make.thresholds(deltas)


##                  [,1]    [,2]    [,3]
## Earth shape   -1.3229 -0.1311      NA
## Earth pictu.. -0.9242  0.4452  2.7832
## Falling off   -1.4503  1.3141  1.9729
## What is Sun   -1.0467  1.4307      NA
## Moonshine     -0.6759 -0.2253  0.4156
## Moon and ni.. -0.8077  1.0337      NA
## Night and d.. -0.6343 -0.1937  0.1853
## Breathe on .. -0.7007 -0.5079 -0.4742

Alternately, we can just send the model object directly:

make.thresholds(model3)


## Using item and item*step tables to create delta parameters
## Creating threshold parameters out of deltas


##                  [,1]    [,2]    [,3]
## Earth shape   -1.3229 -0.1311      NA
## Earth pictu.. -0.9242  0.4452  2.7832
## Falling off   -1.4503  1.3141  1.9729
## What is Sun   -1.0467  1.4307      NA
## Moonshine     -0.6759 -0.2253  0.4156
## Moon and ni.. -0.8077  1.0337      NA
## Night and d.. -0.6343 -0.1937  0.1853
## Breathe on .. -0.7007 -0.5079 -0.4742

You don’t have to do any of this to make a Wright Map. You can just send the model to wrightMap, and use the type parameter to ask it to calculate the thresholds for you.

wrightMap(model3, type = "thresholds", label.items.row = 2)


## Using item and item*step tables to create delta parameters
## Creating threshold parameters out of deltas

Again, the default type is to use the GIN table if present, and to make deltas if not. You can also force it to make deltas (and ignore the GINs) by setting type to deltas. Alternately, if you specify an item.table, the type will switch to deltas unless you then set type to thresholds.

Last, but not least, important time-saving note

Finally: If all you want is the Wright Maps, you can skip CQmodel entirely and just send your files to wrightMap:

wrightMap("ex2a.eap", "ex2a.shw", label.items.row = 3)


## Using item and item*step tables to create delta parameters

To leave a comment for the author, please follow the link and comment on his blog: R Snippets for IRT.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Mapping academic collaborations in Evolutionary Biology

$
0
0

(This article was first published on What is this? David Springate's personal blog :: R, and kindly contributed to R-bloggers)

Mapping academic collaborations in Evolutionary Biology

This post is a repubication of a visualisation I did in 2011 for my (now defunct) datajujitsu.co.uk blog. It was a naive first attempt at web-scraping from an academic publishers website. It was done before I was aware of the problems surrounding access to, and text-mining of, online academic content hosted by publishers such as Wiley and Elsevier. Producing such a piece now (in 2013) would certainly be regarded as a political act. The text and visualisations are unchanged from the original

Like many people, I was immensely impressed with Paul Butler's global map of facebook friend connections, a spectacular way of visualising, and humanising, a large amount of raw data. I was further impressed to find out that he did it solely using R. I recently found Flowingdata's tutorial on creating the same effect using flight information and got to thinking about what other datasets I could apply it to. My original plan was to build a scraper to get all of the abstracts from a particular subject from Pubmed and visualise the academic collaborations between institutions for all of these abstracts. Unfortunately though, Pubmed only stores the addresses of the institutions of the corresponding author, so I decided to stick with my own subject, evolutionary biology, and get all the abstacts from the journals Evolution1558-5646) and Evolutionary Biology1420-9101) since 2009. I could then extract these using a hacked together Python script which would then feed the addresses into the Yahoo PlaceFinder api to get a data set of coordinates for each cross-institution collaboration in every paper published in the journals for the last two and a half years. I then fed this data into R, generated great circles for each of the collaborations using the geosphere package and processed it a la FlowingData to get the following global map of academic collaboration in evolutionary biology since 2009:

Evolution social network 2009-2011

You can clearly see the main hubs of collaboration in Europe and the East and West coasts of the USA, with smaller hubs in Japan and South-Eastern Australia. There are further actively collaborating institutions in South America and Africa, but almost all of their collaborations are with North american and European Universities. Looking into the data itself, the median longitude for JEB institutions is firmly in Europe, while the median longitude for Evolution in the USA (This makes sense since Evolution is based in the states while JEB is a European journal, though there is no geographic imperative to publish in either). Technical info: I scraped the data from the Wiley website for the two journals using Python and BeautifulSoup. For the R analysis I used the modules maps, geosphere, reshape and gdata.

To leave a comment for the author, please follow the link and comment on his blog: What is this? David Springate's personal blog :: R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Accessing global weather model data using the rNOMADS package in R

$
0
0

(This article was first published on Bovine Aerospace » R, and kindly contributed to R-bloggers)

The rNOMADS package interfaces with the NOAA Operational Model Archive and Distribution System to provide access to 55 operational (i.e. real time and prediction) models describing the state of the ocean and the atmosphere. rNOMADS has been used to get wind and wave data for a real time sailing game, to quantify solar energy available for power plants in Europe, and to predict helium balloon flights. We look forward to continuing to hear about novel and unexpected uses for this spectacular data set.

In this post, we show how to use rNOMADS to do the following:
1. Plot global temperature and wind at different atmospheric pressure levels
2. Examine the distribution of wave heights in the Atlantic ocean
3. Produce a precise atmospheric temperature and wind profile for a specific point at a specific time
4. Generate simultaneous atmospheric profiles at over 100 locations on the Eastern Seaboard of the USA

Links to source code are provided below each set of figures. A link to rNOMADS installation instructions is provided at the end of the post.

Global Temperature and Wind Maps

The Global Forecast System (GFS) model provides weather data on a 0.5 x 0.5 degree grid for the entire planet.  It is run 4 times daily, and produces a prediction every three hours out to 180 hours.  The zero hour “analysis” forecast is the state of the atmosphere at the model run time, and it uses some observational data to increase its accuracy.  Here, we plot the temperature at 2 m above the surface, the wind at 10 m above the surface, and the winds at 300 mb (around 9 kilometers elevation).  The 300 mb plot often shows the northern and southern jet streams quite well.

Temperature at the Earth's surface determined using the Global Forecast System model.

Temperature at the Earth’s surface determined using the Global Forecast System model.

Winds at the surface of the Earth from the GFS model.  Note the little spot of high winds - that's Tropical Cyclone Gillian, a Category 3 storm when this image was generated.

Winds at the surface of the Earth from the GFS model. Note the little spot of high winds south of Indonesia – that’s Tropical Cyclone Gillian, a Category 3 storm when this image was generated.

Jet streams and Rossby waves are clearly visible in this image of the wind speeds of the upper troposphere/lower stratosphere.

Jet streams and Rossby waves are clearly visible in this image of the wind speeds of the upper troposphere/lower stratosphere.

Download the source code for these images here.

Wave Heights in the Atlantic Ocean

rNOMADS can also access the NOAA Wave Watch Model.  While I’m less familiar with this one, I was able to generate a plot of what (I think) are wave heights in the western Atlantic Ocean and the Caribbean.

Wave heights in the Atlantic ocean from the NOAA Wave Watch model.

Wave heights in the Atlantic ocean from the NOAA Wave Watch model.

Download the source code for this image here.

Instantaneous Atmospheric Profile over Sakura-Jima volcano, Japan

It’s important to know which direction the winds are going directly above active volcanoes, because eruptions can carry ash into air space and over inhabited areas.  One impetus for the continued development of rNOMADS was to provide a one-stop solution for generating high precision wind profiles over specific points, allowing ash distribution prediction as soon as an eruption begins.  Here, we have generated a spatially and temporally interpolated wind and temperature profile over Sakura-Jima volcano, Japan.  The profile is calculated for the exact time when the source code is run.

Instantaneous temperature profile above Sakura-jima volcano, Japan.

Instantaneous temperature profile above Sakura-jima volcano, Japan.

An eruption at this instant would produce ashfalls east of the volcano for a plume height of 15 km.  However, if a truly massive 30 km high plume was produced, ashfalls would occur both east and west of the volcano.

An eruption at this instant would produce ashfalls east of the volcano for a plume height of 15 km. However, if a truly massive 30 km high plume was produced, ashfalls would occur both east and west of the volcano.

Download the source code for these images here.

Wind and temperature profiles over the USArray seismic and infrasound network

The USArray is a massive grid of seismic and infrasound (low frequency sound) sensors currently occupying the Eastern seaboard of the United States.  Since infrasound propagation is strongly affected by temperature and wind, it’s important to know the weather conditions over each station.  Here, we’ve plotted the temperature and wind profiles for every single station in the USArray on the same plot.  The large variations in wind speed will result in very different infrasound propagation depending on where the stations are located.

Temperature profiles above USArray stations.

Temperature profiles above USArray stations.

Wind speed profiles over USArray stations.  It appears that wind speeds vary by over 100 km/hr at certain heights between stations.

Wind speed profiles over USArray stations. It appears that wind speeds vary by over 100 km/hr at certain heights between stations.

Download the source code for these images here.

Some of these scripts require the aqfig package in R to generate the colorbar legends.

Instructions on installing R, rNOMADS, and their dependencies are here.


To leave a comment for the author, please follow the link and comment on his blog: Bovine Aerospace » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Earthquake Magnitude / Depth Chart

$
0
0

(This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers)

I am working on a project related to secondary effects of earthquakes. To guide me in the analysis I need a chart showing the location, magnitude and depth of recent earthquakes. There are a host of such charts available already, but since I had the required data on hand, it seemed like a good idea to take a stab at it myself.

Getting the Data

The data was sourced from the US Geological Survey web site. I selected dates for the decade between between 1 January 2004 and 1 January 2013, magnitudes greater than 5 and chose CSV as the output format.

Loading the data into R is then simple. Some small transformations are required in order to interpret the time field in the data. I discarded a few columns which were not going to be useful, and added fields for the year and date of observation (for convenience alone: these data were already in the time field).

> catalog <- read.csv(file.path("data", "earthquake-catalog.csv"), stringsAsFactors = FALSE)
> #
> catalog <- within(catalog, {
+   time <- sub("T", " ", time)
+   time <- sub("Z", "", time)
+   time <- strptime(time, format = "%Y-%m-%d %H:%M:%S")
+   date <- as.Date(time)
+   year <- as.integer(strftime(time, format = "%Y"))
+ })
>
> catalog <- catalog[, c(12, 16, 17, 1, 2:5, 14)]

This is what the resulting data frame looks like:

> head(catalog)
          id year       date                time latitude longitude depth mag                           place
1 usc000lv53 2013 2013-12-31 2013-12-31 23:41:47  19.1673  120.0807 10.28 5.2  92km NW of Davila, Philippines
2 usc000lv0r 2013 2013-12-31 2013-12-31 21:32:01  19.1223  120.1797 10.00 5.2 83km NNW of Davila, Philippines
3 usb000m2uh 2013 2013-12-31 2013-12-31 20:04:32  19.0589  120.3057 20.61 5.0 70km NNW of Davila, Philippines
4 usc000luwe 2013 2013-12-31 2013-12-31 20:01:06  19.1181  120.2719 10.95 5.7 77km NNW of Burgos, Philippines
5 usb000m2ub 2013 2013-12-31 2013-12-31 13:55:02 -17.6528 -173.6869 15.38 5.0      114km NNE of Neiafu, Tonga
6 usc000lumu 2013 2013-12-31 2013-12-31 08:36:30 -15.6555 -172.9340 31.82 5.1       93km ENE of Hihifo, Tonga

Making the Charts

Time to generate those charts. There are lots of ways to make maps in R, I chose to use a generic option: ggplot2.

> require(ggplot2)
> require(maps)
> require(grid)
> 
> world.map <- map_data("world")
> 
> ggplot() +
+   geom_polygon(data = world.map, aes(x = long, y = lat, group = group),
+                fill = "#EEEECC") +
+   geom_point(data = catalog, alpha = 0.25,
+              aes(x = longitude, y = latitude, size = mag, colour = depth)) +
+   labs(x = NULL, y = NULL) +
+   scale_colour_gradient("Depth [m]", high = "red") +
+   scale_size("Magnitude") +
+   coord_fixed(ylim = c(-82.5, 87.5), xlim = c(-185, 185)) +
+   theme_classic() +
+   theme(axis.line = element_blank(), axis.text = element_blank(),
+         axis.ticks = element_blank(),
+         plot.margin=unit(c(3, 0, 0, 0),"mm"),
+         legend.text = element_text(size = 6),
+         legend.title = element_text(size = 8, face = "plain"),
+         panel.background = element_rect(fill='#D6E7EF'))

The resulting plot gives the location of the earthquakes as points, with magnitudes indicated by the sizes of the points and depths given by their colour.

earthquake-map

The Earth’s tectonic plates are well defined by the numerous interplate earthquakes, and there is a liberal sprinkling of intreplate events as well.

I made another chart showing the distribution of earthquakes broken down by year.

earthquake-map-panels

Distribution of Earthquake Magnitudes

While we are taking a high level look at the data, it’s interesting to see how the magnitudes are distributed. A logartihmic scale is necessary to make the frequencies visible over the full range of magnitudes.

ggplot(catalog, aes(x = mag)) +
  xlab("Magnitude") + ylab("Number of Earthquakes") +
  stat_bin(drop = TRUE, binwidth = 0.25) +
  scale_y_log10(breaks = c(1, 10, 100, 1000)) +
  theme_classic()

earthquake-magnitude-histogram

Very nice: consistent with a power law, as described by the Gutenberg–Richter law.

To leave a comment for the author, please follow the link and comment on his blog: Exegetic Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

categoryCompare Paper Finally Out!

$
0
0

(This article was first published on Deciphering life: One bit at a time :: R, and kindly contributed to R-bloggers)

categoryCompare Paper Finally Out!

I can finally say that the publication on my Bioconductor package categoryCompare is finally published in the Bioinformatics and Computational Biology section of Frontiers in Genetics. This has been a long time coming, and I wanted to give some background on the inspiration and development of the method and software.

TL;DR

The software package has been in development in one form or another since 2010, released to Bioconductor in summer 2012, and the publication has bounced around and been revised since spring of 2013, and it is finally available to you. All of the supplementary data and methods are available as an R package on github. Version control using git was instrumental in getting this work out in a timely manner. There is still a bunch of work to do on the package.

If I did it again, I would:

  • Write the manuscript using R markdown, as a vignette in a package
  • Ask to be able to make reviewer points issues on Github
  • Submit a preprint with submission

Inspiration

In spring of 2010 I started as a PostDoc with Eric Rouchka. One of his collaborators, Jeff Petruska is interested in the process of collateral sprouting of neurons, especially as it compares to regeneration. Early in my PostDoc, Jeff wanted to do a gene-level comparison of his microarray data of collateral sprouting in skin compared to previously published studies with muscle.

Combing through the literature produced a number of genes differentially expressed in denervated muscle. However, when comparing with the genes resulting from the skin data, there was almost nothing in common and nothing that made sense from a functional standpoint. Skimming around the Bioconductor literature in GOStats, I was struck by Robert Gentleman's example of coloring Gene Ontology nodes by which data set they originated from. This is a very simple meta-analysis and visualization. When I tried it with the skin - muscle comparison, I got some very interesting (i.e. the Petruska group thought the results were very interpretable) results.

Note: At this point, because of the data sources (gene lists from publications with little to no original data), I was using the hypergeometric enrichment test in Category to determine significant GO terms from the two tissues.

V 0.0000001

I started developing this idea into a simple package (i.e. collection of R scripts), that was able to do at least GO term enrichment, and that could be hosted on our group webserver to enable others to make use of it. Visualization and interrogation used the imageMaps function in Robert Gentleman's original demonstration, however any number of data sets could now be compared.

categoryCompare Method – Summary

The basic method is to take gene (or really any annotated feature) lists from multiple experiments, and perform annotation (Gene Ontology Terms, KEGG Pathways, etc) enrichment (either hypergeometric or GSEA type) on each gene list, determine significant annotations from each list, and then examine which annotations come from which list.

Because this results in a lot of data to parse through, exploration of the results is facilitated by considering the annotations as a network of annotations related by the number of shared genes between them, and interacting with the networks in Cytoscape.

If you want to know more, check out the paper, or the vignette in the Bioconductor package

Others

As I developed this idea, I also started looking into other possible software implementations.

  • Shen & Tseng (Shen & Tseng, 2010) had recently published similar work in MAPE-I, but at my level of understanding it required the same identifiers. In retrospect, I should have read the paper better. However, they also did not have a released implementation with their publication.

  • ConceptGen (Sartor et al., 2010) is another interesting application, that allows very similar analyses to categoryCompare, except that it is not possible to explore how the concepts map between multiple user supplied data sets. The only way one can relate multiple data sets is if the data sets map as a concept to another, but one cannot visualize the interrelated concepts between user supplied data sets.

  • Enrichment Map (Merico et al., 2010) was another alternative, but did all of comparison mathematics in Cytoscape itself, based on enrichments calculated outside of Cytoscape. The publication does give an example of a similar type of analysis as performed by categoryCompare. However, I wanted everything, from enrichment calculation to visualization controlled by R. I did use their method of weighting the edges between annotations, however, ditching the GO directed acyclic graph (DAG) view I used initially.

Bioconductor Package

About this time I realized that I wanted the package to fit into the Bioconductor ecosystem. This required a complete redesign and rewrite of the code, as I moved to actually used some sort of OOP model (S4 in this case), and creating an actual package.

The package was released to the wild in the fall of 2012, as part of Bioconductor v 2.10. Of course, this was the last Bioconductor release based on R 2.15, which brought some particular challenges in namespaces with the switch in R 3.0.0 the next year.

Graphviz to RCytoscape

The original code used the graphviz package to do layouts for visualization, but it was difficult to install, and did not work the way I wanted. Thankfully Paul Shannon had just developed the RCytoscape package the year previous (Shannon et al., 2013), and this enabled truly interactive visualization and passing data back and forth between R and Cytoscape.

My use of RCytoscape has actually lead to finding improvements in the package, and better use of it. I am also hoping to make much more use of RCytoscape and the igraph package and combining them in novel ways.

First Publication Attempt

Our first attempt at publication focussed on the original data that inspired the method, and comparing it with gene-gene comparisons directly. We submitted to BMC Bioinformatics, and the reviews were not favorable, and took forever to get back. We actually wondered if the reviewers had read the paper that we submitted. We gave up on this venue. BTW, our last three publications have been submitted there, and the publication process has gotten worse and worse with every submission there. I don't think our papers are getting worse over the years. I don't know if we will bother submitting to BMC Bioinformatics again.

Second Publication Attempt

The second attempt was submitting to the current home in the Bioinformatics and Computational Biology section of Frontiers in Genetics. This time, although the reviews were harsh (i.e. they were not immediately favorable), they were fair, and actually contained useful critique, and pointed to a way forward.

Unfortunately for me, revising the manuscript to address the reviewers criticisms meant a lot of work to construct theoretical examples, as well as a lot of thought in order to pare the manuscript down to make sure our primary message was clearly communicated.

Frontiers Interactive Review

I would like to note that the Frontiers interactive review system, with the ability to discuss individual points with the reviewers (still anonymously) really helped make it possible to determine which points were make or break, and discuss different ways to approach things. This was the best review experience I have ever had, and a large part of that I think was due to being able to interact with the reviewers directly, and not just through letters mediated by the editor.

I think it would be nice if Frontiers had an option for making the review history available if the authors and reviewers were agreeable to it.

Github package of supplementary materials

In the initial publication, I had included the set of scripts used for analysis, data files, results, etc. However, the amount of work required in the rewrite was so substantial, that I created an R package specifically for the analyses that went into the paper, with separate R markdown vignettes for each result type (hypothetical, two different experimental comparisons). This package has documents on how raw data was processed, as well as the semi-processed experimental data.

Publication specific branch

Due to specific changes to the categoryCompare software needed to address the reviewer comments, a publication specific branch of the development version was created (paper). This allowed me to quickly introduce code and features that the reviewers had asked for, without worrying about breaking the current development version that will be released in Bioconductor shortly.

To Do

As with most software projects, there is still plenty to be done.

  • Incorporate new functionality from paper branch into dev

    • Specifically ability to do GSEA built in (probably using limma's romer and roast functions), and new visualization options
  • Change from latex vignette to R markdown (this is technically done, but hasn't made it into the dev branch)

  • Switch to roxygen2 documentation (will be interesting due to use of S4 objects and methods)

  • Implement proper testing using testthat

  • Refactor a lot of code to improve speed, use R conventions

    • I was still a relatively R newbie when I wrote the package, and as I looked at the code while making changes for the reviewers, I noticed some places where I had done some silly things, mainly because I didn't know better when I did it.
  • Consider splitting visualization into its own package

    • Although I developed the visualization specifically for categoryCompare, I think it is generic enough that others might benefit from having it available separately. Therefore I need to think about separating it out, and how best to go about it without making it hard for categoryCompare to keep working as it already does.
  • Make it easy to investigate the actual genes with particularly interesting annotations, and how they are linked together by annotations or other data sources, as well as their original expression levels in the experiments.

Side Effects

A nice side effect of the package is that any annotation enrichment I do now, I almost always do it in the framework of categoryCompare, just because it is a lot easier to make sense of using the visualization capabilities and coupling with Cytoscape.

What Would I Do Differently?

Keeping in mind that I started this work almost 4 years ago, when I still didn't know any R, and had yet to be exposed to the reproducibility and open science movements, or knitr, or pandoc, here are some things that if I started today I would do differently (you can probably guess some of these from above):

  • Put the package under version control immediately! Thankfully I didn't have any moments early in the process when things were not in a git repo, but I am very thankful later on I was able to do diff and branch on my code to figure out where things broke and introduce new features.

  • Start thinking of the analysis as a standalone package from the very beginning, instead of a directory of data and scripts. This is what I do now (blogpost to come), and it makes it much easier if I come up with novel methods to spin them off into a fully fledged R / Bioconductor package

  • Don't underestimate the novelty of something as simple as visualization, and how much it may make or break your method. We ended up adding a good chunk of text to the manuscript on the visualization because we realized how important it was, but only after the reviewers pointed it out to us.

  • Write the paper as a vignette of the ccPaper package itself, and generate Word documents for collaborators who insist on Word docs using Pandoc

  • Start a github repo for the paper, and ask collaborators to try and work on it there

  • Submit a preprint when submitted, so that we start getting feedback on the manuscript early

  • Ask for permission to set up reviewer comments as issues on the github repo to easily track how well we are addressing them.

    • Wouldn't it be cool in a totally open peer review journal to actually do all of the peer review on a service like Github, and have reviewers leave issues, tag them, and comment directly on the text of the publication using the commenting feature of commenting on commits?

To leave a comment for the author, please follow the link and comment on his blog: Deciphering life: One bit at a time :: R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Animated Choropleths in R

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Ari Lamstein has updated his choroplethr package with a new capability for creating animated data maps. I can't embed the animated version here, but click the image below to see an animation of US counties by average household income, from the richest to the poorest by percentile. (The code behind the animation is available on github.)

Animated choropleths

The chloroplethr package is also now available on CRAN, so you can install the latest version (including the new choroplethr_animate function) with the command install.packages(choroplethr) . In addition to playing animations from start to finish, you can also step though each frame using the + and - buttons.

This version of choroplethr was created during Trulia’s latest innovation week. Ari Lamstein wrote most of the R code, and the animation code was written by Brian P Johnson.

Google Groups choroplethr: choroplethr v1.4.0 is now available  

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualize violent crime rates in US with choroplethr package

$
0
0

(This article was first published on My Life as a Mock Quant in English, and kindly contributed to R-bloggers)
Visualize violent crime rates in different US States with choroplethr package

Visualize violent crime rates in different US States with choroplethr package

I knew choroplethr package by the blog post Animated Choropleths in R a few days ago. As a another visualization tool in R language, I wana try this one.

To install the latest stable release(CRAN) type the following from an R console:

install.packages("choroplethr")

To install the development version using the devtools package from github:

library(devtools)
install_github("choroplethr", "trulia")
library(choroplethr)

It's not interesting for me to run just example codes written in choroplethr package, I used other data from rMaps package as a quick data source and visualize it!

library(devtools)
install_github("ramnathv/rCharts@dev")
install_github("ramnathv/rMaps")

Now we can use violent crime rates data in US included in rMaps package.

We can create animated choropleths as the following page:

In my case, we just process the data and visualize it as the follwing simple code:

# load packages
library(rMaps)
library(choroplethr)
# initialization list and get years from violent_crime data
choropleths = list()
# Get years for loop
years <- sort(unique(violent_crime$Year))
# convert to level data
violent_crime$Crime <- cut(violent_crime$Crime, 9)
# Create choropleth component.
for (i in 1:length(years)) {
df <- subset(violent_crime, Year == years[i])
# We need to change the column names for choroplethr function
colnames(df) <- c("Year", "region", "value")
# Cut decimal off df$value <- round(df$value)
title <- paste0("Violent crime rates: ", years[i])
choropleths[[i]] = choroplethr(df, "state", title = title)
}
# Vizualize it!
choroplethr_animate(choropleths)

The result is published via Dropbox as the following (image)link.

Enjoy!

To leave a comment for the author, please follow the link and comment on his blog: My Life as a Mock Quant in English.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shape File Selfies in ggplot2

$
0
0

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers)

In this post you will learn how to:

  1. Create your own quasi-shape file
  2. Plot your homemade quasi-shape file in ggplot2
  3. Add an external svg/ps graphic to a plot
  4. Change a grid grob's color and alpha

*Note get simple .md version here


Background (See just code if you don't care much about the process)

I started my journey wanting to replicate a graphic called a space manikin by McNeil (2005) and fill areas in that graphic like a choropleth. I won't share the image from McNeil's book as it's his intellectual property but know that the graphic is from a gesturing book that divides the body up into zones (p. 275). To get a sense of what the manikin looks like here is the ggplot2 version of it:

Figure 1: ggplot2 Version of McNeil’s (2005) Space Manikin

While this is a map of areas of a body you can see where this could be extended to any number of spatial tasks such as mapping the layout of a room.


1. Creating a Quasi-Shape File

So I figured “zones” that's about like states on a map. I have toyed with choropleth maps of the US in the past and figured I'd generalize this learning. The difference is I'd have to make the shape file myself as the maps package doesn't seem to have McNeil’s space manikin.

Let's look at what ggplot2 needs from the maps package:

library(maps); library(ggplot2)
head(map_data("state"))
##     long   lat group order  region subregion
## 1 -87.46 30.39     1     1 alabama      <NA>
## 2 -87.48 30.37     1     2 alabama      <NA>
## 3 -87.53 30.37     1     3 alabama      <NA>
## 4 -87.53 30.33     1     4 alabama      <NA>
## 5 -87.57 30.33     1     5 alabama      <NA>
## 6 -87.59 30.33     1     6 alabama      <NA>

Hmm coordinates, names of regions, and order to connect the coordinates. I figured I can handle that. I don't 100% know what a shape file is, mostly that it’s a file that makes shapes. What we're making may or may not technically be a shape file but know we're going to map shapes in ggplot2 (I use the quasi to avoid the wrath of those who do know precisely what a shape file is).

I needed to make the zones around an image of a person so I first grabbed a free png silhouette from: http://www.flaticon.com/free-icon/standing-frontal-man-silhouette_10633. I then knew I'd need to add some lines and figure out the coordinates of the outlines of each cell. So I read the raster image into R, plotted in ggplot2 and added lots of grid lines for good measure. Here's what I wound up with:

library(png); library(grid); library(qdap)
url_dl(url="http://i.imgur.com/eZ76jcu.png")
file.rename("eZ76jcu.png", "body.png")
img <- rasterGrob(readPNG("body.png"), 0, 0, 1, 1, just=c("left","bottom"))
ggplot(data.frame(x=c(0, 1), y=c(0, 1)), aes(x=x, y=y)) + 
    geom_point() +
    annotation_custom(img, 0, 1, 0, 1) + 
    scale_x_continuous(breaks=seq(0, 1, by=.05))+ 
    scale_y_continuous(breaks=seq(0, 1, by=.05)) + theme_bw() +
    theme(axis.text.x=element_text(angle = 90, hjust = 0, vjust=0))

plot of chunk unnamed-chunk-2

Figure 2: Silhouette from ggplot2 With Grid Lines


1b. Dirty Deeds Done Cheap

I needed to get reference lines on the plot so I could begin recording coordinates. Likely there's a better process but this is how I approached it and it worked. I exported the ggplot in Figure 2 into (GASP) Microsoft Word (I may have just lost a few die hard command line folks). I added lines there and and figured out the coordinates of the lines. It looked something like this:

Figure 3: Silhouette from ggplot2 with MS Word Augmented Border Lines

After that I began the tedious task of figuring out the corners of each of the shapes (“zones”) in the space manikin. Using Figure 3 and a list structure in R I mapped each of the corners, the approximate shape centers, and the order to plot the coordinates in for each shape. This is the code for corners:

library(qdap)
dat <- list(
    `01`=data.frame(x=c(.4, .4, .6, .6), y=c(.67, .525, .525, .67)),
    `02`=data.frame(x=c(.35, .4, .6, .65), y=c(.75, .67, .67, .75)),
    `03`=data.frame(x=c(.6, .65, .65, .6), y=c(.525, .475, .75, .67)),
    `04`=data.frame(x=c(.4, .35, .65, .6), y=c(.525, .475, .475, .525)),
    `05`=data.frame(x=c(.35, .35, .4, .4), y=c(.75, .475, .525, .67)),
    `06`=data.frame(x=c(.4, .4, .6, .6), y=c(.87, .75, .75, .87)),
    `07`=data.frame(x=c(.6, .6, .65, .65, .73, .73), y=c(.87, .75, .75, .67, .67, .87)),
    `08`=data.frame(x=c(.65, .65, .73, .73), y=c(.67, .525, .525, .67)),
    `09`=data.frame(x=c(.6, .6, .73, .73, .65, .65), y=c(.475, .28, .28, .525, .525, .475)),
    `10`=data.frame(x=c(.4, .4, .6, .6), y=c(.475, .28, .28, .475)),
    `11`=data.frame(x=c(.27, .27, .4, .4, .35, .35), y=c(.525, .28, .28, .475, .475, .525)),
    `12`=data.frame(x=c(.27, .27, .35, .35), y=c(.67, .525, .525, .67)),
    `13`=data.frame(x=c(.27, .27, .35, .35, .4, .4), y=c(.87, .67, .67, .75, .75, .87)),
    `14`=data.frame(x=c(.35, .35, .65, .65), y=c(1, .87, .87, 1)),
    `15`=data.frame(x=c(.65, .65, .73, .73, 1, 1), y=c(1, .87, .87, .75, .75, 1)),
    `16`=data.frame(x=c(.73, .73, 1, 1), y=c(.75, .475, .475, .75)),
    `17`=data.frame(x=c(.65, .65, 1, 1, .73, .73), y=c(.28, 0, 0, .475, .475, .28)),
    `18`=data.frame(x=c(.35, .35, .65, .65), y=c(.28, 0, 0, .28)),
    `19`=data.frame(x=c(0, 0, .35, .35, .27, .27), y=c(.475, 0, 0, .28, .28, .475)),
    `20`=data.frame(x=c(0, 0, .27, .27), y=c(.75, .475, .475, .75)),
    `21`=data.frame(x=c(0, 0, .27, .27, .35, .35), y=c(1, .75, .75, .87, .87, 1))
)

dat <- lapply(dat, function(x) {
    x$order <- 1:nrow(x)
    x
})

space.manikin.shape <- list_df2df(dat, "id")[, c(2, 3, 1, 4)]

And the code for the centers:

centers <- data.frame(
    id = unique(space.manikin.shape$id),
    center.x=c(.5, .5, .625, .5, .375, .5, .66, .69, .66, .5, .34, .31, 
        .34, .5, .79, .815, .79, .5, .16, .135, .16),
    center.y=c(.597, .71, .5975, .5, .5975, .82, .81, .5975, .39, .3775, .39, 
        .5975, .81, .935, .89, .6025, .19, .14, .19, .6025, .89)
)

There you have it folks your very own quasi-shape file. Celebrate the fruits of your labor by plotting that bad Oscar.


2. Plot Your Homemade Quasi-Shape File

 ggplot(centers) + annotation_custom(img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, label = id), color="grey60") 

plot of chunk unnamed-chunk-5

Figure 4: Plotting the Quasi-Shape File and a Raster Image

Then I said I may want to tone down the color of the silhouette a bit so I can plot geoms atop without distraction. Here's that attempt.

img[["raster"]][img[["raster"]] == "#0E0F0FFF"] <- "#E7E7E7"

ggplot(centers) + annotation_custom(img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, label = id), color="grey60") 

plot of chunk unnamed-chunk-6

Figure 5: Altered Raster Image Color


3. Add an External svg/ps

I realized quickly a raster was messy. I read up a bit on them in the R Journal (click here). In the process of reading and fooling around with Picasa I turned my original silhouette (body.png) blue and couldn't fix him. I headed back to http://www.flaticon.com/free-icon/standing-frontal-man-silhouette_10633 to download another. In this act I saw you could download a svg file of the silhouette. I thought maybe this will be less messier and easier to change colors. This led me to a google search and finding the grImport package after seeing this listserve post. And then I saw an article from Paul Murrell (2009) and figured I could turn the svg (I didn't realize what svg was until I opened it in Notepad++) into a ps file and read into R and convert to a flexible grid grob.

Probably there are numerous ways to convert an svg to a ps file but I chose a cloud convert service. After I read the file in with grImport per the Paul Murrell (2009) article. You're going to have to download the ps file HERE and get to your working directory.

browseURL("https://github.com/trinker/space_manikin/raw/master/images/being.ps")
## Move that file from your downloads to your working directory.
## Sorry I don't know how to automate this.
library(grImport)

## Convert to xml
PostScriptTrace("being.ps")

## Read back in and convert to a grob
being_img <- pictureGrob(readPicture("being.ps.xml"))

## Plot it
ggplot(centers) + annotation_custom(being_img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, 
        colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="grey60") 

plot of chunk unnamed-chunk-7

Figure 6: Quasi-Shape File with Grob Image Rather than Raster


4. Change a grid Grob's Color and Alpha

Now we have a flexible grob we can mess around with colors and alpha until our heart's content.

str is our friend to figure out where and how to mess with the grob (str(being_img)). That leads me to the following changes to the image to adjust color and/or alpha (transparency).

being_img[["children"]][[1]][[c("gp", "fill")]] <- 
  being_img[["children"]][[2]][[c("gp", "fill")]] <- "black"

being_img[["children"]][[1]][[c("gp", "alpha")]] <- 
  being_img[["children"]][[2]][[c("gp", "alpha")]] <- .2

## Plot it
ggplot(centers) + annotation_custom(being_img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, 
        colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="grey60") 

plot of chunk unnamed-chunk-8

Figure 7: Quasi-Shape File with Grob Image Alpha = .2


Let's Have Some Fun

Let's make it into a choropleth and a density plot. We'll make some fake fill values to fill with.

set.seed(10)
centers[, "Frequency"] <- rnorm(nrow(centers))

being_img[["children"]][[1]][[c("gp", "alpha")]] <- 
  being_img[["children"]][[2]][[c("gp", "alpha")]] <- .25

ggplot(centers, aes(fill=Frequency)) +
    geom_map(aes(map_id = id), map = space.manikin.shape, 
        colour="black") +
    scale_fill_gradient2(high="red", low="blue") +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="black") + 
    annotation_custom(being_img,0,1,0,1) 

plot of chunk unnamed-chunk-9

Figure 8: Quasi-Shape File as a Choropleth

set.seed(10)
centers[, "Frequency2"] <- sample(seq(10, 150, by=20, ), nrow(centers), TRUE)

centers2 <- centers[rep(1:nrow(centers), centers[, "Frequency2"]), ]

ggplot(centers2) +
#       geom_map(aes(map_id = id), map = space.manikin.shape, 
#       colour="grey65", fill="white") +
    stat_density2d(data = centers2, 
        aes(x=center.x, y=center.y, alpha=..level.., 
        fill=..level..), size=2, bins=12, geom="polygon") + 
    scale_fill_gradient(low = "yellow", high = "red") +
    scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="black") + 
    annotation_custom(being_img,0,1,0,1) +
    geom_density2d(data = centers2, aes(x=center.x, 
        y=center.y), colour="black", bins=8, show_guide=FALSE) 

plot of chunk unnamed-chunk-10

Figure 9: Quasi-Shape File as a Density Plot

Good times were had by all.


Created using the reports (Rinker, 2013) package

Get the .Rmd file here


References



To leave a comment for the author, please follow the link and comment on his blog: TRinker's R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using memoise to cache R values

$
0
0

(This article was first published on Dan Kelley Blog/R, and kindly contributed to R-bloggers)

Introduction

The memoise package can be very handy for caching the results of slow calculations. In interactive work, the slowest calculations can be reading data, so that is demonstrated here. The microbenchmark package shows timing results.

Methods and results

Setup

First, load the package being tested, and also a benchmarking package.

1
2
library(memoise)
library(microbenchmark)

Test conventional function

The demonstration will be for reading a CTD file.

1
library(oce)
## Loading required package: methods
## Loading required package: mapproj
## Loading required package: maps
## Loading required package: ncdf4
## Loading required package: tiff
1
microbenchmark(d <- read.oce("/data/arctic/beaufort/2012/d201211_0002.cnv"))
## Unit: milliseconds
##                                                          expr   min    lq
##  d <- read.oce("/data/arctic/beaufort/2012/d201211_0002.cnv") 160.4 162.5
##  median    uq   max neval
##   162.9 167.6 258.6   100

Memoise the function

Memoising read.oce() is simple

1
r <- memoise(read.oce)

Measure the speed of memoised code

1
microbenchmark(d <- r("/data/arctic/beaufort/2012/d201211_0002.cnv"))
## Unit: microseconds
##                                                   expr   min    lq median
##  d <- r("/data/arctic/beaufort/2012/d201211_0002.cnv") 47.47 48.61   49.5
##     uq    max neval
##  52.57 165199   100

Conclusions

In this example, the speedup was by a factor of about 3000.

The operation tested here is quick enough for interactive work, but this is a 1-dbar file, and the time would be increased to several seconds for raw CTD data, and increased to perhaps a half minute or so if a whole section of CTD profiles is to be read. Using memoise() would reduce that half minute to a hundredth of a second – easily converting an annoyingly slow operation to what feels like zero time in an interactive session.

Resources

To leave a comment for the author, please follow the link and comment on his blog: Dan Kelley Blog/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Viewing all 589 articles
Browse latest View live