(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

In 1993, when music was sold in retail stores, it may have been informative to ask about preference across a handful of music genre. Today, now that the consumer has seized control and the music industry has responded, the market has exploded into more than a thousand different fragmented pairings of artists and their audiences. Grant McCracken, the cultural anthropologist, refers to such proliferation as speciation and the resulting commotion as plenitude. As with movies, genre become microgenre forcing recommender systems to deal with more choices and narrower segments.

This mapping from the website Every Noise at Once is constantly changing. As the website explains, there is a generating algorithm with some additional adjustments in order to make it all readable, and it all seems to work as an enjoyable learning interface. One clicks on the label to play a music sample. Then, you can continue to a list of artists associated with the category and hear additional samples from each artist. Although the map seems to have interpretable dimensions and reflects similarity among the microgenre, it does not appear to be a statistical model in its present form.

At any given point in time, we are stepping into a dynamic process of artists searching for differentiation and social media seeking to create new communities who share at least some common preferences. Word of mouth is most effective when consumers expect new entries and when spreading the word is its own reward. It is no longer enough for a brand to have a good story if customers do not enjoy telling that story to others. Clearly, this process is common to all product categories even if they span a much smaller scale. Thus, we are looking for a scalable statistical model that captures the dynamics through which buyers and sellers come to a common understanding.

Borrowing a form of matrix factorization from recommender systems, I have argued in previous posts for implementing this kind of joint clustering of the rows and columns of a data matrix as a replacement for traditional forms of market segmentation. We can try it with a music preference dataset from the R package prefmod. Since I intend to compare my finding with another analysis of the same 1993 music preference data using the new R package RCA and reported in the American Journal of Sociology, we will begin by duplicating the few data modifications that were made in that paper (see the R code at the end of this post).

In previous attempts to account for music preferences, psychologists have focused on the individual and turned to personality theory for an explanation. For the sociologist, there is always the social network. As marketing researchers, we will add the invisible hand of the market. What is available? How do consumers learn about the product category and obtain recommendations? Where is it purchased? When and where is it consumed? Are others involved (public vs private consumption)?

The Internet opens new purchase pathways, encourages new entities, increases choice and transfers control to the consumer. The resulting postmodern market with its plenitude of products, services, and features cannot be contained within a handful of segments. Speciation and micro-segmentation demand a model that reflects the joint evolution where new products and features are introduced to meet the needs of specific audiences and consumers organize their attention around those microgenre. Nonnegative matrix factorization (NMF) represents this process with a single set of latent variables describing both the rows and the columns at the same time.

After attaching the music dataset, NMF will produce a cluster heatmap summarizing the "loadings" of the 17 music genre (columns below) on the five latent features (rows below): Blues/Jazz, Heavy Metal/Rap, Country/Bluegrass, Opera/Classical, and Rock. The dendrogram at the top displays the results of a hierarchical clustering. Although there are five latent features, we could use the dendrogram to extract more than five music genre clusters. For example, Big Band and Folk music seem to be grouped together, possibly as a link from classical to country. In addition, Gospel may play a unique role linking country and Blues/Jazz. Whatever we observe in the columns will need to be verified by examining the rows. That is, one might expect to find a segment drawn to country and jazz/blues who also like gospel.

We would have seen more of the lighter colors with coefficients closer to zero had we found greater separation. Yet, this is not unexpected given the coarseness of music genre. As we get more specific, the columns become increasingly separated by consumers who only listen to or are aware of a subset of the available alternatives. These finer distinctions define today's market for just about everything. In addition, the use of a liking scale forces us to recode missing values to a neutral liking. We would have preferred an intensity scale with missing values coded as zeros because they indicate no interaction with the genre. Recoding missing to zero is not an issue when zero is the value given to "never heard of" or unaware.

Now, a joint segmentation means that listeners in the rows can be profiled using the same latent features accounting for covariation among the columns. Based on the above coefficient map, we expect those who like opera to also like classical music so that we do not require two separate scores for opera and classical but only one latent feature score. At least this is what we found with this data matrix. A second heatmap enables us to take a closer look at over 1500 respondents at the same time.

We already know how to interpret this heatmap because we have had practice with the coefficients. These colors indicate the values of the mixing weights for each respondent. Thus, in the middle of the heatmap you can find a dark red rectangle for latent feature #3, which we have already determined to represent country/bluegrass. These individuals give the lowest possible rating to everything except for the genre loading on this latent feature. We do not observe that much yellow or lighter colors in this heatmap because less than 13% of the responses fell into the lowest box labeled "dislike very much." However, most of the lighter regions are where you might expect them to be, for example, heavy metal/rap (#2), although we do uncover a heavy metal segment at the bottom of the figure.

Measuring Attraction and Ignoring Repulsion

We often think of liking as a bipolar scale, although what determines attraction can be different from what drives repulsion. Music is one of those product categories where satisfiers and dissatisfiers tend to be different. Negative responses can become extreme so that preference is defined by what one dislikes rather than what one likes. In fact, it is being forced to listen to music that we do not like that may be responsible for the lowest scores (e.g., being dragged to the opera or loud music from a nearby car). So, what would we find if we collapsed the bottom three categories and measured only attraction on a 3-point scale with 0=neutral, dislike or dislike very much, 1=like, and 2=like very much?

NMF thrives on sparsity, so increasing the number of zeros in the data matrix does not stress the computational algorithm. Indeed, the latent features become more separated as we can see in the coefficient heatmap. Gospel stands alone as its own latent feature. Country and bluegrass remain, as does opera/classical, blues/jazz, and rock. When we "remove" dislike for heavy metal and rap, heavy metal moves into rock and rap floats with reggae between jazz and rock. The same is true for folk and easy mood music, only now both are attractive to country and classical listeners.

More importantly, we can now interpret the mixture weights for individual respondents as additive attractors so that the first few rows are the those with interest in all the musical genre. In addition, we can easily identify listeners with specific interests. As we continue to work our way down the heatmap, we find jazz/blues(#4), followed by rock(#5) and a combination of jazz and rock. Continuing, we see country(#2) plus rock and country alone, after which is a variety of gospel (#1) plus some other genre. We end with opera and classical music, by itself and in combination with jazz.

Comparison with the Cultural Omnivore Hypothesis

As mentioned earlier, we can compare our findings to a published study testing whether inclusiveness rules tastes in music (the eclectic omnivore) or whether cultural distinctions between highbrow and lowbrow still dominate. Interestingly, the cluster analysis is approached as a graph-partitioning problem where the affinity matrix is defined as similarity in the score pattern regardless of mean level. All do not agree with this calculation, and we have a pair of dueling R packages using different definitions of similarity (the RCA vs. the CCA).

None of this is news for those of us who perform cluster analysis using the affinity propagation R package apcluster, which enables several different similarity measures including correlations (signed and unsigned). If you wish to learn more, I would suggest starting with the Orange County R User webinar for apcluster. The quality and breadth of the documentation will flatten your learning curve.

Both of the dueling R packages argue that preference similarity ought to be defined by the highs and lows in the score profiles ignoring the mean ratings for different individuals. This is a problem for marketing since consumers who do not like anything ought to be treated differently from consumers who like everything. One is a prime target and the other is probably not much of a user at all.

Actually, if I were interesting in testing the cultural omnivore hypothesis, I would be better served by collecting familiarity data on a broader range of more specific music genre, perhaps not as detailed as the above map but more revealing than the current broad categories. The earliest signs of preference can be seen in what draws our attention. Recognition tends to be a less obtrusive measure than preference, and we can learn a great deal knowing who visits each region in the music genre map and how long they stayed.

NMF identifies a sizable audience who are familiar with the same subset of music genre. These are the latent features, the building blocks as we have seen in the coefficient heatmaps. The lowbrow and the highbrow each confine themselves to separate latent features, residing in gated communities within the music genre map and knowing little of the other's world. The omnivore travels freely across these borders. Such class distinctions may be even more established in the cosmetics product category (e.g., women's makeup). Replacing genre with brand, you can read how this was handled in a prior post using NMF to analyze brand involvement.

R code to perform all the analyses reported in this post

library(prefmod)
data(music)
 
# keep only the 17 genre used
# in the AMJ Paper (see post)
prefer<-music[,c(1:11,13:18)]
 
# calculate number of missing values for each
# respondent and keep only those with no more
# than 6 missing values
miss<-apply(prefer,1,function(x) sum(is.na(x)))
prefer<-prefer[miss<7,]
 
# run frequency tables for all the variables
apply(prefer,2,function(x) table(x,useNA="always"))
# recode missing to the middle of the 5-point scale
prefer[is.na(prefer)]<-3
# reverse the scale so that larger values are
# associated with more liking and zero is
# the lowest value
prefer<-5-prefer
 
# longer names are easier to interpret
names(prefer)<-c("BigBand",
"Bluegrass",
"Country",
"Blues",
"Musicals",
"Classical",
"Folk",
"Gospel",
"Jazz",
"Latin",
"MoodEasy",
"Opera",
"Rap",
"Reggae",
"ConRock",
"OldRock",
"HvyMetal")
 
library(NMF)
fit<-nmf(prefer, 5, "lee", nrun=30)
coefmap(fit, tracks=NA)
basismap(fit, tracks=NA)
 
# recode bottom three boxes to zero
# and rerun NMF
prefer2<-prefer-2
prefer2[prefer2<0]<-0
# need to remove respondents with all zeros
total<-apply(prefer2,1,sum)
table(total)
prefer2<-prefer2[total>0,]
 
fit<-nmf(prefer2, 5, "lee", nrun=30)
coefmap(fit, tracks=NA)
basismap(fit, tracks=NA)

Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

(This article was first published on Stats and things, and kindly contributed to R-bloggers)

HERE is a product by Nokia, formerly called Nokia maps and before that, Ovi maps. It's the result of the acquisition of NAVTEQ in 2007 combined with Plazes and Metacarta, among others. It has a geocoding API, mapping tiles, routing services, and other things. I'm focused on the geocoding service. Under the “Base” license, you can run 10,000 geocoding requests per day. According to wikipedia, that is the most among the free geocoding services. On top of that, HERE does bulk encoding where you submit a file of things to geocode and it returns a link to download a file of results. This sure beats making thousands of requests one at a time.

I figured coding up a quick function to use this service would be the perfect reason to build my first R package. So, after getting my HERE API keys, I fired up devtools and got started with the R package.

First, I had to install the latest version of devtools and roxygen2

install.packages("devtools", repos="http://cran.rstudio.com/")
install.packages("roxygen2", repos="http://cran.rstudio.com/")
library(devtools)
library(roxygen2)

Next, I chose a directory, and ran the create() function from devtools. This creates a package skeleton that contains the basic structure needed for an R package.

devtools::create("geocodeHERE")

Several files and directories are created after running create(). I moved over to the 'R' directory and created a file called “geocodeHERE_simple.R”. I put my function to use the HERE geocoding API in there. This function was designed to be minimalist and similar to the ggmap geocode() function.

#' Attempt to geocode a string
#'
#' Enter a string and have latitude and longitude returned using the HERE API
#' @param search A string to search
#' @param App_id App_id to use the production HERE API. Get one here... http://developer.here.com/get-started. If left blank, will default to demo key with an unknown usage limit.
#' @param App_code App_code to use the production HERE API. Get one here... http://developer.here.com/get-started. If left blank, will default to demo key with an unknown usage limit.
#' @keywords geocode
#' @export
#' @examples
#' dontrun{
#' geocodeHERE_simple("chicago")
#' geocodeHERE_simple("wrigley field chicago IL")
#' geocodeHERE_simple("233 S Wacker Dr, Chicago, IL 60606")
#' }
#' geocodeHERE_simple
geocodeHERE_simple <- function(search, App_id="", App_code=""){
  if(!is.character(search)){stop("'search' must be a character string")}
  if(!is.character(App_id)){stop("'App_id' must be a character string")}
  if(!is.character(App_code)){stop("'App_code' must be a character string")}

  if(App_id=="" & App_code==""){
    App_id <- "DemoAppId01082013GAL"
    App_code <- "AJKnXv84fjrb0KIHawS0Tg"
    base_url <- "http://geocoder.cit.api.here.com/6.2/geocode."
  }else{
    base_url <- "http://geocoder.api.here.com/6.2/geocode."
  }

  search <- RCurl::curlEscape(search)

  final_url <- paste0(base_url, format, "?app_id=", ids$App_id, "&app_code=",
                      ids$App_code, "&searchtext=", search)

  response <- RCurl::getURL(final_url)
  response_parsed <- RJSONIO::fromJSON(response)
  if(length(response_parsed$Response$View) > 0){
    ret <- response_parsed$Response$View[[1]]$Result[[1]]$Location$DisplayPosition
  }else{
    ret <- NA
  }
  return(ret)
}

Note the text on the top of the code. That is added in order to automatically create help documentation for the function.

Now, if you look closely, I am using two other packages in that function… RCurl and RJSONIO. Also, notice that I'm not making any library() calls. This must be done in the DESCRIPTION file that is automatically generated by create(). In addition to calling out what packages to import, you input the package name, author, description, etc.

Package: geocodeHERE
Title: Wrapper for the HERE geocoding API
Version: 0.1
Authors@R: "Cory Nissen <corynissen@gmail.com> [aut, cre]"
Description: Wrapper for the HERE geocoding API
Depends:
    R (>= 3.1.1)
License: MIT
LazyData: true
Imports:
    RJSONIO,
    RCurl

Then, I set my working directory to the package root and ran the document() function from devtools that automatically creates package documentation.

setwd("geocodeHERE")
document()

From this point, you can upload your package to github and use install_github(), or use the install() function to install your package locally.

setwd("..")
install()

As far as the HERE geocoding API goes, I find it pretty decent at doing “fuzzy matching”. That is, given a place name instead of an address, it does a good job of correctly returning the coordinates. The downside is that you must register for an API key, where Google does not require registration to use it's geocoding service. But, you only get 2500 requests per day via Google, HERE offers 10,000 per day.

Any how, a big shout out to Hilary Parker for her blog post on creating an R package using devtools, Hadley Wickham for the devtools package (among others), and RStudio for the fantastic and free (and open source) IDE for R.

The geocodeHERE package is available on github. I'll be adding bulk geocoding functionality as time permits. You can install the package with the following code:

devtools::install_github("corynissen/geocodeHERE")

To leave a comment for the author, please follow the link and comment on his blog: Stats and things.

(This article was first published on blog.RDataMining.com, and kindly contributed to R-bloggers)

*********************************************************
12th Australasian Data Mining Conference (AusDM 2014)
Brisbane, Australia
27-28 November 2014

http://ausdm14.ausdm.org/

*********************************************************

The Australasian Data Mining Conference has established itself as the premier Australasian meeting for both practitioners and researchers in data mining. Since AusDM’02 the conference has showcased research in data mining, providing a forum for presenting and discussing the latest research and developments.

This year’s conference, AusDM’14 builds on this tradition of facilitating the cross-disciplinary exchange of ideas, experience and potential research directions. Specifically, the conference seeks to showcase: Industry Case Studies; Research Prototypes; Practical Analytics Technology; and Research Student Projects. AusDM’14 will be a meeting place for pushing forward the frontiers of data mining in industry and academia. We have lined up an excellent Keynote Speaker program.

Registration
=========

Registration site: http://wired.ivvy.com/event/FM12AD/
Registration fees:
Standard Registration: $495
Student Standard Registration: $320

If you are registering as a student, contact us via the email ausdm14@ausdm.org with an evidence of you being an active student. We will issue you a discount code for you to use the website.

Keynotes
========

Keynote I: Learning in sequential decision problems
Prof. Peter Bartlett, University of California, Berkeley, USA

Abstract: Many problems of decision making under uncertainty can be formulated as sequential decision problems in which a strategy’s current state and choice of action determine its loss and next state, and the aim is to choose actions so as to minimize the sum of losses incurred. For instance, in internet news recommendation and in digital marketing, the optimization of interactions with users to maximize long-term utility needs to exploit the dynamics of users. We consider three problems of this kind: Markov decision processes with adversarially chosen transition and loss structures; policy optimization for large scale Markov decision processes; and linear tracking problems with adversarially chosen quadratic loss functions. We present algorithms and optimal excess loss bounds for these three problems. We show situations where these algorithms are computationally efficient, and others where hardness results suggest that no algorithm is computationally efficient.

Keynote II: Making Sense of a Random World through Statistics
Prof. Geoff McLachlan, University of Queensland, Brisbane, Australia

Abstract: With the growth in data in recent times, it is argued in this talk that there is a need for even more statistical methods in data mining. In so doing, we present some examples in which there is a need to adopt some fairly sophisticated statistical procedures (at least not off-the-shelf methods) to avoid misleading inferences being made about patterns in the data due to randomness. One example concerns the search for clusters in data. Having found an apparent clustering in a dataset, as evidenced in a visualisation of the dataset in some reduced form, the question arises of whether this clustering is representative of an underlying group structure or is merely due to random fluctuations. Another example concerns the supervised classification in the case of many variables measured on only a small number of objects. In this situation, it is possible to construct a classifier based on a relatively small subset of the variables that provides a perfect classification of the data (that is, its apparent error rate is zero). We discuss how statistics is needed to correct for the optimism in these results due to randomness and to provide a realistic interpretation.

Workshop
========

Half-day workshop on R and Data Mining, Thursday afternoon, 27 November
Dr. Yanchang Zhao, RDataMining.com

The workshop will present an introduction on data mining with R, providing R code examples for classification, clustering, association rules and text mining. See workshop slides at http://www.rdatamining.com/docs.

Accepted Papers
============

Comparison of athletic performances across disciplines and disability classes
Chris Barnes

Factors Influencing Robustness and Effectiveness of Conditional Random Fields in Active Learning Frameworks
Mahnoosh Kholghi, Laurianne Sitbon, Guido Zuccon and Anthony Nguyen

Tree Based Scalable Indexing for Multi-Party Privacy Preserving Record Linkage
Thilina Ranbaduge, Peter Christen and Dinusha Vatsalan

Towards Social Media as a Data Source for Opportunistic Sensor Networking
James Meneghello, Kevin Lee and Nik Thompson

A Case Study of Utilising Concept Knowledge in a Topic Specific Document Collection
Gavin Shaw and Richi Nayak

An Efficient Tagging Data Interpretation and Representation Scheme for Item Recommendation
Noor Ifada and Richi Nayak

Evolving Wavelet Neural Networks for Breast Cancer Classification
Maryam Khan, Stephan Chalup and Alexandre Mendes

Dynamic Class Prediction with Classifier Based Distance Measure
Senay Yasar Saglam and Nick Street

Detecting Digital Newspaper Duplicates with Focus on eliminating OCR errors
Yeshey Peden and Richi Nayak

Improving Scalability and Performance of Random Forest Based Learning-to-Rank Algorithms by Aggressive Subsampling
Muhammad Ibrahim and Mark Carman

A Multidimensional Collaborative Filtering Fusion Approach with Dimensionality Reduction
Xiaoyu Tang, Yue Xu, Ahmad Abdel-Hafez and Shlomo Geva

The Schema Last Approach to Data Fusion
Neil Brittliff and Dharmendra Sharma

A Triple Store Implementation to support Tabular Data
Neil Brittliff and Dharmendra Sharma

Pruned Annular Extreme Learning Machine Optimization based on RANSAC Multi Model Response Regularization
Lavneet Singh and Girija Chetty

Automatic Detection of Cluster Structure Changes using Relative Density Self-Organizing Maps
Denny, Pandu Wicaksono and Ruli Manurung

Decreasing Uncertainty for Improvement of Relevancy Prediction
Libiao Zhang, Yuefeng Li and Moch Arif Bijaksana

Identifying Product Families Using Data Mining Techniques in Manufacturing Paradigm
Israt Jahan Chowdhury and Richi Nayak

Market Segmentation of EFTPOS Retailers
Ashishkumar Singh, Grace Rumantir and Annie South

Locality-Sensitive Hashing for Protein Classification
Lawrence Buckingham, James Hogan, Shlomo Geva and Wayne Kelly

Real-time Collaborative Filtering Recommender Systems
Huizhi Liang, Haoran Du and Qing Wang

Pattern-based Topic Modelling for Query Expansion
Yang Gao, Yue Xu and Yuefeng Li

Hartigan’s Method for K-modes Clustering and Its Advantages
Zheng Rong Xiang and Zahidul Islam

Data Cleansing during Data Collection from Wireless Sensor Networks
Md Zahidul Islam, Quazi Mamun and Md Geaur Rahman

Content Based Image Retrieval Using Signature Representation
Dinesha Chathurani Nanayakkara Wasam Uluwitige, Shlomo Geva, Vinod Chandran and Timothy Chappell

Organising Committee
====================

Conference Chairs
Richi Nayak, Queensland University of Technology, Brisbane, Australia
Paul Kennedy, University of Technology, Sydney

Program Chairs (Research)
Lin Liu, University of South Australia, Adelaide
Xue Li, University of Queensland, Brisbane, Australia

Program Chairs (Application)
Kok-Leong Ong, Deakin University, Melbourne
Yanchang Zhao, Department of Immigration & Border Protection, Australia; and RDataMining.com

Sponsorship Chair
Andrew Stranieri, University of Ballarat, Ballarat

Local Chair
Yue Xu, Brisbane, Australia

Steering Committee Chairs
Simeon Simoff, University of Western Sydney
Graham Williams, Australian Taxation Office

Other Steering Committee Members
Peter Christen, The Australian National University, Canberra
Paul Kennedy, University of Technology, Sydney
Jiuyong Li, University of South Australia, Adelaide
Kok-Leong Ong, Deakin University, Melbourne
John Roddick, Flinders University, Adelaide
Andrew Stranieri, University of Ballarat, Ballarat
Geoff Webb, Monash University, Melbourne

Join us on LinkedIn
===================

http://www.linkedin.com/groups/AusDM-4907891

To leave a comment for the author, please follow the link and comment on his blog: blog.RDataMining.com.

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

I really like National Geographic. Their magazine is great, their television documentaries are done well and they helped give me a lifelong love of maps. They generate very good information and help shed light on the world we all share. So why is this graphic so awful?

Let's have a look:
National Geographic image

We'll start off by saying that no one will mistake me for Edward Tufte or Stephen Few or Nathan Yau, though I love their stuff, have read it and have tried to adopt as many of their more sensible recommendations as I can. That understood, I think I'm on solid footing when I say that at a minimum, all graphical elements should fit within the display surface. The first three quantities are so massive, that they can't be contained. How big are they? Well, we have the numbers within the circles, but beyond that, who knows? The plague of Justinian looks like it could be Jupiter to the Black Plague's Saturn, with modern epidemics having more of an Earthly size.

Speaking of circles, I try to avoid them. If those three aforementioned experts have taught me anything it's that the human brain cannot easily process the area of a round object. Quick: without looking at the numbers, tell me what's the relativity between HIV and ebola.

Did you have to scroll to look at both objects? I did. Not only do the largest epidemics spill over the display area, they make it difficult to view a large number of data points at the same time. As we scroll down, we eventually land on a display which has Asian flu at the top and the great plague of London at the bottom. Justinian, the black death and medieval history are erased from our thoughts.

And what's with the x-axis? The circles move from one side to the other, but this dimension conveys no meaning whatsoever.

As an aside, although I love having the years shown, it would have been good to use that to augment the graphic with something that conveys how epidemics have changed over time. Population has changed, medicine has changed and the character of human disease has changed. As I look at the graphic, what I tend to extrapolate from this is that surely the plague of Justinian wiped out most of southern Europe, Anatolia and Mesopotamia. In contrast, SARS likely appeared during a slow news cycle.

It would be disingenuous of me to criticize a display without proposing one of my own. So, here goes.

dfEpidemic = data.frame(Outbreak = c("Plague of Justinian", "Black Plague"
                                     , "HIV/AIDS", "1918 Flu", "Modern Plague"
                                     , "Asian Flu", "6th Cholera Pandemic"
                                     , "Russian Flu", "Hong Kong Flut"
                                     , "5th Cholera Pandemic", "4th Cholera Pandemic"
                                     , "7th Cholera Pandemic", "Swine Flu"
                                     , "2nd Cholera Pandemic", "First Cholera Pandemic"
                                     , "Great Plague of London", "Typhus Epidemic of 1847"
                                     , "Haiti Cholera Epidemic", "Ebola"
                                     , "Congo Measles Epidemic", "West African Meningitis"
                                     , "SARS")
                        , Count = c(100000000, 50000000, 39000000, 20000000
                                    , 10000000, 2000000, 1500000, 1000000
                                    , 1000000, 981899, 704596, 570000, 284000
                                    , 200000, 110000, 100000, 20000, 6631
                                    , 4877, 4555, 1210, 774)
                        , FirstYear = c(541, 1346, 1960, 1918, 1894, 1957, 1899, 1889
                                        , 1968, 1881, 1863, 1961, 2009, 1829, 1817
                                        , 1665, 1847, 2011, 2014, 2011, 2009, 2002))
dfEpidemic$Outbreak = factor(dfEpidemic$Outbreak
                             , levels=dfEpidemic$Outbreak[order(dfEpidemic$FirstYear
                                                                , decreasing=TRUE)])
library(ggplot2)
library(scales)
plt = ggplot(data = dfEpidemic, aes(x=Outbreak, y=Count)) + geom_bar(stat="identity") + coord_flip()
plt = plt + scale_y_continuous(labels=comma)
plt

plot of chunk GetDataFrame

I'm showing that data as a bar chart, so everything fits within the display and the relative size is easy to recognize. I also order the bars by starting year so that we can convey an additional item of information. Are diseases getting more extreme? Nope. Quite the reverse. 1918 flu and HIV have been significant health issues, but they pale in comparison to the plague of Justinian or the Black Death. HIV is significant, but we've been living with that disease for longer than I've been alive. If we want to convey a fourth dimension, we could shade the bars based on the length of the disease.

dfEpidemic$LastYear = c(542, 1350, 2014, 1920, 1903, 1958, 1923, 1890, 1969, 1896, 1879
                        , 2014, 2009, 1849, 1823, 1666, 1847, 2014, 2014, 2014, 2010, 2003)
dfEpidemic$Duration = with(dfEpidemic, LastYear - FirstYear + 1)
dfEpidemic$Rate = with(dfEpidemic, Count / Duration)

plt = ggplot(data = dfEpidemic, aes(x=Outbreak, y=Count, fill=Rate)) + geom_bar(stat="identity")
plt = plt + coord_flip() + scale_y_continuous(labels=comma)
plt

plot of chunk AddDuration

The plague of Justinian dwarfs everything. We'll have one last look with this observation removed. I'll also take out the Black Death so that we're a bit more focused on modern epidemics.

dfEpidemic2 = dfEpidemic[-(1:2), ]
plt = ggplot(data = dfEpidemic2, aes(x=Outbreak, y=Count, fill=Rate)) + geom_bar(stat="identity")
plt = plt + coord_flip() + scale_y_continuous(labels=comma)
plt

plot of chunk SansJustinian

HIV/AIDS now stands out as having the most victims, though the 1918 flu pandemic caused people to succomb more quickly.

These bar charts are hardly the last word in data visualization. Still, I think they convey more information, more objectively than the National Geographic's exhibit. I'd love to see further comments and refinements.

Session info:

## R version 3.1.1 (2014-07-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.6        RWordPress_0.2-3 scales_0.2.4     ggplot2_1.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10    
##  [5] grid_3.1.1       gtable_0.1.2     htmltools_0.2.4  labeling_0.2    
##  [9] MASS_7.3-34      munsell_0.4.2    plyr_1.8.1       proto_0.3-10    
## [13] Rcpp_0.11.2      RCurl_1.95-4.1   reshape2_1.4     rmarkdown_0.2.50
## [17] stringr_0.6.2    tools_3.1.1      XML_3.98-1.1     XMLRPC_0.3-0    
## [21] yaml_2.1.13

To leave a comment for the author, please follow the link and comment on his blog: PirateGrunt » R.

(This article was first published on Working With Data » R, and kindly contributed to R-bloggers)

This entry is part 14 of 14 in the series Using R

Mazama Science has just finished creating class materials on using R for the AirFire team at the USFS Pacific Wildland Fire Sciences Lab in Seattle, Washington. This team of scientists works on monitoring and modeling wildfire emissions, smoke and air quality. The AirFire team has granted permission to release these class materials to the public in the interest of encouraging scientists in other agencies to experiment with R for their daily work. A detailed syllabus follows.

The complete class is available at this location:

http://mazamascience.com/Classes/PWFSL_2014/

Class materials are broken up into nine separate lessons that assume some experience coding but not necessarily any familiarity with R. Autodidacts new to R should take about 20-30 hrs to complete the course. The target audience for these materials consists of USFS employees or graduate students with a degree in the natural sciences and some experience using scientific software such as MATLAB or python. Lessons are presented in sequential order and assume the student already has R and RStudio set up on their computer. Additional system libraries such as NetCDF are required for later lessons.

Here is the basic outline of covered topics.

Lesson 01 — First Steps with R

The first lesson serves as an introduction to fundamental programming concepts in R: functions, operators, vectorized data and data structures (vector, list, matrix, dataframe). By the end of the first lesson, students should be able to open and plot simple data frames and access help documents and source code associated with R functions.

Lesson 02 — Working with Dataframes

Lesson 02 focuses on data frames and uses publicly available data on wild land fires and prescribed burns as an example. This lesson includes a discussion of factors and how to create logical masks for data subletting as well as graphical parameters used in customizing basic plots.

Lesson 03 — ‘dplyr’ for Summary Statistics

Lesson 03 introduces the dplyr package and its core functions: filter(), select(), group_by(), summarize() and arrange(). This lesson ends with a set of tasks, encouraging students to write code similar to the following example given in the lesson:

# Take the "fires" dataset
#   then filter for type == "WF"
#   then group by state
#   then calculate total area by state
#   then arrange in descending order by total
#   finally, put the result in wildfireAreaByState
fires %>%
  filter(type == "WF") %>%
  group_by(state) %>%
  summarize(total=sum(area, na.rm=TRUE)) %>%
  arrange(desc(total)) ->
  wildfireAreaByState

Lesson 04 — Bar and Pie Plots

Lesson 04 focuses on the barplot() and pie() functions and associated plotting customizations so that students end up converting summary tables from the previous lesson into multi-panel plots

Lesson 05 — Simple Maps

Lesson 05 introduces the maps package and uses it to plot wildfire data.

Lesson 06 — Dashboard

Lesson 06 consists of a longer script that defines several functions to encapsulate all of the work covered in previous lessons. The end result is a function that accepts a single datestamp argument, constructs an appropriate URL, imports CSV data as a data frame and then manipulates and plots that data as a summary ‘dashboard’ appropriate for use in a decision support system.

Lesson 07 — BlueSky first Steps

Lesson 07 introduces the ncdf4 package for working with BlueSky model output predicting the spatial extent and concentration of wildfire smoke. The lesson walks through the process of reading in and understanding a NetCDF file and then presents a script to convert existing files into modernized equivalents that are easier to work with.

Lesson 08 — Working with Arrays

The gridded model datasets introduced in Lesson 07 are made available as multi-dimensional R arrays. Lesson 08 describes in greater detail how to work with arrays and how to generate multi-dimensional statistics by using the apply() function. By the end of the lesson, students should be able to perform increasingly detailed analyses of subsets of the data.

Lesson 09 — Working with Dates and Times

Lesson 09 goes into more detail about the time dimension and covers use of the POSIXct data type and the lubridate package, especially for work involving both local and UTC timezones. The openair package is also introduced especially for the rollingMean() and timeAverage() functions which make it easier to compare time series defined on different time axes — very important when comparing model and sensor data.

We hope these lessons encourage people working in the Forest Service or other government science agencies to take a look at R and experiment with it for a variety of data management, analysis and visualization needs. R does have a steep learning curve but, once mastered, provides users with an extremely powerful and customizable tool for all sorts of analysis.

Best of Luck Learning R!

To leave a comment for the author, please follow the link and comment on his blog: Working With Data » R.

(This article was first published on A HopStat and Jump Away » Rbloggers, and kindly contributed to R-bloggers)

In my last post, I discussed how ggplot2 is not always the answer to the question “How should I plot this” and that base graphics were still very useful.

Why Do I use ggplot2 then?

The overall question still remains: why (do I) use ggplot2?

ggplot2 vs lattice

For one, ggplot2 replaced the lattice package for many plot types for me. The lattice package is a great system, but if you are plotting multivariate data, I believe you should choose lattice or ggplot2. I chose ggplot2 for the syntax, added capabilities, and the philosophy behind it. The fact Hadley Wickham is the developer never hurts either.

Having multiple versions of the same plot, with slight changes

Many times I want to do the same plot over and over, but vary one aspect of it, such as color of the points by a grouping variable, and then switch the color to another grouping variable. Let me give a toy example, where we have an x and a y with two grouping variables: group1 and group2.

library(ggplot2)
set.seed(20141016)
data = data.frame(x = rnorm(1000, mean=6))
data$group1 = rbinom(n = 1000, size =1 , prob =0.5)
data$y = data$x * 5 + rnorm(1000)
data$group2 = runif(1000) &gt; 0.2

We can construct the ggplot2 object as follows:

g = ggplot(data, aes(x = x, y=y)) + geom_point()

The ggplot command takes the data.frame you want to use and use the aes to specify which aesthetics we want to specify, here we specify the x and y. Some aesthetics are optional depending on the plot, some are not. I think it's safe to say you always need an x. I then “add” (using +) to this object a “layer”: I want a geometric “thing”, and that thing is a set of points, hence I use geom_points. I'm doing a scatterplot.

If you just call the object g, print is called by default, which plots the object and we see our scatterplot.

plot of chunk print_g

I can color by a grouping variable and we can add that aesthetic:

g + aes(colour = group1)

plot of chunk color_group

g + aes(colour = factor(group1))

plot of chunk color_group

g + aes(colour = group2)

plot of chunk color_group

Note, g is the original plot, and I can add aes to this plot, which is the same as if I did ggplot2(data, aes(...)) in the original call that generated g. NOTE if the aes you are adding was not a column of the data.frame when you created the plot, you will get an error. For example, let's add a new column to the data and then add it to g:

data$newcol = rbinom(n = nrow(data), size=2, prob = 0.5)
g + aes(colour=factor(newcol))

Error: object 'newcol' not found

This fails because the way the ggplot2 object was created. If we had added this column to the data, created the plot, then added the newcol as an aes, the command would work fine.

g2 = ggplot(data, aes(x = x, y=y)) + geom_point()
g2 + aes(colour=factor(newcol))

plot of chunk g2_create2

We see in the first plot with colour = group1, ggplot2 sees a numeric variable group1, so tries a continuous mapping scheme for the color. The default is to do a range of blue colors denoting intensity. If we want to force it to a discrete mapping, we can turn it into a factor colour = factor(group1). We see the colors are very different and are not a continuum of blue, but colors that separate groups better.
The third plot illustrates that when ggplot2 takes logical vectors for mappings, it factors them, and maps the group to a discrete color.

Slight Changes with additions

In practice, I do this iterative process many times and the addition of elements to a common template plot is very helpful for speed and reproducing the same plot with minor tweaks.

In addition to doing similar plots with slight grouping changes I also add different lines/fits on top of that. In the previous example, we colored by points by different grouping variables. In other examples, I tend to change little things, e.g. a different smoother, a different subset of points, constraining the values to a certain range, etc. I believe in this way, ggplot2 allows us to create plots in a more structured way, without copying and pasting the entire command or creating a user-written wrapper function as you would in base. This should increase reproducibility by decreasing copy-and-paste errors. I tend to have many copy-paste errors, so I want to limit them as much as possible.

ggplot reduces number of commands I have to do

One example plot I make frequently is a scatterplot with a smoother to estimate the shape of bivariate data.

In base, I usually have to run at least 3 commands to do this, e.g. loess, plot, and lines. In ggplot2, geom_smooth() takes care of this for you. Moreover, it does the smoothing by each different aesthetics (aka smoothing per group), which is usually what I want do as well (and takes more than 3 lines in base, usually a for loop or apply statement).

g2 + geom_smooth()

plot of chunk smooth
By default in geom_smooth, it includes the standard error of the estimated relationship, but I usually only look at the estimate for a rough sketch of the relationship. Moreover, if the data are correlated (such as in longitudinal data), the standard errors given by default methods are usually are not accurate anyway.

g2 + geom_smooth(se = FALSE)

plot of chunk smooth_no_se

Therefore, on top of the lack of copying and pasting, you can reduce the number of lines of code. Some say “but I can make a function that does that” – yes you can. Or you can use the commands already there in ggplot2.

Faceting

The other reason I frequently use ggplot2 is for faceting. You can do the same graph, conditioned on levels of a variable, which I frequently used. Many times you want to do a graph, subset by another variable, such as treatment/control, male/female, cancer/control, etc.

g + facet_wrap(~ group1)

plot of chunk facet_group

g + facet_wrap(~ group2)

plot of chunk facet_group

g + facet_wrap(group2 ~ group1)

plot of chunk facet_group

g + facet_wrap( ~ group2 + group1)

plot of chunk facet_group

g + facet_grid(group2 ~ group1)

plot of chunk facet_group

Spaghetti plot with Overall smoother

I also frequently have longitudinal data and make spaghetti plot for a per-person trajectory over time.
For this example I took code from StackOverflow to create some longitudinal data.

library(MASS)
library(nlme)

### set number of individuals
n &lt;- 200

### average intercept and slope
beta0 &lt;- 1.0
beta1 &lt;- 6.0

### true autocorrelation
ar.val &lt;- .4

### true error SD, intercept SD, slope SD, and intercept-slope cor
sigma &lt;- 1.5
tau0  &lt;- 2.5
tau1  &lt;- 2.0
tau01 &lt;- 0.3

### maximum number of possible observations
m &lt;- 10

### simulate number of observations for each individual
p &lt;- round(runif(n,4,m))

### simulate observation moments (assume everybody has 1st obs)
obs &lt;- unlist(sapply(p, function(x) c(1, sort(sample(2:m, x-1, replace=FALSE)))))

### set up data frame
dat &lt;- data.frame(id=rep(1:n, times=p), obs=obs)

### simulate (correlated) random effects for intercepts and slopes
mu  &lt;- c(0,0)
S   &lt;- matrix(c(1, tau01, tau01, 1), nrow=2)
tau &lt;- c(tau0, tau1)
S   &lt;- diag(tau) %*% S %*% diag(tau)
U   &lt;- mvrnorm(n, mu=mu, Sigma=S)

### simulate AR(1) errors and then the actual outcomes
dat$eij &lt;- unlist(sapply(p, function(x) arima.sim(model=list(ar=ar.val), n=x) * sqrt(1-ar.val^2) * sigma))
dat$yij &lt;- (beta0 + rep(U[,1], times=p)) + (beta1 + rep(U[,2], times=p)) * log(dat$obs) + dat$eij

I will first add an alpha level to the plotting lines for the next plot (remember this must be done before the original plot is created). tspag will be the template plot, and I will create a spaghetti plot (spag) where each colour represents an id:

library(plyr)
dat = ddply(dat, .(id), function(x){
  x$alpha = ifelse(runif(n = 1) &gt; 0.9, 1, 0.1)
  x$grouper = factor(rbinom(n=1, size =3 ,prob=0.5), levels=0:3)
  x
})
tspag = ggplot(dat, aes(x=obs, y=yij)) + 
  geom_line() + guides(colour=FALSE) + xlab(&quot;Observation Time Point&quot;) +
  ylab(&quot;Y&quot;)
spag = tspag + aes(colour = factor(id))
spag

plot of chunk spag

Many other times I want to group by id but plot just a few lines (let's say 10% of them) dark and the other light, and not colour them:

bwspag = tspag + aes(alpha=alpha, group=factor(id)) + guides(alpha=FALSE)
bwspag

plot of chunk unnamed-chunk-1

Overall, these 2 plots are useful when you have longitudinal data and don't want to loop over ids or use lattice. The great addition is that all the faceting and such above can be used in conjunction with these plots to get spaghetti plots by subgroup.

spag + facet_wrap(~ grouper)

plot of chunk spag_facet

Spaghetti plot with overall smoother

If you want a smoother for the overall group in addition to the spaghetti plot, you can just add geom_smooth:

sspag = spag + geom_smooth(se=FALSE, colour=&quot;black&quot;, size=2)
sspag

plot of chunk spag_smooth

sspag + facet_wrap(~ grouper)

plot of chunk spag_smooth

bwspag + facet_wrap(~ grouper)

plot of chunk spag_smooth

Note that the group aesthetic and colour aesthetic do not perform the same way for some operations. For example, let's try to smooth bwswag:

bwspag + facet_wrap(~ grouper) + geom_smooth(se=FALSE, colour=&quot;red&quot;)

plot of chunk spag_smooth_bad

We see that it smooths each id, which is not what we want. We can achieve the desired result by setting the group aesthetic:

bwspag + facet_wrap(~ grouper) + 
  geom_smooth(aes(group=1), se=FALSE, colour=&quot;red&quot;, size =2)

plot of chunk spag_smooth_corr

I hope that this demonstrates some of the simple yet powerful commands ggplot2 allows users to execute. I agree that some behavior may not seem straightforward at first glance, but becomes more understandable as one uses ggplot2 more.

Another Case this is Useful – Save Plot Twice

Another (non-plotting) example I want to show is how saving ggplot2 objects can make saving duplicate plots much easier. In many cases for making plots to show to others, I open up a PDF device using pdf, make a series of plots, and then close the PDF. There may be 1-2 plots that are the real punchline, and I want to make a high-res PNG of them separately. Let me be clear, I want both – the PDF and PNG. For example:

pdf(tempfile())
print({g1 = g + aes(colour = group1)})
print({g1fac = g + aes(colour = factor(group1))})
print({g2 = g + aes(colour = group2)})
dev.off()

plot of chunk pdf_and_pngs

png(tempfile(), res = 300, height =7, width= 7, units = &quot;in&quot;)
print(g2)
dev.off()

I am printing the objects, while assigning them. (I have to use the { brackets because I use = for assignment and print would evaluate that as arguments without {}). As g2 is already saved as an object, I can close the PDF, open a png and then print that again.

No copying and pasting was needed for remaking the plot, nor some weird turning off and on devices.

Conclusion

I (and think you should) use ggplot2 for a lot of reasons, especially the grammar and philosophy. The plots I have used here are some powerful representations of data that are simple to execute. I believe using this system reflects and helps the true iterative process of making figures. The syntax may not seem intuitive to a long-time R user, but I believe the startup cost is worth the effort. Let me know if you'd like to see any other plots that you commonly use.

To leave a comment for the author, please follow the link and comment on his blog: A HopStat and Jump Away » Rbloggers.

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

Earlier this year I started trying to pull together some of my #f1datajunkie R-related ramblings together in a book form. The project stalled, but to try to reboot it I’ve started publishing it as a living book over on Leanpub. Several of the chapters are incomplete – with TO DO items sketched in, others are still unpublished. The beauty of the Leanpub model is that if you buy a copy, you continue to get access to all future updated versions of the book. (And my idea is that by getting the book out there as it is, I’ll feel as if there’s more (social) pressure on actually trying to keep up with it…)

I’ll be posting more details about how the Leanpub process works (for me at least) in the next week or two, but for now, here’s a link to the book: Wrangling F1 Data With R: A Data Junkie’s Guide.

Here’s the table of contents so far:

Foreword
- A Note on the Data Sources
Introduction
- Preamble
- What are we trying to do with the data?
- Choosing the tools
- The Data Sources
- Getting the Data into RStudio
- Example F1 Stats Sites
- How to Use This Book
- The Rest of This Book…
An Introduction to RStudio and R dataframes
- Getting Started with RStudio
- Getting Started with R
- Summary
Getting the data from the Ergast Motor Racing Database API
- Accessing Data from the ergast API
- Summary
Getting the data from the Ergast Motor Racing Database Download
- Accessing SQLite from R
- Asking Questions of the ergast Data
- Summary
- Exercises and TO DO
Data Scraped from the F1 Website
- Problems with the Formula One Data
- How to use the FormulaOne.com alongside the ergast data
Reviewing the Practice Sessions
- The Weekend Starts Here
- Practice Session Data from the FIA
- Sector Times
- FIA Media Centre Timing Sheets
A Quick Look at Qualifying
- Qualifying Session Position Summary Chart
- Another Look at the Session Tables
- Ultimate Lap Positions
Lapcharts
- Annotated Lapcharts
Race History Charts
- The Simple Laptime Chart
- Accumulated Laptimes
- Gap to Leader Charts
- The Lapalyzer Session Gap
- Eventually: The Race History Chart
Pit Stop Analysis
- Pit Stop Data
- Total pit time per race
- Pit Stops Over Time
- Estimating pit loss time
- Tyre Change Data
Career Trajectory
- The Effect of Age on Performance
- Statistical Models of Career Trajectories
- The Age-Productivity Gradient
- Summary
Streakiness
- Spotting Runs
- Generating Streak Reports
- Streak Maps
- Team Streaks
- Time to N’th Win
- TO DO
- Summary
Conclusion
Appendix One – Scraping formula1.com Timing Data
Appendix Two – FIA Timing Sheets
- Downloading the FIA timing sheets for a particular race
Appendix – Converting the ergast Database to SQLite

If you think you deserve a free copy, let me know… ;-)

To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog... » Rstats.

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers)

This blog post will do a quick exploration of the grapheme make up of words in the English. Specifically we will use R and the qdap package to answer 3 questions:

What is the distribution of word lengths (number of graphemes)?
What is the frequency of letter (grapheme) use in English words?
What is the distribution of letters positioned within words?

Click HERE for a script with all of the code for this post.

We will begin by loading the necessary packages and data (note you will need qdap 2.2.0 or higher):

if (!packageVersion(&quot;qdap&quot;) &gt;= &quot;2.2.0&quot;) {
    install.packages(&quot;qdap&quot;)	
}
library(qdap); library(qdapDictionaries); library(ggplot2); library(dplyr)
data(GradyAugmented)

The Dictionary: Augmented Grady

We will be using qdapDictionaries::GradyAugmented to conduct the mini-analysis. The GradyAugmented list is an augmented version of Grady Ward’s English words with additions from other various sources including Mark Kantrowitz’s names list. The result is a character vector of 122,806 English words and proper nouns.

GradyAugmented
?GradyAugmented

Question 1

What is the distribution of word lengths (number of graphemes)?

To answer this we will use base R’s summary, qdap‘s dist_tab function, and a ggplot2 histogram.

summary(nchar(GradyAugmented))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    6.00    8.00    7.87    9.00   21.00

dist_tab(nchar(GradyAugmented))

   interval  freq cum.freq percent cum.percent
1         1    26       26    0.02        0.02
2         2   116      142    0.09        0.12
3         3  1085     1227    0.88        1.00
4         4  4371     5598    3.56        4.56
5         5  9830    15428    8.00       12.56
6         6 16246    31674   13.23       25.79
7         7 23198    54872   18.89       44.68
8         8 27328    82200   22.25       66.93
9         9 17662    99862   14.38       81.32
10       10  9777   109639    7.96       89.28
11       11  5640   115279    4.59       93.87
12       12  3348   118627    2.73       96.60
13       13  2052   120679    1.67       98.27
14       14  1066   121745    0.87       99.14
15       15   582   122327    0.47       99.61
16       16   268   122595    0.22       99.83
17       17   136   122731    0.11       99.94
18       18    50   122781    0.04       99.98
19       19    17   122798    0.01       99.99
20       20     5   122803    0.00      100.00
21       21     3   122806    0.00      100.00

ggplot(data.frame(nletters = nchar(GradyAugmented)), aes(x=nletters)) + 
    geom_histogram(binwidth=1, colour=&quot;grey70&quot;, fill=&quot;grey60&quot;) +
    theme_minimal() + 
    geom_vline(xintercept = mean(nchar(GradyAugmented)), size=1, 
        colour=&quot;blue&quot;, alpha=.7) + 
    xlab(&quot;Number of Letters&quot;)

plot of chunk unnamed-chunk-3

Here we can see that the average word length is 7.87 letters long with a minimum of 1 (expected) and a maximum of 21 letters. The histogram indicates the distribution is skewed slightly right.

Question 2

What is the frequency of letter (grapheme) use in English words?

Now we will view the overall letter uses in the augmented Grady Word list. Wheel of Fortune lovers…how will r,s,t,l,n,e fare? Here we will double loop through each word with each letter of the alphabet and grab the position of the letters in the words using gregexpr. gregexpr is a nifty function that tells the starting locations of regular expressions. At this point the positioning isn’t necessary for answering the 2nd question but we’re setting our selves up to answer the 3rd question. We’ll then use a frequency table and ordered bar chart to see the frequency of letters in the word list.

Be patient with the double loop (lapply/sappy), it is 122,806 words and takes ~1 minute to run.

position &lt;- lapply(GradyAugmented, function(x){

    z &lt;- unlist(sapply(letters, function(y){
        gregexpr(y, x, fixed = TRUE)
    }))
    z &lt;- z[z != -1] 
    setNames(z, gsub(&quot;\d&quot;, &quot;&quot;, names(z)))
})


position2 &lt;- unlist(position)

freqdat &lt;- dist_tab(names(position2))
freqdat[[&quot;Letter&quot;]] &lt;- factor(toupper(freqdat[[&quot;interval&quot;]]), 
    levels=toupper((freqdat %&gt;% arrange(freq))[[1]] %&gt;% as.character))

ggplot(freqdat, aes(Letter, weight=percent)) + 
  geom_bar() + coord_flip() +
  scale_y_continuous(breaks=seq(0, 12, 2), label=function(x) paste0(x, &quot;%&quot;), 
      expand = c(0,0), limits = c(0,12)) +
  theme_minimal()

plot of chunk letter_barpot

The output is given in percent of letter uses. Let’s see if that jives with the points one gets in a Scrabble game for various tiles:

Overall, yeah I suppose the Scrabble point system makes sense. However, it makes me question why the “K” is worth 5 and why “Y” is only worth 3. I’m sure more thought went into the creation of Scrabble than this simple analysis**.

**EDIT: I came across THIS BLOG POST indicating that perhaps the point values of Scrabble tiles are antiquated.

Question 3

What is the distribution of letters positioned within words?

Now we will use a heat map to tackle the question of what letters are found in what positions. I like the blue – high/yellow – low configuration of heat maps. For me it is a good contrast but you may not agree. Please switch the high/low colors if they don’t suit.

dat &lt;- data.frame(letter=toupper(names(position2)), position=unname(position2))

dat2 &lt;- table(dat)
dat3 &lt;- t(round(apply(dat2, 1, function(x) x/sum(x)), digits=3) * 100)
qheat(apply(dat2, 1, function(x) x/length(position2)), high=&quot;blue&quot;, 
    low=&quot;yellow&quot;, by.column=NULL, values=TRUE, digits=3, plot=FALSE) +
    ylab(&quot;Letter&quot;) + xlab(&quot;Position&quot;) + 
    guides(fill=guide_legend(title=&quot;Proportion&quot;))

plot of chunk letter_heat

The letters “S” and “C” dominate the first position. Interestingly, vowels and the consonants “R” and “N” lead the second spot. I’m guessing the latter is due to consonant blends. The letter “S” likes most spots except the second spot. This appears to be similar, though less pronounced, for other popular consonants. The letter “R”, if this were a baseball team, would be the utility player, able to do well in multiple positions. One last noticing…don’t put “H” in the third position.

*Created using the reports package

To leave a comment for the author, please follow the link and comment on his blog: TRinker's R Blog » R.

(This article was first published on fReigeist » R, and kindly contributed to R-bloggers)

Quite a few people emailed me regarding my post on Econ Job Market. This post is about how you can use very basic and simple R tools to help you in sorting through the Job Openings for Economists list from the American Economic Association.This definitely helped me in developing my spreadsheet of places to apply for.

Before I begin I would like to quickly revisit EJM again.

Revisiting Using R for Econ Job Market
It turns out that my post about scraping the EJM website to obtain a listing of the Job Posts was (partly) redundant. The new EJM system available on “myeconjobmarket.org” provides a facility to download a spreadsheet. Unfortunately, that spreadsheet does not contain the important “position ID”. This position ID is important as you can then construct a link for the applications.

An example:

https://econjobmarket.org/AdDetails.php?posid=2723

The Application Link then becomes:

https://econjobmarket.org/Apply/PosApp.php?posid=2723

In order for this to work, you ll need to have a current login session open as otherwise, you ll be redirect to the main homepage. I updated the spreadsheet and its available here for download. I emailed EJM to add the job opening ID to their spreadsheet, then you can merge the two spreadsheets.

econjobmarket-01-11-2014

Leveraging R for JOE?

Now I am turning to JOE. As on EJM, you can download the Job Openings. Again, they dont include a link to the job posting. However, you can easily construct this because the Job Posting ID is simply a concatenation of the fields “joe_issue_ID” and “jp_id”, separated with an underscore. This gives the JOE_ID.

https://www.aeaweb.org/joe/listing.php?JOE_ID=2014-02_111451008

Now you can try the filtering on JOE to limit the types of postings. But you can also do this in R and you can try to add some features.

Filtering Jobs/ adding a common country code

A first thing I wanted to do is just show you how to filter the job listings and add a common country name or country code.


library(countrycode)
library(data.table)
library(stringr)
options(stringsAsFactors=FALSE)

JOBS<-data.table(read.csv(file="~/Downloads/joe_resultset.csv"))

JOBS$Application_deadline<-as.POSIXct(JOBS$Application_deadline)
JOBS<-JOBS[order(Application_deadline)][Application_deadline>as.POSIXct("2014-10-20")]

###this will keep all full time academic jobs, be it international or just within US
JOBS<-JOBS[grep("Assistant|Professor|Lecturer",jp_title)][grep("Full-Time Academic", jp_section)]

##split out the country
JOBS$country<-gsub("([A-Z]*)( [A-Z]{2,})?(.*)","\1\2", JOBS$locations)
###get harmonized country codes...
JOBS$iso3<-countrycode(JOBS$country, "country.name", "iso3c", warn = FALSE)
###transfer application deadline into a date format
JOBS$Application_deadline<-as.POSIXct(JOBS$Application_deadline)
###drop ones that have already passed
JOBS<-JOBS[order(Application_deadline)][Application_deadline>as.POSIXct("2014-10-20")]

When doing this, you will notice something weird. The application deadline is wrong… in quite a few cases.

Consider for example the job posting for an Assistant Professor position at the Harvard Kennedy School (see https://www.aeaweb.org/joe/listing.php?JOE_ID=2014-02_111451068).

In the spreadsheet, you will see a deadline of 31.01.2015 – which definately cant be right, because the ASSA meetings are in early January. So how can we fix these up? If you look at the plain text of the posting, you will see that applications will begin to be considered 18-Nov-14… that is much more reasonable…

If you sort applications by the application deadline field provided by JOE, you run the risk of missing out due to quirks like this.

One way around this is to run some regular expressions on the main text field to flag up common date formats. This way you do not need to filter all individual job postings for a date. You can simply look at job postings that seem to have weird application deadlines (like later than December).

A regular expression could take the form:

(November|December) ([0-9]{1,2})(,)? (2014)?

which would map date formats like “November 20, 2014″ or “November 20″. The following code maps a few common date formats and the resulting spreadsheet filtering only academic jobs is attached. This formed the starting point for my job market application spreadsheet.

JOBS$jp_full_text<-as.character(JOBS$jp_full_text)

###OTHER DATE MENTIONED IN DECEMBER / NOVEMBER MENTIONED IN THE FULL TEXT
JOBS$otherdate<-""
whichare<-regexpr("(November|December) ([0-9]{1,2})(,)? (2014)?",JOBS$jp_full_text, perl=TRUE, useBytes=TRUE)
JOBS[whichare[1:nrow(JOBS)]!=-1]$otherdate<-regmatches(JOBS$jp_full_text,whichare)
whichare<-regexpr("([0-9]{1,2}) (November|December)(,)? (2014)?",JOBS$jp_full_text, perl=TRUE, useBytes=TRUE)
JOBS[whichare[1:nrow(JOBS)]!=-1]$otherdate<-regmatches(JOBS$jp_full_text,whichare)
whichare<-regexpr("([0-9\.\/]{1,+})([1-9]\.\/]{1,+})(2014)?",JOBS$jp_full_text, perl=TRUE, useBytes=TRUE)
JOBS[whichare[1:nrow(JOBS)]!=-1]$otherdate<-regmatches(JOBS$jp_full_text,whichare)
###add the JOB LISTING URL
JOBS$url<-JOBS[, paste("https://www.aeaweb.org/joe/listing.php?JOE_ID=",joe_issue_ID,"_",jp_id,sep="")]

The resulting spreadsheet is attached joe-results-refined.

To leave a comment for the author, please follow the link and comment on his blog: fReigeist » R.

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Recently, with @3wen, we wanted to play with isodensity curves. The problem is that it is difficult to get – numerically – the equation of the contour (even if we can easily plot it). Consider the following surface (just for fun, in order to illustrate the idea)

> f=function(x,y) x*y+(1-x)*(1-y)
> u=v=seq(0,1,length=21)
> v=seq(0,1,length=11)
> f=outer(u,v,f)
> persp(u,v,f,theta=angle,phi=10,box=TRUE,
+ shade=TRUE,ticktype="detailed",xlab="",
+ ylab="",zlab="",col="yellow")

For instance, assume that we want to locate areas where the density exceed 0.7 (here in the lower left corner, SW, and the upper right corner, NE)

> image(u,v,f)
> contour(u,v,f,add=TRUE,levels=.7)

Is it possible to get the shapefile of the area(s) where the densityexceed some given threshold?

Recall that our density is defined on a grid. The points (on the grid) such that the density exceed the threshold are obtained using

> x=matrix(c(vectu[vecti==TRUE],
+ vectv[vecti==TRUE]),ncol=2)
> plot(x,xlim=0:1,ylim=0:1)

(here it is not perfectly symmetric since I wanted to have a thinner grid on one axis, and a larger one on the other one, just for fun). To get a nice shapefile, let us consider the convex hull of the points (or to be more specific some -convex hull, see Goswami (2013) or Pless (2012)), but actually, we’d better add a random noise to avoid straight lines (for computational issues)

> x=matrix(c(vectu[vecti==TRUE],
+ vectv[vecti==TRUE]),ncol=2)+
+ rnorm(sum(idx)*2)/10000

The -hull is obtained using

> library(alphahull)
> alphashape <- ashape(x, alpha = .2)

The contour is obtained by connecting the following points (here, we have indices of points, used to draw segments ),

> alphashape$edges[, 1:2]
      ind1 ind2
 [1,]    3    2
 [2,]    7   13
 [3,]    6    5
 [4,]    5    4
 [5,]    6   12
 [6,]   14   13
 [7,]   16   15
 [8,]   15   14
 [9,]   12   16
[10,]   17   18
[11,]   18   22
[12,]   22   28
[13,]   21   27
[14,]   34   27
[15,]   30   29
[16,]   29   28
[17,]   31   30
[18,]   33   32
[19,]   32   31
[20,]   34   33
[21,]    1    2
[22,]    1    7
[23,]    3    4
[24,]   17   21

The graph of the convex hull is the following

> plot(alphashape, col = "blue")

The problem is that we do not have a shapefile here. We have indices of points used to draw some segments. We should start from a given point, and then, we go the its neighbor, etc

> id =alphashape$edges[, 1:2]
> boucle=FALSE
> listi=id[1,1:2]
> vk=2:nrow(id)
> i0=as.numeric(listi[length(listi)])
> while(boucle==FALSE){
+ idxi0=which(id[vk,1]==i0)
+ if(length(idxi0)>0)  {nb=id[vk,2][idxi0]}
+ if(length(idxi0)==0) {idxi0=which(id[vk,2]==i0)
+ nb=id[vk,1][idxi0]}
+ if(length(idxi0)==0) {boucle=TRUE}
+ if(boucle==FALSE){
+ listi=c(listi,nb)
+ vk=vk[-idxi0]
+ i0=nb}}

The shapefile for the region in the lower part is

> px=x[listi,]
> px
               [,1]          [,2]
 [1,]  1.000752e-01 -1.416283e-05
 [2,]  5.003630e-02  1.089980e-05
 [3,] -8.050675e-05 -1.453569e-04
 [4,]  6.210392e-05  9.996789e-02
 [5,]  1.017049e-04  1.999146e-01
 [6,]  5.006795e-02  1.998075e-01
 [7,]  1.001361e-01  2.001562e-01
 [8,]  1.499776e-01  1.998995e-01
 [9,]  2.500919e-01  1.000922e-01
[10,]  2.499919e-01 -6.847855e-06
[11,]  1.999688e-01  1.454993e-04
[12,]  1.499061e-01  1.595938e-04
[13,]  1.000752e-01 -1.416283e-05

If we build the polygon associated to those points, we can draw it

> polygon(px,col="red")

Now, remember that we did add a noise. If we round the values, we get

> px=x[listi,]
> pxt=round(px*20)/20
> plot(x)
> polygon(pxt,col="purple")

Based on that function, it is possible to draw anything. For instance, it is possible to visualize isodensity curves of the bivariate Gaussian distribution,

Now, we did use a similar function to visualize hot spot on various maps. In our original work, we have visualization such as

@3wen recently uploaded a long post on his blog, to describe the algorithm we used, to go from the level curves on the graph above to the clusters described below,

Similar graphs can be found in the revised version of our joint paper, Kernel Density Estimation with Ripley’s Circumferential Correction.

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Peggy Fan
Ph.D. Candidate at Stanford's Graduate School of Education

Part of my dissertation at Stanford Graduate School of Education, International Comparative Education program, is looking at the World Values Survey (WVS), a cross-national social survey that started in 1981. Since then there has been 6 waves, and the surveys include questions that capture the demographic, behaviors, personal beliefs, and attitudes of the respondents in a variety of contexts. I am interested in looking civic participation, which is often measured by the extent to which a person belongs to an organization outside of family and work (and religion).

The goal is to create a tool that facilitates preliminary data analyses on this large dataset. The shiny app turns out to be a great tool for data visualization and exploration.

Data manipulation

There are 85 countries from the first five waves in this dataset with about 255,000 observations. My outcome variable is based on a battery of questions from the WVS that ask if the respondent is a member of any of the following types of association: sports, arts, labor, politics, environmental, charity, women’s rights, human rights, or other. A respondent gets a “1” if he and she answers “yes” to any of the associational membership.

For the purpose of this app, I extract variables that are relevant, such as regions, country IDs, gender, educational attainment, and membership from the larger data set. Because the lowest unit of analysis is “country”, I calculate the country and regional averages of membership by topics of gender and educational attainment.

Reactive input

The reactive input function in shiny adjusts the data displayed based on the criteria selected. This allows many ways to dissect data. I utilize it to present data at the world, region, and country levels as well as by gender and education.

I create three tabs by topics. Users can choose “the world” or a region of interest or on the left, and using the reactive with selectInput, I create an object that makes the main panel display the corresponding data. I also use observe to add another reactive element, which displays the list of country after the region is selected.

For gender and educational attainment tabs, renderTable is perfect for displaying the information because it allows users to sort the data based on variables of interest. It also has other options, such as including a search box in the table, but I do not want extraneous features to clutter the table and a simple version is adequate in conveying the information.

server.R
 
  selectedData1 <- reactive({
    if (input$region == "the world") {
      highested_table[,-1]
    } else {
      region = input$region
      highested_table[(highested_table$region == region), -1]
    }
  })
 
  output$mytable1 = renderDataTable({
    selectedData1()
  }, options = list(lengthMenu = c(5,10), pageLength = 5, 
                    searching = FALSE)
  )
 
observe({
    region = input$region
    updateSelectInput(session, "country",
                      choices = levels(as.factor(as.character(wvs_c$country[wvs_c$region==region]))),       selected = levels(as.factor(as.character(wvs_c$country[wvs_c$region==region])))[1]
    )
  })
ui.R
 
 sidebarPanel(
      selectInput("region", "Select a region:",
                  list("All World"= "the world",
                       "North America & Western Europe"="region1",
                       "Central Europe"="region2",
                       "Asia"="region3",
                       "Latina America & Caribbean"="region4",
                       "Sub-Saharan Africa"="region5",
                       "Middle East & Northern Africa"="region6",
                       "Oceania"="region7"),
                  selected= "the World" )
 
      mainPanel(
        tabPanel('Gender', dataTableOutput('mytable'),
                 selectInput('country', 'Select a Country:', 
                             names(wvs_c$country), selected=names(wvs_c$country)[1]),
                 plotOutput("myplot")
        ),

Visualization

The maps provide a holistic view of world and regional comparisons. I choose the rworldmap package because it uses ISO3 as country identifier and I also use ISO3 in my own data, which makes merging of country level data and spatial polygons quite easy. Moreover, its default is a choropleth map by country, with which I only have to adjust the palette for styling.

#server.R
 
library(rworldmap)
wvs_c <- read.csv("/Users/peggyfan/Downloads/R_data/Developing_data_products/wvs_c")
wvs_c <- wvs_c[, -1]
 
colourPalette1 <-c("#F5A9A9", "#F6D8CE", "#F8ECE0", "#EFFBFB", "#E0F2F7", "#CEE3F6", "#A9BCF5")
world <- joinCountryData2Map(wvs_c
                             ,joinCode = "ISO3"
                             ,nameJoinColumn = "iso3"
                             ,mapResolution="li")

When “the world” is chosen, the map tab shows the mapping of the entire data set. The gender and educational attainment tabs show the regional breakdown of those two topics using ggplot.

Below the table, I embed another panel so users can choose a specific country (listed in alphabetical order) to view its gender and educational attainment breakdown in charts created also with ggplot.

For those who are interested in the tabular data displayed on the “gender” and “education attainment” tabs, they can download the data from the website for their own research purposes.

App address: https://peggyfan.shinyapps.io/shinyapps/

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

(This article was first published on Category: R | Todd W. Schneider, and kindly contributed to R-bloggers)

I was pleasantly surprised when somebody shared my traveling salesman animation to reddit and the post made it all the way to reddit's default front page (i.e. the top 25). The gif racked up over 1.3 million pageviews on Imgur, a testament to reddit's traffic-generating prowess. Before the post made it to the front page, though, it was brought to my attention that it was on the second page of reddit, and that with a bit of luck, maybe it would make it to the front page.

That got me wondering: if a post is on reddit's second (or third, or fourth) page, what are the chances that it'll make it to the first page? reddit shows 25 posts per page by default, and at some point I saw my post was at the #26 rank – the very top of the second page, only one spot away from making it to the front page! At that point it seemed inevitable that it would make it to page one... or was it? Of course it did make it to page one, peaking at #14, but I decided I'd investigate to see what I could learn about a reddit post's chances of making it from the top 100 to the top 25.

Much to my surprise, I found out that reddit's front pages are not a pure "meritocracy" based on votes, but that rankings depend heavily on subreddits. The subreddits themselves seem to follow a quota system that allocates certain subreddits to specific slots on pages one and two, and also prevents the front page from devolving entirely into animal gifs. As a final kicker, in case it wasn't completely obvious, I learned that links on the front pages of reddit receive a lot of traffic!

One day in the life of the reddit top 100

Before we get to the analysis, here's an interactive visual of the reddit top 100 over the course of a single day. Each post that made the top 100 has its own series in the graph, where the x axis is time of day and the y axis is the post's reddit rank (i.e. #1 = top of page one, #26 = top of page two, etc). The colors of each series are determined by subreddit – more on that later in this post. You can hover to highlight the path of an individual post, click and drag to zoom, click through to view the underlying thread on reddit, or change the date to see the rankings from different days. At first glance it's pretty clear that posts in the top 50 maintain their ranks longer than posts from 51-100, which turn over much faster:

The data

Fortunately reddit makes it very easy to collect data: the front page is always available as JSON at http://www.reddit.com/.json. I set up a simple Rails application to scrape the top 100 posts (pages 1 - 4) from reddit every 5 minutes and dump the data into a PostgreSQL database, then I wrote some R scripts to analyze the data. All of the code and data used in this post are available on GitHub.

The scraper ran for about 6 weeks, over which time I collected a dataset that includes some 15,000 posts and 1.2 million observations – any post that appeared in the default reddit top 100 over that interval is included.

reddit ranking review

Plenty has been written about how reddit's ranking algorithm works, the short version is that a post's vote score and age are the most important factors, so the highest ranked posts will be the ones that earn a disproportionate number of upvotes over a short time period. As we'll soon see, though, votes and age are not in fact the only important factors that determine rank on reddit's default front pages.

Initial analysis

The first analysis was to graph the probability of a post making the top 25 as a function of its current rank. In other words, take all of the observations of posts that meet the following criteria:

Currently ranked outside the top 25
Have never previously been in the top 25
Are not yet "in decline"

and calculate the percentage of posts at each rank that eventually made it to the top 25. That graph looks like this:

This graph shows the probability that a reddit post will eventually reach the top 25 as a function of its current rank: if a post is ranked #26, there is an 84% chance it will reach the top 25, but if a post is ranked #100, then there is only a 15% chance that it will reach the top 25. The data includes only observations where the post's rank is not yet in decline. There's something very strange about the low probabilities for posts ranked in the 40s, eh?

This basic analysis gave me my first answer: when the traveling salesman gif was ranked #26 and I thought it was inevitable that it would make the front page, in fact it had about an 84% chance of making the top 25. However, this graph raises at least as many questions as it answers, in particular: how could it possibly be that almost half of the posts at rank #50 will eventually make the top 25, while less than 2% of the posts at rank #45 will achieve the same result?

That seems bizarre, as I would have expected a monotonically decreasing graph. I started investigating by looking at the distribution of the best rank for each post, which showed a similar unexpected behavior, especially for posts whose best rank was on page two:

647 posts in the dataset appeared at the #1 rank, the most common best rank achieved. The strange results though are again on page two: about 3 times as many posts peaked at ranks in the low 50s compared to ranks in the mid 40s, and in general it seems like few posts achieve their best rank on page two relative to pages three and four. You might hypothesize that posts don't peak on page two because many of the posts that make it to page two later make it to page one, but that theory is contradicted by the earlier graph which showed that posts on page two have lower conditional probabilities of making it to page one compared to posts on pages three and four.

When I looked at the distribution of scores at each rank, it turned out that posts in the 40s (the range with low top 25 probability) typically have much lower scores than posts at neighboring ranks:

All subreddits are not created equal

It turns out that a post's score and age are not the only important determinants of where the post appears in the default overall ranking. Every post must belong to a subreddit, and the choice of subreddit can have a large impact on the post's ranking.

At any given time there are 50 "default" subreddits which feed the default homepage. The posts in my dataset came from a total of 58 subreddits, though a handful of those had only a single post in the top 100. There were 49 subreddits with at least 10 posts in the top 100, led by r/funny, r/pics, and r/aww. Here's a Google spreadsheet with the full listing of subreddits ordered by number of posts in the top 100.

I started looking at the distribution of observed ranks for posts from individual subreddits, which revealed some unexpected trends. For example, when I made a histogram of observed ranks for all posts in the most popular subreddit, r/funny, I found that r/funny posts simply never appear on the bottom half of page one or most of page two:

This caught me by surprise: I had thought that reddit's front pages were determined purely based on votes and age, but clearly that wasn't the case. I made the same graph for different subreddits, and a few patterns started to emerge. Some subreddits, especially the most popular ones, tended to look like r/funny above, but other subreddits had completely different distributions of observed ranks. Here's the distribution of observed ranks from posts in the r/personalfinance subreddit:

Many posts from r/personalfinance appear in the 40-50 range, but very few posts made the top 25, which is consistent with the earlier graph that showed less than 2% of posts at rank #45 eventually reach the front page. Other subreddits looked different still. My traveling salesman animation was posted in r/dataisbeautiful, where the distribution of observed ranks ranks looks like this:

Not many posts in r/dataisbeautiful made it to the top of page one, but a bunch appeared on the bottom half of page one and most of page two, except for some ranks in the 40s which were dominated by subreddits like /personalfinance.

As I looked at more and more subreddits, it became apparent that there were three "types" of subreddits, represented by r/funny, r/personalfinance, and r/dataisbeautiful above. Here's a series of histograms that show the distribution of observed ranks by subreddit. The individual subreddit labels aren't so important, focus instead on the three different distribution shapes:

I used k-means clustering based on observed rank distributions to assign each subreddit to 1 of 3 clusters, which are color-coded in the graph above. The clusters are:

Cluster 1 (red): the most popular subreddits, a.k.a. "viral candy"

AskReddit, aww, funny, gaming, gifs, IAmA, mildlyinteresting, movies, news, pics, science, Showerthoughts, todayilearned, videos, worldnews

Cluster 2 (green): page two subreddits, often at bottom of page two, almost never on page one

Documentaries, Fitness, gadgets, history, InternetIsBeautiful, listentothis, nosleep, personalfinance, philosophy, UpliftingNews, WritingPrompts

Cluster 3 (blue): the rest

Art, askscience, books, creepy, dataisbeautiful, DIY, EarthPorn, explainlikeimfive, food, Futurology, GetMotivated, Jokes, LifeProTips, Music, nottheonion, OldSchoolCool, other, photoshopbattles, space, sports, television, tifu, TwoXChromosomes

With the number of dimensions reduced from some 50 subreddits to only 3 clusters, it becomes easier to look at the differences between clusters. Here's the distribution of ranks by cluster:

And an area chart which shows the distribution of clusters at each rank:

Cluster 1 represents the most popular subreddits, like r/funny, which dominate the top of page one, but almost never show up on page two. Cluster 2 contains subreddits like r/personalfinance which dominate the bottom of page two, but very rarely make it to page one. Cluster 3 contains everything else: subreddits that don't often make it to the top of page one, but aren't stuck in page two purgatory either; cluster 3 subreddits typically represent the majority of posts at the bottom of page one and top of page two. By the way, in the earlier interactive graph, posts from clusters 1, 2, and 3 are colored red, green, and blue, respectively.

Since these subreddit clusters behave so differently, it might make sense to recalculate the earlier graph showing the conditional probability of making the top 25 separately for each subreddit cluster:

My traveling salesman animation was posted in r/dataisbeautiful, which is part of cluster 3. This newest graph shows that of posts in cluster 3 that reach the #26 rank, 87% will eventually reach the top 25, which is a bit higher than the 84% number calculated earlier based on results aggregated across all subreddits.

The new set of 3 conditional probability graphs makes more intuitive sense than the single earlier graph, which showed a large decline in probability for posts ranked in the 40s, then a big increase for posts ranked in the low 50s. We can see now that the large decline and increase were due to the shifting mixture of subreddit clusters: the ranks in the mid 40s are usually posts from cluster 2, and cluster 2 posts almost never get to the front page, hence the low aggregate conditional probabilities for ranks in the 40s.

Cluster 3's conditional probability graph still looks a bit less satisfying because it is not monotonically decreasing. The cluster 3 conditional probabilities in the 40s are lower than the conditional probabilities on pages three and four, and there's no obvious reason why. Maybe my subreddit clusters are not defined perfectly, or there's something else entirely that causes the cluster 3 posts ranked in the 40s to have lower probability of making the front page than posts in the 50s.

Vote score and age

As mentioned previously, it's a known fact that reddit incorporates vote score and age into its rankings. The rankings, however, are not a strict meritocracy based only on these two factors. Many posts in the top 100 have relatively low scores, say, under 200. Nearly all of the posts that make the top 100 despite low scores come from clusters 2 and 3, which suggests that a post in a cluster 2 or 3 subreddit needs fewer votes to appear in the top 100 compared to a post from cluster 1:

I don't know the exact justification for this, but the preference system for clusters 2 and 3 is probably designed to keep reddit's default top pages more varied than they would be due to votes alone. Based on anecdotal experience, upvote systems favor more easily digestible content – stuff like cute animal gifs. Sure, everybody loves cute animal gifs, but it's also good to offer a wider variety of content, from the sublime to the ridiculous, even if it that requires overriding the direct democracy of a pure vote-based system. Looking back at the list of subreddit clusters, it seems like cluster 1 has the most fun and cheap laughs, cluster 2 contains more serious and discussion-oriented posts, and cluster 3 is a bit of a grab bag somewhere in between.

At the upper echelons, very few posts that make the top 25 have scores less than 1000, regardless of which subreddit cluster they come from:

Posts in the top 25 have to have a high score regardless of subreddit, but posts don't need to have a high score to be on page two. Furthermore, page two excludes many of the most popular subreddits, and therefore can often take on a more informational and less "cute" vibe. On pages three and four, posts from any subreddit cluster can appear, but posts from cluster 1 subreddits have much higher scores than their counterparts from clusters 2 and 3, which again suggests that votes are graded on a curve that favors clusters 2 and 3:

I haven't included post age in the above graphs because graphs can only contain so many dimensions before they become indecipherable, but heat maps offer another way to visualize the relationship between score, age, and a post's probability of making the top 25. As expected, the heat map shows that the probability of making the front page is generally highest when age is low and score is high, but so few cluster 2 posts achieve a high score that the aggregate probability of a cluster 2 post making the top 25 from the top 100 is very slim:

Imgur pageviews data

I was particularly impressed that my salesman gif received over 1.3 million pageviews on Imgur. I thought it'd be cool to measure pageviews as a function of reddit rank – my post only got to rank #14, just imagine how many pageviews the posts at #1 must receive!

Imgur is by far the most popular domain in the dataset, accounting for 43% of all posts that reached the reddit top 100. This is crucial for an analysis of pageviews, because reddit doesn't provide pageview data for each post, but Imgur does, so while we can't know how many views non-Imgur posts received, we can at least roughly observe the effect of reddit rank on traffic.

I grabbed pageview data for every Imgur post, grouped by best rank achieved, then calculated the 25th, 50th, and 75th percentiles, which look like this:

The median Imgur post that reaches #1 on reddit has over 2 million pageviews. Again we see a strange result that Imgur posts in the 50s actually have more pageviews than the posts in the 20s, but this can once again probably be explained by subreddits: the most popular cluster 1 subreddits get a lot of direct traffic themselves, and they're the ones that tend to dominate the ranks in the 50s. Overall, Imgur accounts for 58% of posts in cluster 1, 0.04% of cluster 2, and 35% of cluster 3.

In conclusion... wait, what was the point of all of this?

I had always thought that reddit's front pages operated as some kind of direct democracy, and I was surprised to learn that's not actually the case. reddit's codebase is largely open source, so it's possible that the logic that reserves certain ranks for certain subreddits is completely in the open, but again I didn't know about it, and neither did any of the redditors I asked about it.

I'd be curious to see what would happen if all subreddits were treated equally: my guess is that the reddit default top 100 would contain an even higher rate of funny pictures, but who knows, maybe there'd be some unintended side effects that would cause people to upvote more varied content.

Code and data on GitHub

The code and data are both available on GitHub. There are 3 main components of the repo:

Ruby code to collect data from the reddit API
R code to analyze the data
Postgres database dump file – 25 mb compressed, fully restored it takes up about 175 mb on disk

To leave a comment for the author, please follow the link and comment on his blog: Category: R | Todd W. Schneider.

(This article was first published on Spatial.ly » R, and kindly contributed to R-bloggers)

When I start an R class, one of my opening lines is nearly always that the software is now used by the likes of the New York Times graphics department or Facebook to manipulate their data and produce great visualisations. After saying this, however, I have always struggled to give tangible examples of how an R output blossoms into a stunning and informative graphic. That is until now…

I spent the past year working hard with an amazing designer – Oliver Uberti – to create a book of 100+ maps and graphics about London. The majority of graphics we produced for London: The Information Capital required R code in some shape or form. This was used to do anything from simplifying millions of GPS tracks, to creating bubble charts or simply drawing a load of straight lines. We had to produce a graphic every three days to hit the publication deadline so without the efficiencies of copying and pasting old R code, or the flexibility to do almost any kind of plot, the book would not have been possible. So for those of you out there interested in the process of creating great graphics with R, here are 5 graphics shown from the moment they came out of R to the moment they were printed.

This graphic shows the origin-destination flows of commuters in Southern England. In R I used the geom_segment() command from the brilliant ggplot2 package to draw slightly transparent white lines between the centroids of the origins and destinations. I thought my R export looked pretty good on black, but we then imported it into Adobe Illustrator and Oliver applied a series of additional transparency effects to the lines to make them glow against the dark blue background (a colour we use throughout the book).

This is a crop from a graphic we produced to show the differences between the daytime and nighttime population of London (we are showing nighttime here). It copies the code I used to produce my Population Lines print, but Oliver went to the effort of manually cleaning the edges of the plot (I couldn’t work out how to automatically clip the lines in ggplot2!) by following the red-line I over-plotted. Colours were tweaked and labels added, all in Illustrator.

One of my favourite graphics in the book shows the number if pieces of work by each artist in the Tate galleries. We can only show a small section here, but full-sized it looks spectacular as it features a Turner painting at its centre. The graphic started life as a treemap that simply scaled the squares by the number of artists. R has a very easy to use treemap() function in the treemap package. Oliver then painstakingly broke the exported graphic to bits, converted the squares to picture frames and arranged them on “the wall”.

This map, showing cyclists in London by time of day, was created from code similar to this graphic. It is an example where very little needed to be done to plot once exported – we only really needed to add the River Thames (this could have been done in R), some labels and then optimise the colours for printing. Hundreds of thousands of line segments are plotted here and the graphic is an excellent illustration of R’s power to plot large volumes of data.

The graphic above (full size here) has been the most popular from the book so far. It takes 2011 Census data and maps people by marital status as well as showing the absolute numbers as a streamgraph. ggplot2 was used to create both the maps and the plot. We stuck to the exported colours for the maps and then manually edited the streamgraph colours. The streamgraph was created with the geom_ribbon() function in ggplot2.

All the graphics shown so far started life as databases containing, as a minimum, several thousand rows of data. In this final example we show a “small data” example – the lives of 100 Londoners who have earned a blue plaque on one of London’s buildings. The data were manually compiled with each person having 3 attributes against their name: the age they lived to, the age when they created their most significant work, the period of their life they lived in London. Thanks to ggplot2 I was able to use the code below to generate the coarse looking plot above. Oliver could then take this and flip it before restyling and adding labels in Illustrator. They key thing here was that a couple of lines of code in R saved a day of manually drawing lines.

#We order by age of when the person started living in London, this is the order field.

ggplot(Data,aes(order,origin))+geom_segment(aes(xend=order, yend=Age))+geom_segment(aes(x=order,y=st_age, xend=order, yend=end_age), col="red")+geom_segment(aes(x=order,y=st_age2, xend=order, yend=end_age2), col="yellow")+ coord_polar()

Purchase London: The Information Capital.

To leave a comment for the author, please follow the link and comment on his blog: Spatial.ly » R.

(This article was first published on Tatvic Blog » R, and kindly contributed to R-bloggers)

I am extremely thrilled to announce that RGoogleAnalytics was released recently by CRAN. R is already a swiss army knife for data analysis largely due its 6000 libraries. What this means is that digital analysts can now fully use the analytical capabilities of R to fully explore their Google Analytics Data. In this post, we will go through the basics of RGoogleAnalytics. Let’s begin

Fire up your favorite R IDE and install RGoogleAnalytics. Installation is pretty basic. In case, you are new to RGoogleAnalytics, refer this post to learn how to install it.

Since RGoogleAnalytics uses the Google Analytics Core Reporting API under the hood, every request to the API has to be authorized under the OAuth2.0 protocol. This requires an initial setup in terms of registering an app with the Google Analytics API so that you get a unique set of project credentials (Client ID and Client Secret). Here’s how to do this -

Navigate to Google Developers Console
Create a New Project and Open it
Navigate to APIs and ensure that the Analytics API is turned On for your project
Navigate to Credentials and create a New Client ID
Select Application Type – Installed Application

Once your Client ID and Client Secret are created, copy them to your R Script.

Once the project is configured and the credentials set ready, we need to authenticate your Google Analytics Account with your app. This ensures that your app (R Script) can access your Google Analytics data/List your Google Analytics profiles and so on. Once authenticated you get a pair of tokens (Access Token and Refresh Token). An Access Token is appended with each API request so that Google’s servers know that the requests came from your app and they are authentic. Access Tokens expire after 60 minutes so they need to be regenerated using the Refresh Token. I will show you how to do that but prior to that, lets continue the data extraction flow.

require(RGoogleAnalytics)

# Authorize the Google Analytics account
# This need not be executed in every session once the token object is created 
# and saved
client.id <- "xxxxxxxxxxxxxxxxxxxxxxxxx.apps.googleusercontent.com"
client.secret <- "xxxxxxxxxxxxxxxd_TknUI"
token <- Auth(client.id,client.secret)

# Save the token object for future sessions
save(token,file="./token_file")

The next step is to get the Profile ID/View ID of the Google Analytics profile for which the data extraction is to be carried out. It can be found within the Admin Panel of the Google Analytics UI. This profile ID maps to the table.id argument below.

The code below generates a query with the Standard Query Parameters – Start Date, End Date, Dimensions, Metrics etc. and hits the query to the Google Analytics API. The API response is converted in the form of a R DataFrame.

# Get the Sessions & Transactions for each Source/Medium sorted in 
# descending order by the Transactions

query.list <- Init(start.date = "2014-08-01",
                   end.date = "2014-09-01",
                   dimensions = "ga:sourceMedium",
                   metrics = "ga:sessions,ga:transactions",
                   max.results = 10000,
                   sort = "-ga:transactions",
                   table.id = "ga:123456")

# Create the Query Builder object so that the query parameters are validated
ga.query <- QueryBuilder(query.list)

# Extract the data and store it in a data-frame
ga.data <- GetReportData(ga.query, token)

# Sanity Check for column names
dimnames(ga.data)

# Check the size of the API Response
dim(ga.data)

In future sessions, you need not generate the Access Token every time. Assumming that you have saved it to a file, it can be loaded via the following snippet -

load("./token_file")

# Validate and refresh the token
ValidateToken(token)

Here are a few practices that you might find useful -

Before querying for a set of dimensions and metrics, you might want to check whether they are compatible. This can be done using the Dimensions and Metrics Explorer
The Query Feed Explorer lets you try out different queries in the browser and you can then copy the query parameters to your R Script. It can be found here. I have found this to be a huge time-saver for debugging failed queries
In case if the API returns an error, here’s a guide to understanding the cryptic error responses.

Did you find RGoogleAnalytics useful? Please leave your comments below. In case if you have a feature request or want to file a bug please use this link.

Kushan Shah

Kushan is a Web Analyst at Tatvic. His interests lie in getting the maximum insights out of raw data using R and Python. He is also the maintainer of the RGoogleAnalytics library. When not analyzing data, he reads Hacker News. Google+

Introduction

Solar altitude is a function of time, longitude and latitude, and so it can be possible to infer location based on measuring altitude as a function of time. This form of solar navigation can be based on sunrise and sunset times, at least on non-equinox days.

I have explored this for a school-based project I call “SkyView” [1] involving light sensors and Arduino microcontrollers, and I suspect that readers could imagine other applications as well.

I will illustrate the ideas and the accuracy of the method based on the example of sunrise and sunset on Remembrance Day in Halifax, Nova Scotia, a city where respect for the fallen is very strong.

Methods

According to various websites [e.g. 2], sunrise on the Halifax Remembrance Day sunrise is at 7:06AM (11:06 UTC), with sunset at 4:51PM (20:51 UTC).

Rather than devising an inverse formula to infer location from time and solar altitude, the R function optim may be used to find the longitude and latitude that minimize angle the sun makes to the horizon. That angle is given by the altitude component of the list returned by oce::solarAngle().

Thus, a method for inferring the location of Halifax is as follows. The code should be self-explanatory to anyone who can read R.

`1`	library(oce)

## Loading required package: methods
## Loading required package: mapproj
## Loading required package: maps
## Loading required package: proj4

misfit <- function(lonlat)
{
    riseAlt <- sunAngle(rise, longitude=lonlat[1], latitude=lonlat[2], useRefraction=TRUE)$altitude
    setAlt <- sunAngle(set, longitude=lonlat[1], latitude=lonlat[2], useRefraction=TRUE)$altitude
    0.5 * (abs(riseAlt) + abs(setAlt))
}
start <- c(-50, 50) # set this vaguely near the expected location
rise <- as.POSIXct("2014-11-11 11:06:00", tz="UTC")
set <- as.POSIXct("2014-11-11 20:51:00", tz="UTC")
bestfit <- optim(start, misfit)

# Plot coastline
data(coastlineWorldFine, package="ocedata")
plot(coastlineWorldFine, clon=-64, clat=45, span=500)
grid()

# Plot a series of points calculated by perturbing the 
# suggested times by about the rounding interval of 1 minute.
n <- 25
rises <- rise + rnorm(n, sd=30)
sets <- set + rnorm(n, sd=30)
set.seed(20141111) # this lets readers reproduce this exactly
for (i in 1:n) {
    rise <- rises[i]
    set <- sets[i]
    fit <- optim(start, misfit)
    points(fit$par[1], fit$par[2], pch=21, bg="pink", cex=1.4)
}
# Show the unperturbed fit
points(bestfit$par[1], bestfit$par[2], pch=21, cex=2, bg="red")
# Show the actual location of Halifax
points(-63.571, 44.649, pch=22, cex=2, bg='yellow')
# A legend summarizes all this work
legend("bottomright", bg="white", 
       pch=c(21, 21, 22), pt.bg=c("red", "pink", "yellow"),
       legend=c("Best", "Range", "True"))

center

Results and conclusions

The diagram above indicates that varying times by half a minute (corresponding to the rounding interval in public forecasts of sunrise and sunset times) yields approximately 25km of uncertainty in geographical position, at this particular time of year. (Note that a degree of latitude is about 111km.)

Readers interested in exploring the uncertainty through the year should find the R code simple to modify. It is also easy to express the uncertainty statistically.

Further discusion of matters relating to solar angles can be found in my upcoming book [3].

Resources

A website for the SkyView project is http://emit.phys.ocean.dal.ca/~kelley/skyview.
A website providing the requisite sunrise and sunset times is http://www.timeanddate.com/astronomy/canada/halifax.
Dan Kelley, in preparation. Oceanographic Analysis with R. Springer-Verlag.
Source code: 2014-11-10-solar-navigation.R.

To leave a comment for the author, please follow the link and comment on his blog: Dan Kelley Blog/R.

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

This post is based on the free and open source Creating-maps-in-R teaching resource for introducing R as a command-line GIS.

R is well known as an language ideally suited for data processing, statistics and modelling. R has a number of spatial packages, allowing analyses that would require hundreds of lines of code in other languages to be implemented with relative ease. Geographically weighted regression, analysis of space-time data and raster processing are three niche areas where R outperform much of the competition, thanks to community contributions such as spgwr, spacetime and the wonderfully straightforward raster packages.

What seems to be less well known is that R performs well as a self standing Geographical Information System (GIS) in its own right. Everyday tasks such as reading and writing geographical data formats, reprojecting, joining, subsetting and overlaying spatial objects can be easy and intuitive in R, once you understand the slightly specialist data formats and syntax of spatial R objects and functions. These basic operations are the basic foundations of GIS. Mastering them will make much more advanced operations much easier. Based on the saying ‘master walking before trying to run’, this mini tutorial demonstrates how to load and plot a simple geographical object in R, illustrating that the ease with which continuous and binned choropleth map color schemes can be created using ggmap, an extension of the popular ggplot2 graphics package. Crucially, we will also see how to join spatial and non spatial datasets, resulting in a map of where the Conservative party succeeded and failed in gaining council seats in the 2014 local elections.

As with any project, the starting point is to load the data we’ll be using. In this case we can download all the datasets from a single souce: the Creating-maps-in-R github repository which is designed to introduce R’s basic geographical functionality to beginners. We can use R to download and unzip the files using the following commands (from a Linux-based operating system). This ensures reproducibility:

# load the packages we'll be using for this tutorial
x <- c("rgdal", "dplyr", "ggmap", "RColorBrewer")
lapply(x, library, character.only = TRUE)

# download the repository:
download.file("https://github.com/Robinlovelace/Creating-maps-in-R/archive/master.zip", destfile = "rmaps.zip", method = "wget")
unzip("rmaps.zip") # unzip the files

Once ‘in’ the folder, R has easy access to all the datasets we need for this tutorial. As this is about GIS, the first stage is to load and plot some spatial data: a map of London:

setwd("/home/robin/Desktop/Creating-maps-in-R-master/") # navigate into the unzipped folder
london <- readOGR("data/", layer = "london_sport")

## OGR data source with driver: ESRI Shapefile 
## Source: "data/", layer: "london_sport"
## with 33 features and 4 fields
## Feature type: wkbPolygon with 2 dimensions

plot(london)

The data has clearly loaded correctly and can be visualised, but where is it? The london is simply printed, a load of unreadable information is printed, including the coordinates defining the geographical extent of each zone and additional non-geographical attributes. The polymophic means that generic functions behave differently depending on the type of data they are fed. The following command, for example, is actually calling mean.Date behind the scenes, allowing R to tell us that the the 2^nd of July was half way through the year. The default mean.default function does not work:

mean(as.Date(c("01/01/2014", "31/12/2014"), format = "%d/%m/%Y"))

## [1] "2014-07-02"

In the same way, we can use the trusty summary function to summarise our R object:

summary(london)

## Object of class SpatialPolygonsDataFrame
## Coordinates:
##        min      max
## x 503571.2 561941.1
## y 155850.8 200932.5
## Is projected: TRUE 
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +ellps=airy +units=m +no_defs]
## Data attributes:
##    ons_label                    name      Partic_Per       Pop_2001     
##  00AA   : 1   Barking and Dagenham: 1   Min.   : 9.10   Min.   :  7181  
##  00AB   : 1   Barnet              : 1   1st Qu.:17.60   1st Qu.:181284  
##  00AC   : 1   Bexley              : 1   Median :19.40   Median :216505  
##  00AD   : 1   Brent               : 1   Mean   :20.05   Mean   :217335  
##  00AE   : 1   Bromley             : 1   3rd Qu.:21.70   3rd Qu.:248917  
##  00AF   : 1   Camden              : 1   Max.   :28.40   Max.   :330584  
##  (Other):27   (Other)             :27

This has outputed some very useful information: the bounding box of the object, its coordinate reference system (CRS) and even summaries of the attributes associated with each zone. nrow(london) will tell us that there are 33 polygons represented within the object.

To gain a fuller understanding of the structure of the london object, we can use the str function (but only on the first polygon, to avoid an extrememly long output):

str(london[1,])

## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  1 obs. of  4 variables:
##   .. ..$ ons_label : Factor w/ 33 levels "00AA","00AB",..: 6
##   .. ..$ name      : Factor w/ 33 levels "Barking and Dagenham",..: 5
##   .. ..$ Partic_Per: num 21.7
##   .. ..$ Pop_2001  : int 295535
##   ..@ polygons   :List of 1
##   .. ..$ :Formal class 'Polygons' [package "sp"] with 5 slots
##   .. .. .. ..@ Polygons :List of 1
##   .. .. .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
##   .. .. .. .. .. .. ..@ labpt  : num [1:2] 542917 165647
##   .. .. .. .. .. .. ..@ area   : num 1.51e+08
##   .. .. .. .. .. .. ..@ hole   : logi FALSE
##   .. .. .. .. .. .. ..@ ringDir: int 1
##   .. .. .. .. .. .. ..@ coords : num [1:63, 1:2] 541178 541872 543442 544362 546662 ...
##   .. .. .. ..@ plotOrder: int 1
##   .. .. .. ..@ labpt    : num [1:2] 542917 165647
##   .. .. .. ..@ ID       : chr "0"
##   .. .. .. ..@ area     : num 1.51e+08
##   ..@ plotOrder  : int 1
##   ..@ bbox       : num [1:2, 1:2] 533569 156481 550541 173556
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "x" "y"
##   .. .. ..$ : chr [1:2] "min" "max"
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots
##   .. .. ..@ projargs: chr "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs"

This shows us that the fundamental structure of a SpatialPolygonsDataFrame is actually rather complicated. This complexity is useful, allowing R to store the full range of information needed to describe almost any polygon-based dataset. The @ symbol in the structure represents slots which are specific to the S4 object class and contain specific pieces of information within the wider london object. The basic slots within the london object are:

@data, which contains the the attribute data for the zones
@polygons, the geographic data associated with each polygon (this confusingly contains the @Polygons slot: each polygon feature can contain multiple Polygons, e.g. if an administrative zone is non-contiguous)
@plotOrder is simply the order in which the polygons are plotted
@bbox is a slot associated with all spatial objects, representing its spatial extent
@proj4string the CRS associated with the object

Critically for exploring the attributes of london is the data slot. We can look at and modify the attributes of the subdivisions of london easily using the @ notation:

head(london@data)

##   ons_label                 name Partic_Per Pop_2001
## 0      00AF              Bromley       21.7   295535
## 1      00BD Richmond upon Thames       26.6   172330
## 2      00AS           Hillingdon       21.5   243006
## 3      00AR             Havering       17.9   224262
## 4      00AX Kingston upon Thames       24.4   147271
## 5      00BF               Sutton       19.3   179767

Having seen his notation, many (if not most) R beginners will tend to always use it to refer to attribute data in spatial objects. Yet @ is often not needed. To refer to the population of London, for example, the following lines of code yield the same result:

mean(london@data$Pop_2001)

## [1] 217335.1

mean(london$Pop_2001)

## [1] 217335.1

Thus we can treat the S4 spatial data classes as if they were regular data frames in some contexts, which is extremely useful for concise code. To plot the population of London zones on a map, the following code works:

cols <- brewer.pal(n = 4, name = "Greys")
lcols <- cut(london$Pop_2001,
  breaks = quantile(london$Pop_2001),
  labels = cols)
plot(london, col = as.character(lcols))

Now, how about joining additional variables to the spatial object? To join information to the existing variables, the join functions from dplyr (which replaces and improves on plyr) are a godsend. The following code loads a non-geographical dataset and joins an additional variable to london@data:

ldat <- read.csv("/home/robin/Desktop/Creating-maps-in-R-master/data/london-borough-profiles-2014.csv")
dat <- select(ldat, Code, contains("Anxiety"))
dat <- rename(dat, ons_label = Code, Anxiety = Anxiety.score.2012.13..out.of.10.)
dat$Anxiety <- as.numeric(as.character(dat$Anxiety))

## Warning: NAs introduced by coercion

london@data <- left_join(london@data, dat)

## Joining by: "ons_label"

head(london@data) # the new data has been added

##   ons_label                 name Partic_Per Pop_2001 Anxiety
## 1      00AF              Bromley       21.7   295535    3.20
## 2      00BD Richmond upon Thames       26.6   172330    3.56
## 3      00AS           Hillingdon       21.5   243006    3.34
## 4      00AR             Havering       17.9   224262    3.17
## 5      00AX Kingston upon Thames       24.4   147271    3.23
## 6      00BF               Sutton       19.3   179767    3.34

Plotting maps with ggplot

In order to plot the average anxiety scores across london we can use ggplot2:

lf <- fortify(london, region = "ons_label")

## Loading required package: rgeos
## rgeos version: 0.2-19, (SVN revision 394)
##  GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921 
##  Polygon checking: TRUE

lf <- rename(lf, ons_label = id)
lf <- left_join(lf, london@data)

## Joining by: "ons_label"

ggplot(lf) + geom_polygon(aes(long, lat, group = group, fill = Anxiety))

The challenge

Using the skills you have learned in the above tutorial, see if you can replicate the graph below: the proportion of Conservative councilors selected in different parts of London. Hint: the data is contained in ldat, as downloaded from here: http://data.london.gov.uk/dataset/london-borough-profiles.

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace - R.

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

In a previous post we looked at how to use D3 TopoJSON files with R and make some very D3-esque maps. I mentioned that one thing missing was moving Alaska & Hawaii a bit closer to the continental United States and this post shows you how to do that.

The D3 folks have it easy. They just use the built in modified Albers composite projection. We R folk have to roll up our sleeves a bit (but not much) to do the same. Thankfully, we can do most of the work with the elide (“ih lied”) function from the maptools package.

We’ll start with some package imports:

library(maptools)
library(mapproj)
library(rgeos)
library(rgdal)
library(RColorBrewer)
library(ggplot2)
 
# for theme_map
devtools::source_gist("33baa3a79c5cfef0f6df")

I’m using a GeoJSON file that I made from the 2013 US Census shapefile. I prefer GeoJSON mostly due to it being single file and the easy conversion to TopoJSON if I ever need to use the same map in a D3 context (I work with information security data most of the time, so I rarely have to use maps at all for the day job). I simplified the polygons a bit (passing -simplify 0.01 to ogr2ogr) to reduce processing time.

We read in that file and then transform the projection to Albers equal area and join the polygon ids to the shapefile data frame:

# https://www.census.gov/geo/maps-data/data/cbf/cbf_counties.html
# read U.S. counties moderately-simplified GeoJSON file
us <- readOGR(dsn="data/us.geojson", layer="OGRGeoJSON")
 
# convert it to Albers equal area
us_aea <- spTransform(us, CRS("+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"))
us_aea@data$id <- rownames(us_aea@data)

Now, to move Alaska & Hawaii, we have to:

extract them from the main shapefile data frame
perform rotation, scaling and transposing on them
ensure they have the right projection set
merge them back into the original spatial data frame

The elide function has parameters for all the direct shape munging, so we’ll just do that for both states. I took a peek at the D3 source code for the Albers projection to get a feel for the parameters. You can tweak those if you want them in other positions or feel the urge to change the Alaska rotation angle.

# extract, then rotate, shrink & move alaska (and reset projection)
# need to use state IDs via # https://www.census.gov/geo/reference/ansi_statetables.html
alaska <- us_aea[us_aea$STATEFP=="02",]
alaska <- elide(alaska, rotate=-50)
alaska <- elide(alaska, scale=max(apply(bbox(alaska), 1, diff)) / 2.3)
alaska <- elide(alaska, shift=c(-2100000, -2500000))
proj4string(alaska) <- proj4string(us_aea)
 
# extract, then rotate & shift hawaii
hawaii <- us_aea[us_aea$STATEFP=="15",]
hawaii <- elide(hawaii, rotate=-35)
hawaii <- elide(hawaii, shift=c(5400000, -1400000))
proj4string(hawaii) <- proj4string(us_aea)
 
# remove old states and put new ones back in; note the different order
# we're also removing puerto rico in this example but you can move it
# between texas and florida via similar methods to the ones we just used
us_aea <- us_aea[!us_aea$STATEFP %in% c("02", "15", "72"),]
us_aea <- rbind(us_aea, alaska, hawaii)

Rather than just show the resultant plain county map, we’ll add some data to it. The first example uses US drought data (from November 11th, 2014). Drought conditions are pretty severe in some states, but we’ll just show areas that have any type of drought (levels D0-D4). The color ramp shows the % of drought coverage in each county (you’ll need a browser that can display SVGs to see the maps):

# get some data to show...perhaps drought data via:
# http://droughtmonitor.unl.edu/MapsAndData/GISData.aspx
droughts <- read.csv("data/dm_export_county_20141111.csv")
droughts$id <- sprintf("%05d", as.numeric(as.character(droughts$FIPS)))
droughts$total <- with(droughts, (D0+D1+D2+D3+D4)/5)
 
# get ready for ggplotting it... this takes a cpl seconds
map <- fortify(us_aea, region="GEOID")
 
# plot it
gg <- ggplot()
gg <- gg + geom_map(data=map, map=map,
                    aes(x=long, y=lat, map_id=id, group=group),
                    fill="#ffffff", color="#0e0e0e", size=0.15)
gg <- gg + geom_map(data=droughts, map=map, aes(map_id=id, fill=total),
                    color="#0e0e0e", size=0.15)
gg <- gg + scale_fill_gradientn(colours=c("#ffffff", brewer.pal(n=9, name="YlOrRd")),
                                na.value="#ffffff", name="% of County")
gg <- gg + labs(title="U.S. Areas of Drought (all levels, % county coverage)")
gg <- gg + coord_equal()
gg <- gg + theme_map()
gg <- gg + theme(legend.position="right")
gg <- gg + theme(plot.title=element_text(size=16))
gg

While that shows Alaska & Hawaii in D3-Albers-style, it would be more convincing if we actually used the FIPS county codes on Alaska & Hawaii, so we’ll riff off the previous post and make a county land-mass area choropleth (since we have the land mass area data available in the GeoJSON file):

gg <- ggplot()
gg <- gg + geom_map(data=map, map=map,
                    aes(x=long, y=lat, map_id=id, group=group),
                    fill="white", color="white", size=0.15)
gg <- gg + geom_map(data=us_aea@data, map=map, aes(map_id=GEOID, fill=log(ALAND)),
                    color="white", size=0.15)
gg <- gg + scale_fill_gradientn(colours=c(brewer.pal(n=9, name="YlGn")),
                                na.value="#ffffff", name="County LandnMass Area (log)")
gg <- gg + labs(title="U.S. County Area Choropleth (log scale)")
gg <- gg + coord_equal()
gg <- gg + theme_map()
gg <- gg + theme(legend.position="right")
gg <- gg + theme(plot.title=element_text(size=16))
gg

Now, you have one less reason to be envious of the D3 cool kids!

The code & shapefiles are available on github.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

(This article was first published on QuantStrat TradeR » R, and kindly contributed to R-bloggers)

Several readers, upon seeing the risk and return ratio along with other statistics in the previous post stated that the result may have been the result of data mining/over-optimization/curve-fitting/overfitting, or otherwise bad practice of creating an amazing equity curve whose performance will decay out of sample.

Fortunately, there’s a way to test that assertion. In their book “Trading Systems: A New Approach to System Development and Portfolio Optimization”, Urban Jaekle and Emilio Tomasini use the concept of the “stable region” to demonstrate a way of visualizing whether or not a parameter specification is indeed overfit. The idea of a stable region is that going forward, how robust is a parameter specification to slight changes? If the system just happened to find one good small point in a sea of losers, the strategy is likely to fail going forward. However, if small changes in the parameter specifications still result in profitable configurations, then the chosen parameter set is a valid configuration.

As Frank’s trading strategy only has two parameters (standard deviation computation period, aka runSD for the R function, and the SMA period), rather than make line graphs, I decided to do a brute force grid search just to see other configurations, and plotted the results in the form of heatmaps.

Here’s the modified script for the computations (no parallel syntax in use for the sake of simplicity):

download("https://dl.dropboxusercontent.com/s/jk6der1s5lxtcfy/XIVlong.TXT",
         destfile="longXIV.txt")

download("https://dl.dropboxusercontent.com/s/950x55x7jtm9x2q/VXXlong.TXT", 
         destfile="longVXX.txt") #requires downloader package

xiv <- xts(read.zoo("longXIV.txt", format="%Y-%m-%d", sep=",", header=TRUE))
vxx <- xts(read.zoo("longVXX.txt", format="%Y-%m-%d", sep=",", header=TRUE))
vxmt <- xts(read.zoo("vxmtdailyprices.csv", format="%m/%d/%Y", sep=",", header=TRUE))

getSymbols("^VIX", from="2004-03-29")

vixvxmt <- merge(Cl(VIX), Cl(vxmt))
vixvxmt[is.na(vixvxmt[,2]),2] <- vixvxmt[is.na(vixvxmt[,2]),1]

xivRets <- Return.calculate(Cl(xiv))
vxxRets <- Return.calculate(Cl(vxx))

getSymbols("^GSPC", from="1990-01-01")
spyRets <- diff(log(Cl(GSPC)))

t1 <- Sys.time()
MARmatrix <- list()
SharpeMatrix <- list()
for(i in 2:21) {
  
  smaMAR <- list()
  smaSharpe <- list()
  for(j in 2:21){
    spyVol <- runSD(spyRets, n=i)
    annSpyVol <- spyVol*100*sqrt(252)
    vols <- merge(vixvxmt[,2], annSpyVol, join='inner')
    
    
    vols$smaDiff <- SMA(vols[,1] - vols[,2], n=j)
    vols$signal <- vols$smaDiff > 0
    vols$signal <- lag(vols$signal, k = 1)
    
    stratRets <- vols$signal*xivRets + (1-vols$signal)*vxxRets
    #charts.PerformanceSummary(stratRets)
    #stratRets[is.na(stratRets)] <- 0
    #plot(log(cumprod(1+stratRets)))
    
    stats <- data.frame(cbind(Return.annualized(stratRets)*100, 
                              maxDrawdown(stratRets)*100, 
                              SharpeRatio.annualized(stratRets)))
    
    colnames(stats) <- c("Annualized Return", "Max Drawdown", "Annualized Sharpe")
    MAR <- as.numeric(stats[1])/as.numeric(stats[2])    
    smaMAR[[j-1]] <- MAR
    smaSharpe[[j-1]] <- stats[,3]
  }
  rm(vols)
  smaMAR <- do.call(c, smaMAR)
  smaSharpe <- do.call(c, smaSharpe)
  MARmatrix[[i-1]] <- smaMAR
  SharpeMatrix[[i-1]] <- smaSharpe
}
t2 <- Sys.time()
print(t2-t1)

Essentially, just wrap the previous script in a nested for loop over the two parameters.

I chose GGplot2 to plot the heatmaps for more control with coloring.

Here’s the heatmap for the MAR ratio (that is, returns over max drawdown):

MARmatrix <- do.call(cbind, MARmatrix)
rownames(MARmatrix) <- paste0("SMA", c(2:21))
colnames(MARmatrix) <- paste0("runSD", c(2:21))
MARlong <- melt(MARmatrix)
colnames(MARlong) <- c("SMA", "runSD", "MAR")
MARlong$SMA <- as.numeric(gsub("SMA", "", MARlong$SMA))
MARlong$runSD <- as.numeric(gsub("runSD", "", MARlong$runSD))
MARlong$scaleMAR <- scale(MARlong$MAR)
ggplot(MARlong, aes(x=SMA, y=runSD, fill=scaleMAR))+geom_tile()+scale_fill_gradient2(high="skyblue", mid="blue", low="red")

Here’s the result:

Immediately, we start to see some answers to questions regarding overfitting. First off, is the parameter set published by TradingTheOdds optimized? Yes. In fact, not only is it optimized, it’s by far and away the best value on the heatmap. However, when discussing overfitting, curve-fitting, and the like, the question to ask isn’t “is this the best parameter set available”, but rather “is the parameter set in a stable region?” The answer, in my opinion to that, is yes, as noted by the differing values of the SMA for the 2-day sample standard deviation. Note that this quantity, due to being the sample standard deviation, is actually the square root of the two squared residuals of that time period.

Here are the MAR values for those configurations:

> MARmatrix[1:10,1]
    SMA2     SMA3     SMA4     SMA5     SMA6     SMA7     SMA8     SMA9    SMA10    SMA11 
2.471094 2.418934 2.067463 3.027450 2.596087 2.209904 2.466055 1.394324 1.860967 1.650588

In this case, not only is the region stable, but the MAR values are all above 2 until the SMA9 value.

Furthermore, note that aside from the stable region of the 2-day sample standard deviation, a stable region using a standard deviation of ten days with less smoothing from the SMA (because there’s already an average inherent in the sample standard deviation) also exists. Let’s examine those values.

> MARmatrix[2:5, 9:16]
      runSD10  runSD11  runSD12  runSD13  runSD14  runSD15  runSD16   runSD17
SMA3 1.997457 2.035746 1.807391 1.713263 1.803983 1.994437 1.695406 1.0685859
SMA4 2.167992 2.034468 1.692622 1.778265 1.828703 1.752648 1.558279 1.1782665
SMA5 1.504217 1.757291 1.742978 1.963649 1.923729 1.662687 1.248936 1.0837615
SMA6 1.695616 1.978413 2.004710 1.891676 1.497672 1.471754 1.194853 0.9326545

Apparently, a standard deviation between 2 and 3 weeks with minimal SMA smoothing also produced some results comparable to the 2-day variant.

Off to the northeast of the plot, using longer periods for the parameters simply causes the risk-to-reward performance to drop steeply. This is essentially an illustration of the detriments of lag.

Finally, there’s a small rough patch between the two aforementioned stable regions. Here’s the data for that.

> MARmatrix[1:5, 4:8]
       runSD5    runSD6    runSD7   runSD8   runSD9
SMA2 1.928716 1.5825265 1.6624751 1.033216 1.245461
SMA3 1.528882 1.5257165 1.2348663 1.364103 1.510653
SMA4 1.419722 0.9497827 0.8491229 1.227064 1.396193
SMA5 1.023895 1.0630939 1.3632697 1.547222 1.465033
SMA6 1.128575 1.3793244 1.4085513 1.440324 1.964293

As you can see, there are some patches where the MAR is below 1, and many where it’s below 1.5. All of these are pretty detached from the stable regions.

Let’s repeat this process with the Sharpe Ratio heatmap.

SharpeMatrix <- do.call(cbind, SharpeMatrix)
rownames(SharpeMatrix) <- paste0("SMA", c(2:21))
colnames(SharpeMatrix) <- paste0("runSD", c(2:21))
sharpeLong <- melt(SharpeMatrix)
colnames(sharpeLong) <- c("SMA", "runSD", "Sharpe")
sharpeLong$SMA <- as.numeric(gsub("SMA", "", sharpeLong$SMA))
sharpeLong$runSD <- as.numeric(gsub("runSD", "", sharpeLong$runSD))
ggplot(sharpeLong, aes(x=SMA, y=runSD, fill=Sharpe))+geom_tile()+
  scale_fill_gradient2(high="skyblue", mid="blue", low="darkred", midpoint=1.5)

And the result:

Again, the TradingTheOdds parameter configuration lights up, but among a region of strong configurations. This time, we can see that in comparison to the rest of the heatmap, the northern stable region seems to have become clustered around the 10-day standard deviation (or 11) with SMAs of 2, 3, and 4. The regions to the northeast are also more subdued by comparison, with the Sharpe ratio bottoming out around 1.

Let’s look at the numerical values again for the same regions.

Two-day standard deviation region:

> SharpeMatrix[1:10,1]
    SMA2     SMA3     SMA4     SMA5     SMA6     SMA7     SMA8     SMA9    SMA10    SMA11 
1.972256 2.210515 2.243040 2.496178 1.975748 1.965730 1.967022 1.510652 1.963970 1.778401

Again, numbers the likes of which I myself haven’t been able to achieve with more conventional strategies, and numbers the likes of which I haven’t really seen anywhere for anything on daily data. So either the strategy is fantastic, or something is terribly wrong outside the scope of the parameter optimization.

Two week standard deviation region:

> SharpeMatrix[1:5, 9:16]
      runSD10  runSD11  runSD12  runSD13  runSD14  runSD15  runSD16  runSD17
SMA2 1.902430 1.934403 1.687430 1.725751 1.524354 1.683608 1.719378 1.506361
SMA3 1.749710 1.758602 1.560260 1.580278 1.609211 1.722226 1.535830 1.271252
SMA4 1.915628 1.757037 1.560983 1.585787 1.630961 1.512211 1.433255 1.331697
SMA5 1.684540 1.620641 1.607461 1.752090 1.660533 1.500787 1.359043 1.276761
SMA6 1.735760 1.765137 1.788670 1.687369 1.507831 1.481652 1.318751 1.197707

Again, pretty outstanding numbers.

The rough patch:

> SharpeMatrix[1:5, 4:8]
       runSD5   runSD6   runSD7   runSD8   runSD9
SMA2 1.905192 1.650921 1.667556 1.388061 1.454764
SMA3 1.495310 1.399240 1.378993 1.527004 1.661142
SMA4 1.591010 1.109749 1.041914 1.411985 1.538603
SMA5 1.288419 1.277330 1.555817 1.753903 1.685827
SMA6 1.278301 1.390989 1.569666 1.650900 1.777006

All Sharpe ratios higher than 1, though some below 1.5

So, to conclude this post:

Was the replication using optimized parameters? Yes. However, those optimized parameters were found within a stable (and even strong) region. Furthermore, it isn’t as though the strategy exhibits poor risk-to-return metrics beyond those regions, either. Aside from raising the lookback period on both the moving average and the standard deviation to levels that no longer resemble the original replication, performance was solid to stellar.

Does this necessarily mean that there is nothing wrong with the strategy? No. It could be that the performance is an artifact of “observe the close, enter at the close” optimistic execution assumptions. For instance, quantstrat (the go-to backtest engine in R for more trading-oriented statistics) uses a next-bar execution method that defaults on the *next* day’s close (so if you look back over my quantstrat posts, I use prefer=”open” so as to get the open of the next bar, instead of its close). It could also be that VXMT itself is an instrument that isn’t very well known in the public sphere, either, seeing as how Yahoo finance barely has any data on it. Lastly, it could simply be the fact that although the risk to reward ratios seem amazing, many investors/mutual fund managers/etc. probably don’t want to think “I’m down 40-60% from my peak”, even though it’s arguably easier to adjust a strategy with a good reward to risk ratio with excess risk by adding cash (to use a cooking analogy, think about your favorite spice. Good in small quantities.), than it is to go and find leverage for a good reward to risk strategy with very small returns (not to mention incurring all the other risks that come with leverage to begin with, such as a 50% drawdown wiping out an account leveraged two to one).

However, to address the question of overfitting, through a modified technique from Jaekle and Tomasini (2009), these are the results I found.

Thanks for reading.

Note: I am a freelance consultant in quantitative analysis on topics related to this blog. If you have contract or full time roles available for proprietary research that could benefit from my skills, please contact me through my LinkedIn here.

To leave a comment for the author, please follow the link and comment on his blog: QuantStrat TradeR » R.

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

This article was originally published in Geoinformatics magazine.

R is well known as a powerful, extensible and relatively fast statistical programming language and open software project with a command line interface (CLI). What is less well known is that R also has cutting edge spatial packages that allow it to behave as a fully featured Geographical Information System in the full sense of the word. In fact, some of cutting edge algorithms for image processing and spatial statistics are implemented in R before any other widely available software product (Bivand et al. 2013). Sophisticated techniques such as geographically weighted regression and spatial interaction models can be custom built around your spatial data in R. But R also works as a general purpose GIS, with mature functions for performing all established techniques of spatial analysis such as spatial selections, buffers and clipping. What is unique about R is that all these capabilities are found in a single programme: R provides a truly integrated modelling environment.

The advantages and drawbacks of R as a GIS

Despite being able to perform the same operations as dedicated GIS software such as ArcGIS and QGIS, R is fundamentally different in the way that the user interacts with it. Not only are most operations complete by typing (e.g. you type “plot(map1)” to plot the data contained in the map1 object), the visualisation stage is different. There is no dynamic canvas which can be used to pan and zoom: instead R only produces visual or other types of output when commanded to do so, using functions such as plot.

Table 1: A summary of the relative merits of R compared with more traditional GIS software.

Attribute	Advantages of R	Drawbacks of R
User interface	Command line interface allows rapid description of workflow and reproducibility	Steep learning curve (eased by RStudio)
Visualising data	Sophisticated and customisable graphics	No dynamic zoomable canvas
Selecting data	Concise and consistent method using square brackets (e.g. “map1[x > y,]”)	Difficult to dynamically select objects from map
Manipulating data	Very wide range of functions through additional packages	Only single core processing
Analysing/modelling data	Integrated processing, analysis, and modelling framework	Sometimes more than one solution available

How to get started with spatial data in R

Here is not the place to go into the details of spatial data analysis in R. Instead, we provide a number of resources that will enable rapidly getting up to speed with both R and its spatial packages. Note that when we say “packages” for R, we are referring to specific add-ons, analogous to extentions in ArcGIS and QGIS. The range of add ons is truly vast, as can be seen on the R website: http://cran.r-project.org/web/views/Spatial.html. This diversity can be daunting and it is certainly frustrating when you know that a problem can be solved in several different ways: R teaches you to think about your data and analysis carefully. The most commonly used packages for spatial analysis are probably sp (the basis of spatial functionality in R), rgdal (for loading spatial file formats such as shapefiles) and rgeos (for spatial analysis). Each is installed and loaded in the same way: rgeos, for example, is installed and loaded by typing:


    install.packages(“rgeos”)
    library(rgeos)

An introductory tutorial on R as a GIS is the working paper “Introduction to visualising spatial data in R” (Lovelace and Cheshire, 2014). This document, along with sample code and data, is available free online. It contains links to many other spatial R resources, so is a good starting point for further explorations of R as a GIS.

R in action as a GIS

To demonstrate where R’s flexibility comes into its own, imagine you have a large number of points that you would like to analyse. These are heavily clustered in a few area, and you would like to a) know where these clusters are; b) remove the points in these clusters and create single points in their place, which summarise the information contained in the clustered points; c) visualise the output.

clustering

We will not actually run through all the steps needed to do this in R. Suffice to know that it is possible in R and very difficult to do in other GIS packages such as QGIS, ArcGIS or even PostGIS (another command line GIS that is based on the database language Postgres, a type of SQL): I was asked to tackle this problem by the Spanish GIS service SIGTE after other solutions had been tried.

All of the steps needed to solve the problem, including provision of example data, are provided online. Here I provide an overview of the processes involved and some of the key functions to provide insight into the R way of thinking about spatial data.

First the data must be loaded and converted into yet another spatial data class. This is done using the readOGR function from the rgdal package mentioned above and then using the command as(SpatialPoints(stations), "ppp") to convert the spatial object stations into the ppp class from the spatstat package.

Next the data is converted into a density raster. The value of each pixel corresponds to the interpolated density of points in that area. This is visualised using the plot and contour functions to ensure that the conversion has worked properly.

The raster image is converted into smooth lines using the contourLines. The lines, one for each cluster zone, must then be converted into polygons using the command gPolygonize(SLDF[5, ]). gPolygonize is a very useful function from the rgeos package which automates the conversion of lines into polygons.

The results are plotted using the rather lengthy set of commands shown below. This results in the figure displayed above (notice the red argument creates the red fill of the zones):


    plot(Dens, main = "")
    plot(lnd, border = "grey", lwd = 2, add = T)
    plot(SLDF, col = terrain.colors(8), add = T)
    plot(cAg, col = "red", border = "white", add = T)
    graphics::text(coordinates(cAg) + 1000, labels = cAg$CODE)

Finally the points inside the cluster polygons are extracted using R’s very concise spatial subsetting syntax:


    sIn <- stations[cAg, ]  # select the stations inside the clusters

The code above, translated into English, means “take all the station points within the cluster polygon called cAg and save them as a new object call sIn”.

Conclusion

The purpose of the article has been to introduce the idea that GIS software can take many forms, including the rather unusual command line interface of the statistical programming language R. Hopefully any preconceptions about poor performance have been dispelled by the examples of converting text into spatial objects and of clustering points. Both examples would be difficult to undertake in more traditional GIS software packages. R has a steep learning curve and should probably be seen more as a powerful tool to use in harmony with other GIS packages for dealing with particularly tricky/unconventional tasks rather than a standalone GIS package in its own right form most users. R interfaces to QGIS and ArcGIS should make this easier, although this solution is not yet mature. In addition to new functionalities, R should also provide a new way of thinking about spatial data for many GIS users (Lovelace and Cheshire, 2014). Yes it has a steep learning curve, but it’s a fun curve to be on whether you are on the beginning of the ascent or in the near vertical phase of the exponential function!

References

Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2013). Applied spatial data analysis with R (Vol. 747248717). Springer.

Lovelace, R., & Cheshire, J. (2014). Introduction to visualising spatial data in R. National Centre for Research Methods, 1403. Retrieved from http://eprints.ncrm.ac.uk/3295/

Wickham, H. (2014). Tidy data. The Journal of Statistical Software, 14(5). Retrieved from http://vita.had.co.nz/papers/tidy-data.html

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace - R.

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

You have probably seen @coulmont's maps. If you haven't, you should probably go and spend some time on his blog (but please, come back afterwards, I still my story to tell you). Consider for instance the maps we obtained for a post published in Monkey Cage, a few months ago,

The codes were discussed on a blog post (I spent some time on the econometric model, not really on the map, by that time).

My mentor in cartography, Reka (aka @visionscarto) taught me that maps were always subjective. And indeed.

Consider the population below 24 years old, in Paris. Or to be more specific, the proportion in a quartier of the population below 24.

> Young=(df$POP0017+df$POP1824)/df$POP)*100

There is a nice package to cut properly a continuous variable

> library(classInt)

And there are many possible options. Breaks can be at equal distances,

> class_e=classIntervals(Young,7,style="equal")

or based on quantiles (here probabilities are at equal distances)

> class_q=classIntervals(Young,7,style="quantile")

So, what could be the impact on a map. Here, we consider a gradient of colors, with 200 values

> library(RColorBrewer)
> plotclr=colorRampPalette(brewer.pal(7,
"RdYlBu")[7:1] )(200)

With the so-called "equal" option (which divides the range of the variable into 200 parts), we have the breaks on the right of the legend. With the "quantile" options (where quantiles are obtained for various probabilities, where here we divide the range of probabilities into 200 parts), we have the breaks on the left of the legend. If we get back to the graph with the cumulative distribution function, above, in the first case, we equally split the range of the variable (on the x-axis), while in the second case, we equally split the range of the probability (on the y-axis).

Breaks are very different with those two techniques. Now, if we try to visualize where the young population is located, on a map, we use the following code

> colcode=findColours(class_e, plotclr)	
> plot(paris,col=colcode,border=colcode)

Here, with the equal option, we have the following map,

while with the quantile option, we get

> colcode=findColours(class_q, plotclr)	
> plot(paris,col=colcode,border=colcode)

Those two maps are based on the same data. But I have the feeling that they do tell different stories...

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.