Wednesday, June 22, 2016

It's Not Me. It's You.

A few years ago, I was asked to develop an app for the following task: Let attendees of a party pick up to five other guests with which they would like to have the last dance. An hour before the event, the app would compute matches (i.e., attendees that picked each other) and notify each guest the number of obtained matches---without disclosing the identity of the match. (Even if we had wanted to disclose the match, we would not have been able to do it. The data going through the app was anonymized.) Also, attendees could voluntarily opt out from voting. 

It is worth mentioning three important aspects about attendees: (a) Their significant others were not invited to the party; (b) they could pick anyone from the list of attendees; and (c) they knew that "last dance" meant much more than holding each other for 3 minutes. We can thus safely assume that most guests casted their votes based on how attracted they were to their picks. In this blog post, we let the details of the app aside and, instead, try to understand the value in the data that we collected.

Everything has beauty, but not everyone sees it.
Confucius


In the remaining of the blog post, we analyze different interesting aspects of the data. We also invite readers to explore the data and submit their observations.

A Bird's-Eye View


Let's start with a summary of the data:

  • Number of Attendees: 240
  • Percentage of Picks (by Position): 55% (1) - 49.1% (2) - 43.7% (3) - 37.5% (4) - 34.1% (5) 

Observation 1: A significant portion of the attendees allegedly decided not to vote. This could be for a number of reasons. Two of these reasons are that they were either in relationships or not interested in participating. However, let us focus on those that did participate: They consistently picked more than one person

Distribution of Votes


Our next step is to understand how the votes above are distributed across attendees. For this, we produce a histogram over the number of votes that each attendee got. We include the R code used to process the data and, in turn, produce the histogram:

require(ggplot2)
#Load data
dance <- read.csv("[path_to_file]/dance.csv")
#Produce count of votes per person
counts <- as.data.frame(table(dance$voted))
#Count voters that were not voted
None <- as.data.frame(setdiff(dance$voter, dance$voted))
colnames(None) <- c("Var1")
None$Freq <- 0
#Combine tables
counts <- rbind(counts,None)
#Produce plot
ggplot(data.frame(counts), aes(x=factor(Freq))) + 
    geom_histogram(fill="grey", color="grey50") + 
    xlab("Number of votes") +
    ylab("Frequency") +
    ggtitle("Histogram of attractiveness") +
    theme_bw()

Note that we account for the people that were voted but did not cast their vote. The histogram is as follows:

Figure 1: Histogram over count of votes each person obtained.


Observation 2: Many guests were voted by multiple people (see Number of votes >= 2 in the X axis in Figure 1). (One of them even got picked by 36 different people!) We could say that these guests with multiple votes are rather popular and generally attractive. Nonetheless, the vast majority of guests got one vote. This poses a hopeful message: Somewhere, someone is choosing you!

I could have concluded this post with the line above and left a very inspiring message. However, data is also inspiring and there is much more we can learn from it.

Imagine Me and You, I do (Listen)


Our next concern is the distribution of votes by position. In other words, we want to know the number of guests that got N votes in position P, for all positions. This could potentially empower our message above, if we saw a large number of guests being the top pick (P=1) of a single person  (N=1). We show this for all positions (i.e., P=[1..5]), for completeness. The R code to plot this information is as follows:

ggplot(dance, aes(x=Freq, fill=as.factor(Position))) + 
    geom_histogram(color="black", position="identity", binwidth=1, origin = -0.5) + 
    facet_grid(Position ~ ., labeller="label_both") + 
    scale_x_continuous(breaks=seq(1, max(data$Freq), by = 1)) + 
    theme_bw() + theme(legend.position="none") + 
    xlab("Number of votes") + ylab("Frequency") + ggtitle("Distribution of votes")

The histogram that we obtain is shown in Figure 2.

Figure 2: Histogram of votes by position.


Observation 3: The most notable result from the plot above is that we do not see 36 votes (or even close) in any particular position. Furthermore, the most number of votes for a guest in the first position (i.e., Position:1) was 8 (see top plot in Figure 2). This tells us that popular, generally attractive people are often not the top pick.


Is There a Match? 


Although the observations discussed so far are intriguing and worth discussing further, lets bring back the focus to the original motivation of the app: Where there any matches? We start by looking at the distribution of number of matches (see Table 1).

Matches 0 1 2 3 4 5
Guests 214 22 2 2 0 0
Table 1: Distribution of number of matches.

The results above are very disappointing: Only 26 guests (22 + 2 + 2 with 1, 2, and 3 matches respectively), or about 10% of the total number of guests, got at least one match. We can still think positively and argue that a large fraction of the "zero-match" guests was due to the rather low participation in the app.

To validate the intuition above, we plot a directed graph of picks, defined as follows: Each node represents a guest and each (directed) edge indicates a pick from the voter to the voted guest. Furthermore, the size of each node is proportional to the number of votes or, in other words, the number of incoming edges. The following code produces the in Figure 3. A high-definition version of the plot is available for download here in PDF format.

require(igraph)
dance.edge.list <- read.csv("[path_to_file]/dance-edge-list.csv", header=FALSE)
xlist <- graph.data.frame(dance.edge.list,directed=TRUE)
V(xlist)$size <- degree(xlist)/5
plot.igraph(xlist, vertex.label=NA, edge.arrow.size=.2, edge.curved=.1, 
            layout=layout.fruchterman.reingold)


The graph is as follows:

Figure 3: Graph of picks. Arrows indicate direction of pick.

It is easy to notice---by the size of the nodes---guests that are popular and generally attractive. Interestingly, one can assume that these (popular) guests will be surrounded by a large number of guests, which makes our plot representative of how attendees will be distributed across the dance floor.

Observation 4: In any case, these are far from being the main reasons why we plotted the data as a graph in the first place. As we had intuited above, a large fraction of the voters picked guests that did not participate in the survey (see nodes in the perimeter of the graph with no outgoing edges). That being said, we can remain positive! This is not at all personal!

The analysis that I have performed above is by no means comprehensive. In fact, I would love to see other observations that you can extract from this valuable data. I have uploaded the anonymized data (dance.csv and dance-edge-list.csv), so feel free to download it and start experimenting with it. Once you find something worth sharing, go ahead and post it in a comment in the comment section below!

Observation 5 (Bonus Track): I do not want to end this post without pointing out something in Figure 3 that made me smile. If you look in the far left of the graph, there are two guests that have unequivocally picked each other. I do not know if they planned it like this; I do not know if they this is 100% chance; What I know is that these two guests enjoyed the party very, very much.



Saturday, February 27, 2016

It's All About the Data

There are many fascinating and intriguing hypotheses that eventually we would like to accept or refute. A notable example is Einstein's prediction on the existence of gravitational waves, the last puzzle piece in his general theory of relativity, which was recently confirmed by researchers from the Laser Interferometer Gravitational-Wave Observatory (LIGO). Other perhaps more approachable examples include "educated guesses" on what type of customers are more likely to buy certain products or whether we can accurately estimate arrival time to destinations. Likewise, there are diverse, powerful methods that we can adopt to accept or refute such hypotheses. To confirm the existence of gravitational waves, for instance, the LIGO researchers had to detect microscopic variations in laser beams. For other scenarios, methods such as statistical analysis, machine learning, or deterministic algorithms may suffice. However, the fundamental value of accepting or refuting a hypothesis lies in the data and how we use it.

Reality does not exist until it is measured
Physicists at The Australian National University (ANU)

Get Data, But Do It Well

Data collection is one of the most critical steps when conducting experiments or evaluations, for two main reasons. First, we must collect data that covers all relevant aspects of the hypothesis of interest. To predict the (type of) customers that are more likely to buy picture frames, for example, we need information about all types of customers as well as the products that they have bought and when. Second, we must collect representative data. For our customer prediction problem, for instance, we need data from different days, months, and years, since certain purchases may be seasonal. As an example, see the following plot for interest in winter jackets over time:

The peaks clearly indicate that winter jackets are "hot" only during winter.

We will often experience that the data we are looking for has already been collected by someone out there. Other times, though, we will have to do it ourselves.

It's Our Lucky Day: Someone Did The (Dirty) Job For Us

Fortunately, a large number of datasets have been collected in a principled manner and made available for research and experimentation. The next list shows a few popular datasets:

Some researchers and bloggers even have compiled rather extensive lists of datasets, to make it easier for others to find data for their problems of interest. In these—and other—compilations, datasets are characterized by domain (e.g., biology, elections), size (e.g., small, medium, big), cleanliness (e.g., raw, normalized), type (e.g., structured, unstructured), problem (e.g., part-of-speech tagging, regression), supervision (e.g., labeled, unlabeled), to name a few. Among the many lists available on the Web, the next three lists, which roughly characterize datasets by domain, are a great start when looking for data: 1001 Datasets, Awesome Datasets, and Kevin Chai's List.

Beyond the above lists—and datasets therein—, many, many more datasets can be found in the Web, although with a little bit of (extra) work. Conferences and research competitions often release datasets for specific problems. For instance, the Live Question Answering track in the Text REtrieval Conference (TREC) provides natural language questions and answers with judgements. Companies release datasets to have highly qualified researchers deliver solutions to timely problems. One of the arguably most relevant examples is the NETFLIX prize, which was launched in 2009 to seek solutions for the well-known movie recommendation problem. Their dataset continues to attract researchers these days as well. Governments release datasets, but for other reasons (e.g., transparency, comply with regulations). A notable example is the DATA.gov effort, which offers over 170,000 datasets to date. Finally, researchers many times offer datasets collected for their papers, to enable reproducibility. For example, Eamonn Keogh and his group routinely make their time-series data sets available.

Not Too Fast, Though: It's Time To Get Hands-On

However, and as expected, we will often need to generate the datasets ourselves (e.g., because datasets do not cover aspects of the problem of interest or because the data does not exist); it is then helpful to know what to do when that moment arrives. As a first step, I suggest you to do a fresh online search or to check specific discussion threads (e.g., threads ongoing in Quora and Reddit). It is very likely that from the moment you started studying the problem to the day you want to test your hypotheses, someone produced some relevant data. If you are still empty-handed after this first step, you will have to create the data yourself.

Suggesting data collection strategies is a post on its own that I will eventually would like to write about. Nevertheless, I recommend two useful directions in the meantime. First, take advantage of available Web crawlers (e.g., Apache Nutch) and crawl websites relevant to your problem. You can in turn scrape their content from your offline, local copy of the websites. Second, see if the data you need can be obtained through Application Program Interfaces (APIs). For example, the Twitter API lets you (almost) painlessly access the Twitter firehose. Other interesting APIs include the StreetEasy API—for apartment rentals and sales—and the 311 Content API—for government information and non-emergency services.  Whatever you do when creating your very own dataset, though, make it public afterwards (e.g., in your personal website, in the discussion threads mentioned above, as a comment in this post, or in DataMarket). Others will find immensurable value in it and will greatly appreciate it! After all, don't forget, it's all about the data...

Tuesday, January 26, 2016

A Blog as a Means to Stay Updated

Hi all,

I should probably have said "Hello World!" or something of the kind. However, I want to be honest with you and with myself. I started this blog to force myself to stay updated on the topics that I am (very) interested in—information extraction, information retrieval, natural language processing, machine learning, education, soccer, music, to name a few, and any combination thereof. At the same time, I hope that my posts are of interest to you, online readers, so that we engage in discussion and, hopefully, collaboration.

I have already outlined a few posts that I would like to write to get started. I am not going to list them here because they will become an unnecessary commitment; instead, I will briefly mention what type of posts I will be writing:

  • Self-contained post: The post will discuss a topic from beginning to end. No additional material (e.g., in the form of previous posts) will be required. 
  • Series-like post: The post will start, continue, or end a series of posts on a specific topic. I foresee a series-like post to be a well-defined building block of a rather long, research-oriented project.

I hope that sooner rather than later I get my first official post out, so that we get the conversation started!

Best,

Pablo.