Wednesday, June 22, 2016

It's Not Me. It's You.

A few years ago, I was asked to develop an app for the following task: Let attendees of a party pick up to five other guests with which they would like to have the last dance. An hour before the event, the app would compute matches (i.e., attendees that picked each other) and notify each guest the number of obtained matches---without disclosing the identity of the match. (Even if we had wanted to disclose the match, we would not have been able to do it. The data going through the app was anonymized.) Also, attendees could voluntarily opt out from voting. 

It is worth mentioning three important aspects about attendees: (a) Their significant others were not invited to the party; (b) they could pick anyone from the list of attendees; and (c) they knew that "last dance" meant much more than holding each other for 3 minutes. We can thus safely assume that most guests casted their votes based on how attracted they were to their picks. In this blog post, we let the details of the app aside and, instead, try to understand the value in the data that we collected.

Everything has beauty, but not everyone sees it.

In the remaining of the blog post, we analyze different interesting aspects of the data. We also invite readers to explore the data and submit their observations.

A Bird's-Eye View

Let's start with a summary of the data:

  • Number of Attendees: 240
  • Percentage of Picks (by Position): 55% (1) - 49.1% (2) - 43.7% (3) - 37.5% (4) - 34.1% (5) 

Observation 1: A significant portion of the attendees allegedly decided not to vote. This could be for a number of reasons. Two of these reasons are that they were either in relationships or not interested in participating. However, let us focus on those that did participate: They consistently picked more than one person

Distribution of Votes

Our next step is to understand how the votes above are distributed across attendees. For this, we produce a histogram over the number of votes that each attendee got. We include the R code used to process the data and, in turn, produce the histogram:

#Load data
dance <- read.csv("[path_to_file]/dance.csv")
#Produce count of votes per person
counts <-$voted))
#Count voters that were not voted
None <-$voter, dance$voted))
colnames(None) <- c("Var1")
None$Freq <- 0
#Combine tables
counts <- rbind(counts,None)
#Produce plot
ggplot(data.frame(counts), aes(x=factor(Freq))) + 
    geom_histogram(fill="grey", color="grey50") + 
    xlab("Number of votes") +
    ylab("Frequency") +
    ggtitle("Histogram of attractiveness") +

Note that we account for the people that were voted but did not cast their vote. The histogram is as follows:

Figure 1: Histogram over count of votes each person obtained.

Observation 2: Many guests were voted by multiple people (see Number of votes >= 2 in the X axis in Figure 1). (One of them even got picked by 36 different people!) We could say that these guests with multiple votes are rather popular and generally attractive. Nonetheless, the vast majority of guests got one vote. This poses a hopeful message: Somewhere, someone is choosing you!

I could have concluded this post with the line above and left a very inspiring message. However, data is also inspiring and there is much more we can learn from it.

Imagine Me and You, I do (Listen)

Our next concern is the distribution of votes by position. In other words, we want to know the number of guests that got N votes in position P, for all positions. This could potentially empower our message above, if we saw a large number of guests being the top pick (P=1) of a single person  (N=1). We show this for all positions (i.e., P=[1..5]), for completeness. The R code to plot this information is as follows:

ggplot(dance, aes(x=Freq, fill=as.factor(Position))) + 
    geom_histogram(color="black", position="identity", binwidth=1, origin = -0.5) + 
    facet_grid(Position ~ ., labeller="label_both") + 
    scale_x_continuous(breaks=seq(1, max(data$Freq), by = 1)) + 
    theme_bw() + theme(legend.position="none") + 
    xlab("Number of votes") + ylab("Frequency") + ggtitle("Distribution of votes")

The histogram that we obtain is shown in Figure 2.

Figure 2: Histogram of votes by position.

Observation 3: The most notable result from the plot above is that we do not see 36 votes (or even close) in any particular position. Furthermore, the most number of votes for a guest in the first position (i.e., Position:1) was 8 (see top plot in Figure 2). This tells us that popular, generally attractive people are often not the top pick.

Is There a Match? 

Although the observations discussed so far are intriguing and worth discussing further, lets bring back the focus to the original motivation of the app: Where there any matches? We start by looking at the distribution of number of matches (see Table 1).

Matches 0 1 2 3 4 5
Guests 214 22 2 2 0 0
Table 1: Distribution of number of matches.

The results above are very disappointing: Only 26 guests (22 + 2 + 2 with 1, 2, and 3 matches respectively), or about 10% of the total number of guests, got at least one match. We can still think positively and argue that a large fraction of the "zero-match" guests was due to the rather low participation in the app.

To validate the intuition above, we plot a directed graph of picks, defined as follows: Each node represents a guest and each (directed) edge indicates a pick from the voter to the voted guest. Furthermore, the size of each node is proportional to the number of votes or, in other words, the number of incoming edges. The following code produces the in Figure 3. A high-definition version of the plot is available for download here in PDF format.

dance.edge.list <- read.csv("[path_to_file]/dance-edge-list.csv", header=FALSE)
xlist <-,directed=TRUE)
V(xlist)$size <- degree(xlist)/5
plot.igraph(xlist, vertex.label=NA, edge.arrow.size=.2, edge.curved=.1, 

The graph is as follows:

Figure 3: Graph of picks. Arrows indicate direction of pick.

It is easy to notice---by the size of the nodes---guests that are popular and generally attractive. Interestingly, one can assume that these (popular) guests will be surrounded by a large number of guests, which makes our plot representative of how attendees will be distributed across the dance floor.

Observation 4: In any case, these are far from being the main reasons why we plotted the data as a graph in the first place. As we had intuited above, a large fraction of the voters picked guests that did not participate in the survey (see nodes in the perimeter of the graph with no outgoing edges). That being said, we can remain positive! This is not at all personal!

The analysis that I have performed above is by no means comprehensive. In fact, I would love to see other observations that you can extract from this valuable data. I have uploaded the anonymized data (dance.csv and dance-edge-list.csv), so feel free to download it and start experimenting with it. Once you find something worth sharing, go ahead and post it in a comment in the comment section below!

Observation 5 (Bonus Track): I do not want to end this post without pointing out something in Figure 3 that made me smile. If you look in the far left of the graph, there are two guests that have unequivocally picked each other. I do not know if they planned it like this; I do not know if they this is 100% chance; What I know is that these two guests enjoyed the party very, very much.


  1. Greetings from California! I'm bored at work so I decided to check out your website on my iphone during lunch break. I really like the info you present here and can't wait to take a look when I get home. I'm shocked at how quick your blog loaded on my mobile .. I'm not even using WIFI, just 3G .. Anyways, great blog! outlook 365 sign in

  2. The others fragment incorporates vitality estimating arrangements and information perception instruments. Data Analytics Course in Bangalore

  3. It is extremely nice to see the greatest details presented in an easy and understanding manner.
    data science training institute in hyderabad

  4. Hazard Management, also known as hazard identification and control, is the systematic process of identifying, assessing, and mitigating hazards within various environments or contexts. Hazards refer to potential sources of harm, including conditions, situations, or activities that have the potential to cause injury, illness, property damage, or adverse effects on the environment.

  5. We extend our heartfelt thanks for sharing the comprehensive information about the best junior colleges in Hyderabad for CEC. Your content has been incredibly helpful and informative. Thank you for providing such valuable insights.
    Best Juniour Colleges In Hyderabad For CEC

  6. Looking for the Best CEC College in Hyderabad? Look no further than CMS! Our institute provides the BEST CEC Coaching in Hyderabad.
    Best CEC Colleges In Hyderabad

  7. It is quite pleasant to have the most important information presented in a clear and understandable way.
    Best B.Com Colleges In Hyderabad

  8. Thanks for sharing the best information and suggestions they very nice and very useful to us. You made a good site it’s very interesting one.I am satisfied with your site and also your information.
    Sap Abap Training In Hyderabad


  9. It is quite pleasant to have the most important info
    check ours Oxford Migration for digital marketing