A few years ago, I was asked to develop an app for the following task: Let attendees of a party pick up to five other guests with which they would like to have the last dance. An hour before the event, the app would compute

*matches*(i.e., attendees that picked each other) and notify each guest the number of obtained matches---without disclosing the identity of the match. (Even if we had wanted to disclose the match, we would not have been able to do it. The data going through the app was anonymized.) Also, attendees could voluntarily opt out from voting.
It is worth mentioning three important aspects about attendees: (a) Their significant others were not invited to the party; (b) they could pick anyone from the list of attendees; and (c) they knew that "last dance" meant much more than holding each other for 3 minutes. We can thus safely assume that most guests casted their votes based on how attracted they were to their picks. In this blog post, we let the details of the app aside and, instead, try to understand the value in the data that we collected.

Everything has beauty, but not everyone sees it.

Confucius

In the remaining of the blog post, we analyze different interesting aspects of the data. We also invite readers to explore the data and submit their observations.

### A Bird's-Eye View

Let's start with a summary of the data:

- Number of Attendees:
**240** - Percentage of Picks (by Position):
**55%**(1) -**49.1%**(2) -**43.7%**(3) -**37.5%**(4) -**34.1%**(5)

**A significant portion of the attendees allegedly decided not to vote. This could be for a number of reasons. Two of these reasons are that they were either in relationships or not interested in participating. However, let us focus on those that did participate: They**

*Observation 1*:**consistently picked more than one person**.

### Distribution of Votes

Our next step is to understand how the votes above are distributed across attendees. For this, we produce a histogram over the number of votes that each attendee got. We include the R code used to process the data and, in turn, produce the histogram:

require(ggplot2) #Load data dance <- read.csv("[path_to_file]/dance.csv") #Produce count of votes per person counts <- as.data.frame(table(dance$voted)) #Count voters that were not voted None <- as.data.frame(setdiff(dance$voter, dance$voted)) colnames(None) <- c("Var1") None$Freq <- 0 #Combine tables counts <- rbind(counts,None) #Produce plot ggplot(data.frame(counts), aes(x=factor(Freq))) + geom_histogram(fill="grey", color="grey50") + xlab("Number of votes") + ylab("Frequency") + ggtitle("Histogram of attractiveness") + theme_bw()

Note that we account for the people that were voted but did not cast their vote. The histogram is as follows:

*Observation 2:*Many guests were voted by multiple people (see Number of votes >= 2 in the

*X*axis in Figure 1). (One of them even got picked by 36 different people!) We could say that these guests with multiple votes are rather popular and generally attractive. Nonetheless,

**the vast majority of guests got one vote**. This poses a hopeful message: Somewhere, someone is choosing you!

I could have concluded this post with the line above and left a very inspiring message. However, data is also inspiring and there is much more we can learn from it.

### Imagine Me and You, I do (Listen)

Our next concern is the distribution of votes by position. In other words, we want to know the number of guests that got

*N*votes in position*P*, for all positions. This could potentially empower our message above, if we saw a large number of guests being the top pick (*P=1*) of a single person (*N=1*). We show this for all positions (i.e.,*P=[1..5]*), for completeness. The R code to plot this information is as follows:ggplot(dance, aes(x=Freq, fill=as.factor(Position))) + geom_histogram(color="black", position="identity", binwidth=1, origin = -0.5) + facet_grid(Position ~ ., labeller="label_both") + scale_x_continuous(breaks=seq(1, max(data$Freq), by = 1)) + theme_bw() + theme(legend.position="none") + xlab("Number of votes") + ylab("Frequency") + ggtitle("Distribution of votes")

The histogram that we obtain is shown in Figure 2.

Figure 2: Histogram of votes by position. |

**The most notable result from the plot above is that we do not see 36 votes (or even close) in any particular position. Furthermore, the most number of votes for a guest in the first position (i.e., Position:1) was 8 (see top plot in Figure 2). This tells us that**

*Observation 3:***popular, generally attractive people are often**

**not the**

**top pick**.

### Is There a Match?

Although the observations discussed so far are intriguing and worth discussing further, lets bring back the focus to the original motivation of the app:

*Where there any matches?*We start by looking at the distribution of number of matches (see Table 1).Matches | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|

Guests | 214 | 22 | 2 | 2 | 0 | 0 |

The results above are very disappointing: Only 26 guests (22 + 2 + 2 with 1, 2, and 3 matches respectively), or about 10% of the total number of guests, got at least one match. We can still think positively and argue that a large fraction of the "zero-match" guests was due to the rather low participation in the app.

To validate the intuition above, we plot a directed graph of picks, defined as follows: Each

*node*represents a guest and each (directed)*edge*indicates a pick from the voter to the voted guest. Furthermore, the size of each node is proportional to the number of votes or, in other words, the number of incoming edges. The following code produces the in Figure 3. A high-definition version of the plot is available for download here in PDF format.
The graph is as follows:

It is easy to notice---by the size of the nodes---guests that are popular and generally attractive. Interestingly, one can assume that these (popular) guests will be surrounded by a large number of guests, which makes our plot representative of how attendees will be distributed across the dance floor.

*Observation 4:*In any case, these are far from being the main reasons why we plotted the data as a graph in the first place. As we had intuited above, a

**large fraction of the voters picked guests that did not participate in the survey**(see nodes in the perimeter of the graph with no outgoing edges). That being said, we can remain positive! This is not at all personal!

The analysis that I have performed above is by no means comprehensive. In fact, I would love to see other observations that you can extract from this valuable data. I have uploaded the anonymized data (dance.csv and dance-edge-list.csv), so feel free to download it and start experimenting with it. Once you find something worth sharing, go ahead and post it in a comment in the comment section below!

*Observation 5 (Bonus Track):*I do not want to end this post without pointing out something in Figure 3 that made me smile. If you look in the far left of the graph, there are

**two guests that have unequivocally picked each other**. I do not know if they planned it like this; I do not know if they this is 100% chance; What I know is that these two guests enjoyed the party very, very much.

ReplyDeleteGreetings from California! I'm bored at work so I decided to check out your website on my iphone during lunch break. I really like the info you present here and can't wait to take a look when I get home. I'm shocked at how quick your blog loaded on my mobile .. I'm not even using WIFI, just 3G .. Anyways, great blog! outlook 365 sign in

The others fragment incorporates vitality estimating arrangements and information perception instruments. Data Analytics Course in Bangalore

ReplyDeleteVery Informative Article

ReplyDeleteData Science Interview Questions