Saturday, February 27, 2016

It's All About the Data

There are many fascinating and intriguing hypotheses that eventually we would like to accept or refute. A notable example is Einstein's prediction on the existence of gravitational waves, the last puzzle piece in his general theory of relativity, which was recently confirmed by researchers from the Laser Interferometer Gravitational-Wave Observatory (LIGO). Other perhaps more approachable examples include "educated guesses" on what type of customers are more likely to buy certain products or whether we can accurately estimate arrival time to destinations. Likewise, there are diverse, powerful methods that we can adopt to accept or refute such hypotheses. To confirm the existence of gravitational waves, for instance, the LIGO researchers had to detect microscopic variations in laser beams. For other scenarios, methods such as statistical analysis, machine learning, or deterministic algorithms may suffice. However, the fundamental value of accepting or refuting a hypothesis lies in the data and how we use it.

Reality does not exist until it is measured
Physicists at The Australian National University (ANU)

Get Data, But Do It Well

Data collection is one of the most critical steps when conducting experiments or evaluations, for two main reasons. First, we must collect data that covers all relevant aspects of the hypothesis of interest. To predict the (type of) customers that are more likely to buy picture frames, for example, we need information about all types of customers as well as the products that they have bought and when. Second, we must collect representative data. For our customer prediction problem, for instance, we need data from different days, months, and years, since certain purchases may be seasonal. As an example, see the following plot for interest in winter jackets over time:

The peaks clearly indicate that winter jackets are "hot" only during winter.

We will often experience that the data we are looking for has already been collected by someone out there. Other times, though, we will have to do it ourselves.

It's Our Lucky Day: Someone Did The (Dirty) Job For Us

Fortunately, a large number of datasets have been collected in a principled manner and made available for research and experimentation. The next list shows a few popular datasets:

Some researchers and bloggers even have compiled rather extensive lists of datasets, to make it easier for others to find data for their problems of interest. In these—and other—compilations, datasets are characterized by domain (e.g., biology, elections), size (e.g., small, medium, big), cleanliness (e.g., raw, normalized), type (e.g., structured, unstructured), problem (e.g., part-of-speech tagging, regression), supervision (e.g., labeled, unlabeled), to name a few. Among the many lists available on the Web, the next three lists, which roughly characterize datasets by domain, are a great start when looking for data: 1001 Datasets, Awesome Datasets, and Kevin Chai's List.

Beyond the above lists—and datasets therein—, many, many more datasets can be found in the Web, although with a little bit of (extra) work. Conferences and research competitions often release datasets for specific problems. For instance, the Live Question Answering track in the Text REtrieval Conference (TREC) provides natural language questions and answers with judgements. Companies release datasets to have highly qualified researchers deliver solutions to timely problems. One of the arguably most relevant examples is the NETFLIX prize, which was launched in 2009 to seek solutions for the well-known movie recommendation problem. Their dataset continues to attract researchers these days as well. Governments release datasets, but for other reasons (e.g., transparency, comply with regulations). A notable example is the DATA.gov effort, which offers over 170,000 datasets to date. Finally, researchers many times offer datasets collected for their papers, to enable reproducibility. For example, Eamonn Keogh and his group routinely make their time-series data sets available.

Not Too Fast, Though: It's Time To Get Hands-On

However, and as expected, we will often need to generate the datasets ourselves (e.g., because datasets do not cover aspects of the problem of interest or because the data does not exist); it is then helpful to know what to do when that moment arrives. As a first step, I suggest you to do a fresh online search or to check specific discussion threads (e.g., threads ongoing in Quora and Reddit). It is very likely that from the moment you started studying the problem to the day you want to test your hypotheses, someone produced some relevant data. If you are still empty-handed after this first step, you will have to create the data yourself.

Suggesting data collection strategies is a post on its own that I will eventually would like to write about. Nevertheless, I recommend two useful directions in the meantime. First, take advantage of available Web crawlers (e.g., Apache Nutch) and crawl websites relevant to your problem. You can in turn scrape their content from your offline, local copy of the websites. Second, see if the data you need can be obtained through Application Program Interfaces (APIs). For example, the Twitter API lets you (almost) painlessly access the Twitter firehose. Other interesting APIs include the StreetEasy API—for apartment rentals and sales—and the 311 Content API—for government information and non-emergency services.  Whatever you do when creating your very own dataset, though, make it public afterwards (e.g., in your personal website, in the discussion threads mentioned above, as a comment in this post, or in DataMarket). Others will find immensurable value in it and will greatly appreciate it! After all, don't forget, it's all about the data...