Using Data Science to Improve Super Bowl Pool

Apart from the game, half-time show and the advertisements, there is one more thing associated with Super Bowl – Super Bowl Pool. However, there is a known problem with the traditional Super Bowl Pool – the unequal distribution of probabilities of win across different squares. Certain squares in the pool are much liklier to win whereas some others are least likely to win. There are news articles like this that try to help people make right selection when the numbers on the rows and columns are already assigned at the time of picking the squares. But when that is not the case, many players end up saying “My box sucks!”. This is buyers remorse and it ends up minimizing the interest in participation as certain squares are perceived to be useless even before the game starts.

The popularity of the super bowl pool and its shortcoming got me thinking about possible modifications. After some pondering, I came up with an idea – what if we pair the most likely number with least likely number in a single row/column. This would lead to a 5×5 pool with less disparity of win probabilities across different squares.

The next challenge was to get the frequency of the digits between 0-9 for past Super Bowls to come-up with the pairing.

While looking for the digit frequencies on the web, I came across an article in dataists that not only discussed the frequency disparity but also listed functions in R to grab the score data from Wikipedia using YQL. Borrowing the concept from there, I wrote a process in PHP to grab the frequencies for each digit using YQL.

Digit Frequency Probability
0 101 27.45
1 21 5.71
2 9 2.45
3 56 15.22
4 38 10.33
5 10 2.72
6 30 8.15
7 77 20.92
8 9 2.45
9 17 4.62

 

With that information in hand, here is the number pairing to ensure better probability distribution.

Digit Pair Combined Frequency Probability
0, 2 110 29.89
7, 8 86 23.37
3, 5 66 17.93
4, 9 55 14.95
6, 1 51 13.86

 

As you can see, the probabilities in the modified pool is much more uniform compared to the traditional pool.

Below is a sample modified super bowl pool initialized with random pairs on rows and columns. In practice, the pool squares are ‘taken’ by players before the numbers are assigned.

Charts for Hacker News Polls

If you prefer to see the charts in action before going through the details, visit HN Charts.

The Idea

In my last hack, I created a process to visualize support/oppose statistics for Wikipedia black out discussion on Jimbo Wales user talk page. Next time I saw a Hacker News poll in the front page, an idea struck my mind – why not create a process to visualize data from the Hacker News polls? Having worked with HN Search API and the charting process, I thought it would be fairly easy for me to create this service. The project turned out to be more complicated but at the mean time more fun than I had originally anticipated.

Research

Polls are currently in beta in Hacker News and I think for that reason poll data are not accessible from HN Search API. I wrote a quick e-mail to PG with the following details:

It appears that the Poll stats are not associated with the parent item
in the HN Search API:

  1. Parent item query does not include the Poll stats in the response (e.g. http://api.thriftdb.com/api.hnsearch.com/items/3420203-c2ed1 )
  2. Comments query for the parent item does not include the Poll stats either (e.g. http://api.thriftdb.com/api.hnsearch.com/items/_search?q=3420203-c2ed1&pretty_print=true)

I can imagine the volume of e-mails PG receives and was not anticipating a response but still  sent an e-mail, just in case. I haven’t received a response yet.

That didn’t stop me from moving ahead. Afterall, I scraped the content in my previous hack using YQL. But then I found that HN is blocking YQL and Yahoo Pipes from accessing the stories in Hacker News. I ended up using another service for scraping that I will not disclose here to make it easier for HN to keep up with careless scrapers. I am caching data and using HN API / RSS feeds whenever possible; resorting to scrapes only when absolutely needed.

Evolution

Once I built the visualization page for individual polls, I wanted to make it easier to visualize interesting polls. To do this, I added two new features i) a list of frontpage polls to visualize (using RSS) and ii) a list of best Hacker News polls (using HN Search API)

Details

The end result is HN Charts, a service to visualize Hacker News polls.

The site is designed using Twitter Bootstrap and developed in PHP. The charts are rendered using the Google Visualization API. The charts pages are cached for ten minutes so poll results may be upto ten minutes old.

I have also placed social share buttons on the individual chart pages so that you can share your favorite polls.

Final Note

Understanding the efforts of the HN community to prevent Eternal September, I decided not to place any links in the poll visualization pages to the parent poll. Your feedback will be highly appreciated.

Blanking all Wikipedia as SOPA Protest Live Stats

Jimbo Wales, the founder of Wikipedia, started a poll in Wikipedia to evaluate the support for his idea of blanking all Wikipedia to protest against SOPA. Hundreds of contributors are participating in the discussion and the story is currently the top most article in Hacker News.

One thing about the discussion is that it is textual. It is fun to read through different sides of the argument but it is hard to evaluate the overall support/oppose statistics. I created a simple process that scrapes the Wikipedia discussion in real-time and creates a pie-chart showing the current support/oppose statistics. You can see the live stats in the pie-chart below.

[Update: The chart below is now an image of the final outcome as the discussion has already ended]

Tools used: YQL, Google Visualization API
Framework: PHP The results of the YQL queries are grabbed as JSON and the count of the number of matches is extracted from the result. The count of results for four queries (Support, Strong Support, Oppose, Strong Oppose) is passed to Google Visualization API to create the chart.

At the end, if Wikipedians have their way of having a discussion, we have hackers way of visualizing it.

Health Insurance? There is an app for that!

My last hack, Hacker News Like Button, received a mixed reaction in Hacker News. The community loved the hack but thought that a like button would lead to an Eternal September. I agree with the community on that so I have been fooling around with few ideas to make the Like button more ‘likeable’. But that is not what this post is about.

I have been brushing my development skills on various topics. For example, I spent some time creating 3D animations in HTML5 using javascript and the result is iframed below.

The embedded page currently not available.

I had also been wanting to learn one of the two languages: Python (and Django) or Ruby (on Rails). I did some research on these and decided to go with Python. The primary reason for this being Google App Engine’s compatibility with Python.

I believe in learning by doing and so I was looking for a problem to work on using Python. At the mean time, I had to select my health insurance plan for next year. Health insurance plan selection is an optimization problem. You can pay higher premium to minimize your risk or pay lower premium but take higher risk. Employer contribution and HSA plans serve as additional constraints. And as the annual enrollment is ariving soon, this problem fit the definition of “interesting and useful”.

Here is the final project : Pick-a-Plan.

I used Bootstrap for the layout/design and Google Visualization API for the gauges in Health-Meter. Health-Meter allows you to experiment with different health-service scenarios and as you adjust you medical-needs, estimated annual cost is updated. The idea is to let the users compare different plans under different health-care scenarios so that they can make informed decision.

Google App Engine getting started document was adequate to help me get this project completed. I am impressed by the simplicity of Google App Engine development and deployment process. While there might be complexities that I did not come across due to the simple nature of this application; the fact that I got this project completed without much hurdle means a lot. As they say, well begun is half done.

Please give Pick-a-Plan a try and let me know what you think. Your encouragement will keep me motivated.

Hacker News Like Button

Let me begin by thanking you all for your love; my last post made it to the Hacker News front page and also got some buzz in Twitter, Reddit and Facebook. I am not just thanking the ones that agreed with me and upvoted my post, those of you who have disagreed and commented are valued equally.

I have named my blog ‘Hacks and Thoughts’ but my last two posts (only ones so far) were both thoughts so I was due for a hack. So here it comes.

The Idea

A couple of times, I have submitted web-contents to Hacker News only to find that it has already been submitted there. When my article made it to the HN front-page, I realized that there is another disconnect as well: readers who come to the page from other sources (subscription, twitter etc.) are not aware of the attention and thus remain isolated from the interesting discussions. In my last post, I invited them to join the discussion by providing a link at the end but that was not a very good solution.

The Research

Having been around HN for some time, I suspected that someone might have already come-up with a solution. I searched for Hacker News submission tools but only found a bookmarklet and a Chrome extension. A like button is much more provoking for readers visiting a page than a bookmarklet or the extension which has to be pre-installed. Also, it would not help readers realize the discussions that might be going on. I decided to explore it a little bit.

The Details

Hacker News needs you to be logged in to your account for upvoting and does not provide a programmatic way to do that on your behalf. I remembered from earlier experiences that a resubmission (once per user) counted as an upvote. I found that the bookmarklet used the following javascript to submit the news:

javascript:window.location=%22http://news.ycombinator.com/submitlink?u=%22+encodeURIComponent(document.location)+%22&t=%22+encodeURIComponent(document.title)

Basically, we could pass the url (as query string u) and the title (as query string t) to http://news.ycombinator.com/submitlink programmatically and the user would only need to click submit given that s/he is already logged in. That was one part of the like button.

The other part was getting the HN points for the submission so that it could be displayed in the like button. This would indicate the HN hotness right on the page without the need for the user to go find in HN. I was trying to come-up with a way to scrape the points from the thread but found that HN actually offers a search API that returns json response. On top of that, it offered scoring different attributes for matching so it was possible to only search submitted urls that matched the url of the current content and retrieve the score if the submission already existed.

That’s it! The rest is skinning the cat and the recently released Twitter Bootstrap toolkit came handy in putting a page together.

The Test

Pre-launch, every thing seems to be working fine (on three browsers that is ) except for the not-so-nice user expereince of having to close the submit pop-up window after submitting. If this takes off after launch, the large scale of requests might affect HN search API. I didn’t see any rate-limiting details for the API but there might be a limit. I registered a domain and forwarded it to my hobby project server so that may go down as well. If the community thinks this is useful, I am willing to build and maintain it as a reliable service.

 

The curious case of DuckDuckGo

I am searching for a search engine that doesn’t make me worry about privacy as I search. Why do I worry about privacy? Because this scale of private information on evil hands can create a havoc. I am not saying search engines today are evil. I am saying that this data may one day fall on evil hands. Here is a link to one cause of the fear: Google Hack Attack Was Ultra Sophisticated, New Details Show.

Enter DuckDuckGo, a search company that mentions “… does not collect or share personal information” in their privacy policy that also provides a link to an illustrated guide to the current privacy vulnerability. Sounds promising.

But how about the search itself. Is this one-man show at all comparable to the multi-billion dollar competitors? I think DuckDuckGo does beat them in several front. In fact, two searches that I did over the past week led me to this write-up, not the privacy debacle.

Case I: Google over does search

I searched Google for “College or UnCollege?”, the topic of my previous post, to see if my blog post shows up in the Google search result or not. The result was different from what I expected:

Google thought I misspelled my search query and showed results for “College or College?”. This must be because Uncollege is not in Google’s dictionary yet. The case in point: Google decided that I typed a wrong query and went ahead to show results for what it thought I meant rather than asking me if I meant to search for what it thinks was the correct query.

I did the same search in DuckDuckGo and I got results as expected. Simple.

Case II: Google doesn’t do enough

This one was a test that I came up with. I searched the word “Java” in Google. Google returned results in 0.11 seconds and all results on the front page related to Java, the programming language. But, aren’t there different meanings of the word Java? What if I was searching for place Java of Java Coffee fame. I would need to scan through the search results, spanning multi-pages (second page in this), until I came across the one that related to Java, the place.

I searched “Java” in DuckDuckGo. And guess what? DDG showed this “Get results for different meanings of Java” box that showed different meanings of Java that you could select to refine your search. ”Java is the most populous island in Indonesia” was the text suggesting the place reference of the word Java. DDG did take about a second to bring the results (which is 10 times longer that Google’s response time) but since I did not have to go through results to find out whether the results were related to my search context, I arrived at the desired result faster. Google has made search responses blazing fast, but not necessarily the process of you reaching your desired results.

To conclude, I am impressed by DuckDuckGo as pro-privacy search engine with nifty features that seem to be better than Google in certain circumstances. I am definitely going to try switching from Google. That would make a great post no matter what the outcome :)

 

PS: Please join the discussion at Hacker News: http://news.ycombinator.com/item?id=2911175