Finding controversial and interesting posts on Hacker News

I’ve been an avid reader of Hacker News for the past year or so. If you don’t know what it is, in few words it’s an aggregator that is centered around technology news and everything surrounding them. So, not only news about the latest and greatest frameworks to use, but also any news that community finds interesting. Stories are upvoted to the front page that you see when you go the link above. It has been going since 2006.

As I’m reading it, I started noticing certain threads in types of news that make to the front page. There are couple of posts already on various aspects of Hacker News ( David Robinson’s post, Julia Silge post - she also mentioned bigrquery package that I’ll use too), but what I wanted to take a look is about distribution of points that each story gets and number of comments under it.

You see, I have this theory that controversial posts tend to attract many more comments than an average post, while interesting posts tend to be upvoted quite highly without many comments. I think, it is obvious that controversial posts tend to have quite a few comments since people tend to go into word battles with each other trying to prove a point ( “Someone is wrong on the Internet” ).

What makes an interesting post is very much open to interpretation, but I explain it by the fact that some posts are just good and there is nothing one needs to comment on them, so they are just upvoted.

As a sidenote, by calling some posts “interesting” I’m not saying that other posts are not interesting, but rather that posts with this characteristic tend to be different from the “average” ones.

So, the purpose of this post is to explore my intuition and to also try out bigrquery package.

To operationalize my intuition, I was considering multiple things, but at the end I’ve come down to a very simple metric - ratio of story’s score to number of comments. To make sure that there as few outliers as possible, I’ve also decided to use only stories with 50+ points. This, of course, is arbitrary, but what can you do, amirite?

Without looking into data (you have to trust me that I didn’t) I would have couple of observations that I want to check:

  1. Most posts have relatively standard distribution of points to comments ratio. I would say that it is quite common for posts on the front page to have around 150-200 points with 70-100 comments. So, ratio is 2-3 for most posts.
  2. This ratio is also distributed normally.
  3. Controversial posts will have ratio close to 1 or below.
  4. Interesting posts will have ratio 5 or higher.


As I’ve mentioned, Google provides access to all Hacker News posts and comments via BigQuery. You can use it via bigrquery package by (surprise!) Hadley Wickham. The Readme of the package provides with clear enough instructions on how to do it.

The only slight wrinkle to it is the fact that I’ve stored the name of the project in the vault using secret package (you can read about how do it here). It’s not actually required since authentication is done online anyways, but I think it’s a good idea to do it since in cases where you do need to store some API keys, it’ll be a natural thing to do.


Getting the “secret” name of the project:

vault <- file.path("/home/misha/R/projects", ".vault")
key_dir <- file.path("~/.ssh") 
my_private_key <- file.path(key_dir, "secret_example")
project <- get_secret("secret_two", key = my_private_key, vault = vault)

Now I can get the stories I need using simple SQL query:

sql <- "SELECT * FROM [bigquery-public-data:hacker_news.full] WHERE score > 50"
stories <- query_exec(sql, project = project, max_pages = Inf)

I’m obviously not going to run this query over and over again, so I’ve cached it and retrieving it from cache is just:

stories <- readRDS("/home/misha/R/projects/hackernews/stories_cache.rds") %>%
  dplyr::filter(descendants > 0) %>%
  dplyr::filter(! %>%
  dplyr::mutate(ratio = score/descendants) 

I’ve also cleaned the dataframe a bit to have only non-missing and positive descendants.


Let’s look at what kind of data we have:

## 'data.frame':    131827 obs. of  15 variables:
##  $ by         : chr  "Garbage" "teng" "newman314" "ghosh" ...
##  $ score      : int  98 75 107 183 119 142 251 81 170 87 ...
##  $ time       : int  1352800497 1363053023 1484596441 1485235710 1282419418 1424670541 1379623273 1421604151 1502125673 1454162856 ...
##  $ timestamp  : POSIXct, format: "2012-11-13 09:54:57" "2013-03-12 01:50:23" ...
##  $ title      : chr  "Netflix Open Source" "Strikingly Creates Simple, Beautiful Web Sites in Minutes" "With Google's RAISR, images can be up to 75% smaller without losing detail" "Unexpected Consequences of Self-Driving Cars" ...
##  $ type       : chr  "story" "story" "story" "story" ...
##  $ url        : chr  "" "" "" "" ...
##  $ text       : chr  "" "" "" "" ...
##  $ parent     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ deleted    : logi  NA NA NA NA NA NA ...
##  $ dead       : logi  NA NA NA NA NA NA ...
##  $ descendants: int  50 44 51 209 52 32 228 71 80 62 ...
##  $ id         : int  4777242 5359507 13412464 13469038 1623482 9092781 6414162 8908197 14949107 11001833 ...
##  $ ranking    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ratio      : num  1.96 1.705 2.098 0.876 2.288 ...

There are 134303 stories in total. The most interesting columns for me are score, title and descendants (this indicates number of comments in the story).

Distribution of scores over time:

stories %>%
  ggplot(., aes(x = timestamp, y = score)) +
    geom_smooth() +
    scale_x_datetime(date_breaks = "1 year", date_labels = "%Y") +
## `geom_smooth()` using method = 'gam'

Distribution of comments over time:

stories %>%
  ggplot(., aes(x = timestamp, y = descendants)) +
    geom_smooth() +
    scale_x_datetime(date_breaks = "1 year", date_labels = "%Y") +
## `geom_smooth()` using method = 'gam'

Indeed, for both scores and number of comments there is an upward trend. That is not surprising considering that site itself is probably growing in popularity. These 2 graphs also help in answering my first question. At least in the last couple of years, on average posts get 150-170 points and 70-100 comments.

Now to other questions I’ve posed in the beginning.

Distribution of ratio

stories %>%
  ggplot(., aes(x = ratio)) +
    geom_histogram(bins = 100) +

It looks like there are some really interesting stories based on my criteria :). Instead of filtering the stories with a very high ratio, I’ll instead plot a log of the ratio:

stories %>%
  ggplot(., aes(x = log(ratio))) +
    geom_density() +

Log distribution of the ratio is indeed normal. So, I wasn’t completely correct in my second assumption.

stories %>%
  dplyr::select(ratio) %>%
##      ratio         
##  Min.   :  0.1652  
##  1st Qu.:  1.5077  
##  Median :  2.4348  
##  Mean   :  3.8337  
##  3rd Qu.:  4.0968  
##  Max.   :398.0000

Controversial posts

I’ve defined controversial posts to be all posts where ratio is below 1.

stories %>%
  dplyr::filter(ratio < 1) %>%
## [1] 11798

In total, there are 11798 (or 9%) such stories. For the sake of demonstration, I’ll only show top 20 titles and urls for posts with the most controversial ratio.

stories %>%
  dplyr::select(title, url, ratio) %>%
  dplyr::arrange(ratio) %>%
  dplyr::top_n(-20) %>%
## Selecting by ratio
title url ratio
Please tell us what features you’d like in news.ycombinator 0.1651955
Microsoft is Dead 0.1730337
HN: share your unneeded Google Wave invites 0.1958763
Ask HN: Where do you live? 0.2021661
Soaring Student Debt Prompts Calls for Relief 0.2054264
Inappropriate comments at pycon 2013 called out 0.2205882
Poll: Where are you from? 0.2208333
Ask HN: What annoys you? 0.2347826
Why Do Obese Patients Get Worse Care? Many Doctors Don’t See Past the Fat 0.2355556
Ask HN: Whom do you admire most? 0.2393162
In Fed and Out, Many Now Think Inflation Helps 0.2467532
Americans Are Putting Billions More Than Usual in Their 401(k)s 0.2544248
Reddit CEO Cracks Down on Abusive Content to Protect Users, Attract Advertisers 0.2546917
Ask HN: What do you drink? 0.2581967
Ask HN: What are you reading right now? 0.2651007
Why i’m done wearing a helmet 0.2669683
How Women Got Crowded Out of the Computing Revolution 0.2723577
YC Summer 2017 Invites/Rejections 0.2738095
“A Statement with My View on Curtis Yarvin and Strange Loop” 0.2771084
Ask HN: What are you working on today? 0.2774194

I’ve forgotten about “Ask HN” posts and it is obvious that they are the most prevalent when it comes down to the metric I’ve chosen. But the rest of the post do look quite controversial, so I would say that my intuition is confirmed.

Interesting posts

Now to interesting posts:

stories %>%
  dplyr::select(title, url, ratio) %>%
  dplyr::arrange(ratio) %>%
  dplyr::top_n(20) %>%
## Selecting by ratio
title url ratio
ESO (European Southern Observatory) Announcing Unprecedented Discovery 145.0
Ritchie, Stroustrup, and Gosling interview 146.0
Fantasy author Pratchett dies aged 66 152.0
Introducing Kindle Oasis 152.0
Microsoft Open Sources .NET and Mono 156.0
Show HN: CoinMall Alpha: A Crypto Marketplace for Digital Goods 156.0
Tell HN: Google removes rust, Netflix, other GitHub repos after DMCA takedown 163.0
The DAO is currently being attacked, over 2M Ethereum missing so far 188.5
Microsoft Surface Book 190.0
Dream of New Kind of Credit Union Is Extinguished by Bureaucracy 190.0
Dennis Ritchie RIP 212.0
Magic Mushroom Drug Lifts Depression in Human Trial 223.0
Shrinking to Zero: The Raspberry Pi Gets Smaller 240.0
Dennis Ritchie 243.0
Texas Student Is Under Police Investigation for Building a Clock 269.0
Bill Gates Makes Statement on Steve Jobs 310.0
Thanks, Steve 335.0
Beware: Windows 10 Signature Edition Blocks Installing Linux 344.0
Wikileaks’ Assange wins U.N. ruling on ’arbitrary detention“ 373.0
SpaceX: “Landing confirmed” 398.0

Hm, I would say that here I’m a bit off. Posts with highest ratio seem to be posts about deaths of famous (to people at HN) people. In fact, most of the post are not necessarily interesting, they are rather posts where people don’t really have a lot to add to whatever the topic is, but yet topic is relevant to the community at large, so there are large number of upvotes with few comments.


I think, all of us have intuition about one thing or another. I like that it is rather trivial (and yet interesting) to check whether one’s intuition is correct. And, hey, 2.5 out of 4 ain’t bad!


There aren't any comments yet. Be the first to comment!

Leave a comment

Thank you

Your comment has been submitted and will be published once it has been approved.



Your post has not been submitted. Please return to the form and make sure that all fields are entered. Thank You!