I’ve been an avid reader of Hacker News for the past year or so. If you don’t know what it is, in few words it’s an aggregator that is centered around technology news and everything surrounding them. So, not only news about the latest and greatest frameworks to use, but also any news that community finds interesting. Stories are upvoted to the front page that you see when you go the link above. It has been going since 2006.
As I’m reading it, I started noticing certain threads in types of news that make to the front page. There are couple of posts already on various aspects of Hacker News ( David Robinson’s post, Julia Silge post - she also mentioned
bigrquery package that I’ll use too), but what I wanted to take a look is about distribution of points that each story gets and number of comments under it.
You see, I have this theory that controversial posts tend to attract many more comments than an average post, while interesting posts tend to be upvoted quite highly without many comments. I think, it is obvious that controversial posts tend to have quite a few comments since people tend to go into word battles with each other trying to prove a point ( “Someone is wrong on the Internet” ).
What makes an interesting post is very much open to interpretation, but I explain it by the fact that some posts are just good and there is nothing one needs to comment on them, so they are just upvoted.
As a sidenote, by calling some posts “interesting” I’m not saying that other posts are not interesting, but rather that posts with this characteristic tend to be different from the “average” ones.
So, the purpose of this post is to explore my intuition and to also try out
To operationalize my intuition, I was considering multiple things, but at the end I’ve come down to a very simple metric - ratio of story’s score to number of comments. To make sure that there as few outliers as possible, I’ve also decided to use only stories with 50+ points. This, of course, is arbitrary, but what can you do, amirite?
Without looking into data (you have to trust me that I didn’t) I would have couple of observations that I want to check:
- Most posts have relatively standard distribution of points to comments ratio. I would say that it is quite common for posts on the front page to have around 150-200 points with 70-100 comments. So, ratio is 2-3 for most posts.
- This ratio is also distributed normally.
- Controversial posts will have ratio close to 1 or below.
- Interesting posts will have ratio 5 or higher.
As I’ve mentioned, Google provides access to all Hacker News posts and comments via BigQuery. You can use it via
bigrquery package by (surprise!) Hadley Wickham. The Readme of the package provides with clear enough instructions on how to do it.
The only slight wrinkle to it is the fact that I’ve stored the name of the project in the vault using
secret package (you can read about how do it here). It’s not actually required since authentication is done online anyways, but I think it’s a good idea to do it since in cases where you do need to store some API keys, it’ll be a natural thing to do.
library(tidyverse) library(bigrquery) library(secret) library(ggthemes)
Getting the “secret” name of the project:
vault <- file.path("/home/misha/R/projects", ".vault") key_dir <- file.path("~/.ssh") my_private_key <- file.path(key_dir, "secret_example") project <- get_secret("secret_two", key = my_private_key, vault = vault)
Now I can get the stories I need using simple SQL query:
sql <- "SELECT * FROM [bigquery-public-data:hacker_news.full] WHERE score > 50" stories <- query_exec(sql, project = project, max_pages = Inf)
I’m obviously not going to run this query over and over again, so I’ve cached it and retrieving it from cache is just:
stories <- readRDS("/home/misha/R/projects/hackernews/stories_cache.rds") %>% dplyr::filter(descendants > 0) %>% dplyr::filter(!is.na(descendants)) %>% dplyr::mutate(ratio = score/descendants)
I’ve also cleaned the dataframe a bit to have only non-missing and positive descendants.
Let’s look at what kind of data we have:
## 'data.frame': 131827 obs. of 15 variables: ## $ by : chr "Garbage" "teng" "newman314" "ghosh" ... ## $ score : int 98 75 107 183 119 142 251 81 170 87 ... ## $ time : int 1352800497 1363053023 1484596441 1485235710 1282419418 1424670541 1379623273 1421604151 1502125673 1454162856 ... ## $ timestamp : POSIXct, format: "2012-11-13 09:54:57" "2013-03-12 01:50:23" ... ## $ title : chr "Netflix Open Source" "Strikingly Creates Simple, Beautiful Web Sites in Minutes" "With Google's RAISR, images can be up to 75% smaller without losing detail" "Unexpected Consequences of Self-Driving Cars" ... ## $ type : chr "story" "story" "story" "story" ... ## $ url : chr "http://netflix.github.com/" "http://lifehacker.com/5989963/strikingly-creates-simple-beautiful-web-sites-in-minutes" "http://www.pcmag.com/news/351027/google-raisr-intelligently-makes-low-res-images-high-quality" "http://rodneybrooks.com/unexpected-consequences-of-self-driving-cars/" ... ## $ text : chr "" "" "" "" ... ## $ parent : int NA NA NA NA NA NA NA NA NA NA ... ## $ deleted : logi NA NA NA NA NA NA ... ## $ dead : logi NA NA NA NA NA NA ... ## $ descendants: int 50 44 51 209 52 32 228 71 80 62 ... ## $ id : int 4777242 5359507 13412464 13469038 1623482 9092781 6414162 8908197 14949107 11001833 ... ## $ ranking : int NA NA NA NA NA NA NA NA NA NA ... ## $ ratio : num 1.96 1.705 2.098 0.876 2.288 ...
There are 134303 stories in total. The most interesting columns for me are
descendants (this indicates number of comments in the story).
Distribution of scores over time:
stories %>% ggplot(., aes(x = timestamp, y = score)) + geom_smooth() + scale_x_datetime(date_breaks = "1 year", date_labels = "%Y") + theme_solarized()
## `geom_smooth()` using method = 'gam'
Distribution of comments over time:
stories %>% ggplot(., aes(x = timestamp, y = descendants)) + geom_smooth() + scale_x_datetime(date_breaks = "1 year", date_labels = "%Y") + theme_solarized()
## `geom_smooth()` using method = 'gam'
Indeed, for both scores and number of comments there is an upward trend. That is not surprising considering that site itself is probably growing in popularity. These 2 graphs also help in answering my first question. At least in the last couple of years, on average posts get 150-170 points and 70-100 comments.
Now to other questions I’ve posed in the beginning.
Distribution of ratio
stories %>% ggplot(., aes(x = ratio)) + geom_histogram(bins = 100) + theme_solarized()
It looks like there are some really interesting stories based on my criteria :). Instead of filtering the stories with a very high ratio, I’ll instead plot a log of the ratio:
stories %>% ggplot(., aes(x = log(ratio))) + geom_density() + theme_solarized()
Log distribution of the ratio is indeed normal. So, I wasn’t completely correct in my second assumption.
stories %>% dplyr::select(ratio) %>% summary()
## ratio ## Min. : 0.1652 ## 1st Qu.: 1.5077 ## Median : 2.4348 ## Mean : 3.8337 ## 3rd Qu.: 4.0968 ## Max. :398.0000
I’ve defined controversial posts to be all posts where ratio is below 1.
stories %>% dplyr::filter(ratio < 1) %>% nrow()
##  11798
In total, there are 11798 (or 9%) such stories. For the sake of demonstration, I’ll only show top 20 titles and urls for posts with the most controversial ratio.
stories %>% dplyr::select(title, url, ratio) %>% dplyr::arrange(ratio) %>% dplyr::top_n(-20) %>% knitr::kable()
## Selecting by ratio
|Please tell us what features you’d like in news.ycombinator||0.1651955|
|Microsoft is Dead||http://www.paulgraham.com/microsoft.html||0.1730337|
|HN: share your unneeded Google Wave invites||0.1958763|
|Ask HN: Where do you live?||0.2021661|
|Soaring Student Debt Prompts Calls for Relief||http://www.wsj.com/articles/soaring-student-debt-prompts-calls-for-relief-1473759003?mod=pls_whats_news_us_business_f||0.2054264|
|Inappropriate comments at pycon 2013 called out||https://twitter.com/adriarichards/status/313417655879102464/photo/1||0.2205882|
|Poll: Where are you from?||0.2208333|
|Ask HN: What annoys you?||0.2347826|
|Why Do Obese Patients Get Worse Care? Many Doctors Don’t See Past the Fat||http://www.nytimes.com/2016/09/26/health/obese-patients-health-care.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=second-column-region®ion=top-news&WT.nav=top-news&_r=0||0.2355556|
|Ask HN: Whom do you admire most?||0.2393162|
|In Fed and Out, Many Now Think Inflation Helps||http://www.nytimes.com/2013/10/27/business/economy/in-fed-and-out-many-now-think-inflation-helps.html?pagewanted=all||0.2467532|
|Americans Are Putting Billions More Than Usual in Their 401(k)s||https://www.bloomberg.com/news/articles/2017-01-04/americans-are-putting-billions-more-than-usual-in-their-401-k-s||0.2544248|
|Reddit CEO Cracks Down on Abusive Content to Protect Users, Attract Advertisers||https://www.wsj.com/articles/reddit-ceo-cracks-down-on-abusive-content-to-protect-users-attract-advertisers-1510853654?mod=e2tw||0.2546917|
|Ask HN: What do you drink?||0.2581967|
|Ask HN: What are you reading right now?||0.2651007|
|Why i’m done wearing a helmet||http://www.bikinginmpls.com/im-done-wearing-helmet/||0.2669683|
|How Women Got Crowded Out of the Computing Revolution||https://www.bloomberg.com/view/articles/2017-08-19/how-women-got-crowded-out-of-the-computing-revolution||0.2723577|
|YC Summer 2017 Invites/Rejections||0.2738095|
|“A Statement with My View on Curtis Yarvin and Strange Loop”||https://s3.amazonaws.com/sl-notes/yarvin.txt||0.2771084|
|Ask HN: What are you working on today?||0.2774194|
I’ve forgotten about “Ask HN” posts and it is obvious that they are the most prevalent when it comes down to the metric I’ve chosen. But the rest of the post do look quite controversial, so I would say that my intuition is confirmed.
Now to interesting posts:
stories %>% dplyr::select(title, url, ratio) %>% dplyr::arrange(ratio) %>% dplyr::top_n(20) %>% knitr::kable()
## Selecting by ratio
Hm, I would say that here I’m a bit off. Posts with highest ratio seem to be posts about deaths of famous (to people at HN) people. In fact, most of the post are not necessarily interesting, they are rather posts where people don’t really have a lot to add to whatever the topic is, but yet topic is relevant to the community at large, so there are large number of upvotes with few comments.
I think, all of us have intuition about one thing or another. I like that it is rather trivial (and yet interesting) to check whether one’s intuition is correct. And, hey, 2.5 out of 4 ain’t bad!