There was a useR! conference in July, 2017. I’ve been meaning to go over some of the talks that are interesting (to me) for quite some time, so now that it is September in Berlin and weather is atrocious - what can be better than spending some time listening to interesting people talking about interesting things? Well, maybe many things, but whatever, I’m doing this.
As a way of preparation I went over the schedule and wrote down every single talk that I thought is interesting. In total I’ve found 28 talks (for couple of talks I can’t find videos or slides online). There were also couple of tutorials (those were on the order of 2-3 hours each), but this post is only about talks.
The purpose of this post is mostly for me to have cliffnotes about each talk. Basically, I was going to watch them anyway, so why not also put links to them and few things I found interesting down for posterity?
The talk goes over the process of building the shiny app for a client. From the slides it looks like the app is for the internal users of the company that analyzes voice of the person on the other line of the phone.
What I found especially interesting is that the entire app (including infrastructure and deployment) has been developed by data scientists. In the end, they are using docker containers that are running behind the load balancer.
Another interesting tidbit is that they were forced to use data.table instead of dplyr since it provided them with 25 speed-up in lookup. I’m not sure why they were loading entire tables into memory instead of using a database, but perhaps there were some good reasons to do so.
The talk introduces the idea of a transformation forest which is an extension of random forest idea that is fairly well-known to most people.The main idea is that random forest and many other regression approaches predict “conditional mean of the target given predictors”. With transformation forest what you get is the conditional distribution of the mean given predictors. That allows for much richer inference since you get to see higher moments of the distribution (variance, for example) and make conclusion based on that.
To understand that more fully there is a paper.
This is a talk by Szilard Pafka of benchmarking fame. He has a github repo where he collects result of his benchmarks. The bottom line is that not every tool is always going to work as advertised. In his talk, for example, he mentions Spark as a tool that gave him the most troubles AND produced the poorest results. This, of course, doesn’t mean that Spark is not a tool that you should use, just that not to use tools just for the sake of using them.
What I thought this talk was about is ensemble in Kaggle sense (i.e., mixing multiple models together to get higher accuracy). But in fact it is about putting similar packages together and creating GUI for people to use them with more ease. From the case-studies in the talk it looks like it can be useful for people who don’t use/know R too much. I’m not in that group, so it’s difficult to say how useful this approach is. At the same time, main idea of unifying packages together is definitely useful.
The talk is mostly about new package called pool. The idea of the package is to bring db connections to Shiny. Since Shiny app tend to be a single process with multiple users you don’t want to create a connection every time any user is doing a query to a remote database. So, instead, you are creating a pool of connections that are recycled a needed and maintained separately. This should make it faster for your users with only a little overhead.
The idea of the package is centered around hyperparameter optimization. What Jakob Richter implemented is an online database where anyone can store their hyperparameter optimization routines for any type of model. So the idea is that you can use this information to tune your own model without knowing what are the current state-of-the-art approach to tuning. For example, in random forest that is implemented in caret package the author decided that number of trees shouldn’t be tuned. There is some research that shows that this is correct, but maybe some other research comes along and now you definitely should tune your number of trees in the forest. With the package in question, you can just download whatever approach is considered to be the best currently and use it for your case.
The talk is about actual translation. I thought it was more to understand what errors might mean since sometimes they are quite cryptic. Oh, well.
The talk is about (shock!) Shiny Proxy. It is an alternative to Shiny Server Pro from RStudio. It uses Docker on the backend and Java to manage authentication of users. The basic idea is that you can create a Docker container with whatever Shiny app you want and then you can use Shiny Proxy to create a protected environment so that only people with enough rights get access to this app. The whole thing is open-sourced and in general can be used for multiple use-cases. And since it is based on Docker, everything that comes with it comes to you for free. So you can use Docker Swarm and all of the thingys.
And of course Docker all the things!
Very interesting talk that goes over a use-case to implement a stream processor in R in AWS. At the very end Gergely showed the example where there is a Kinesis produces with flights data that is processed by R and then stored in Redis. This is then used by Shiny app that reads from Redis every two seconds and updates the UI. I think, I’ll try to replicate that at some point since it sound interesting.
Julia Silge and David Robinson have tidytext package that can be used for all sorts of things related to text mining. The talk is just an overview of things that are possible. But there are many blogs by both of them where they go in more details about how to do it.
- We R a Community-making high quality courses for high education accessible for all:The >eR-Biostat initiative
A project to develop courses to train students to use R with specific focus on biostatistics.
Some practical advice of how to manage change in data scientist environment. Usual suspects: git, Docker, packrat, miniCRAN. One interesting idea is to introduce/have a title of Data Science Administrator who is responsible for package updates and making sure that there are no breaking changes.
Exponential random graph models with Bayesian implementation. Author created Bergm package that can be used for this purpose. This is not the topic I’m familiar with, but if I do come across something like this in future work I’ll be sure to check it out more fully.
How do you deploy a package with AWS credentials? This is a fairly common question and since you really don’t want to store them in plain text, there is this package that can encrypt and decrypt any information you want. And it is by Gabor Csardi, so it must be good :).
This talk is responsible for this post, so not much more to add. And Docker all the things!
In my prevoius life I was a social scientist and IRT was one of the most exciting things I’ve came across. In many ways I see it as a gateway drug for some parts of social science into more rigorous world of scientific method.
As for the talk – it talks about how to connect multiple test forms in order to compare them between each other. More info on the CRAN page.
Web framework for R that makes it easy to create API-endpoints. Very similar to plumber, but one cool thing that they introduced is the way to serve an API scalably.
DOCKER ALL THE THINGS!
Short introduction to reinforcement learning and even shorter introduction to a package. I didn’t have a case where I thought I could use reinforcement learning, but maybe some day…
Renjin is an open-source implementation of R language that runs in JVM. It is designed to speed-up some code with, for example, JIT compilation in for-loops. In some cases it can improve speed of some code. One cool thing that was shown is that you can use renjing as a package to speed-up some parts of the code.
Overview of odbc package that is one part of modern pipeline of using R with databases without too much headache. I’m still unclear where you supposed to get ODBC drivers if you are not using professional version of RStudio.
IBM package that abstracts away the work with Spark by providing a way to compile your R code to Spark code that can be executed in a distributed manner.
Update on the R Consortium-backed project to improve DBI in R.
No video. Slides are extended and beautified version of vignette.
Some practical cases of how to make sure that you actually using JIT. Similar to renjin you need to pay attention to not use
eval and similar things since then there is no way for compiler to work properly.
But the talk is about how one can use R from Haskell. Haskell provides a strong typing, while R takes a role to do machine learning and other cool stuff that is difficult in Haskell
The talk covers the package shiny.collections that is designed to provide a way to have Google Docs-like collaboration. What it does is abstracts away a call to database that may change over time and you’ll always get the most recent version of that database in your version of Shiny. In the presentation Marek showed how you can create a chat application with only 50 lines of code, so it can be a very interesting way to handle multiple users doing something in your shiny app with not too much overhead.