UseR 2017 review: Talks

There was a useR! conference in July, 2017. I’ve been meaning to go over some of the talks that are interesting (to me) for quite some time, so now that it is September in Berlin and weather is atrocious - what can be better than spending some time listening to interesting people talking about interesting things? Well, maybe many things, but whatever, I’m doing this.

As a way of preparation I went over the schedule and wrote down every single talk that I thought is interesting. In total I’ve found 28 talks (for couple of talks I can’t find videos or slides online). There were also couple of tutorials (those were on the order of 2-3 hours each), but this post is only about talks.

The purpose of this post is mostly for me to have cliffnotes about each talk. Basically, I was going to watch them anyway, so why not also put links to them and few things I found interesting down for posterity?

  1. How we built a Shiny App for 700 users?

The talk goes over the process of building the shiny app for a client. From the slides it looks like the app is for the internal users of the company that analyzes voice of the person on the other line of the phone.

What I found especially interesting is that the entire app (including infrastructure and deployment) has been developed by data scientists. In the end, they are using docker containers that are running behind the load balancer.

Another interesting tidbit is that they were forced to use data.table instead of dplyr since it provided them with 25 speed-up in lookup. I’m not sure why they were loading entire tables into memory instead of using a database, but perhaps there were some good reasons to do so.

  1. Transformation Forests

The talk introduces the idea of a transformation forest which is an extension of random forest idea that is fairly well-known to most people.The main idea is that random forest and many other regression approaches predict “conditional mean of the target given predictors”. With transformation forest what you get is the conditional distribution of the mean given predictors. That allows for much richer inference since you get to see higher moments of the distribution (variance, for example) and make conclusion based on that.

To understand that more fully there is a paper.

  1. A Benchmark of Open Source Tools for Machine Learning from R

This is a talk by Szilard Pafka of benchmarking fame. He has a github repo where he collects result of his benchmarks. The bottom line is that not every tool is always going to work as advertised. In his talk, for example, he mentions Spark as a tool that gave him the most troubles AND produced the poorest results. This, of course, doesn’t mean that Spark is not a tool that you should use, just that not to use tools just for the sake of using them.

  1. Ensemble packages with user friendly interface: an added value for the R community

What I thought this talk was about is ensemble in Kaggle sense (i.e., mixing multiple models together to get higher accuracy). But in fact it is about putting similar packages together and creating GUI for people to use them with more ease. From the case-studies in the talk it looks like it can be useful for people who don’t use/know R too much. I’m not in that group, so it’s difficult to say how useful this approach is. At the same time, main idea of unifying packages together is definitely useful.

  1. Interacting with databases from Shiny

The talk is mostly about new package called pool. The idea of the package is to bring db connections to Shiny. Since Shiny app tend to be a single process with multiple users you don’t want to create a connection every time any user is doing a query to a remote database. So, instead, you are creating a pool of connections that are recycled a needed and maintained separately. This should make it faster for your users with only a little overhead.

  1. mlrHyperopt: Effortless and collaborative hyperparameter optimization experiments

The idea of the package is centered around hyperparameter optimization. What Jakob Richter implemented is an online database where anyone can store their hyperparameter optimization routines for any type of model. So the idea is that you can use this information to tune your own model without knowing what are the current state-of-the-art approach to tuning. For example, in random forest that is implemented in caret package the author decided that number of trees shouldn’t be tuned. There is some research that shows that this is correct, but maybe some other research comes along and now you definitely should tune your number of trees in the forest. With the package in question, you can just download whatever approach is considered to be the best currently and use it for your case.

  1. RL10N: Translating Error Messages & Warnings

The talk is about actual translation. I thought it was more to understand what errors might mean since sometimes they are quite cryptic. Oh, well.

  1. ShinyProxy

The talk is about (shock!) Shiny Proxy. It is an alternative to Shiny Server Pro from RStudio. It uses Docker on the backend and Java to manage authentication of users. The basic idea is that you can create a Docker container with whatever Shiny app you want and then you can use Shiny Proxy to create a protected environment so that only people with enough rights get access to this app. The whole thing is open-sourced and in general can be used for multiple use-cases. And since it is based on Docker, everything that comes with it comes to you for free. So you can use Docker Swarm and all of the thingys.

  1. Developing and deploying large scale Shiny applications for non-life insurance

Pretty cool presentation, but the main value is in the tutorial. It shows how to extend Shiny app to have slick looking stuff going with JavaScript and jQuery and all of those things. Definitely going to revisit it.

And of course Docker all the things!

  1. Stream processing with R in AWS

Very interesting talk that goes over a use-case to implement a stream processor in R in AWS. At the very end Gergely showed the example where there is a Kinesis produces with flights data that is processed by R and then stored in Redis. This is then used by Shiny app that reads from Redis every two seconds and updates the UI. I think, I’ll try to replicate that at some point since it sound interesting.

  1. Text mining, the tidy way

Julia Silge and David Robinson have tidytext package that can be used for all sorts of things related to text mining. The talk is just an overview of things that are possible. But there are many blogs by both of them where they go in more details about how to do it.

  1. We R a Community-making high quality courses for high education accessible for all:The >eR-Biostat initiative

A project to develop courses to train students to use R with specific focus on biostatistics.

  1. Beyond Prototyping: Best practices for R in critical enterprise environments

Some practical advice of how to manage change in data scientist environment. Usual suspects: git, Docker, packrat, miniCRAN. One interesting idea is to introduce/have a title of Data Science Administrator who is responsible for package updates and making sure that there are no breaking changes.

  1. Bayesian social network analysis with Bergm

Exponential random graph models with Bayesian implementation. Author created Bergm package that can be used for this purpose. This is not the topic I’m familiar with, but if I do come across something like this in future work I’ll be sure to check it out more fully.

  1. Can you keep a secret?

How do you deploy a package with AWS credentials? This is a fairly common question and since you really don’t want to store them in plain text, there is this package that can encrypt and decrypt any information you want. And it is by Gabor Csardi, so it must be good :).

  1. Clouds, Containers and R, towards a global hub for reproducible and collaborative data science

This talk is responsible for this post, so not much more to add. And Docker all the things!

  1. IRT test equating with the R package equateIRT

In my prevoius life I was a social scientist and IRT was one of the most exciting things I’ve came across. In many ways I see it as a gateway drug for some parts of social science into more rigorous world of scientific method.

As for the talk – it talks about how to connect multiple test forms in order to compare them between each other. More info on the CRAN page.

  1. jug: Building Web APIs for R

Web framework for R that makes it easy to create API-endpoints. Very similar to plumber, but one cool thing that they introduced is the way to serve an API scalably.

  1. Automatically archiving reproducible studies with Docker

DOCKER ALL THE THINGS!

  1. ReinforcementLearning: A package for replicating human behavior in R

Short introduction to reinforcement learning and even shorter introduction to a package. I didn’t have a case where I thought I could use reinforcement learning, but maybe some day…

  1. The renjin package: Painless Just-in-time Compilation for High Performance R

Renjin is an open-source implementation of R language that runs in JVM. It is designed to speed-up some code with, for example, JIT compilation in for-loops. In some cases it can improve speed of some code. One cool thing that was shown is that you can use renjing as a package to speed-up some parts of the code.

  1. odbc - A modern database interface

Overview of odbc package that is one part of modern pipeline of using R with databases without too much headache. I’m still unclear where you supposed to get ODBC drivers if you are not using professional version of RStudio.

  1. R4ML: A Scalable R for Machine Learning

IBM package that abstracts away the work with Spark by providing a way to compile your R code to Spark code that can be executed in a distributed manner.

  1. Improving DBI

Update on the R Consortium-backed project to improve DBI in R.

  1. Programming with tidyverse grammars

No video. Slides are extended and beautified version of vignette.

  1. Taking Advantage of the Byte Code Compiler

Some practical cases of how to make sure that you actually using JIT. Similar to renjin you need to pay attention to not use source, eval and similar things since then there is no way for compiler to work properly.

  1. R and Haskell: Combining the best of two worlds

But the talk is about how one can use R from Haskell. Haskell provides a strong typing, while R takes a role to do machine learning and other cool stuff that is difficult in Haskell

  1. shiny.collections: Google Docs-like live collaboration in Shiny

The talk covers the package shiny.collections that is designed to provide a way to have Google Docs-like collaboration. What it does is abstracts away a call to database that may change over time and you’ll always get the most recent version of that database in your version of Shiny. In the presentation Marek showed how you can create a chat application with only 50 lines of code, so it can be a very interesting way to handle multiple users doing something in your shiny app with not too much overhead.

Comments

There aren't any comments yet. Be the first to comment!

Leave a comment

Thank you

Your comment has been submitted and will be published once it has been approved.

OK

Sorry

Your post has not been submitted. Please return to the form and make sure that all fields are entered. Thank You!

OK