Buying a car data sceintist way

Recently I’ve been thinking about what car to buy. Searching the web is one way, but I wanted to do it in a more “data” way, so in this series I’ll show how I’ve extracted data about ~19.000 used cars in Berlin to find out what models/manufacturers seem to hold the price for longer.

In my case I’ve decided to use this as an excuse to practice some of the skills that I’ve been dying to try for a long time - web scrapping with R. Main idea is to get some data from a used car ad place (this will be part 1), process it (this will be also in this post), produce some visualisations (part 2), and finally run some machine learning algorithms to identify what are the most important factor deciding the price of a car (part 3). After that I’m thinking to put it all in Shiny and publish it online, but who knows.

Most of the things described above can be done with some tools from tidyverse. For the first part I’ll be using mostly rvest and dplyr.

How I got data

Since I’m currently based in Berlin, Germany (and I’m looking for a job too!), I’ve decided to use one of the bigger ad places of used and new cars - Moreover, I’ve decided to limit my search only to ~10 km radius around Berlin so that these data can be as useful to me as possible. In total there were ~23,000 ads placed on all models of cars with this search query.

Since I wanted to get as much of data as possible, I’ve decided to harvest data directly from each ad, rather than using API. I did it for three main reasons:

  1. Their API requires credentialing and, frankly, I was a bit lazy to register with them, because…
  2. This whole project is mostly for me to practice web scraping, rather than using API’s (I’ll do API’s later).
  3. The volume of data I was getting to scrape from them is rather small, so it probably didn’t matter for their servers anyways.

So, if you are going to use this approach to get ~200k ads, than you should probably think about using API’s directly, rather than what I’ve done here.

My approach was as follows:

Step 1: Since each search query gives you at most 1000 ads (50 pages with 20 ads per page), I needed a way to create multiple search queries such that they each have at most 1000 ads.

Step 2: First of all, I’ve gone through search and identified 38 pairings of “search-id” and “car manufacturer”. For example, if you are looking for Audi, than in your search query you need to add &makeModelVariant1.makeId=1900 to your query.

Step 3. To limit my search even further, I’ve used two more parameters: minMileage and maxMileage.

That is how it looks like in code:

prod_pairings <- read_csv("./Data/supplements/Producer-makeId-pairings.csv")
minMileage <- seq(0, 100000, by = 10000)
maxMileage <- seq(10000, 110000, by = 10000)
maxMileage[11] <- 200000
id_km <- map_df(1:nrow(prod_pairings), function(i){
  data.frame(carmake = prod_pairings$carmake[i], id = prod_pairings$id[i], minMileage = minMileage, maxMileage = maxMileage)

Resulting id_km provides a data frame with 418 rows of the form: Car Manufacturer ID - minMileage - maxMileage, where minMileage is 10,000 km less than maxMileage.

Step 4: I then used this id_km data frame further to find out how many pages each query generates:

id_km <- id_km %>% 
  group_by(id, minMileage, maxMileage) %>% 
  mutate(url = sprintf(url_base, "&userPosition=52.54503%2C13.44169", minMileage, maxMileage, id)) %>%
  mutate(counter = read_html(url) %>% html_node(css = "span.hit-counter:nth-child(2)") %>% html_text() %>% as.numeric() %/% 20 + 1) 

The actual query is in the second mutate statement and it uses rvest to find a number of results generated with the query that is then processed. Hadley Wickham himself suggested using SelectorGadget to find correct CSS or Xpath of the element, however, in my case it was far more straightforward to use Firefox “Inspect Element” tool.

Step 5: In the next step I produce actual queries using information from previous step:

search_url_df <- map_df(1:nrow(id_km), function(i){
  data.frame(carmake = id_km$carmake[i], 
             url = id_km$url[i], 
             search_url = paste0(id_km$url[i], "&pageNumber=", seq(from = 1, to = id_km$counter[i], by = 1)),
             stringsAsFactors = FALSE)

Step 6: I then use those queries to find out id’s of cars on those pages:

cars_id <- map_df(1:nrow(search_url_df), function(i){
  cat(".") #handy tracker
  ids <- read_html(search_url_df$search_url[i]) %>% html_nodes(".parking-block") %>% xml_attr("data-parking") %>% as.integer()
  data.frame(id = ids[2:length(ids)], car_maker = search_url_df$carmake[i]) #ids start from 2 because 1 corresponds to an ad

cars_id$id <- as.numeric(cars_id$id)
cars_id <- cars_id %>% filter(! %>% filter(!duplicated(id)) 

Notice that I use xml_attr function to find out id’s. Again, it is possible to find it out directly from the page and I’ve found out that in this particular case it makes id extraction much easier.

Step 7: Finally, I’ve had ~19,000 ids of cars that satisfied my search criteria (10 km around Berlin). Now we are finally ready to actually query id by id to find out whatever there is on the page. For that, I’ve created a function that is used in map_df and dummy data frame that contained all NA’s if for some reason query to was unsucessful (the most common is error 404 since some of the ads were removed in between my queries). All the attributes were found again using “Inspect Element” in Firefox.

dummy_df <- data.frame(id = NA_real_,
                       carmake = NA_real_,
                       price = NA_real_,
                       title = NA_real_,
                       owners = NA_real_,
                       envir = NA_real_,
                       category = NA_real_, #category
                       mileage = NA_real_, #mileage
                       eng_vol = NA_real_, #engine volume
                       eng_pow = NA_real_, #engine power
                       fuel_type = NA_real_, #type of fuel
                       fuel_cons = NA_real_, #fuel consumption
                       co2_em = NA_real_, #CO2 emmissions
                       seats = NA_real_, #Number of seats
                       doors = NA_real_, #Number of doors
                       transm = NA_real_, #Transmission
                       co2_class = NA_real_, #Emission class
                       co2_sticker = NA_real_, #Emission sticker
                       fr = NA_real_, #first registration
                       hu = NA_real_, #HU
                       clima = NA_real_, #Climatisation
                       extra = NA_real_,  #Extra features
                       stringsAsFactors = FALSE)

extract_data <- function(i){
  cat(i, ",")
  pg <- read_html(paste0(url_base_car, cars_id$id[i]))
  data.frame(id = cars_id$id[i],
             carmake = cars_id$car_maker[i],
             price = pg %>% html_node(".rbt-prime-price") %>% html_text(),
             title = pg %>% html_node("#rbt-ad-title") %>% html_text(),
             owners = pg %>% html_node("#rbt-numberOfPreviousOwners-v") %>% html_text(),
             envir = pg %>% html_node("#rbt-envkv") %>% html_text(),
             category = pg %>% html_node(".rbt-sl") %>% html_text(), #category
             mileage = pg %>% html_node("#rbt-mileage-v") %>% html_text(), #mileage
             eng_vol = pg %>% html_node("#rbt-cubicCapacity-v") %>% html_text(), #engine volume
             eng_pow = pg %>% html_node("#rbt-power-v") %>% html_text(), #engine power
             fuel_type = pg %>% html_node("#rbt-fuel-v") %>% html_text(), #type of fuel
             fuel_cons = pg %>% html_node("#rbt-envkv\\.consumption-v") %>% html_nodes("div") %>% html_text() %>% paste(collapse = ";"), #fuel consumption
             co2_em = pg %>% html_node("#rbt-envkv\\.emission-v") %>% html_text(), #CO2 emmissions
             seats = pg %>% html_node("#rbt-numSeats-v") %>% html_text(), #Number of seats
             doors = pg %>% html_node("#rbt-doorCount-v") %>% html_text(), #Number of doors
             transm = pg %>% html_node("#rbt-transmission-v") %>% html_text(), #Transmission
             co2_class = pg %>% html_node("#rbt-emissionClass-v") %>% html_text(), #Emission class
             co2_sticker = pg %>% html_node("#rbt-emissionsSticker-v") %>% html_text(), #Emission sticker
             fr = pg %>% html_node("#rbt-firstRegistration-v") %>% html_text(), #first registration
             hu = pg %>% html_node("#rbt-hu-v") %>% html_text(), #HU
             clima = pg %>% html_node("#rbt-climatisation-v") %>% html_text(), #Climatisation
             extra = pg %>% html_node("#rbt-features") %>% html_nodes("p") %>% html_text() %>% paste(collapse = ";"),  #Extra features
             stringsAsFactors = FALSE)

Then, using those elements is simply:

cars_df <- map_df(1:nrow(cars_id), possibly(extract_data, dummy_df)) 

Notice that I’m using possibly. This ensures that if there is an error with a query, than R will use dummy_df instead. In this case (and probably in most cases) it is crucial to use either possibly or safely since if you don’t than entire data frame won’t be created. Because of that my first attempt took ~10 hours just to fail at the very end. It took so long because I’ve had Sys.sleep() statement in a function to reduce the load on a server. When removed, above mapping takes ~1 hour.

Bonus cleaning

Since most of the data harvested was actually pretty clean already, I’ve just put what I did in terms of cleaning here. In the next post I’ll do some Exploratory Data Analysis (EDA) to see whether there are some interesting patterns:


df <- df %>% 
  filter(! %>%
  separate(category, c("category1", "category2"), sep = ",") %>%
  separate(eng_pow, c("eng_pow_kw", "eng_pow_ps"), sep = "\\(") %>%
  separate(fuel_cons, c("fuel_cons_comb", "fuel_cons_city", "fuel_cons_hwy"), sep = ";")

df <- type_convert(df, 
      .default = col_number(),
      carmake = col_factor(levels = levels(as.factor(df$carmake))),
      title = col_character(),
      category1 = col_factor(levels = levels(as.factor(df$category1))),
      category2 = col_character(),
      fuel_type = col_factor(levels = levels(as.factor(df$fuel_type))),
      doors = col_factor(levels = levels(as.factor(df$doors))),
      transm = col_factor(levels = levels(as.factor(df$transm))),
      co2_class = col_factor(levels = levels(as.factor(df$co2_class))),
      fr = col_date(format = "%m/%Y"),
      hu = col_character(),
      clima = col_factor(levels = levels(as.factor(df$clima))),
      extra = col_character()
    ), locale = locale(grouping_mark = "."))

With tidyverse cleaning is actually pretty easy, especially separating columns into more readable form.


There aren't any comments yet. Be the first to comment!

Leave a comment

Thank you

Your comment has been submitted and will be published once it has been approved.



Your post has not been submitted. Please return to the form and make sure that all fields are entered. Thank You!