# Buying a car data sceintist way

Recently I’ve been thinking about what car to buy. Searching the web is one way, but I wanted to do it in a more “data” way, so in this series I’ll show how I’ve extracted data about ~19.000 used cars in Berlin to find out what models/manufacturers seem to hold the price for longer.

In my case I’ve decided to use this as an excuse to practice some of the skills that I’ve been dying to try for a long time - web scrapping with R. Main idea is to get some data from a used car ad place (this will be part 1), process it (this will be also in this post), produce some visualisations (part 2), and finally run some machine learning algorithms to identify what are the most important factor deciding the price of a car (part 3). After that I’m thinking to put it all in Shiny and publish it online, but who knows.

Most of the things described above can be done with some tools from tidyverse. For the first part I’ll be using mostly rvest and dplyr.

## How I got data

Since I’m currently based in Berlin, Germany (and I’m looking for a job too!), I’ve decided to use one of the bigger ad places of used and new cars - mobile.de. Moreover, I’ve decided to limit my search only to ~10 km radius around Berlin so that these data can be as useful to me as possible. In total there were ~23,000 ads placed on all models of cars with this search query.

Since I wanted to get as much of data as possible, I’ve decided to harvest data directly from each ad, rather than using mobile.de API. I did it for three main reasons:

1. Their API requires credentialing and, frankly, I was a bit lazy to register with them, because…
2. This whole project is mostly for me to practice web scraping, rather than using API’s (I’ll do API’s later).
3. The volume of data I was getting to scrape from them is rather small, so it probably didn’t matter for their servers anyways.

So, if you are going to use this approach to get ~200k ads, than you should probably think about using API’s directly, rather than what I’ve done here.

My approach was as follows:

Step 1: Since each search query gives you at most 1000 ads (50 pages with 20 ads per page), I needed a way to create multiple search queries such that they each have at most 1000 ads.

Step 2: First of all, I’ve gone through mobile.de search and identified 38 pairings of “search-id” and “car manufacturer”. For example, if you are looking for Audi, than in your search query you need to add &makeModelVariant1.makeId=1900 to your query.

Step 3. To limit my search even further, I’ve used two more parameters: minMileage and maxMileage.

That is how it looks like in code:

prod_pairings <- read_csv("./Data/supplements/Producer-makeId-pairings.csv")
minMileage <- seq(0, 100000, by = 10000)
maxMileage <- seq(10000, 110000, by = 10000)
maxMileage[11] <- 200000
id_km <- map_df(1:nrow(prod_pairings), function(i){
data.frame(carmake = prod_pairings$carmake[i], id = prod_pairings$id[i], minMileage = minMileage, maxMileage = maxMileage)
})

Resulting id_km provides a data frame with 418 rows of the form: Car Manufacturer ID - minMileage - maxMileage, where minMileage is 10,000 km less than maxMileage.

Step 4: I then used this id_km data frame further to find out how many pages each query generates:

id_km <- id_km %>%
group_by(id, minMileage, maxMileage) %>%
mutate(url = sprintf(url_base, "&userPosition=52.54503%2C13.44169", minMileage, maxMileage, id)) %>%
mutate(counter = read_html(url) %>% html_node(css = "span.hit-counter:nth-child(2)") %>% html_text() %>% as.numeric() %/% 20 + 1) 

The actual query is in the second mutate statement and it uses rvest to find a number of results generated with the query that is then processed. Hadley Wickham himself suggested using SelectorGadget to find correct CSS or Xpath of the element, however, in my case it was far more straightforward to use Firefox “Inspect Element” tool.

Step 5: In the next step I produce actual queries using information from previous step:

search_url_df <- map_df(1:nrow(id_km), function(i){
data.frame(carmake = id_km$carmake[i], url = id_km$url[i],
search_url = paste0(id_km$url[i], "&pageNumber=", seq(from = 1, to = id_km$counter[i], by = 1)),
stringsAsFactors = FALSE)
})

Step 6: I then use those queries to find out id’s of cars on those pages:

cars_id <- map_df(1:nrow(search_url_df), function(i){
cat(".") #handy tracker
ids <- read_html(search_url_df$search_url[i]) %>% html_nodes(".parking-block") %>% xml_attr("data-parking") %>% as.integer() data.frame(id = ids[2:length(ids)], car_maker = search_url_df$carmake[i]) #ids start from 2 because 1 corresponds to an ad
})

cars_id$id <- as.numeric(cars_id$id)
cars_id <- cars_id %>% filter(!is.na(id)) %>% filter(!duplicated(id)) 

Notice that I use xml_attr function to find out id’s. Again, it is possible to find it out directly from the page and I’ve found out that in this particular case it makes id extraction much easier.

Step 7: Finally, I’ve had ~19,000 ids of cars that satisfied my search criteria (10 km around Berlin). Now we are finally ready to actually query mobile.de id by id to find out whatever there is on the page. For that, I’ve created a function that is used in map_df and dummy data frame that contained all NA’s if for some reason query to mobile.de was unsucessful (the most common is error 404 since some of the ads were removed in between my queries). All the attributes were found again using “Inspect Element” in Firefox.

dummy_df <- data.frame(id = NA_real_,
carmake = NA_real_,
price = NA_real_,
title = NA_real_,
owners = NA_real_,
envir = NA_real_,
category = NA_real_, #category
mileage = NA_real_, #mileage
eng_vol = NA_real_, #engine volume
eng_pow = NA_real_, #engine power
fuel_type = NA_real_, #type of fuel
fuel_cons = NA_real_, #fuel consumption
co2_em = NA_real_, #CO2 emmissions
seats = NA_real_, #Number of seats
doors = NA_real_, #Number of doors
transm = NA_real_, #Transmission
co2_class = NA_real_, #Emission class
co2_sticker = NA_real_, #Emission sticker
fr = NA_real_, #first registration
hu = NA_real_, #HU
clima = NA_real_, #Climatisation
extra = NA_real_,  #Extra features
stringsAsFactors = FALSE)

extract_data <- function(i){
cat(i, ",")
pg <- read_html(paste0(url_base_car, cars_id$id[i])) data.frame(id = cars_id$id[i],
carmake = cars_id$car_maker[i], price = pg %>% html_node(".rbt-prime-price") %>% html_text(), title = pg %>% html_node("#rbt-ad-title") %>% html_text(), owners = pg %>% html_node("#rbt-numberOfPreviousOwners-v") %>% html_text(), envir = pg %>% html_node("#rbt-envkv") %>% html_text(), category = pg %>% html_node(".rbt-sl") %>% html_text(), #category mileage = pg %>% html_node("#rbt-mileage-v") %>% html_text(), #mileage eng_vol = pg %>% html_node("#rbt-cubicCapacity-v") %>% html_text(), #engine volume eng_pow = pg %>% html_node("#rbt-power-v") %>% html_text(), #engine power fuel_type = pg %>% html_node("#rbt-fuel-v") %>% html_text(), #type of fuel fuel_cons = pg %>% html_node("#rbt-envkv\\.consumption-v") %>% html_nodes("div") %>% html_text() %>% paste(collapse = ";"), #fuel consumption co2_em = pg %>% html_node("#rbt-envkv\\.emission-v") %>% html_text(), #CO2 emmissions seats = pg %>% html_node("#rbt-numSeats-v") %>% html_text(), #Number of seats doors = pg %>% html_node("#rbt-doorCount-v") %>% html_text(), #Number of doors transm = pg %>% html_node("#rbt-transmission-v") %>% html_text(), #Transmission co2_class = pg %>% html_node("#rbt-emissionClass-v") %>% html_text(), #Emission class co2_sticker = pg %>% html_node("#rbt-emissionsSticker-v") %>% html_text(), #Emission sticker fr = pg %>% html_node("#rbt-firstRegistration-v") %>% html_text(), #first registration hu = pg %>% html_node("#rbt-hu-v") %>% html_text(), #HU clima = pg %>% html_node("#rbt-climatisation-v") %>% html_text(), #Climatisation extra = pg %>% html_node("#rbt-features") %>% html_nodes("p") %>% html_text() %>% paste(collapse = ";"), #Extra features stringsAsFactors = FALSE) } Then, using those elements is simply: cars_df <- map_df(1:nrow(cars_id), possibly(extract_data, dummy_df))  Notice that I’m using possibly. This ensures that if there is an error with a query, than R will use dummy_df instead. In this case (and probably in most cases) it is crucial to use either possibly or safely since if you don’t than entire data frame won’t be created. Because of that my first attempt took ~10 hours just to fail at the very end. It took so long because I’ve had Sys.sleep() statement in a function to reduce the load on a server. When removed, above mapping takes ~1 hour. ## Bonus cleaning Since most of the data harvested was actually pretty clean already, I’ve just put what I did in terms of cleaning here. In the next post I’ll do some Exploratory Data Analysis (EDA) to see whether there are some interesting patterns: library(tidyverse) df <- df %>% filter(!is.na(id)) %>% separate(category, c("category1", "category2"), sep = ",") %>% separate(eng_pow, c("eng_pow_kw", "eng_pow_ps"), sep = "\\(") %>% separate(fuel_cons, c("fuel_cons_comb", "fuel_cons_city", "fuel_cons_hwy"), sep = ";") df <- type_convert(df, cols( .default = col_number(), carmake = col_factor(levels = levels(as.factor(df$carmake))),
title = col_character(),
category1 = col_factor(levels = levels(as.factor(df$category1))), category2 = col_character(), fuel_type = col_factor(levels = levels(as.factor(df$fuel_type))),
doors = col_factor(levels = levels(as.factor(df$doors))), transm = col_factor(levels = levels(as.factor(df$transm))),
co2_class = col_factor(levels = levels(as.factor(df$co2_class))), fr = col_date(format = "%m/%Y"), hu = col_character(), clima = col_factor(levels = levels(as.factor(df$clima))),
extra = col_character()
), locale = locale(grouping_mark = "."))


With tidyverse cleaning is actually pretty easy, especially separating columns into more readable form.

## Comments

There aren't any comments yet. Be the first to comment!

### Thank you

Your comment has been submitted and will be published once it has been approved.

### Sorry

Your post has not been submitted. Please return to the form and make sure that all fields are entered. Thank You!