Counting cyclists in Berlin
I recently found out that berlin has a bunch of automatic bike-counters around berlin…and the data is public! There’s even an interactive map of the stations. This is all pretty awesome. Although the data is stored in multiple sheets of an excel file (#?&!!), I thought it’d be an interesting exercise to clean it up and see how useful it could be for other projects.
Get the data
Let’s grab the data.
filename <- 'cyclist_counts.xlsx'
url <- "https://www.berlin.de/senuvk/verkehr/lenkung/vlb/download/radzaehlung/Gesamtdatei_Stundenwerte_2014_2017.xlsx"
if (!file.exists(filename))
download.file(url, filename, 'libcurl')
It’s probably a good idea to take a look at the excel file so that it’s easier to follow along with the code below. You can also open up the R markdown in RStudio.
Unfortunately, there’s no official checksum provided to guarantee the data were downloaded correctly, but I can show you what my checksum is.
digest::digest(filename, algo = 'md5', file = TRUE)
[1] "e79c4ea5b5e1437d90a7d06f572f741e"
If you get a different md5 checksum, then you don’t have the same file as I do.
The counters
The first sheet Hinweise
(= advice) is mostly text describing some general information about the data. The notable parts tell us that it is the uneditted data from the counter systems and is published at the end of the counting-year. What’s a counting-year
? Good question; no idea. The excel file was last modified on the 20th of April 2018, so I guess the next update will probably be around April 2019.
There are also some disclaimers basically telling us that the counters don’t necessarily give an exact count due to weather conditions, maintainance, and other complications. This is described in more detail on the FAQ page (in German). They work using an induction loop, which detects a metal object moving passed. Full carbon bikes are not countable but there are likely so few that the difference is absorbed in the measurement error anyway. Apparently bikes with more or less than two wheels can also cause miscounts. However, groups of bikes can be counted accurately so long as they don’t ride too closely side-by-side.
The FAQ page also tells us that some counters are set up in pairs in order to count the two directions of traffic separately. Well see how to detect these pairs when we inspect the data.
The first data sheet, Standortdaten
= location data, lists the counters with a unique ID, a street name, lattitude-longitude (not shown here), and installation date.
counters <- filename %>%
readxl::read_excel(sheet = 'Standortdaten') %>%
transmute(
id = `Zähler` %>% as_factor(),
pair = str_replace(id, '-[NSOW]$', '') %>% as_factor(),
location = `Beschreibung - Fahrtrichtung`,
installation_date = as.Date(Installationsdatum)
) %>%
arrange(installation_date, location) %>%
write_csv('counters.csv')
n_counters <- counters %>% distinct(id) %>% nrow()
There are 26 counters listed in the sheet. I have seen it written that there are only 17, but I this number doesn’t count the paired counters as separate. The pairs are listed as north/south or east/west (in German), with the last letter of the ID indicating the direction. This allows us to identify the pair by removing the last character of the ID.
id | pair | location | installation_date |
---|---|---|---|
12-PA-SCH | 12-PA-SCH | Schwedter Steg | 2012-03-01 |
02-MI-JAN-N | 02-MI-JAN | Jannowitzbrücke Nord | 2015-04-01 |
02-MI-JAN-S | 02-MI-JAN | Jannowitzbrücke Süd | 2015-04-01 |
13-CW-PRI | 13-CW-PRI | Prinzregentenstraße | 2015-04-01 |
18-TS-YOR-O | 18-TS-YOR | Yorckstraße Ost | 2015-04-01 |
18-TS-YOR-W | 18-TS-YOR | Yorkstraße West | 2015-04-01 |
27-RE-MAR | 27-RE-MAR | Markstraße | 2015-05-01 |
19-TS-MON | 19-TS-MON | Monumentenstraße | 2015-05-01 |
03-MI-SAN-O | 03-MI-SAN | Invalidenstraße Ost | 2015-06-01 |
03-MI-SAN-W | 03-MI-SAN | Invalidenstraße West | 2015-06-01 |
05-FK-OBB-O | 05-FK-OBB | Oberbaumbrücke Ost | 2015-06-01 |
05-FK-OBB-W | 05-FK-OBB | Oberbaumbrücke West | 2015-06-01 |
26-LI-PUP | 26-LI-PUP | Paul-und-Paula-Uferweg | 2015-06-01 |
24-MH-ALB | 24-MH-ALB | Alberichstraße | 2015-07-01 |
10-PA-BER-N | 10-PA-BER | Berliner Straße Nord | 2016-05-01 |
10-PA-BER-S | 10-PA-BER | Berliner Straße Süd | 2016-05-01 |
17-SK-BRE-O | 17-SK-BRE | Breitenbachplatz Ost | 2016-05-01 |
17-SK-BRE-W | 17-SK-BRE | Breitenbachplatz West | 2016-05-01 |
23-TK-KAI | 23-TK-KAI | Kaisersteg | 2016-05-01 |
15-SP-KLO-S | 15-SP-KLO | Klosterstraße Süd | 2016-05-01 |
20-TS-MAR-N | 20-TS-MAR | Mariendorfer Damm Nord | 2016-05-01 |
20-TS-MAR-S | 20-TS-MAR | Mariendorfer Damm Süd | 2016-05-01 |
21-NK-MAY | 21-NK-MAY | Maybachufer | 2016-05-01 |
06-FK-FRA-O | 06-FK-FRA | Frankfurter Allee Ost | 2016-06-01 |
06-FK-FRA-W | 06-FK-FRA | Frankfurter Allee West | 2016-06-01 |
15-SP-KLO-N | 15-SP-KLO | Klosterstraße Nord | 2016-06-01 |
I’m not sure what the numbers in the ID are, but the first group of letters is an abbreviation of the district (Mitte, Pankow, etc), the second group is an abbreviation of the street name, and the third group (where it exists) is the direction of traffic (north, south, east, west).
Take a deep breath
Now for the data we came here for. As of this writing, there are 6 sheets, one for each year from 2012 to 2017.
years <- 2012:2017
The corresponding sheets are called Jahresdatei YYYY
(= year data).
sheets0 <- years %>%
as.character() %>%
paste0('Jahresdatei ', .)
sheets0
[1] "Jahresdatei 2012" "Jahresdatei 2013" "Jahresdatei 2014"
[4] "Jahresdatei 2015" "Jahresdatei 2016" "Jahresdatei 2017"
Let’s turn that into a named list so that it works well with the map_dfr
function below.
sheets <- sheets0 %>%
as.list() %>%
set_names(., .)
The read_excel
function isn’t as flexible as its read_csv
cousin so we’ll write a wrapper for loading the data from a particular sheet.
read_excel <- function(sheet, .filename=filename, .n=n_counters) {
readxl::read_excel(
.filename,
sheet = sheet,
col_types = c('date', rep('numeric', .n))
)
}
Now we can simply map that function over all of our sheet names to get a dataframe from each sheet, then bind the rows of these dataframes together so that we have all the data together. The columns also have awkward headers, so we’ll rename them to the counter ID.
df_untidy <- sheets %>%
map_dfr(read_excel, .id = 'sheet') %>%
# get rid of redundant info in the col names
# and use english names
set_names(names(.) %>% str_replace_all(" .*", "")) %>%
rename(measured_at = Zählstelle)
sheet | measured_at | 02-MI-JAN-N | 02-MI-JAN-S | 03-MI-SAN-O | 03-MI-SAN-W |
---|---|---|---|---|---|
Jahresdatei 2017 | 2017-12-31 19:00:00 | 29 | 29 | 19 | 14 |
Jahresdatei 2017 | 2017-12-31 20:00:00 | 31 | 22 | 16 | 15 |
Jahresdatei 2017 | 2017-12-31 21:00:00 | 21 | 18 | 11 | 5 |
Jahresdatei 2017 | 2017-12-31 22:00:00 | 11 | 18 | 6 | 10 |
Jahresdatei 2017 | 2017-12-31 23:00:00 | 7 | 9 | 10 | 20 |
All that remains is to tidy up the data:
- Each variable forms a column: we have the ID of the counter, the hour of measurement, and the actual measurement.
- Each observation forms a row: every hourly measurement should have exactly one row.
- Each type of observational unit forms a table: there’s only the hourly count.
The untidy data has a possible measurement for every counter, so we’ll need to gather those columns together. The nulls exists only because of the untidy format used - we can simply drop them.
df_tidy <- df_untidy %>%
gather(counter_id, total, -measured_at, -sheet) %>%
filter(!is.na(total)) %>%
mutate(counter_id = counter_id %>% factor(levels = levels(counters$id)))
sheet | measured_at | counter_id | total |
---|---|---|---|
Jahresdatei 2017 | 2017-12-31 18:00:00 | 27-RE-MAR | 13 |
Jahresdatei 2017 | 2017-12-31 19:00:00 | 27-RE-MAR | 4 |
Jahresdatei 2017 | 2017-12-31 20:00:00 | 27-RE-MAR | 5 |
Jahresdatei 2017 | 2017-12-31 21:00:00 | 27-RE-MAR | 2 |
Jahresdatei 2017 | 2017-12-31 22:00:00 | 27-RE-MAR | 2 |
Jahresdatei 2017 | 2017-12-31 23:00:00 | 27-RE-MAR | 4 |
After performing some checks, I realised there are a lot of duplicated data. In particular, Jahresdatei 2012
also contains a complete copy of Jahresdatei 2013
. Let’s get rid of the duplicates. Whilst we’re at it, we’ll add in some extra info about each counter by joining on the counter dataframe.
df <- df_tidy %>%
distinct(counter_id, measured_at, .keep_all = TRUE) %>%
# add a unique row id
ungroup() %>%
arrange(counter_id, measured_at) %>%
mutate(id = 1:n()) %>%
# other helpful stuff
# reorder columns
select(id, sheet, counter_id, measured_at, total) %>%
# save the data
write_csv('counter_data.csv') %>%
# add counter info
inner_join(counters, by = c('counter_id' = 'id')) %>%
arrange(id)
This leaves us with a total of 526,790 measurements.
I’ll use the stopifnot
function to list a bunch of properties that I know the data must satisfy if it is correct. This prevents this notebook from compiling if any of them have an entry that doesn’t evaluate to TRUE
.
stopifnot(
# all counts non-negative
df$total >= 0,
# measurments occur after installation
df$measured_at >= df$installation_date,
# measurements occur within the time range
year(df$measured_at) <= max(years),
year(df$measured_at) >= min(years),
# installations occur within the time range
year(df$installation_date) <= max(years),
year(df$installation_date) >= min(years),
# there are as many row ids as rows
max(df$id) == nrow(df),
# at most as many measurements as hours in the year
df %>%
group_by(counter_id, year = year(measured_at)) %>%
count() %>%
transmute(8760 + if_else(year %% 4 == 0, 24, 0) - n >= 0) %>%
pull(),
# exactly 26 counters
n_counters == df %>%
distinct(counter_id) %>%
nrow()
)
Exploring the data
In principle there should be one measurement per counter per hour, starting from when each counter is installed. Let’s check how many gaps appear in the data.
gaps <- df %>%
group_by(counter_id) %>%
mutate(
next_measurement = lead(measured_at),
difference = next_measurement - measured_at
) %>%
filter(difference > 1) %>%
select(id, counter_id, difference, measured_at, next_measurement)
id | counter_id | difference | measured_at | next_measurement |
---|---|---|---|---|
578 | 12-PA-SCH | 2 hours | 2012-03-25 01:00:00 | 2012-03-25 03:00:00 |
580 | 12-PA-SCH | 2 hours | 2012-03-25 04:00:00 | 2012-03-25 06:00:00 |
103630 | 13-CW-PRI | 16 hours | 2015-09-21 23:00:00 | 2015-09-22 15:00:00 |
146616 | 18-TS-YOR-O | 13 hours | 2017-11-15 16:00:00 | 2017-11-16 05:00:00 |
354380 | 10-PA-BER-N | 4 hours | 2016-05-14 00:00:00 | 2016-05-14 04:00:00 |
356679 | 10-PA-BER-N | 6 hours | 2016-08-17 22:00:00 | 2016-08-18 04:00:00 |
408515 | 15-SP-KLO-N | 16 hours | 2017-08-13 23:00:00 | 2017-08-14 15:00:00 |
414641 | 17-SK-BRE-O | 8 hours | 2016-08-24 20:00:00 | 2016-08-25 04:00:00 |
426976 | 17-SK-BRE-W | 2 hours | 2016-05-21 02:00:00 | 2016-05-21 04:00:00 |
466572 | 20-TS-MAR-S | 485 hours | 2017-07-24 23:00:00 | 2017-08-14 04:00:00 |
494816 | 23-TK-KAI | 198 hours | 2017-07-01 23:00:00 | 2017-07-10 05:00:00 |
494859 | 23-TK-KAI | 14 hours | 2017-07-11 23:00:00 | 2017-07-12 13:00:00 |
There are only 12 gaps, and all but two are resolved fairly quickly. I’m not sure what caused the two larger gaps.
Some counters are in higher traffic locations than others. Note that the larger the median is, the larger the interquartile range is too. This happens in almost all of the plots.
by_counter <- df %>%
group_by(counter_id) %>%
summarise(
q25 = quantile(total, 0.25),
q50 = quantile(total, 0.50),
q75 = quantile(total, 0.75)
)
There is also a commuter-effect, with peaks showing at 08:00 and 17:00. There is almost no traffic in the wee hours of the morning.
by_hour <- df %>%
group_by(h = hour(measured_at)) %>%
summarise(
q25 = quantile(total, 0.25),
q50 = quantile(total, 0.50),
q75 = quantile(total, 0.75)
)
Since there is a strong commuter-effect and some counters are paired to count different directions, it makes sense to take a look at the difference in traffic between the pairs during the day. We see that, for example, in Jannowitzbrücke (02-MI-JAN
) cyclists travel south-bound in the morning and north-bound in the evening. Interestingly, traffic seems to flow in mostly north-bound at Berlinerstraße (10-PA-BER
) at almost any time of the day. I’m not sure why this might be the case but it’s worth looking into later. The one exception is at 08:00, when many cyclists are likely commuting to the centre of Berlin (south-bound).
pairs_by_hour <- df %>%
filter(str_detect(location, 'Nord|Ost|Süd|West')) %>%
inner_join(df, by = c('pair', 'measured_at')) %>%
filter(as.numeric(counter_id.x) < as.numeric(counter_id.y)) %>%
select(measured_at, matches('id|total')) %>%
mutate(
comparison = paste0(counter_id.x, ' - ', counter_id.y),
difference = total.x - total.y
) %>%
group_by(g = hour(measured_at), comparison) %>%
summarise(difference = mean(difference))
In Klosterstraße (15-SP-KLO
) and Mariendorfer Damm (20-TS-MAR
) it looks like traffic is roughly equal in both directions all day. The former flows slightly more north-bound in evening, the latter more north-bound in the morning. Given the positions of these two pairs of counters, it is unlikely that they are part of a commuter-loop. The overall counts for these four counters is fairly small on average, so these apparent effects may just be noise.
The weekday counts seem fairly constant and fall by about a factor of a third on the weekend. This is consistent with the commuter-effect seen above.
by_weekday <- df %>%
group_by(g = wday(measured_at, label = TRUE, week_start = 1)) %>%
summarise(
q25 = quantile(total, 0.25),
q50 = quantile(total, 0.50),
q75 = quantile(total, 0.75)
)
There is also a yearly seasonal pattern. Apart from berliners being more willing to commute in summer, I’d be willing to bet that tourists are also responsible for the summer peak in June/July. In winter, the roads can become unsafe for cyclists due to the ice, snow, and lack of maintenance.
by_month <- df %>%
group_by(g = month(measured_at, label = TRUE)) %>%
summarise(
q25 = quantile(total, 0.25),
q50 = quantile(total, 0.50),
q75 = quantile(total, 0.75)
)
There is no clear yearly effect, as the median just ambles around 60 cyclist per hour.
by_year <- df %>%
group_by(g = year(measured_at)) %>%
summarise(
q25 = quantile(total, 0.25),
q50 = quantile(total, 0.50),
q75 = quantile(total, 0.75)
)
What’s next?
So…that was all easier than expected. The weird north-bound pattern at Berlinerstraße and the two large measurement gaps still need to be looked into, but overall the data are in a good state. I’d like to be able to combine this with data on bike-related traffic accidents, but from what I can see, the Berlin police chose one of the few formats worse than an excel sheet: pdf tables.
With 3D pie charts.
…