The dating game
By Zhian N. Kamvar in R
January 2, 2020
Dates are not so sweet
It is known: parsing dates entered by humans is a huge pain:
Source:
https://xkcd.com/1179/
ISO 8601 nerds: a crack has developed in the old way. The new year has wounded both the m/d/yy and the d/m/yy factions. The time is ripe for a new global order. TONIGHT (2020-01-01) WE RIDE https://t.co/1YTL1SWDm2
— Brooke Watson Madubuonwu (@brookLYNevery1) January 1, 2020
Dates are a never-ending source of fresh hell. There are so many different ways parsing dates can go wrong. It’s not quite as bad as the horror that is parsing HTML with regex, but it’s close. For example, consider the regex used to parse valid ISO 8601 dates… you know, the dates that are supposed to be “the good ones”:
^([\+-]?\d{4}(?!\d{2}\b))((-?)((0[1-9]|1[0-2])(\3([12]\d|0[1-9]|3[01]))?|W([0-4]\d|5[0-2])(-?[1-7])?|(00[1-9]|0[1-9]\d|[12]\d{2}|3([0-5]\d|6[1-6])))([T\s]((([01]\d|2[0-3])((:?)[0-5]\d)?|24\:?00)([\.,]\d+(?!:))?)?(\17[0-5]\d([\.,]\d+)?)?([zZ]|([\+-])([01]\d|2[0-3]):?([0-5]\d)?)?)?)?$
Oh deer lord 😱
This is only just the beginning. There are so many blog posts already dedicated to working with dates in R, and I’m not going to re-hash the whole rigormarole of explaining the difference between Date and POSIXt classes. I’m just going to point you to some excellent walkthroughs such as this gist by Bonnie Dixon (hosted by Noam Ross).
What I want to focus on is: what tools are available to parse date strings to
the Date
class in R and how well do they work to weird situations? Some of the
common things I’ve seen in my work for example:
- an ambiguous mix of dd/mm/yy and mm/dd/yy format (e.g. 02/04/20 or 04/02/20)
- Dates are in French, but you’re working in an English locale (e.g. 04 Février 2020)
- Dates are imported from Excel (e.g. 43865, which are represented as the either the number of days since 1899-12-30 if you are importing from Windows OR the number of days since 1904-01-01 on MacOS)1
There’s a package for that
I’m immediately familiar with three packages on CRAN that are solely2 dedicated to parsing dates: {lubridate}, {anytime}, {parsedate}. However, I know that I’m probably missing some, so I’ll try to use the {pkgsearch} package to find them. I know that I want to search for any package that mentions “date” in the title and has either “parse,” “handle,” “convert,” or “detect” in the description.
N.B. I was missing a package as Jim Hester kindly pointed out:
One other package you don't mention is readr, which can parse dates on import _or_ after the fact with `readr::parse_date()`, e.g. https://t.co/L0xOlMHOYC
— Jim Hester (@jimhester_) January 2, 2020
It doesn't handle the raw date count from excel, but it handles all the rest fine. pic.twitter.com/mMa0r21m0u
library("dplyr")
library("stringr")
library("pkgsearch")
date_pkgs <- pkg_search("date", size = 200) %>%
filter(str_detect(title, "[Dd]ate") &
str_detect(description, "[Pp]ars|[Hh]andl|[Cc]onvert|[Dd]etect")) %>%
filter(str_detect(maintainer_name, "Zhian", negate = TRUE)) %>% # my packages don't parse dates
filter(package != "chron") %>% # chron is a well-known package that doesn't do text to date conversions
arrange(desc(downloads_last_month))
date_pkgs[] %>%
mutate(package = str_glue("[{package}](https://cran.r-project.org/package={package})")) %>%
select(Package = package, `Downloads Last Month` = downloads_last_month, Title = title) %>%
knitr::kable(format.args = list(big.mark = ","))
Package | Downloads Last Month | Title |
---|---|---|
lubridate | 859,067 | Make Dealing with Dates a Little Easier |
anytime | 78,442 | Anything to ‘POSIXct’ or ‘Date’ Converter |
parsedate | 19,323 | Recognize and Parse Dates in Various Formats, Including All ISO |
8601 Formats | ||
datetime | 7,047 | Nominal Dates, Times, and Durations |
date | 4,351 | Functions for Handling Dates |
incidence | 3,271 | Compute, Handle, Plot and Model Incidence of Dated Events |
clock | 2,389 | Date-Time Types and Tools |
incidence2 | 1,071 | Compute, Handle and Plot Incidence of Dated Events |
MMWRweek | 828 | Convert Dates to MMWR Day, Week, and Year |
dint | 735 | A Toolkit for Year-Quarter, Year-Month and Year-Isoweek Dates |
datetimeutils | 695 | Utilities for Dates and Times |
datefixR | 389 | Fix Really Messy Dates |
rccdates | 340 | Date Functions for Swedish Cancer Data |
jalcal | 310 | Conversion Between Jalali (Persian or Solar Hijri) and Gregorian |
Calendar Dates | ||
unstruwwel | 229 | Detect and Parse Historic Dates |
There are a few more, but then there are some that don’t really parse dates, such as {dint}, {MMWRweek}, and {datetime}. This leaves us with a total of seven packages on CRAN that handle dates: {lubridate}, {anytime}, {parsedate}, {date}, {datetimeutils}, and {rccdates}.
Thunderdate
Let’s see how these packages do on our date gauntlet, specified from above.
the_dates <- c("2020-02-04", "04 February 2020", "2/4/20" , "4/2/20" , "04 Février 2020", 43865)
# Formats for {base} R need to be in the exact order
the_formats <- c("%Y-%m-%d" , "%d %B %Y" , "%m/%d/%y", "%d/%m/%y", "%d %B %Y")
# Formats for {lubridate} are much easier to read
print(lub_formats <- unique(gsub("[[:punct:][:space:]]", "", the_formats)))
## [1] "Ymd" "dBY" "mdy" "dmy"
Setting up the {readr} function for use with {purrr}
the_locales <- c("en", "en", "en", "en", "fr", "en")
readr_parse_date <- function(date, format, locale) {
readr::parse_date(date, format = format, locale = readr::locale(locale))
}
purrrlist <- list(the_dates, c(the_formats, NA), the_locales)
the_origin <- as.Date("1970-01-01")
res <- tibble(
original = the_dates,
base = as.Date(the_dates, format = the_formats),
lubridate = lubridate::parse_date_time(the_dates, orders = lub_formats),
readr = purrr::pmap_dbl(purrrlist, readr_parse_date) %>% as.Date(origin = the_origin),
anytime = anytime::anydate(the_dates),
parsedate = parsedate::parse_date(the_dates),
date = date::as.date(the_dates),
datetimeutils = datetimeutils::guess_datetime(the_dates),
rccdates = rccdates::as.Dates(the_dates)
) %>%
mutate_at(-1, as.Date)
## Warning: 1 failed to parse.
## Warning: 1 parsing failure.
## row col expected actual
## 1 -- date like NA 43865
original | base | lubridate | readr | anytime | parsedate | date | datetimeutils | rccdates |
---|---|---|---|---|---|---|---|---|
2020-02-04 | 2020-02-04 | 2020-02-04 | 2020-02-04 | 2020-02-04 | 2020-02-04 | NA | NA | 2020-02-04 |
04 February 2020 | 2020-02-04 | 2020-04-20 | 2020-02-04 | 2020-02-04 | 2020-02-04 | 2020-02-04 | NA | NA |
2/4/20 | 2020-02-04 | 2020-02-04 | 2020-02-04 | NA | 2020-02-04 | 1920-02-04 | NA | NA |
4/2/20 | 2020-02-04 | 2020-04-02 | 2020-02-04 | NA | 2020-04-02 | 1920-04-02 | NA | NA |
04 Février 2020 | NA | 2020-04-20 | 2020-02-04 | NA | 2020-01-04 | NA | NA | NA |
43865 | NA | NA | NA | 4386-01-01 | 2021-12-27 | NA | NA | NA |
Table 1: Results
base | lubridate | readr | anytime | parsedate | date | datetimeutils | rccdates |
---|---|---|---|---|---|---|---|
4 | 2 | 5 | 2 | 3 | 1 | 0 | 1 |
Table 2: Dates correctly parsed (of 6)
There’s a lot going on here, so I’ll summarize a few things:
- None of these format perfectly, locales and numeric dates are hard to handle.
- {datetimeutils} could not convert a single date presented and {rccdate} could only parse the ISO 8601 formatted date (but it’s not meant to be a general date parsing package).
- By far, the most successful package was {readr}, but it takes a bit of work to set up the function to return a date vector with multiple locales and formats.
- The most successful “magic” package (one without knowledge of formats) was {parsedate}.
- {anytime} is super conservative, but will assume that a string of numbers represents a date (e.g. 20200204).
- {lubridate} does fairly well, but it makes some strange mistakes with the month spelled out.
- {base} only does well here because I gave it the EXACT specifications for each date. If I didn’t, it would only be able to parse the ISO 8601 date.
For those curious, this is what the results would look like from a French locale:
library("withr")
with_locale(c(LC_TIME = "fr_FR.UTF-8"), {
res <- tibble(
original = the_dates,
base = as.Date(the_dates, format = the_formats),
lubridate = lubridate::parse_date_time(the_dates, orders = lub_formats),
readr = purrr::pmap_dbl(purrrlist, readr_parse_date) %>% as.Date(origin = the_origin),
anytime = anytime::anydate(the_dates),
parsedate = parsedate::parse_date(the_dates),
date = date::as.date(the_dates),
datetimeutils = datetimeutils::guess_datetime(the_dates),
rccdates = rccdates::as.Dates(the_dates)
) %>%
mutate_at(-1, as.Date)
})
knitr::kable(res, caption = "Les resultats", label = 3)
original | base | lubridate | readr | anytime | parsedate | date | datetimeutils | rccdates |
---|---|---|---|---|---|---|---|---|
2020-02-04 | 2020-02-04 | 2020-02-04 | 2020-02-04 | 2020-02-04 | 2020-02-04 | NA | NA | 2020-02-04 |
04 February 2020 | NA | 2020-04-20 | 2020-02-04 | 2020-02-04 | 2020-02-04 | 2020-02-04 | NA | NA |
2/4/20 | 2020-02-04 | 2020-02-04 | 2020-02-04 | NA | 2020-02-04 | 1920-02-04 | NA | NA |
4/2/20 | 2020-02-04 | 2020-04-02 | 2020-02-04 | NA | 2020-04-02 | 1920-04-02 | NA | NA |
04 Février 2020 | 2020-02-04 | 2020-04-20 | 2020-02-04 | NA | 2020-01-04 | NA | NA | NA |
43865 | NA | NA | NA | 4386-01-01 | 2021-12-27 | NA | NA | NA |
Table 3: Les resultats
No right answer
I think the main thing to take away from this exercise is that there is no right answer when it comes to parsing dates. I’ve helped out with a project that aims at trying to provide yet another magic solution, but even it has drawbacks. You ultimately get the best results when you know the formats and locales you are dealing with, so the best option is just to evangelize about the ISO 8601 and hope. All the available packages have their own idiosyncrasies and it’s an absolute minefield when it comes to dates (note that this doesn’t even get into things like the year 2038 problem, or leap seconds).
-
Note that these may be mixed in with text-based dates in Excel, making them even more difficult to parse ↩︎
-
This is a bit of a stretch since lubridate provides tools for manipulating timespans ↩︎