\[ \newcommand{\nRV}[2]{{#1}_1, {#1}_2, \ldots, {#1}_{#2}} \newcommand{\pnRV}[3]{{#1}_1^{#3}, {#1}_2^{#3}, \ldots, {#1}_{#2}^{#3}} \newcommand{\onRV}[2]{{#1}_{(1)} \le {#1}_{(2)} \le \ldots \le {#1}_{(#2)}} \newcommand{\RR}{\mathbb{R}} \newcommand{\Prob}[1]{\mathbb{P}\left({#1}\right)} \newcommand{\PP}{\mathcal{P}} \newcommand{\iidd}{\overset{\mathsf{iid}}{\sim}} \newcommand{\X}{\times} \newcommand{\EE}[1]{\mathbb{E}\left[{#1}\right]} \newcommand{\Var}[1]{\mathsf{Var}\left({#1}\right)} \newcommand{\Ber}[1]{\mathsf{Ber}\left({#1}\right)} \newcommand{\Geom}[1]{\mathsf{Geom}\left({#1}\right)} \newcommand{\Bin}[1]{\mathsf{Bin}\left({#1}\right)} \newcommand{\Poi}[1]{\mathsf{Pois}\left({#1}\right)} \newcommand{\Exp}[1]{\mathsf{Exp}\left({#1}\right)} \newcommand{\SD}[1]{\mathsf{SD}\left({#1}\right)} \newcommand{\sgn}[1]{\mathsf{sgn}} \newcommand{\dd}[1]{\operatorname{d}\!{#1}} \]
1.4 Date and Time
Date/time is the the messiest data type … by far. Handling them is a very important skill for statistical analysis. You will face them a lot of times!
R handles date/time in three classes:
Date
class represents dates.POSIXct
andPOSIXlt
classes represent times.
POSIX stands for Portable Operating System Interface of UNIX, ct
for Calender Time and lt
for local time.
Internally, R stores dates as the number of days since 1970-01-01 and, times as the number of seconds since 1970-01-016 (except the POSIXlt
class). This is why, 1st January 1970 is called epoch. In POSIXlt
class, date/time are stored as a list of components (hour, min, sec, months etc.) making it easy to extract the parts7.
1.4.1 Parsing Date and Time
Often, you will get date/time as strings. There are several approach for parsing strings into date/times. Let’s try some of them. In this section, we will use 4 packages to explore them.
Date
A lot of possible formats can be used for representing dates. You can parse them in that many ways too.
With specified format
as.Date()
and readr::parse_date()
are good choice. They allow a wide range of input formats through the format =
argument. The default format is a 4 digit year, followed by a 1 or 2 digit month, then a 1 or 2 digit day, separated by either dashes (-
) or forward slashes (/
), i.e., "%Y-%m-%d"
and "%Y/%m/%d"
.
## Defalut ones
as.Date("2021-3-14")
#> [1] "2021-03-14"
as.Date("2004/11/6")
#> [1] "2004-11-06"
# But parse_date() has some drawback
parse_date("2004/11/6")
#> Warning: 1 parsing failure.
#> row col expected actual
#> 1 -- date like 2004/11/6
#> [1] NA
parse_date("2021-3-14")
#> Warning: 1 parsing failure.
#> row col expected actual
#> 1 -- date like 2021-3-14
#> [1] NA
# You have to include 0 for parsing single digit decimals
parse_date("2004/11/06")
#> [1] "2004-11-06"
parse_date("2021-03-14")
#> [1] "2021-03-14"
For dates not in standard format, you need to specify the format string according to the below table.
Code | Value | Remark |
---|---|---|
"%d" |
Day of the month (decimal number) | |
"%e" |
Optional leading space | Only for readr::parse_date() |
"%m" |
Month (decimal number) | |
"%B" |
Month (Full name) | Case doesn’t matter |
"%b" |
Month (3 letter abbrebiated name) | Case doesn’t matter |
"%Y" |
Year (4 digits) | 00-69 -> 2000-2069, 70-99 -> 1970-1999 |
"%y" |
Year (2 digits) |
as.Date("2/8/2021", format = "%m/%d/%Y")
#> [1] "2021-02-08"
# While specifying the format,
# you don't need to include 0 for single digit decimals
parse_date("7-1-71", format = "%d-%m-%y")
#> [1] "1971-01-07"
# Case doesn't matter
parse_date("SepteMBer 28, 2002", format = "%B %d, %Y")
#> [1] "2002-09-28"
as.Date("18AuG03", format = "%d%b%y")
#> [1] "2003-08-18"
You can use non-English month names with parse_date()
specifying the locale =
argument to locale()
.
Witout specified format
You can also use helpers provided by the lubridate
package. They are short and unambiguous!
- Identify the order of the year, month and day in your dates.
- Arrange
y
,m
andd
in that exact order. It will be the name of the parsing function inlubridate
.
# Unlike parse_date(), or like as.Date(),
# you don't need to include 0 in single digit decimals
ymd("2002-12-8")
#> [1] "2002-12-08"
# You may include "th" after day
mdy("January 7th, 1971")
#> [1] "1971-01-07"
mdy("January 7, 1971")
#> [1] "1971-01-07"
# Unquoted numbers are allowed
dym(28200209)
#> [1] "2002-09-28"
# Case doesn't matter
ydm("2002-28-SeP")
#> [1] "2002-09-28"
Check out the lubridate
cheatsheet for more.
Parsing Times
Unlike dates, the time part of a time string has two kind of representation:
- 24 hour clock, i.e., hh:mm:ss (default one)
- 12 hour clock, hh:mm:ss followed by am or pm
But they have to be specified with a specific date. So there are many kind of representation of a specific time. readr::parse_time()
, readr::parse_datetime()
8, as.POSIXct()
and as.POSIXlt()
function can be used to parse them, specifying their formats. The default formats are
Format | Functions |
---|---|
"%Y-%m-%d %H:%M:%OS" |
as.POSIXct() , as.POSIXlt() and readr::parse_datetime() |
"%Y-%m-%d %H:%M:%S" |
” |
"%Y/%m/%d %H:%M:%OS" |
” |
"%Y/%m/%d %H:%M:%S" |
” |
"%Y-%m-%d %H:%M" |
” |
"%Y/%m/%d %H:%M" |
” |
"%Y-%m-%d" |
as.POSIXct() , as.POSIXlt() , readr::parse_datetime() and readr::parse_date() |
"%Y/%m/%d" |
” |
As you have guessed correctly, the codes "%H"
is for hours, "%M"
for minutes, "%OS"
for partial seconds and "%S"
for integer seconds.
# Defaults
parse_datetime("2023-07-24 23:55:26")
#> [1] "2023-07-24 23:55:26 UTC"
time_1 <- as.POSIXct("2023-07-24 23:55:26")
time_1
#> [1] "2023-07-24 23:55:26 UTC"
# Specifying format
time_2 <- as.POSIXlt("25072023 08:32:07", format = "%d%m%Y %H:%M:%S")
time_2
#> [1] "2023-07-25 08:32:07 UTC"
# Don't forget to include dates!
as.POSIXct("08:05:06")
#> Error in as.POSIXlt.character(x, tz, ...): character string is not in a standard unambiguous format
# But parse_time() allows that
parse_time("08:05:06")
#> 08:05:06
# am/pm can be spcified
parse_time("4:06 pm")
#> 16:06:00
Specifying timezone
By default, as.POSIXct()
function stores time with system’s time zone. But you can customize this with tz =
argument. On the other hand, readr::parse_datetime()
stores in UTC (same as GMT), which can be changed with locale = locale(tz = <TIME_ZONE>)
.
# In Asia/Singapore
parse_datetime("2020-01-01 11:42:03", locale = locale(tz = "Asia/Singapore"))
#> [1] "2020-01-01 11:42:03 +08"
# In GMT
as.POSIXct("2020-01-01 11:42:03", tz = "GMT")
#> [1] "2020-01-01 11:42:03 GMT"
# With system's tz
time_3 <- parse_datetime("2020-01-01 11:42:03",
locale = locale(tz = Sys.timezone())
)
time_3
#> [1] "2020-01-01 11:42:03 UTC"
Sys.timezone()
returns timezone of your system.
1.4.2 Extracting the components
While dealing with a long timeframe of data, the years, months, weekdays, weeks, quarters, day of the months etc. are often useful for insights. Let’s extract them from some famous statisticians’ birthdays.
statisticians_bdays <- c(
CRRao = as.Date("1920-09-10"),
PCMahalanobis = as.Date("1893-06-29"),
Cramer = as.Date("1893-09-25"),
KRParthasarathy = as.Date("1936-06-25")
)
With inbuilt functions
The inbuilt functions year()
, months()
, weekdays()
, week()
, quarter()
, day()
are used to obtain them. Names of these functions are self-explanatory.
# vector of years
year(statisticians_bdays)
#> CRRao PCMahalanobis Cramer KRParthasarathy
#> 1920 1893 1893 1936
# vector of months
months(statisticians_bdays)
#> CRRao PCMahalanobis Cramer KRParthasarathy
#> "September" "June" "September" "June"
# vector of weekdays
weekdays(statisticians_bdays)
#> CRRao PCMahalanobis Cramer KRParthasarathy
#> "Friday" "Thursday" "Monday" "Thursday"
# vector of week numbers
week(statisticians_bdays)
#> [1] 37 26 39 26
# vector of quarters
quarter(statisticians_bdays)
#> [1] 3 2 3 2
# vector of days of the months
day(statisticians_bdays)
#> [1] 10 29 25 25
From default components
You can strip out different components of a POSIXlt
object with unclass()
and unlist()
functions.
# doesn't work for POSIXct objects!
unclass(time_1)
#> [1] 1690242926
#> attr(,"tzone")
#> [1] ""
# column form
unclass(time_2)
#> $sec
#> [1] 7
#>
#> $min
#> [1] 32
#>
#> $hour
#> [1] 8
#>
#> $mday
#> [1] 25
#>
#> $mon
#> [1] 6
#>
#> $year
#> [1] 123
#>
#> $wday
#> [1] 2
#>
#> $yday
#> [1] 205
#>
#> $isdst
#> [1] 0
#>
#> $zone
#> [1] "UTC"
#>
#> $gmtoff
#> [1] 0
#>
#> attr(,"tzone")
#> [1] "UTC"
#> attr(,"balanced")
#> [1] TRUE
# list form
unlist(time_2)
#> sec min hour mday mon year wday yday isdst zone gmtoff
#> "7" "32" "8" "25" "6" "123" "2" "205" "0" "UTC" "0"
# extract seconds
time_2$sec
#> [1] 7
# extract weekday number
time_2$wday
#> [1] 2
1.4.3 Operations on date/time
date_f1 <- as.Date("04/08/2021", format = "%m/%d/%Y")
date_f2 <- as.Date("October 8, 2021", format = "%B %d, %Y")
Difference between 2 dates/times
The subtraction opoerator can be used to get difference between 2 dates in days
date_f1 - date_f2
#> Time difference of -183 days
time_2 - time_1
#> Time difference of 8.611389 hours
The inbuilt function difftime()
specifies the diff in specified units
# in weeks
difftime(date_f1, date_f2, units = "weeks")
#> Time difference of -26.14286 weeks
# default is days
difftime(date_f1, date_f2)
#> Time difference of -183 days
# in seconds
difftime(time_1, as.POSIXct("1970-01-01 00:00:00", tz = "UTC"), units = "secs")
#> Time difference of 1690242926 secs
as.POSIXct("2021-03-10 08:32:07") - as.POSIXct("2023-03-09 23:55:26")
#> Time difference of -729.6412 days
You can even apply it on a vector of dates which will return the interval differences between consecutive vector elements.
Addition and Subtraction of days and seconds
Any number added to or subtracted from a date object is treated as day(s). On the other hand, the same for a time object is considered as seconds.
Comparing with logical operators
Except the logical AND (&&
and &
) and logical OR (||
and |
), all the usual logical operators can be used.
Sequence of dates
You can create a sequence of dates using seq()
function specifying the starting date.
# 7 dates differs by 1 week
seq(date_f1, length = 7, by = "week")
#> [1] "2021-04-08" "2021-04-15" "2021-04-22" "2021-04-29" "2021-05-06"
#> [6] "2021-05-13" "2021-05-20"
# 7 dates differs by 14 days
seq(date_f1, length = 7, by = 14)
#> [1] "2021-04-08" "2021-04-22" "2021-05-06" "2021-05-20" "2021-06-03"
#> [6] "2021-06-17" "2021-07-01"
# 7 dates differs by 2 weeks
seq(date_f1, length = 7, by = "2 weeks")
#> [1] "2021-04-08" "2021-04-22" "2021-05-06" "2021-05-20" "2021-06-03"
#> [6] "2021-06-17" "2021-07-01"
# 7 dates differs by 7 months
seq(date_f1, length = 7, by = "7 months")
#> [1] "2021-04-08" "2021-11-08" "2022-06-08" "2023-01-08" "2023-08-08"
#> [6] "2024-03-08" "2024-10-08"
# 7 dates differs by 4 years
seq(date_f1, length = 7, by = "4 years")
#> [1] "2021-04-08" "2025-04-08" "2029-04-08" "2033-04-08" "2037-04-08"
#> [6] "2041-04-08" "2045-04-08"
1.4.4 The chron
package
chron
date/time objects are differnent from the usual ones. It returns time in chron
format
Arithmetic of chron
objects
# comparison
time_2_ch > time_1_ch
#> [1] FALSE
# Adding 10 days
time_1_ch + 10
#> [1] (08/03/13 23:55:26)
# subtraction
time_2_ch - time_1_ch
#> [1] -730485
# Difference in the unit specified unit
difftime(time_2_ch, time_1_ch, unit = "hours")
#> Time difference of -17531640 hours
# difference in the time
as.chron("2013-03-10 08:32:07") - as.chron("2013-03-09 23:55:26")
#> [1] 08:36:41
Remember that, chron
does not adjust for the time zones