1.2 Data

Data is a collection of discrete values that contains of important factual information. Analyzing it systematically is an essential part of statistics.

1.2.1 Understanding different kinds data

Generally you will deal with 3 kinds of data:

  • Discrete Numeric Data
  • Continuous Numeric Data
  • Categorical Data

Stevens (1946) gives a broad classification of data from measurements into 9 categories.

You may notice that many data are described in terms of numbers and many variables naturally take only discrete values. Such data can be visualized with Boxplot and Histograms. Key features of such data are Centre, Spread and the Shape.

  • Center: Widely used measure of centre is the mean or the average of the data set. Other measures include the median and the mode . They tell us where the data is centered around. For example, if you have a dataset of 10 numbers (Say, 1, 90, 48, 7, 7, 8, 9, 2, 3, 4) and order them by lowest to highest (i.e., 1, 2, 3, 4, 7, 7, 8, 9, 48, 90) and if you change the largest one by a larger number and smallest one by a smaller number the mean, median, mode may not change but if you change only the smallest one, then the mean will change but median and mode will not.
  • Spread: Understanding variabiity of the given data is very important. If one were to understand mean as specifying the center then the range of the data set around it is determined by its variability or spread. It is often measured by the variance(var()) or standard deviation(sd()) or the inter-quartile range. For example, Suppose, you have a dataset of Statistics exam score where everyone does well and get scores 98, 99, 100 then the spread of the data is low. But in the same exam if some students get 0, 4, 10 and some students get 90, 92 then the spread is high.
  • Shape: To understand various distributional aspects of the dataset, one needs to understand its shape. For example, if it is symmetric or skewed round it’s mean. Other aspects include among the data points which are more likely than others. For example, Take the density of \(\Bin{10, \frac{1}{2}}\) You know it’s shape is symmetric about \(x = \frac{1}{2}\) but, \(\Poi{4}\) isn’t. The shape of the distribution is governed by the nature of it’s graph around the mean, wheather it is skewed left or right.

1.2.2 Playing with Data

R contains many single valued data types to use. In this book, we will focus on three most imporatnt ones.

  1. Numeric Data
marks_in_stat <- 40
mode(marks_in_stat)
#> [1] "numeric"
  1. Character Data surrounded by double quotes
course <- "B.Sc."
mode(course)
#> [1] "character"
  1. Logical Data TRUE and FALSE
is_smart <- TRUE
mode(is_smart)
#> [1] "logical"

Vectors, Lists, Matrices

Manipulating these 4 types of data is most basic skill that will help you in data analysis. You have already seen creation and slicing of vectors. Though much of the stuffs left!

More on vectors

The functions rep(), seq() and their friends rep_len(), seq_along(), seq_len() may also help you to create vectors.

seq() creates a vector just like the colon : operator. You can spcify the gap between numbers with by = argument.

s_1 <- seq(1, 10)
s_half <- seq(1, 10, by = 0.5)

Explore the seq_along(), seq_len() functions.

The rep() function replicates the 1st argument as many times as the 2nd argument.

rep(6, 7)
#> [1] 6 6 6 6 6 6 6

## 1st argument can be a vector
rep(2:6, 3)
#>  [1] 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
Lists

List is a similar to vectors but unlike vectors, it can have components with mixed data types.

x_lis <- list(x = 1:3, b = "B")
mode(x_lis)
#> [1] "list"
Matrices

You can create this two-dimensional data structure: matrix with matrix() function specifying the entries in the 1st argument as a vector. The number of rows is spcified with nrow = and columns with ncol =. You can use both but be sure that they multiply up to match the number of entries.

mat_a <- matrix(seq(3, 5, by = 1 / 10), nrow = 7, ncol = 3)
mat_a
#>      [,1] [,2] [,3]
#> [1,]  3.0  3.7  4.4
#> [2,]  3.1  3.8  4.5
#> [3,]  3.2  3.9  4.6
#> [4,]  3.3  4.0  4.7
#> [5,]  3.4  4.1  4.8
#> [6,]  3.5  4.2  4.9
#> [7,]  3.6  4.3  5.0

## Only with number of columns
mat_b <- matrix(seq(3, 5, by = 1 / 10), ncol = 3)

It’s a good practice to specify either of the one. By default the entries filled columnwise. You may do it rowwise by setting byrow = TRUE.

mat_c <- matrix(seq(3, 5, by = 1 / 10), ncol = 7, byrow = TRUE)

The entries can be accessed with position or row number or column number.

## 4th row
mat_b[4, ]
#> [1] 3.3 4.0 4.7

## Particular entry (specifying position)
mat_a[2, 3]
#> [1] 4.5

But, you also need manipulate them. The usual entrywise operations are same as in the case of vectors: +, -, * and / (check on your own!). %*% is different, it performs matrix multiplication.

## Two matrices matching the ususal order
mat_c %*% mat_a
#>        [,1]   [,2]   [,3]
#> [1,]  76.51  92.68 108.85
#> [2,]  92.68 112.28 131.88
#> [3,] 108.85 131.88 154.91

## Matrix-vector product
mat_c %*% 1:7
#>       [,1]
#> [1,]  95.2
#> [2,] 114.8
#> [3,] 134.4

And finally, transpose can be done with t() function

t(mat_a)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,]  3.0  3.1  3.2  3.3  3.4  3.5  3.6
#> [2,]  3.7  3.8  3.9  4.0  4.1  4.2  4.3
#> [3,]  4.4  4.5  4.6  4.7  4.8  4.9  5.0
t(1:7)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,]    1    2    3    4    5    6    7

Also note that matrix can be made of strings but just like vectors if you enter mixed data types they are coerced to the same data type.

1.2.3 Dealing with data frames

A data frame is a generic data object used to store tabular data. It has three main components: the data, observations (or rows) and variables (or columns).

Create

Use the data.frame() function to create a data frame.

ka_district <- c(
  "Bagalakote", "Ballari", "Belagavi", "Bengaluru Rural", "Bengaluru Urban",
  "Bidar", "Chamarajanagara", "Chikkaballapura", "Chikkamagaluru",
  "Chitradurga", "Dakshina Kannada", " Davanagere", "Dharwada",
  "Gadag", "Hassana", "Haveri", "Kalaburagi", "Kodagu", "Kolara",
  "Koppala", "Mandya", "Mysuru", "Raichuru", "Ramanagara", "Shivamogga",
  "Tumakuru", "Udupi", "Uttara Kannada", "Vijayapura", "Yadagiri"
)

ka_dis <- c(
  215, 620, 558, 1109, 8813, 350, 780, 420,
  144, 478, 816, 242, 1051, 249, 1238, 315,
  807, 185, 1993, 515, 1997, 2886, 371, 156,
  589, 1746, 838, 964, 296, 128
)

ka_discharge <- data.frame(ka_district, ka_dis)

It is indeed a list with a specified class called data.frame!

class(ka_discharge)
#> [1] "data.frame"
mode(ka_discharge)
#> [1] "list"
sapply(ka_discharge, mode)
#> ka_district      ka_dis 
#> "character"   "numeric"

But it has some restrictions4:

  • all variables must be same length vectors.
data.frame(x = 1:10, y = 1:11)
#> Error in data.frame(x = 1:10, y = 1:11): arguments imply differing number of rows: 10, 11
  • you can’t use the same name for two different variables. R will change it.
data.frame(x = 1:3, y = 2:4, x = 3:5)
#>   x y x.1
#> 1 1 2   3
#> 2 2 3   4
#> 3 3 4   5

Due to these restrictions and the resulting two-dimensional structure, data frames can mimick some of the behaviour of matrices. You can select rows and do operations on rows. You can’t do that with lists, as a row is undefined there.

You should use a data frame for any dataset that fits in that two-dimensional structure. Essentially, you use data frames for any dataset where a column coincides with a variable and a row coincides with a single observation in the broad sense of the word. For all other structures, lists are the way to go.

Note that if you want a nested structure, you have to use lists. As elements of a list can be lists themselves, you can create very flexible structured objects.

Datasets as Data frames

R consists of many builtin datasets that one can use. Run data() to list currently installed data sets.

Datasets are often stored as data frame. Let us study one example: the airquality dataset. Use help: ?airquality

Some initial stuffs

You might want to know how the data frame looks like. You can execute airquality and get it whole but datasets are large. It’s better to see some initial rows. Use head() function to print first six rows.

head(airquality)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

## You may specify the number of lines
head(airquality, n = 10)
#>    Ozone Solar.R Wind Temp Month Day
#> 1     41     190  7.4   67     5   1
#> 2     36     118  8.0   72     5   2
#> 3     12     149 12.6   74     5   3
#> 4     18     313 11.5   62     5   4
#> 5     NA      NA 14.3   56     5   5
#> 6     28      NA 14.9   66     5   6
#> 7     23     299  8.6   65     5   7
#> 8     19      99 13.8   59     5   8
#> 9      8      19 20.1   61     5   9
#> 10    NA     194  8.6   69     5  10

Try the tail() function. Name is self-explanatory.

To know a specific datapoint, you can mention it’s position with numbers or the variable name and the observation number. Just like matrix!

airquality[148, 4]
#> [1] 63
airquality$Temp[148]
#> [1] 63

You can get an entire observation with it’s position

airquality[148, ]
#>     Ozone Solar.R Wind Temp Month Day
#> 148    14      20 16.6   63     9  25
airquality[length(airquality$Temp), ]
#>     Ozone Solar.R Wind Temp Month Day
#> 153    20     223 11.5   68     9  30

Just like vector slicing, a data frame can be sliced with vector.

airquality[, c(1, 4)] |> head(n = 3)
#>   Ozone Temp
#> 1    41   67
#> 2    36   72
#> 3    12   74
airquality[1:3, c(1, 4)]
#>   Ozone Temp
#> 1    41   67
#> 2    36   72
#> 3    12   74
Five number summary

Five number summary is a set of descriptive statistics consisting of the five most important sample percentiles of a dataset (\(\nRV{X}{n}\)):

  • the sample minimum (smallest observation) - \(X_{\left(1\right)}\)
  • the lower quartile or first quartile - \(X_{\left(\left[\frac{n}{4}\right]\right)}\)
  • the median (the middle value) - \(X_{\left(\left[\frac{n}{2}\right]\right)}\)
  • the upper quartile or third quartile - \(X_{\left(\left[\frac{3n}{4}\right]\right)}\)
  • the sample maximum (largest observation) - \(X_{\left(n\right)}\)
summary(airquality$Temp)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   56.00   72.00   79.00   77.88   85.00   97.00

It gives you a rough idea about how a data set looks like.

Plotting

Histogram

par(mfrow = c(1, 2))

## left
hist(airquality$Temp)

## right: better insights
hist(airquality$Temp,
  breaks = airquality$Temp |> pretty(n = 24),
  xlab = "Temperature",
  main = "Temperature count \n New York, May-Sep 1973",
  col = "seagreen"
)

par() can be used for plotting multiple plots in a single frame, making it easy to create a figure arrangement with fine control.

With single variable airquality$Temp, plot() function plots airquality$Temp ~ index.

par(mfrow = c(1, 2))

## left
plot(airquality$Temp)

## right: scatter plot between the mentioned variables
plot(
  y = airquality$Ozone, x = airquality$Temp,
  xlab = "Temperature", ylab = "Ozone",
  main = "Ozone ~ Temperature \n New York, May-Sep 1973",
  col = "navyblue"
)

You may plot each combination of two variables just passing the whole dataset.

plot(airquality)

Reading data frames

Dataset files comes in various different formats. You can read most of them in R, inluding files created in other statistical softwares: Excel (in CSV, XLSX, or TXT format) SAS, Stata, SPSS and more.

As the name suggests read.csv() loads data from CSV (comma seperated values) formatted files and stores it into a data frame. The file = argument takes the file path and header = takes a boolean value wheather the 1st row contains the column (variable) names or not.

ka_bull_df <- read.csv(
  file = "../../assets/datasets/KAbulletin.csv",
  header = TRUE
)
class(ka_bull_df)
#> [1] "data.frame"

## column names
names(ka_bull_df)
#> [1] "District"           "Today.s.Positives"  "Total.Positives"   
#> [4] "Today.s.Discharges" "Total.Discharges"   "Total.Active.Cases"
#> [7] "Today.s.Deaths"     "Deaths"

For local files, specify the file path with respect to your present workspace folder. And, if you want to load the file from an online http/https source, specify the link verbatim.

For example, you may load the above KAbulletin.csv file by downloading it or using the link: https://iscd-r.github.io/assets/datasets/KAbulletin.csv

Some other reading functions include

  • read.table() (white space seperated data values)
  • read.csv2() (when data values are ; seperated instead of ,)
  • read.delim() (delimited text files)
  • read_excel() (excel files)

1.2.4 Generating random data

R contains functions to handle most of the standard probability distributions. For each of them it has four functions:

  • random sample function: prefixed with r
  • density function: prefixed with d
  • cumulative distribution function: prefixed with p
  • quantile function (aka inverse c.d.f.): prefixed with q
Uniform random variable
Discrete

Let \(n \in \N\). \(X \sim U(\{1,2,\ldots,n\})\) is called Uniform random variable taking values in \(\{1,2,\ldots,n\}\) with probability mass function, \[\Prob{X = k} = \frac{1}{n} \ \forall \ k \in \{1,2,\ldots,n\}\]

  • Rolling a fair dice 10 times
sample(1:6, size = 10, replace = TRUE)
#>  [1] 3 5 2 5 6 6 4 5 4 3
  • Tossing a biased coin 10 times with success probability 0.7
sample(c(0, 1), 10, replace = TRUE, prob = c(0.3, 0.7))
#>  [1] 1 1 1 0 1 1 1 1 1 1
Continuous (for an interval)

Let \(a,b \in R\) with \(a < b\). \(X \sim U(a,b)\) is called Uniform random variable taking values in \((a,b)\) with probability mass function, \[\Prob{X = x} = \frac{1}{b-a} \ \forall \ x \in (a,b)\]

  • 10 \(U(-1,1)\) sample
runif(10, -1, 1)
#>  [1]  0.23998107 -0.93655031 -0.07717456 -0.33995817  0.26704154 -0.15378254
#>  [7] -0.18958006  0.83741541  0.61752058  0.08018222
Binomial random variable

Let \(n \in \N\) and \(p \in (0,1)\). \(X \sim \Bin{n,p}\) is called Binomial random variable taking values in \(\{1,2,\ldots,n\}\) with probability mass function, \[\Prob{X = k} = \binom{n}{k}p^k(1-p)^{n-k} \ \forall \ k \in \{1,2,\ldots,n\}\]

  • 20 \(\Bin{6,\frac{1}{2}}\) sample
rbinom(20, 6, 0.5)
#>  [1] 1 3 1 4 1 3 2 5 3 3 5 2 2 4 4 3 3 1 6 1
  • 10 \(\Bin{30, 0.3}\) sample
rbinom(n = 10, size = 30, prob = 0.3)
#>  [1]  9  6  8  7  6  9  9 11 10  8

See Exercise 1.8

Normal random variable

Let \(\mu \in \R, \sigma > 0\). \(X \sim N(\mu,\sigma^2)\) is called Normal random variable taking values in \(\R\) with probability mass function, \[f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \ \forall \ x \in \R\]

  • 2 \(N(10, 25)\) sample
rnorm(1, mean = 10, sd = 5)
#> [1] 9.414016
  • 10 \(N(3, 16)\) sample
rnorm(10, mean = 3, sd = 4)
#>  [1] -2.476862  2.293486  0.842391 -0.967609  9.071873  2.637864  4.537119
#>  [8]  2.501265  2.231686  3.924981

See Exercise 1.9

Exponential random variable

Let \(\lambda > 0\). \(X \sim \Exp{\lambda}\) is said to be distributed Exponentially with parameter \(\lambda\) taking values in \(\R\) with probability mass function, \[f(x) = \begin{cases} \lambda e^{-\lambda x} &\text{ if } x > 0 \\ 0 &\text{otherwise} \end{cases} = \lambda e^{-\lambda x}\mathbb{1}\left[ x > 0 \right]\] where \(\mathbb{1}\) is the indicator function.

  • 10 \(\Exp{\frac{1}{2500}}\) sample
rexp(10, 1 / 2500)
#>  [1] 3358.55756 2429.46648   45.61405 2454.83229  545.83246 1546.48566
#>  [7] 1052.93050  518.34757 1375.16501  611.85440
  • 10 \(\Exp{3}\) sample
rexp(10, 3)
#>  [1] 0.38243272 1.91277774 0.28767228 0.46806652 0.06176535 0.13334676
#>  [7] 0.40827191 0.85497960 0.66295343 0.27423785

See Exercise 1.10

Poisson random variable

Let \(\lambda > 0\). \(X \sim \Poi{\lambda}\) is called Poisson random variable taking values in non-negetive integers with probability mass function, \[\Prob{X = k} = \frac{\lambda^k e^{-\lambda}}{k!} \ \forall \ k \in \N \cup \{0\}\]

  • 10 \(\Poi{\frac{2}{5}}\) sample
rpois(10, 0.4)
#>  [1] 1 0 1 1 0 0 0 0 0 0
  • 7 \(\Poi{3}\) sample
rpois(7, 8)
#> [1]  5  4 11 13  5 11  6

See Exercise 1.11

1.2.5 Working With dplyr package

library("dplyr")

The Master.csv file contains Deceased data from Karnataka COVID-19 Bulletin

deceased_df <- read.csv(
  file = "../../assets/datasets/Master.csv",
  header = TRUE
)
head(deceased_df)
#>   Sno        District State.P.No Age.In.Years    Sex
#> 1   1      Kalaburagi          6           76   Male
#> 2   2 Chikkaballapura         53           70 Female
#> 3   3        Tumakuru         60           60   Male
#> 4   4      Bagalakote        125           75   Male
#> 5   5      Kalaburagi        177           65   Male
#> 6   6           Gadag        166           80 Female
#>                      Description Symptoms Co.Morbidities  DOA        DOD
#> 1 Travel history to Saudi Arabia     <NA>  HTN & Asthama <NA>       <NA>
#> 2        Travel history to Mecca     <NA>           <NA> <NA>       <NA>
#> 3        Travel history to Delhi     <NA>           <NA> <NA>       <NA>
#> 4                           <NA>     <NA>           <NA> <NA> 2020-04-03
#> 5                           SARI     <NA>           <NA> <NA>       <NA>
#> 6                           SARI     <NA>           <NA> <NA>       <NA>
#>      MB.Date Notes
#> 1 2020-03-13  <NA>
#> 2 2020-03-26  <NA>
#> 3 2020-03-27  <NA>
#> 4 2020-04-04  <NA>
#> 5 2020-04-08  <NA>
#> 6 2020-04-09  <NA>
names(deceased_df) <- c(
  "Sno", "District", "Pid", "Age", "Sex",
  "Description", "Symptoms", "CMB", "DOA",
  "DOD", "MB.Date", "Notes"
)

Some imporatnt dplyr functions:

filter():

  • Extract rows that meet logical criteria
  • filters data according to the given condition

Filters data by age greater than 100

filter(deceased_df, Age > 100)
#>     Sno        District     Pid Age    Sex Description
#> 1  3277 Bengaluru Urban  180841 102   Male         ILI
#> 2 17972 Bengaluru Rural 1361618 102   Male        SARI
#> 3 24686 Bengaluru Urban 1341967 102   Male        SARI
#> 4 27273 Bengaluru Urban 2360283 102 Female        SARI
#> 5 33704          Mysuru 2807010 110   Male        SARI
#> 6 34793          Haveri 2843699 101   Male       SARI 
#> 7 35077           Kolar 2816836 103   Male         ILI
#> 8 37190          Kodagu 2947715 101 Female        SARI
#> 9 37373  Uttara Kannada 2996149 102   Male         ILI
#>                        Symptoms      CMB        DOA        DOD    MB.Date
#> 1                  Fever, Cough CKD, IHD 2020-08-08 2020-08-08 2020-08-10
#> 2                Breathlessness  DM, HTN 2021-04-24 2021-04-25 2021-05-08
#> 3                Breathlessness       DM 2021-04-24 2021-05-04 2021-05-23
#> 4                Breathlessness        - 2021-05-11 2021-05-25 2021-05-27
#> 5  Fever, Cough, Breathlessness        - 2021-06-12 2021-06-17 2021-06-19
#> 6 Fever, Cough, Breathlessness  DM, HTN  2021-06-18 2021-06-28 2021-06-28
#> 7                  Fever, Cough        - 2021-06-14 2021-06-30 2021-07-01
#> 8  Fever, Cough, Breathlessness        - 2021-08-01 2021-08-26 2021-08-27
#> 9                  Fever, cough      HTN       <NA> 2021-09-02 2021-09-06
#>                   Notes
#> 1                  <NA>
#> 2                  <NA>
#> 3                  <NA>
#> 4                  <NA>
#> 5                  <NA>
#> 6                  <NA>
#> 7                  <NA>
#> 8                  <NA>
#> 9 Died at his residence

Retains only the rows satisfying the given conditions

filter(deceased_df, Age > 100 & Sex == "Female")
#>     Sno        District     Pid Age    Sex Description
#> 1 27273 Bengaluru Urban 2360283 102 Female        SARI
#> 2 37190          Kodagu 2947715 101 Female        SARI
#>                       Symptoms CMB        DOA        DOD    MB.Date Notes
#> 1               Breathlessness   - 2021-05-11 2021-05-25 2021-05-27  <NA>
#> 2 Fever, Cough, Breathlessness   - 2021-08-01 2021-08-26 2021-08-27  <NA>
head(deceased_df$DOA)
#> [1] NA NA NA NA NA NA
head(deceased_df$MB.Date)
#> [1] "2020-03-13" "2020-03-26" "2020-03-27" "2020-04-04" "2020-04-08"
#> [6] "2020-04-09"

Drop the NA rows

deceased_df <- filter(deceased_df, !is.na(DOD))

Can’t be done with subset()

mutate():

  • To add new variable without affecting original ones
deceased_df <- mutate(
  deceased_df,
  reporting.time = as.Date(deceased_df$MB.Date) - as.Date(deceased_df$DOD)
  # Here you have added new variable "reporting.time" to the dataframe
  # Original variables are not affected
)

Similarly added a new variable Month

deceased_df <- mutate(deceased_df,
  Month = months(as.Date(deceased_df$MB.Date))
)

distinct():

  • Removes rows with duplicate values

Selects distinct rows of Age variable

DT <- distinct(deceased_df, Age)

Other variables can be kept with .keep_all = TRUE argument

DT <- distinct(deceased_df, Age, .keep_all = TRUE)

slice():

  • Select rows by position
SL <- slice(deceased_df, 10:12)
head(SL, 2)
#>   Sno        District Pid Age  Sex Description       Symptoms
#> 1  24           Bidar 590  82 Male        SARI           <NA>
#> 2  25 Bengaluru Urban 557  63 Male        <NA> Breathlessness
#>                                        CMB        DOA        DOD    MB.Date
#> 1                                     <NA> 2020-04-27 2020-04-28 2020-05-02
#> 2 Diabetes & Hypertension & Hypothyroidism 2020-04-30 2020-05-02 2020-05-02
#>   Notes reporting.time Month
#> 1  <NA>         4 days   May
#> 2  <NA>         0 days   May

group_by():

  • To create a “grouped” copy of a table grouped by columns in … dplyr functions will manipulate each “group” separately and combine the results.
GS <- group_by(deceased_df, Sex)

groups data by the specified variable.

head(GS)
#> # A tibble: 6 × 14
#> # Groups:   Sex [2]
#>     Sno District          Pid   Age Sex   Description Symptoms CMB   DOA   DOD  
#>   <int> <chr>           <int> <dbl> <chr> <chr>       <chr>    <chr> <chr> <chr>
#> 1     4 Bagalakote        125    75 Male  <NA>        <NA>     <NA>  <NA>  2020…
#> 2    11 Chikkaballapura   250    65 Male  <NA>        H1N1 po… DM &… 2020… 2020…
#> 3    13 Bengaluru Urban   195    66 Male  <NA>        <NA>     <NA>  2020… 2020…
#> 4    14 Vijayapura        374    42 Male  <NA>        <NA>     <NA>  2020… 2020…
#> 5    19 Bengaluru Urban   465    45 Fema… SARI        Pneumon… Daib… 2020… 2020…
#> 6    20 Kalaburagi        422    57 Male  SARI        <NA>     CLD   2020… 2020…
#> # ℹ 4 more variables: MB.Date <chr>, Notes <chr>, reporting.time <drtn>,
#> #   Month <chr>

Display does NOT show grouping, but it will specify the groups

summarise():

  • Compute table of summaries
  • Summarises multiple values into a single value

Gives the mean of age for each gender.

summarise(GS, mean(Age, na.rm = TRUE))
#> # A tibble: 10 × 2
#>   Sex    `mean(Age, na.rm = TRUE)`
#>   <chr>                      <dbl>
#> 1 Female                      60.7
#> 2 F                           65.2
#> 3 Male                        60.7
#> 4 M                           66.1
#> 5 M E23                       71  
#> 6 N                           39  
#> # ℹ 4 more rows

sample_n():

  • To select random rows according to the value specified

Selects 2 random rows from dataframe deceased_df.

sample_n(deceased_df, size = 2)
#>     Sno        District     Pid Age  Sex Description              Symptoms
#> 1 26593 Bengaluru Urban 1839041  75 Male        SARI Fever, Breathlessness
#> 2  9105 Bengaluru Urban  600977  65 Male        SARI        Breathlessness
#>        CMB        DOA        DOD    MB.Date                 Notes
#> 1 HTN, IHD       <NA> 2021-05-09 2021-05-26 Died at his residence
#> 2      IHD 2020-09-29 2020-09-29 2020-10-02                  <NA>
#>   reporting.time   Month
#> 1        17 days     May
#> 2         3 days October

Selects 0.0001-fraction of rows at random.

sample_frac(deceased_df, size = 0.0001)
#>     Sno        District     Pid Age    Sex Description
#> 1 31188 Bengaluru Urban 2629021  85 Female         ILI
#> 2 19862 Bengaluru Urban 1192493  79 Female        SARI
#> 3  1874 Bengaluru Urban   50930  80   Male         ILI
#> 4 35683 Bengaluru Urban 2867489  66 Female        SARI
#>                       Symptoms     CMB        DOA        DOD    MB.Date Notes
#> 1                 Fever, Cough DM, HTN 2021-05-29 2021-06-04 2021-06-05  <NA>
#> 2 Fever, Cough, Breathlessness DM, HTN 2021-04-18 2021-05-08 2021-05-12  <NA>
#> 3               Breathlessness     HTN 2020-07-18 2020-07-18 2020-07-27  <NA>
#> 4        Fever, Breathlessness     HTN 2021-06-24 2021-07-08 2021-07-09  <NA>
#>   reporting.time Month
#> 1         1 days  June
#> 2         4 days   May
#> 3         9 days  July
#> 4         1 days  July

count():

  • To count the unique values of one or more variables

Gives a frequency table for months

count(deceased_df, Month)
#>        Month     n
#> 1      April  2963
#> 2     August  4107
#> 3   December   429
#> 4   February   481
#> 5    January   837
#> 6       July  3561
#> 7       June  6046
#> 8      March   236
#> 9        May 13592
#> 10  November   736
#> 11   October  2593
#> 12 September  3643
#> 13      <NA>    17

arrange():

  • Order rows by values of a column or columns (low to high)
  • use with desc() to order from high to low
orderdf <- arrange(deceased_df, Age)

Creates a new dataframe orderdf having rows arranged by - Age.

head(orderdf, 2)
#>     Sno        District     Pid    Age    Sex Description       Symptoms CMB
#> 1 14253 Bengaluru Urban 1260623 0.0000   Male        SARI Breathlessness HTN
#> 2 20970      Ramanagara 2032210 0.0082 Female        SARI Breathlessness   -
#>          DOA        DOD    MB.Date Notes reporting.time Month
#> 1 2021-04-21 2021-04-23 2021-04-24  <NA>         1 days April
#> 2 2021-05-07 2021-05-10 2021-05-14  <NA>         4 days   May

Arranges the data in alphabetical order of the variable - Description

orderdf2 <- arrange(deceased_df, Description)

1.2.5.1 The pipe operator - %>%

  • Used to chain codes
  • x %>% f(y) becomes f(x, y)
filtered_data <- filter(deceased_df, Month != "September")
grouped_data <- group_by(filtered_data, Month)
summarise(grouped_data, mean(Age, na.rm = TRUE))
#> # A tibble: 11 × 2
#>   Month    `mean(Age, na.rm = TRUE)`
#>   <chr>                        <dbl>
#> 1 April                         61.2
#> 2 August                        61.3
#> 3 December                      64.9
#> 4 February                      65.3
#> 5 January                       63.6
#> 6 July                          60.0
#> # ℹ 5 more rows

The same code written shortly with Pipe - %>%

deceased_df %>%
  filter(Month != 5) %>%
  group_by(Month) %>%
  summarise(mean(Age, na.rm = TRUE))
#> # A tibble: 12 × 2
#>   Month    `mean(Age, na.rm = TRUE)`
#>   <chr>                        <dbl>
#> 1 April                         61.2
#> 2 August                        61.3
#> 3 December                      64.9
#> 4 February                      65.3
#> 5 January                       63.6
#> 6 July                          60.0
#> # ℹ 6 more rows

Exercises

Exercise 1.4 (Dice Experiment) Rolling a die 1500 Times.

x <- c(1, 2, 3, 4, 5, 6)
prob_x <- c(1 / 8, 1 / 8, 1 / 8, 1 / 8, 3 / 8, 1 / 8)
f_1500 <- sample(x, size = 1500, replace = TRUE, prob = prob_x)
  1. Describe what each R statement is performing in the above.
  2. Using the mean() and var() function find the mean and variance of f_1500. From this information alone what would you conclude is the range of the random variable f_1500.
  3. Does the mean and variance from the sample generated compare closely with the true mean and variance of f_1500.

Exercise 1.5 (Sums of Rolls) Suppose you wish to simulate in R the experiment of Rolling a die 5 times and noting down its sum. You can use the sample(), matrix() and apply() functions.

x <- c(1, 2, 3, 4, 5, 6)
prob_x <- c(1 / 6, 1 / 6, 1 / 6, 1 / 6, 1 / 6, 1 / 6)
rolls <- sample(x, size = 1500, replace = TRUE, prob = prob_x)
rolls_mat <- matrix(rolls, nrow = 5)
roll_sums <- apply(rolls_mat, 2, sum)
  1. Describe the functions matrix() and apply()
  2. Run the following R-code and observe the picture. What does \(\displaystyle\int_{12}^{21}\)norm_density(\(x,\mu,\sigma\))\(\dd{x}\) approximate?
library("ggplot2")
norm_density <- function(x, a, s) {
  (1 / ((2 * pi)^(0.5) * s)) * exp(-(x - a)^2 / (2 * s^2))
}

df_rolls <- data.frame(roll_sums)
mu <- mean(df_rolls$roll_sums)
sigma <- sd(df_rolls$roll_sums)

ggplot(data = df_rolls) +
  geom_histogram(
    mapping = aes(x = roll_sums, y = ..density..),
    color = "#00846b",
    fill = NA,
    binwidth = 1
  ) +
  xlim(5, 30) +
  geom_function(
    fun = norm_density,
    args = list(a = mu, s = sigma),
    color = "black"
    )
  1. If \[\begin{align} \text{Area under the histogram between 12 and 21 } \approx \int_{a}^{b} \frac{1}{\sqrt{2\pi}}\exp\left( -\frac{x^2}{2} \right)\dd{x} \end{align}\] then what would be your guess for \(a\) and \(b\)?

Exercise 1.6 Use the dplyr package to do the following computations.

  1. Create a new data frame iris_1 that contains only the species virginica and versicolor with sepal lengths longer than 6cm and sepal widths longer than 2.5cm. How many observations and variables are in the dataset?
  2. Now, create a iris_2 data frame from iris_1 that contains only the columns for Species, Sepal.Length, and Sepal.Width. How many observations and variables are in the dataset?
  3. Create an iris_3 data frame from iris_2 that orders the observations from largest to smallest sepal length. Show the first 6 rows of this dataset.
  4. Create an iris_4 data frame from iris_3 that creates a column with a Sepal.Area (length \(\X\) width) value for each observation. How many observations and variables are in the dataset?
  5. Create iris_5 that calculates the average sepal length, the average sepal width, and the sample size of the entire iris_4 data frame and print iris_5.
  6. Finally, create iris_6 that calculates the average sepal length, the average sepal width, and the sample size for each species of in the iris_4 data frame and print iris_6.
  7. In these exercises, you have successively modified different versions of the data frame iris_1, iris_2, iris_3, iris_4, iris_5, iris_6. At each stage, the output data frame from one operation serves as the input fro the next. A more easy way to do this is to use the pipe operator %>% from the tidyr package. Rework all of your previous statements into an extended piping operation that uses iris as the input and generates iris_6 as the output.

Exercise 1.7 (Coin Toss Experiment) Do the following:

  1. Tossing a coin 10 times.
bin_1 <- rbinom(1000, 10, 0.5)
bin_2 <- rbinom(1000, 10, 0.25)
bin_3 <- rbinom(1000, 10, 0.75)
  1. Using the ?rbinom explain what each of the above functions is performing in R.
  2. Using the mean() and var() function find the mean and variance of bin_1, bin_2, bin_3. Compare them with the true mean and variance of the respective Binomial distribution.
  1. geom_hist() function.
library(ggplot2)
df_bin_1 <- data.frame(bin_1)
plt_1_1 <- ggplot(df_bin_1) +
  geom_histogram(
    mapping = aes(x = bin_1),
    color = "#00846b",
    fill = "NA",
    binwidth = 1
  )
plt_2_1 <- ggplot(df_bin_1) +
  geom_histogram(
    mapping = aes(x = bin_1, y = ..density..),
    color = "#00846b",
    fill = "NA",
    binwidth = 1
  )
  1. Explain what are the plots plt_1_1, plt_2_1 providing.
  2. Rewrite the code to provide the plots for bin_2 and bin_3.
  3. What can you say about the three plots?
  1. (Density Approximation) The below code plots the norm_density() function in the interval \([0, 10]\) with a \(= 5\), s \(= \sqrt{2.5}\) along with the plot plt_2_1.
library("ggplot2")

norm_density <- function(x, a, s) {
  (1 / ((2 * pi)^(0.5) * s)) * exp(-(x - a)^2 / (2 * s^2))
}
df_bin_1 <- data.frame(bin_1)

ggplot(df_bin_1) +
  geom_histogram(
    mapping = aes(x = bin_1, y = ..density..),
    color = "#00846b", fill = "NA", binwidth = 1
  ) +
  xlim(0, 10) +
  geom_function(fun = norm_density, args = list(a = 5, s = (2.5)^(0.5)))
  1. From the picture what does \(\displaystyle\int_{3}^{6}\)norm_density(\(x,\mu,\sigma\))\(\dd{x}\) approximate?
  2. If \[\begin{align} \text{Area under the histogram between 3 and 7 } \approx \int_{c}^{d} \frac{1}{\sqrt{2\pi}}\exp\left( -\frac{x^2}{2} \right)\dd{x} \end{align}\] then what would be your guess for \(c\) and \(d\)?
  3. How would you try the same idea for bin_2 and bin_3? Would you get the same result?

Exercise 1.8 (Binomial Distribution Plot) Generate 1000 samples of \(\Bin{10,0.5}\). Using ggplot(), write R-code to plot a histogram of relative frequency along with a lineplot of the true Binomial probabilities. (as shown in the figure)

Exercise 1.9 (Normal Distribution Plot) Generate 1000 samples of \(N(0,1)\). Using ggplot(), write R-code to plot a histogram of relative frequency along with a lineplot of the true Normal probabilities. (as shown in the figure)

Exercise 1.10 (Exponential Distribution Plot) Generate 1000 samples of \(\Exp{1}\). Using ggplot(), write R-code to plot a histogram of relative frequency along with a lineplot of the true Exponential probabilities. (as shown in the figure)

Exercise 1.11 (Poisson Distribution Plot) Generate 1000 samples of \(\Poi{3}\). Using ggplot(), write R-code to plot a histogram of relative frequency along with a lineplot of the true Poisson probabilities. (as shown in the figure)

References