3.7 Method of least squares (Linear regression)

Exercises

Exercise 3.16 (From R for Data Science, Section 23) Consider the simulated data set sim1 in modelr library. Do the following without using tibble

  1. Using ggplot() provide a scatter plot of the sim$y versus sim1$x.

  2. Assume that \(m \sim U(-5, 5)\) and \(c \sim U(-20, 40)\). Generate \(100\) lines with slopes \(m\) and intercept \(c\). Plot all the lines layered on top of the scatter plot done above.

  3. Using the below function

    rss <- function(a, data) {
      d <- data$y - (a[1] + data$x * a[2])
      sum(d^2)
    }

    compute the residual sum of squares for each of the lines.

  4. Using ggplot(), the inbuilt function rank() and filter() plot the 10 best lines (i.e.,10 lowest rss) along with the data points. Colour the BRL := best random line in viridis plasma red.

  5. Understand optim() function and the command

    ls_fit <- optim(c(0, 0), rss, data = sim1)

    Describe the output of the code decide what ls_fit$par provide and call this BOL := best optim line.

  6. Use the inbuilt lm() function to compute the slope and intercept of least square line and the line LSL := least square line.

  7. For LSL, BOL, BRL compute the residuals using the function given below

    residual <- function(a, data) {
      d <- data$y - (a[1] + data$x * a[2])
      d
    }

    and provide three plots of the same as a histogram and scatter plot.

Exercise 3.17 (Bosokovitch's best fit line) Write an R-code that uses the optim() or optimize() function and solves Bosokovitch’s formulation of finding the best line? That is for data points \(\{x_i, y_i\}_{i=1}^{n}\) find \(m, c\) that minimizes \[\begin{align} \tag{3.17} \sum_{k=1}^{n} \left| y_k - mx_k - c \right| \end{align}\]