R COLAB –

The data contained in lake.csv will now be used to create an OLS model with two predictors. Recall that the outcome the research was interested in was TN.

Remove outliers from the data. List the outliers (i.e., data rows) that have been removed for the predictors and use the outlier-free data for the remaining questions.


Question 2

Fit two linear regression models to predict TN with the new outlier-free data set.

  • One model should use NIN as a predictor.
  • The other model should use both NIN and TW as predictors.

Report the formula (i.e., with calculated b values) for both models you create.

Please ensure clarity when referencing predictors within the formulas (i.e., dont just write x, provide clear labels).


Question 3

As we learned in class, the summary() function can be used to extract various information about a linear model, such as whether the coefficients are significantly different than 0.

Find a way to extract only the Std. Error statistic that the summary() function displays for the TW predictor.


Question 4

Does the model with only a single predictor have a slope significantly different than 0?

  • Write R code to extract the p-value for the slope.
  • Report the 95% confidence interval around the slope.
  • Use of the function confint() is prohibited.

Question 5

Does the model with two predictors contain slopes significantly different than 0?

  • Write R code to extract the p-value for each predictors slope.
  • Report the 95% confidence interval around each predictors slope.
  • Use of the function confint() is prohibited.

Question 6

Conduct a test to determine whether the additional predictor in the second model significantly improves the fit relative to the single-predictor model.

  • Does the second predictor significantly improve the model?
  • Report the test-statistic, degrees of freedom, and p-value.
  • Use of anova() is prohibited.

Question 7

Assume that a lake has an average influent nitrogen concentration of 5.7 and a water retention time of 0.98.

Use the preferred model, as determined from the previous question, to predict the annual nitrogen concentration of that lake.


Question 8

A psychologist studying perceived quality of life in a large number of cities came up with the following equation using mean temperature (F) and median income in $1,000 as predictors:

= 5.37 0.01 Temp + 0.05 Income

Interpret the regression equation in terms of the coefficients.
(i.e., state what each predictor of the model means in plain English)


Question 9

Using the model from the previous question, assume a city has a mean temperature of 55 degrees and a median income of $12,000.

What is its predicted Quality of Life score?


Question 10

You are a highfalutin marketing guru who wants to predict the sales of your brand using the data set DataDrivenMarketing.csv.

However, there are a number of missing (NA) values in this data set.

Using what you know about R, report how many NA values are in each of the datasets columns.

(Note that there are many different ways to achieve this.)


Question 11

Recall that base-R has a function read.csv() that can be used to load a CSV file. The tidyverse has the function read_csv().

Compare how both functions load the DataDrivenMarketing.csv data.

There is a subtle difference, explain what it is.

Tip: Look at the amount of NAs.


Question 12

Using the DataDrivenMarketing.csv data, create a dataframe that removes any row which has a NA value.

Report how many rows this new data frame has.

Use this cleaned up data set for all subsequent questions.


Question 13

Plot a correlation matrix of the possible (quantitative) predictors you could include in your model that predicts sales.

  • Do not use default colours
  • Make the category labels black

Question 14

Because the predictors are somewhat correlated, you suspect that multicollinearity may be affecting the regression estimates.

Investigate this possibility by comparing two models:

Sales = b + b(TV) + b(Radio) + b(Social Media)

to a model with just TV and Social Media:

Sales = b + b(TV) + b(Social Media)

Use an appropriate diagnostic to assess the extent of collinearity among the predictors in each model.

Summarize your findings and explain which model is more affected.


Question 15

Search for outliers in the $TV and $Social.Media columns.

Report how many you find in each and remove them from the data set for subsequent questions.


Question 16

Using the data set with outliers removed, begin a hierarchical regression by creating a model with just TV as a predictor.

Report the models formula (with coefficients) and R statistic.


Question 17

Repeat the previous question, but include Social Media as a predictor.


Question 18

Conduct a test to evaluate whether social media significantly improves the fit of the model.

  • Use of anova() is prohibited
  • Report the F-statistic, degrees of freedom, p-value, and conclusion

Question 19

Build an ordinary least-squares regression model with both TV and Influencer as predictors of Sales.

  • Are each of its coefficients significantly different than 0?
  • What is the multiple R of this model?

Question 20

Conduct an F-test to evaluate whether influencer significantly improved the fit to the model over one with just TV as a predictor.

  • Which is the preferred model?
  • Use of anova() is prohibited
  • Report the F-statistic, degrees of freedom, and p-value

Question 21

Using the preferred model from the previous question, create a plot of the residuals to evaluate homogeneity of variance.

Is the assumption reasonable?


Question 22

Using the preferred model, evaluate whether the residuals are normally distributed.

Is the assumption reasonable?

WRITE MY PAPER


Leave a Reply