Computer Science Question

For this assignment you will write an R program to complete the tasks given below. You will hand in two files for this assignment.

A File with your R program. This file should contain only the code (no output) and must have the typical r extension. No other file extensions will be accepted. The reason is that the assignment be graded based on your R code and not the output file. The output file will be used to verify the code commands. Also, please make sure that all comments, discussion, and conclusions regarding results are also annotated as part of your code.
A PDF/DOC file with your output code. We are giving you more flexibility regarding how you want to present your output (tables, plots, etc.). You can either use RMD files that combine code, narrative txt, and plots or you can use word document with copy and paste from the R platform you are using. However, please remember that all output (tables, plots, comments, conclusions, etc.) shown in this file has to be generated by the same R code that you submit. This is important! Output shown that is generated using a separate code or output shown that is not supported by the submitted code will not be graded. Screenshots will not be accepted.

Use the following file

R Data Set: HMEQ_Loss.csv (in the zip file attached).

The Data Dictionary in the zip file.

Grouping Variable: TARGET_BAD_FLAG

Step 1: Read in the Data

Step 2: Box-Whisker Plots

Plot a box plot of all the numeric variables split by the grouping variable. The plot needs the following:

The MAIN TITLE of the box plot should be set to your name
Add color to the boxes
Comment on whether or not there are any observable differences in the box plots between the two groups.

Step 3: Histograms

Plot a histogram of at least one of the numeric variables

Step 4: Impute “Fix” all the numeric variables that have missing values

For the missing Target variables, simply set the missing values to zero
For the remaining numeric variables with missing values, create two new variables. One variable will have a name beginning with IMP_ and it will contained the imputed value. The second value will have a name beginning with M_ and it will contain a 1 if the record was imputed and a zero if it was not.
You may impute with any method that makes sense. The median or mean value will be useful in most cases.
Push yourself! Try one complex imputation like the one described in the lectures.
Delete the original variable after it has been imputed.
Run a summary to prove that all the variables have been imputed
Compute a sum for all the M_ variables to prove that the number of flags is equal to the number of missing values.

Step 5: One Hot Encoding

For the character / category variables, perform one hot encoding. For this create a Flag for each categories.

Delete the original class variable

Run a summary to show that the category variables have been replaced by Flag variables.

Essential Activities:

Notes: