Computer Science Question

For this assignment you will write an R program to complete the tasks given below. You will hand in two files for this assignment.

  • A File with your R program. This file should contain only the code (no output) and must have the typical r extension. No other file extensions will be accepted. The reason is that the assignment be graded based on your R code and not the output file. The output file will be used to verify the code commands. Also, please make sure that all comments, discussion, and conclusions regarding results are also annotated as part of your code.
  • A PDF/DOC file with your output code. We are giving you more flexibility regarding how you want to present your output (tables, plots, etc.). You can either use RMD files that combine code, narrative txt, and plots or you can use word document with copy and paste from the R platform you are using. However, please remember that all output (tables, plots, comments, conclusions, etc.) shown in this file has to be generated by the same R code that you submit. This is important! Output shown that is generated using a separate code or output shown that is not supported by the submitted code will not be graded. Screenshots will not be accepted.

Use the following file

  • R Data Set: HMEQ_Loss.csv (in the zip file attached).
  • The Data Dictionary in the zip file.
  • Grouping Variable: TARGET_BAD_FLAG
  • Step 1: Read in the Data

    • Read the data into R
    • List the structure of the data (str)
    • Execute a summary of the data
    • Print the first six records

    Step 2: Box-Whisker Plots

    Plot a box plot of all the numeric variables split by the grouping variable. The plot needs the following:

    • The MAIN TITLE of the box plot should be set to your name
    • Add color to the boxes
    • Comment on whether or not there are any observable differences in the box plots between the two groups.

    Step 3: Histograms

    Plot a histogram of at least one of the numeric variables

    • Manually set the number of breaks to a value that makes sense
    • Superimpose a density line to the graph

    Step 4: Impute “Fix” all the numeric variables that have missing values

    • For the missing Target variables, simply set the missing values to zero
    • For the remaining numeric variables with missing values, create two new variables. One variable will have a name beginning with IMP_ and it will contained the imputed value. The second value will have a name beginning with M_ and it will contain a 1 if the record was imputed and a zero if it was not.
    • You may impute with any method that makes sense. The median or mean value will be useful in most cases.
    • Push yourself! Try one complex imputation like the one described in the lectures.
    • Delete the original variable after it has been imputed.
    • Run a summary to prove that all the variables have been imputed
    • Compute a sum for all the M_ variables to prove that the number of flags is equal to the number of missing values.

    Step 5: One Hot Encoding

  • For the character / category variables, perform one hot encoding. For this create a Flag for each categories.
  • Delete the original class variable
  • Run a summary to show that the category variables have been replaced by Flag variables.
  • Essential Activities:

    1. Watch all the training videos
    2. Execute the example code while watching the training videos.

    Notes:

    1. This assignment is due Sunday at 11:59 PM EST

    WRITE MY PAPER


    Leave a Reply