Questions
Familiarize yourself with the codebook for the Email dataset below and then Import/load the Email dataset.
Question 1: In this assignment we will focus on predicting whether an email is spam or not. Determine the number of emails that make up this data set and the proportion of the emails that were spam. (That is, you want to construct a univariate frequency table – something we did on Mini-Assignment 2). How many emails make up this data set? What percent of the emails were spam?
Question 2: Determine what proportion of emails with an attachment are spam and what proportion of emails without an attachment are spam. (That is, you want to find the appropriate conditional percent – look for appropriate code under `bivariate’). Fill in the blanks:
____% of emails without an attachment were classified as spam, whereas _____ % of emails with an attachment were classified as spam.
Question 3: Construct a chi-square test to determine whether there is an association between spam email and whether an email contained an attachment. State the corresponding test statistic and p-value to test this association. What is your conclusion?
Question 4: Now, fit an appropriate regression model between spam and attachment. What type of regression model is appropriate? Why?
Question 5: Using your model above, what can you say about how the odds of spam vary based on whether an email has an attachment?
Question 6: Determine what proportion of emails with an image are spam and what proportion of emails without an image are spam. Fill in the blanks:
____% of emails without an image were classified as spam, whereas _____ % of emails with an image were classified as spam.
Question 7: Fit an appropriate regression model between spam and image. Determine what proportion of emails with an image are spam and what proportion of emails without an image are spam. Interpret the odds ratio to compare emails with images to emails without images.
Question 8: Construct an appropriate model with our response variable (spam) and explanatory variables (attachment, number of characters, image, and whether there is exclamation point in the subject). Controlling for all other predictor variables, is whether a message is spam independently associated with whether there is an exclamation point in the subject? Why?
Question 9: Using your model from Question 8. Answer the following: Controlling for all other predictor variables, as number of characters in a message increases by 1, what can be said about the odds of it being spam?
CODEBOOK: E-mail Data
Today we will be working with a corpus of emails received by a single gmail account over the first three months of 2012. Just like any other email address this account received and sent regular emails as well as receiving a large amount of spam, unsolicited bulk email. We will be using what we have learned about logistic regression models to see if we can build a model that is able to predict whether or not a message is spam based on a variety of characteristics of the email (e.g. inclusion of words like winner, inherit, or password, the number of exclamation marks used, etc.) While the spam filters used by large corporations like Google and Microsoft are quite a bit more complex the fundamental idea is the same – binary classification based on a set of predictors.
The description of the data is as follows:
- spam Indicator for whether the email was spam.
- tomultiple Indicator for whether the email was addressed to more than one recipient.
- from Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
- cc Indicator for whether anyone was CCed
- sent_email Indicator for whether the sender had been sent an email in the last 30 days
- image Indicates whether any images were attached.
- attach Indicates whether any files were attached
- dollar Indicates whether a dollar sign or the word ‘dollar’ appeared in the email
- winner Indicates whether “winner” appeared in the email
- inherent Indicates whether “inherit” (or an extension, such as inheritance) appeared in the email.
- password Indicates whether “password” appeared in the email.
- num_char The number of characters in the email, in thousands.
- line_breaks The number of line breaks in the email (does not count text wrapping).
- format Indicates whether the email was written using HTML (e.g. may have included bolding or active links) or plaintext.
- re_subj Indicates whether the subject started with “Re:”, “RE:”, “re:”, or “rE”
- exclaim_subj Indicates whether there was an exclamation point in the subject.
- urgent_subj Indicates whether the word “urgent” was in the email subject.
- exclaim_mess The number of exclamation points in the email message.
- number Factor variable saying whether there was no number, a small number (under 1 million), or a big number.