COEN140 Lab 7

$30.00

Download Details:

  • Name: Lab-7-jak7re.zip
  • Type: zip
  • Size: 263.51 KB

Category:

Description

Rate this product

Spam classification using logistic regression
Consider the email spam data set. This consists of 4601 email messages, from which 57
features have been extracted. These are as follows:
• 48 features, giving the percentage of words in a given message which match a given word
on the list. The list contains words such as “business”, “free”, “george”, etc. (The data was
collected by George Forman, so his name occurs quite a lot.)
• 6 features, giving the percentage of characters in the email that match a given character
on the list. The characters are ; ( [ ! $ #
• Feature 55: The average length of an uninterrupted sequence of capital letters
• Feature 56: The length of the longest uninterrupted sequence of capital
• Feature 57: The sum of the lengths of uninterrupted sequence of capital
1. Download the data at http://www.cse.scu.edu/~yfang/coen140/spambase.zip. The data is
split into a training set (of size 3065) and a test set (of size 1536).
2. Please normalize the features by standardizing the columns so they all have mean 0
and unit variance.
3. Build and fit a logistic regression model using gradient descent. Report the error rate on
the training and test sets.