Description
1 Utilities (50 points)
File name: utilities.py
Implement: You will implement three functions listed here and detailed below.
def generate_vocab(dir, min_count, max_files)
def create_word_vector(fname, vocab)
def load_data(dir, vocab, max_files)
Write-Up: Describe your implementation concisely.
def generate_vocab(dir, min_count, max_files)
This function will take a starting directory dir (e.g. “aclImdb/train”), min_count which is the minimum
number of times you want to see a word before adding it to a vocabulary. If min_count=2 then we only
want to consider words that have been seen 2 or more times in the dataset as a part of our vocabulary. This
1
function also takes a parameter max_files which is just for implementation purposes. This allows you to
do small tests without using the full dataset. The full dataset takes a long time to generate feature vectors,
so you may want to use only 200 files to start with. If max_files= -1 then all files are used. This returns a
list or numpy array of the vocabulary. Remember that when using max_files, you should be sure to grab
an even number of positive and negative samples.
def create_word_vector(fname, vocab)
This function takes the vocabulary and a review file and generates a feature vector. fname is the filename
for a review. Assume that the aclImdb directory is in the same directory as the test script.py. This returns
one feature vector.
def load_data(dir, vocab, max_files)
This function loads the data, returning a set of feature vectors and associated labels. It should return two lists/arrays X, Y. max_files is again for implementation reasons, to allow for smaller tests. If max_files= -1
then all files are used.
2 ML (50 points)
File name: ml.py
You will use sklearn to test many different models on the ACL IMDB data.
Implement: You will implement the functions listed here and detailed below.
def dt_train(X,Y)
def kmeans_train(X)
def knn_train(X,Y,K)
def perceptron_train(X,Y)
def nn_train(X,Y, hls)
def pca_train(X,K)
def pca_transform(X,pca)
def svm_train(X,Y,k)
def model_test(X,model)
def compute_F1(Y, Y_hat)
All of the _train— functions are very similar. They take data, in the form of feature vectors and labels
(except K-Means and PCA) and produce models. The models are then passed to model_test to make
predictions on a set of test data. knn_train has a K which is the number of neighbors to be considered.
nn_train has hls which stands for hidden layer size. This can be a tuple of any size, depending on how many
layers you want (e.g. (5,2) or (3)). pca_train has a K for the number of principal components to keep in
learning the transformation. pca_transform applies the learned transformation pca to the data. svm_train
has a k which stands for kernel. This is a string. See the sklearn documentation for this. compute_F1 should
take the labels and the predictions and compute the F1 score.
Write-Up: Describe your implementations concisely.
The result of the test script gives the following results:
D e ci si o n Tree : 0. 6 2 5 0 0 0 0 0 0 0 0 0 0 0 0 1 1
D e ci si o n Tree + PCA: 0. 5 2 4 2 7 1 8 4 4 6 6 0 1 9 4 2 2
KMeans : 0. 2 7 6 9 2 3 0 7 6 9 2 3 0 7 6 9 3
KMeans + PCA: 0. 2 7 6 9 2 3 0 7 6 9 2 3 0 7 6 9 4
KNN: 0. 5 7 6 5
KNN + PCA: 0. 5 7 3 6 4 3 4 1 0 8 5 2 7 1 3 1 6
Pe rcep t r on : 0. 6 0 9 5 2 3 8 0 9 5 2 3 8 0 9 6 7
Pe rcep t r on + PCA: 0. 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 8
Neu ral Network : 0. 4 5 1 6 1 2 9 0 3 2 2 5 8 0 6 4 4 9
Neu ral Network + PCA: 0. 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 10
SVM: 0. 6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 11
SVM + PCA: 0. 5 8 3 9 4 1 6 0 5 8 3 9 4 1 6 12
2