CMPSCI691 Homework 8

$30.00

Download Details:

  • Name: hw8-nvoy4w.zip
  • Type: zip
  • Size: 9.40 KB

Category:

Description

5/5 - (1 vote)

Question 1

CSV file questions.csv contains 5,264 questions about science from the
Madsci Network question-and-answer website [5]. Each question is
represented by a unique ID, and the text of the question itself.

Function preprocess() (in preprocess.py) takes a CSV file, an optional
stopword file, and an optional list of extra stopwords as input and
returns a corpus (i.e., an instance of the Corpus class), with any
stopwords removed. You can run this code as follows:

>>> corpus = preprocess(‘questions.csv’, ‘stopwordlist.txt’, [‘answer’, ‘dont’, ‘find’, ‘im’, ‘information’, ‘ive’, ‘message’, ‘question’, ‘read’, ‘science’, ‘wondering’])
>>> print ‘V = %s’ % len(corpus.vocab)
V = 21919

The resultant corpus may be split into a corpus of “training”
documents and a corpus of “testing” documents as follows:

>>> train_corpus = corpus[:-100]
>>> print len(train_corpus)
5164
>>> test_corpus = corpus[-100:]
>>> print len(test_corpus)
100

Class LDA (in hw8.py) implements latent Dirichlet allocation.

a) Implement code for slice sampling [6] concentration parameters
alpha and beta, i.e., drawing alpha and beta from P(alpha, beta |
w_1, …, w_N, z_1, …, z_N, H), by filling in the missing code
in slice_sample(). Hint: you should work in log space, i.e., all
probabilities should be represented in log space, and you should
use log_evidence_corpus_and_z() to compute log P(w_1, …, w_N,
z_1, …, z_N | alpha, beta). You can run your code as follows:

>>> extra_stopwords = [‘answer’, ‘dont’, ‘find’, ‘im’, ‘information’, ‘ive’, ‘message’,’question’, ‘read’, ‘science’, ‘wondering’]
>>> corpus = preprocess(‘questions.csv’, ‘stopwordlist.txt’, extra_stopwords)
>>> train_corpus = corpus[:-100]
>>> assert train_corpus.vocab == corpus.vocab
>>> test_corpus = corpus[-100:]
>>> assert test_corpus.vocab == corpus.vocab
>>> V = len(corpus.vocab)
>>> T = 100
>>> alpha = 0.1 * T
>>> m = ones(T) / T
>>> beta = 0.01 * V
>>> n = ones(V) / V
>>> lda = LDA(train_corpus, alpha, m, beta, n)
>>> lda.gibbs(num_itns=250)

b) If the random number generator is initialized with a value of 1000
as follows, what are the sampled values of concentration
parameters alpha and beta at the end of iteration 250?

>>> lda.gibbs(num_itns=250, random_seed=1000)

* References

[1] http://www.python.org/

[2] http://numpy.scipy.org/

[3] http://www.scipy.org/

[4] http://ipython.org/

[5] http://research.madsci.org/dataset/

[6] http://www.cs.toronto.edu/~radford/ftp/slc-samp.pdf