EECS4415 Big Data Systems Assignment 2 (10%): Distributed Text Analytics using Python

$30.00

Download Details:

  • Name: a2-u24v8v.zip
  • Type: zip
  • Size: 649.02 KB

Category:

Description

Rate this product

Objective
In this assignment, you will be designing and implementing MapReduce algorithms for performing basic
analytics on textual data. The dataset is coming from a collection of 55000+ English song lyrics from
LyricsFreak1 and can be accessed here (registration to Kaggle is required to download the raw dataset):
Dataset (songdata.csv): https://www.kaggle.com/mousehead/songlyrics
The first set of the MapReduce algorithms compute n-grams of the lyrics; the second set computes kskip-n-grams of the lyrics; the third set computes an inverted index of the lyrics. These are important
statistics and tools commonly used in computational linguistic and information retrieval tasks.
Important Notes:
 You must use the submit command to electronically submit your solution by the due date.
 All programs are to be written using Python 3.
 Your programs should be tested on the docker image that we provided before being submitted.
 To get full marks, your code must be well-documented.
What to Submit
When you have completed the assignment, move or copy your python scripts and outputs in a directory
(e.g., assignment2), and use the following command to electronically submit your files:
% submit 4415 a2 umapper.py ureducer.py unigrams.txt bmapper.py breducer.py bigrams.txt
tmapper.py treducer.py trigrams.txt frequency-computation.txt skipgrammapper.py
skipgramreducer.py skipgrams.txt iimaper.py iireducer.py inverted-index.txt team.txt
The team.txt file includes information about the team members (first name, last name, student ID,
login, yorku email). You can also submit the files individually after you complete each part of the
assignment– simply execute the submit command and give the filename that you wish to submit. Make
sure you name your files exactly as stated (including lower/upper case letters). Failure to do so will
result in a mark of 0 being assigned. You may check the status of your submission using the command:
% submit -l 4415 a1

1 https://www.lyricsfreak.com/
A. Distributed Computation of n-grams (35%, 5% each)
In the fields of computational linguistics, an n-gram is a contiguous sequence of n items from a given
sample of text. For this part of the assignment you can assume that items are words collected from song
lyrics. An n-gram of size 1 is referred to as a “unigram”; of size 2 is a “bigram”; of size 3 is a “trigram”.
For example, given the text input “I love ice cream” the following unigrams, bigrams and
trigrams are computed:
unigrams (“I”, “love”, “ice”, “cream”)
bigrams (“I love”, “love ice”, “ice cream”)
trigrams (“I love ice”, “love ice cream”)
Your task is to design and implement MapReduce algorithms that given a collection of English songs:
 compute the number of occurrences of each unigram in the song collection (umapper.py,
ureducer.py) and output the results in a file called unigrams.txt
 compute the number of occurrences of each bigram in the song collection (bmapper.py,
breducer.py) and output the results in a file called bigrams.txt
 compute the number of occurrences of each trigram in the song collection (tmapper.py,
treducer.py) and output the results in a file called trigrams.txt
 how would you modify these scripts in order to compute the frequency of each of the quantities
(instead of the number of occurrences)? Provide a short answer in plain text (up to half a page)
with the name frequency-computation.txt
The collection of songs is provided in a file songdata.csv that follows the same format as the
original data set provided by Kaggle. The contents of the file might vary when testing your code.
Running the script:
The following webpage provides useful information on how to test your scripts first locally and then in
the Hadoop environment:
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/