Assignment 1 ECO481

$30.00

Download Details:

  • Name: Assignment-1-ioqhdc.zip
  • Type: zip
  • Size: 886.46 KB

Category:

Description

Rate this product

1 Exercise 1 (25%)
1. Your friend Albert builds a classification algorithm on the following data. It has 10,000
features and 100 observations. As the newest machine learning expert in your group of
friends, he decided to ask you for help. (10 points)
ˆ Clearly explain to Albert what his model is likely to suffer from.
ˆ In fact, Albert has implemented the model and finds an accuracy rate of 99% on the
training sample. Unfortunately, when it evaluates the model on a new dataset (the
test sample), you have 50% accuracy. Explain to Albert what it means to have 50%
accuracy.
ˆ Suggest a step you would take to fix the problem Albert is having.
ˆ After correcting the problem, you get two models. First, you use logistic regression
and get an error rate of 10% on training data and 15% on test data. Then we use
the 1-nearest neighbors (i.e. K=1) and get an average error rate (simple average
over test and training datasets) of 9%. Based on these results, what method would
you recommend to Albert for his classification exercise? Clearly explain why.
2. For each decision boundary, explain which classifier between the logistic regression and
KNN is likely to have generated it. (5 points)
1
Panel A Panel B
3. After training a logistic regression classifier, Anne and Bryan have one of the data point
that is properly classifier and far away from the decision boundary. Bryan thinks that
removing this point will not affect the decision boundary. Anne disagrees and think that
it may affect the decision boundary. Who is right? Explain. (5 points)
4. You have a set of three data points with only one predictor and an output: x1 = 4 and
y1 = 1, x2 = −2 and y2 = 0, x3 = 1 and y3 = 1. Suppose you use it as a training sample.
In a logistic regression, what will be the value of the parameter β associated with x? Give
a short explanation for your choice. (5 points)
ˆ a) β = 1
ˆ b) β = 0
ˆ c) β = ∞
2 Exercise 2: (10%)
You work as a data analyst at Twitter. Because of all the scandals and growing fake news on
social media, Twitter would like to improve its algorithm to detect fake news. As the junior
analyst, you are in charge of analyzing the past “fake news” data.
You are given the following information:
ˆ 10 out of 10000 tweets are classified as fake by the current twitter algorithm.
ˆ With some human checks, we realize that 20% of the tweets classified as fake by the
algorithm are not.
ˆ 10% of tweets classified as ”non fake” by the algorithm are.
1. Your boss asks you to build the confusion matrix for 1,000,000 tweets (roughly the number
of tweets per 5 minutes in 2021).(5 points)
2
2. You have a talk in front of your team. How would you summarize in a very meaningful
way how well the current algorithm is performing? (show your calculation and explain
your decision). (5 points)
3 Exercise 3: Application… (65%)
“Mobile money (m-money) refers to the use of mobile phones to perform financial and banking
functions.” (IFC Mobile Money Study 2011: Summary Report).
In low-income countries, mobile money is a substitute for banking access. In fact, individuals
do not need a bank account to perform financial transactions (send and receive money) via
their mobile service. One of its biggest advantages is that it can reach the most remote and
vulnerable populations. Many observers agree that this new financial tool has an important
role in widening financial inclusion in low-income countries (See Jack and Suri 2011 and Suri
2017 for a review).
Therefore, it is crucial to understand the mobile money users and non-users characteristics.
In this application, we want to analyze the determinants of mobile money adoption using (real)
data from a survey of 2,282 households in Kenya on M-PESA (“M” for Mobile and “Pesa” for
Money in Swahili), one of the most successful mobile money applications.
1. First, you have access to the following table:
Personal ID Large household size Have a cell phone Have a mattress at home M-pesa user
1 False True False True
2 False False False False
3 False True False True
4 True False False False
5 True False True True
6 True False False False
ˆ Using the table and an entropy-based information gain, construct a decision tree (by
hand, i.e make the calculus and find the relevant splits) that would predict the use
of M-pesa for an individual. (NB: the logarithm to use in the entropy measurement
is the logarithm to the base 2.) (15 points)
ˆ What will be the prediction generated by the tree for: “Large Household”= false,
“Have a cell phone”= False and “Have a mattress at home”=true.(1.5 points)
ˆ What will be the prediction generated by the tree for: “Large Household”= True,
“Have a cell phone”= True and “Have a mattress at home”=true.(1.5 points)
3
2. Now, you have access to a more complete database. Read the file “mobile money.csv” in
Python. (2 points)
3. Present descriptive statistics on the outcome variable mpesa user. Comment.(3 points)
4. Present descriptive statistics on the following variables depending on the mpesa user
status. (10 points: 5 points for each label)
ˆ Own Cell Phone
ˆ Per Capita Consumption
ˆ Per Capita Food Consumption
ˆ Total Wealth
ˆ Household Size
ˆ Education of Head (Years)
ˆ Positive Shock
ˆ Negative Shock
ˆ Weather/Agricultural shock
ˆ Illness Shock
ˆ Send Remittances
ˆ Receive Remittances
ˆ Bank account
ˆ Mattress
ˆ Savings & Credit Cooperative (SACCO)
ˆ Merry Go Round/ ROSCA
ˆ Farmer
ˆ Public Service
ˆ Professional Occupation
ˆ Househelp
ˆ Run a Business
ˆ Sales
ˆ In Industry
ˆ Other Occupation
4
ˆ Unemployed
5. Comment on the descriptive statistics’ main takeaways in question 3 (no more than 3
lines).(2 points)
6. Construct the following classifiers using the outcome variable mpesa user (11 points):
ˆ Logistic Classifier
ˆ Decision Tree Classifier
ˆ Random Forest classifier
NB: Consider a train-test split of 80-20. Consider also standardizing the data before.
7. Comparing the accuracy rate and the area under the curve (AUC) criteria, find the best
classifier among those in question 5.(6 points: 2-2-2)
8. What are the top 3 predictors based on the best classifier found in question 6? (3 points)
9. Consider now a KNN classifier. Using a loop ”for”, consider a value of K from 1 to 10 by
step of 1. In the ML jargon, we are doing a grid search. It aims to tune (find) the value
of the hyperparameter ”K”). Using cross-validation methods on the training data set, for
each value of K, find the optimal value of neighbours K. (5 points)
10. Is the optimal KNN classifier, as found in question 8, outperforming the one found in
6?(2 points)
11. Based on what you have found, what is the key recommendation that you can make to
a government that would like to foster the use of M-PESA among the population? (no
more than 3 lines).(3 points)
References:
Jack, W., & Suri, T. (2011). Mobile money: The economics of M-PESA (No. w16721). National Bureau of Economic Research.
Jack, W., & Suri, T. (2014). Risk sharing and transactions costs: Evidence from Kenya’s
mobile money revolution. American Economic Review, 104(1), 183-223.
Suri, T. (2017). Mobile money. Annual Review of Economics, 9, 497-520.
5
Variables Labels