Description
You are asked to use a decision tree model to predict the usage of a car. The data is the claim_history.csv which has 10,302 observations. The analysis specifications are:
Target Variable
- CAR_USE. The usage of a car. This variable has two categories which are Commercial and Private. The Commercial category is the Event value.
Nominal Predictor
- CAR_TYPE. The type of a car. This variable has six categories which are Minivan, Panel Truck, Pickup, SUV, Sports Car, and Van.
- OCCUPATION. The occupation of the car owner. This variable has nine categories which are Blue Collar, Clerical, Doctor, Home Maker, Lawyer, Manager, Professional, Student, and Unknown.
Ordinal Predictor
- EDUCATION. The education level of the car owner. This variable has five ordered categories which are Below High School < High School < Bachelors < Masters < Doctors.
Analysis Specifications
- Partition. Specify the target variable as the stratum variable. Use stratified simple random sampling to put 70% of the records into the Training partition, and the remaining 30% of the records into the Test partition. The random state is 27513.
- Decision Tree. The maximum number of branches is two. The maximum depth is two. The split criterion is the Entropy metric.
You need to write a few Python programs to assist you in answering the questions.
Question 1 (20 points)
Please provide information about your Data Partition step.
- (5 points). Please provide the frequency table (i.e., counts and proportions) of the target variable in the Training partition?
- (5 points). Please provide the frequency table (i.e., counts and proportions) of the target variable in the Test partition?
- (5 points). What is the probability that an observation is in the Training partition given that CAR_USE = Commercial?
- (5 points). What is the probability that an observation is in the Test partition given that CAR_USE = Private?
Question 2 (40 points)
Please provide information about your decision tree.
- (5 points). What is the entropy value of the root node?
- (5 points). What is the split criterion (i.e., predictor name and values in the two branches) of the first layer?
- (10 points). What is the entropy of the split of the first layer?
- (5 points). How many leaves?
- (15 points). Describe all your leaves. Please include the decision rules and the counts of the target values.
Question 3 (40 points)
Please apply your decision tree to the Test partition and then provide the following information.
- (10 points). Use the proportion of target Event value in the training partition as the threshold, what is the Misclassification Rate in the Test partition?
- (10 points). What is the Root Average Squared Error in the Test partition?
- (10 points). What is the Area Under Curve in the Test partition?
- (10 points). Generate the Receiver Operating Characteristic curve for the Test partition. The axes must be properly labeled. Also, don’t forget the diagonal reference line.

