Information System Notes: February 2012

Wednesday, February 29, 2012

Classification with Decision Trees

Feb. 27 was the Tree Day in our class. Decision Trees are one the more exiting DM algorithms.

I started with describing splitting rules and went over the Entropy and Gini examples. Other details such as leaf purity, English rules, and prunning were discussed during the activity time.

Here are the slides and here is the class activity. To help student understand the calculation for Entropy and Gini trees, I recreated the book example in Excel. You may find it useful for students too.

It is very important to remind students to pay attention to number of observations in leaves. Number of observations in each leaf indicates the extent to which the corresponding English rules. Generalizability & pruning should be adequately discussed in DTree lecture.

Note: I change the location of my files, if you need a file and couldn't find it on the link below, write to me: elahe dot j dot marmarchi at gmail dot com

classification with KNN

Monday, Feb. 20 we talked about KNN. Of course, Shazam is a good example to use for teaching KNN. I had seen the Shazam case in the Linoff & Berry book and students like the example because many of them have seen or heard about Shazam. I use the free version of Shazam and when I ask my students: why my free version does not work as well as your paid ones? , they get excited to come up with ideas:

It is good to follow the traditional order of topic when teaching KNN:

1- distance function

2- choosing k

3- and combining information from the

To do so I stay on my slide 7 (as I say in my second slide, all the pictures are from the web), and explain it as a patient/drug prescription case. The data is on previous patients and the drug that was prescribed for each. Then I ask the class to decide which drug to prescribe for the new patient.

My ppt slides are available here.The numerical examples come from Daniel Larose's book. I also like the KNN example Jame's Hamilton has on his webpage.

SAS EM MBR node is not a very exciting one. It does not many customization options and the output does not include the details of the model. Here's the activity we completed in the class.

It is a good idea to discuss sensitivity and specificity after we have talked about two classification models. We are still working the "clean" ~3K-records churn data set that comes with book. We used SAS EM Model Comparison, set it to select the better model between LogReg and MBR based on validation misclassifications rate. It chooses MBR over LogReg. But when I asked the class to calculate sensitivity and specificity for both models, some of them said they would choose LogReg. And it is a reasonable choice because it has 2 percentage points higher sensitivity. We then discussed other applications in which sensitivity or specificity may be important criterion for evaluating models like HIV test, marketing campaigns, mammograms,...and an example by a student: pregnancy tests !!

I found this on the web, thought it's funny, one "neighboring" worm:

Monday, February 27, 2012

Target is Mining Pregnancy!

My student had read the story on Forbes.com and she shared it with me in our last week's class meeting (Feb. 20). We planned to share it with the class but got too busy with KNN algorithm and our class activities. The story went viral so much so that Colbert took it on and made it the topic of one of his "Tonight's Word" segment, enjoy: Colbert on the Target, pregnant girl and her furious father.

Tuesday, February 14, 2012

Prediction with logistic regression

Yesterday evening we had the 3rd meeting of our. The tradition here at the University of Arkansas is to start with linear regression and logistic regression. Then go on with decision tree and other algorithms. I skipped linear regression, explained logistic regression theory and moved on to dissecting and interpreting SAS EM results. The data set is telecom Churn data set which is available on the book website, has over 3000 records and is very clean.

The ppt slides can be found here. I started with a simple example of linear regression: Gorgiean's enjoyment of snow over time which I found from here. Interestingly we had snow on Monday, there was some snow on the ground when we woke up. Thus this example was very relevant !!! My students liked it.

I then talked about data partitioning. I explained the reason for moving from:

target -> probability of target -> odds of target -> log of odds -> conducting linear regression of log of odds on the input variables.

It is not easy for an undergraduate student to digest these but I wanted to expose them to the ideas. I recommend this to all teachers to spend some time on explaining the assumptions and theory behind logistic regression.

On SAS EM output, I focused on coefficient estimates, significance levels for each input and for the whole regression equation, misclassification rate, false positives, false negatives, lift, and lift chart. The class activity and the follow up concept checks are available here.

I plan to talk about stepwise, forward, and backward variable selection methods in next class- and move on to KNN.

Monday, February 6, 2012

Touching the data!

Today was the second meeting of our BI class. Students had done a simple exercise in SAS EM with Churn data set which has 3351 data records. It's the same telecom Churn data I had used for my professional Master's class but it's clean.

The topic was data preprocessing and some basic EDA. We talked about correlated variables, Chi-square, Cramer's V, variable worth, missing values, normalization, outliers, mean, mode, median, skewness, kurtosis. The PPT slides are here. Student reminded me that I needed to explain what interval, nominal, ordinal, and binary variable are, it was a timely reminder. Phone numbers, area code, and zipcode are always good examples.

Class activity for today's class is here, so is the concept check which is on the second page on the same link.