Information System Notes

Friday, August 3, 2012

What is Big Data?

Big Data is a tech buzz word, each of us may have our own understanding of what Big Data is or what it contains. It contains data, we know; perhaps very large quantities of data in formats that are new to us and stored differently from what we have seen ?

I was in a faculty meeting discussing BI/analytics courses where I first felt extent of the diversity of meanings that the phrase "Big Data" conveys, therefore decided to write about it.

source:  http://www.bigdatabytes.com/managing-big-data-starts-here/ 
Big Data is mostly associated with Hadoop. An open source technology introduced by Apache for real time, scalable & reliable processing of and intelligence on Big Data. Big Data and Hadoop are well described by Tim Elliott on his blog.




How is it different? I believe understanding HBase is the key. HBase is a non-relational, distributed data storage system for the Big Data in Hadoop. InfoSys educators need to develop course modules (for 1 or 2 class meetings) to illustrate how Big Data is structured, or better say, unstructured in HBase, contrast its features with those w/ RDBMS and discuss NoSQL.

Parallel processing, synchronization, and other technical matters, belong to master's program in more technical programs, I would think. But they can be briefly discussed in our DB/BI classes too.

I am working on my course module because I think it is necessary for our students to enter the world of Big Data informed and prepared. Will share it here as soon as it's ready, stay tuned!






Sunday, May 6, 2012

Brainstorming vs. Debate & Dissent


"Imagine: How Creativity Works" is the title of a book by Jonah Lehrer, I heard his talk on the Colbert Report, also on BookTV

My dissertation research was on idea integration in electronic brainstorming groups; Jonah Lehrer's definition of creativity also emphasizes combination and integration of idea.His book and the New Yorker article on Jan. 2012, has provoked many responses from those who believe Jonah Lehrer has inadequately addressed (or in some cases misused) the scientific literature on brainstorming. In particular, Lehrer's advocating "Debate & Dissent" as an alternative to brainstorming for achieving creativity is believed to not be scientifically grounded.

For instance, Scott Burkun rebuttal  raises several points including the need for having a skillful group leader, motivated & smart individuals, and phasing the ideation process to allow a safe-time when idea divergence is encouraged.

Here I provide some additions to this interesting discussion:
  1. it is true that having two separate phases of idea divergence and convergence helps (Shalley & Zhou 2007). Allowing divergence helps with creating ideas that are far apart from each other, and when those ideas are shared, they are likely to activate concepts in associative memory that are far apart from each other. And when that happens the likelihood of generating creative ideas increases (Santanen, Briggs, & deVreede 2004). This, of course, assumes that individuals attend to each others' ideas, they process those ideas and the use those ideas for creating combinative or integrative ideas. 
  2. Second, having a safe-time when critique/debate is not allowed is good. During the safe-time  evaluation apprehension is decreased and individuals are more likely to generate divergent ideas or ideas that are perceived as not-very-useful. But as Scott Burkun notes, not-very-useful ideas may stimulate generation of very useful ideas. This is again based on the assumptions that individuals attend to each other and process each others' ideas.
  3. The above two points both require attending and processing ideas of others. This is when different forms of intervention can help, and one of them is using facilitators. Faciltiators can encourage generation of more ideas by providing a social comparion mechansim (Santanen, Briggs, & deVreede 2004). Facilitators can control flow ideas to make sure individuals are not overwhelmed as we know cognitive overload hampers creativity. Facilitators can encourage debate and dissent when it can improve ideas. Facilitators can encourage idea integration during the safe-time because idea integration can contribute to the divergent idea generation process. If individuals attend to each others' idea and improve them (in a controlled manner) then that may change the directions of divergence and improve the idea generation process outcomes.
  4. I have to agree with some of the commentators on Scott Burkun's post that it takes practice to do it right. I happen to subscribe to the situationalists' view of creativity (Shneiderman 2003).  The combination of some mediocare ideas may turn out to be a brilliant idea and thus what we need is motivated and open-minded individuals who are willing to think, attend, and process. One can imagine that an environment like Bldg. 20 in MIT built in Kansas if managed well may eventually become the locus of many great and innovative ideas. But I understand Soctt Burkun's point that Jonah Lehrer's approach of using this one building as evidence is flawed.

As an Information Systems researcher, I look for ways to use computers as an instrument to  promote idea combination and integration. In my next post, I explain this in detail.




Incomplete List of References: 
  • Santanen, E.L., R.O.Briggs, & G.J. de Vreede (2004). Causal relationships in creative problem solving:  Comparing facilitation interventions for ideation. Journal of Management Information Systems, 20(4): 167-197.
  • Shalley C. & Zhou J.(2007). Handbook of Organizational Creativity, Ed. 
  • Shneiderman, Ben (2003). Human Needs and The New Computational Technologies. The MIT Press. 


Tuesday, April 17, 2012

Take Home or In-Class?

This was a hot topic in my BI class for two weeks. I am not a big believer in exams, I keep exams in my courses because they should be there.In this class, there's one exam that is worth 10% of the final grade, students wanted it to be take home. Exams are by default in-class tests, therefore take-home test advocates were trying to persuade others to agree on a take-home tests. Everyone agreed and here's the exam: on my Class e-mail GDoc.  
image on the left: from here.


The exam covers LogReg, DTree, KNN, Clustering, and MBR, is designed to be completed in 5 hours or less, and is based on SAS EM but questions can be answered using any other DM software as well. This is the first BI exam I created, for my professional master's BI class last semester, I used Dr. Douglas's in-class exam.


As a first exam, I am happy because I had a few interesting discussion based on the question: comparing models based on Cumulative Lift, understanding importance of  Expected Confidence in MBR rules, examining MBR rules that contain everyday essentials (items with very high expected confidence) on the right side, understanding the MBR rules with no time information, examining & redoing clusters based on application needs, and articulating business recommendation based on findings.


Please take a look and share your opinions with me.

Wednesday, February 29, 2012

Classification with Decision Trees

Feb. 27 was the Tree Day in our class. Decision Trees are one the more exiting DM algorithms.



I started with describing splitting rules and went over the Entropy and Gini examples. Other details such as leaf purity, English rules, and prunning were discussed during the activity time.


Here are the slides and here is the class activity. To help student understand the calculation for Entropy and Gini trees, I recreated the book example in Excel. You may find it useful for students too. 

It is very important to remind students to pay attention to number of observations in leaves. Number of observations in each leaf indicates the extent to which the corresponding English rules. Generalizability & pruning should be adequately discussed in DTree lecture.








Note: I change the location of my files, if you need a file and couldn't find it on the link below, write to me: elahe dot j dot marmarchi at gmail dot com

classification with KNN

Monday, Feb. 20 we talked about KNN. Of course, Shazam is a good example to use for teaching KNN. I had seen the Shazam case in the Linoff & Berry book and students like the example because many of them have seen or heard about Shazam. I use the free version of Shazam and when I ask my students: why my free version does not work as well as your paid ones? , they get excited to come up with ideas:

It is good to follow the traditional order of topic when teaching KNN:
1- distance function
2- choosing k
3- and combining information from the 

To do so I stay on my slide 7 (as I say in my second slide, all the pictures are from the web), and explain it as a patient/drug prescription case. The data is on previous patients and the drug that was prescribed for each. Then I ask the class to decide which drug to prescribe for the new patient. 

My ppt slides are available here.The numerical examples come from Daniel Larose's book. I also like the KNN example Jame's Hamilton has on his webpage.

SAS EM MBR node is not a very exciting one. It does not many customization options and the output does not include the details of the model. Here's the activity we completed in the class.

It is a good idea to discuss sensitivity and specificity after we have talked about two classification models. We are still working the "clean" ~3K-records churn data set that comes with book. We used SAS EM Model Comparison, set it to select the better model between LogReg and MBR based on validation misclassifications rate. It chooses MBR over LogReg. But when I asked the class to calculate sensitivity and specificity for both models, some of them said they would choose LogReg. And it is a reasonable choice because it has 2 percentage points higher sensitivity. We then discussed other applications in which sensitivity or specificity may be important criterion for evaluating models like HIV test, marketing campaigns, mammograms,...and an example by a student: pregnancy tests !!

I found this on the web, thought it's funny, one "neighboring" worm:


Monday, February 27, 2012

Target is Mining Pregnancy!

My student had read the story on Forbes.com and she shared it with me in our last week's class meeting (Feb. 20). We planned to share it with the class but got too busy with KNN algorithm and our class activities. The story went viral so much so that Colbert took it on and made it the topic of one of his "Tonight's Word" segment, enjoy: Colbert on the Target, pregnant girl and her furious father.

Tuesday, February 14, 2012

Prediction with logistic regression

Yesterday evening we had the 3rd meeting of our. The tradition here at the University of Arkansas is to start with linear regression and logistic regression. Then go on with decision tree and other algorithms. I skipped linear regression, explained logistic regression theory and moved on to dissecting and interpreting SAS EM results. The data set is telecom Churn data set which is available on the book website, has over 3000 records and is very clean.


The ppt slides can be found here. I started with a simple example of linear regression: Gorgiean's enjoyment of snow over time which I found from here. Interestingly we had snow on Monday, there was some snow on the ground when we woke up. Thus this example was very relevant !!! My students liked it.
I then talked about data partitioning. I explained the reason for moving from:


target -> probability of target -> odds of target -> log of odds -> conducting linear regression of log of odds on the input variables.


It is not easy for an undergraduate student to digest these but I wanted to expose them to the ideas. I recommend this to all teachers to spend some time on explaining the assumptions and theory behind logistic regression.


On SAS EM output, I focused on coefficient estimates, significance levels for each input and for the whole regression equation, misclassification rate, false positives, false negatives, lift, and lift chart. The class activity and the follow up concept checks are available here.


I plan to talk about stepwise, forward, and backward variable selection methods in next class- and move on to KNN.