Events

Home » News » Events » Content

May 29: Dr.Wei Fan from IBM T.J.Watson Research: Let the Data Speak for Themselves - Simplistic Approaches to Difficult Data Mining Problems

2008-05-29
View:

【Topic】Let the Data Speak for Themselves - Simplistic Approaches to Difficult Data Mining Problems

【Speaker】Wei Fan

【Time】2008-5-29 10:30-12:00

【Venue】Room 241, Weilun Building

【Language】English/Chinese

【Organizer】Department of Management Science and Engineering

【Target Audience】

【Backgound Information】

Abstract:

Inductive learning is to construct an accurate model from labeled training examples to match the true model that generates the data. One major difficulty is that the actual true model is never known for many practical problems. Any assumption about the exact form of the true model could be wrong. The validity of any assumption is difficult to verify since labeled examples are non-exhaustive for most applications, and there may be little known truth about the exact generating mechanism. The main stream research of machine learning has been focusing on rather sophisticated and well-thought approaches to approximating the true models in classification, regression and probability estimation problems. Examples of well-known algorithms belonging to this family include Boosting, Bagging, SVM, Mixture models, Logistic Regression, and GUIDE, among many others.

In this talk, we will discuss a family of Randomized Decision Tree algorithms or RDT that can be used efficiently and accurately for classification, regression and probability estimation problems. The training procedure of RDT incorporates some surprisingly simple and unconventional random factors that "encode" the training data into multiple decision trees. However, its accuracy in all three major problems is either higher or significantly higher than many well-known sophisticated approaches on many difficult problems.

In summary, this talk offers the following insights:

1. Introduction of Randomized Decision Trees and its application in classification, regression, and probability estimation.

2. Selected applications of RDT

a. fraud detection,

b. default and late payment prediction

c. advertisement cost and benefit optimization

d. system module performance estimation

e. ground ozone level estimation

f. chip failure rate

g. query optimization

3. A fresh and unconventional look at accurate machine learning and data mining without making strong assumptions. Software and some datasets are available for download at:

http://www.weifan.info/software.htmor

http://www.cs.columbia.edu/~wfan/software.htm

Short Biography

Dr. Wei Fan received his PhD in Computer Science from Columbia University in 2001 and has been working in IBM T.J.Watson Research since then. He published more than 50 papers in top data mining, machine learning and database conferences, such as KDD, SDM, ICDM, ECML/PKDD,

SIGMOD, VLDB, ICDE, AAAI, etc. Dr. Fan has served as Area Chair, Senior PC of SIGKDD'06, SDM'08 and ICDM'08, as well as PC of several prestigious conferences in the area including KDD'08/07/05, ICDM'07/06/05/04/03, SDM'07/06/05/04, CIKM'07/06, ECML/PKDD'07'06, ICDE'04, AAAI'07, PAKDD'08/07, EDBT'04, WWW'08/07, etc. Dr. Fan was invited to speak at ICMLA'06. He served as US NSF panelist in 2007/08. His main research interests and experiences are in various areas of data mining and database systems, such as, risk analysis, high performance computing, extremely skewed distribution, cost-sensitive learning, data stre ams, ensemble methods, graph mining, feature discovery, and commercial data mining systems. He is particularly interested in simple, unconventional, but effective methods to solve difficult problems. His thesis work on intrusion detection has been licensed by a start-up company since 2001. His co-authored paper in ICDM'06 that uses "Randomized Decision Tree" to predict skewed ozone days won the best application paper award. His co-authored paper in KDD'97 on distributed learning system "JAM" won the runner-up best research paper award.