5-29: IBM沃森研究中心Wei Fan博士: Let the Data Speak for Themselves - Simplistic Approaches to Difficult Data Mining Problems-清华大学经济管理学院

5-29: IBM沃森研究中心Wei Fan博士: Let the Data Speak for Themselves - Simplistic Approaches to Difficult Data Mining Problems

2008-05-29

阅读：

【主讲】Wei Fan哥伦比亚大学计算机科学博士现任职于IBM T.J.Watson Research

【主题】Let the Data Speak for Themselves - Simplistic Approaches to Difficult Data Mining Problems

【时间】2008-5-29（周四）10:30-12:00

【地点】伟伦楼241

【语言】中/英文

【主办】管理科学与工程系

【Backgound Information】

Abstract:

Inductive learning is to construct an accurate model from labeled

training examples to match the true model that generates the data. One

major difficulty is that the actual true model is never known for many

practical problems. Any assumption about the exact form of the true

model could be wrong. The validity of any assumption is difficult to

verify since labeled examples are non-exhaustive for most applications,

and there may be little known truth about the exact generating

mechanism. The main stream research of machine learning has been

focusing on rather sophisticated and well-thought approaches to

approximating the true models in classification, regression and

probability estimation problems. Examples of well-known algorithms

belonging to this family include Boosting, Bagging, SVM, Mixture models,

Logistic Regression, and GUIDE, among many others.

In this talk, we will discuss a family of Randomized Decision Tree

algorithms or RDT that can be used efficiently and accurately for

classification, regression and probability estimation problems. The

training procedure of RDT incorporates some surprisingly simple and

unconventional random factors that "encode" the training data into

multiple decision trees. However, its accuracy in all three major

problems is either higher or significantly higher than many well-known

sophisticated approaches on many difficult problems.

In summary, this talk offers the following insights:

1. Introduction of Randomized Decision Trees and its application in

classification, regression, and probability estimation.

2. Selected applications of RDT

a. fraud detection,

b. default and late payment prediction

c. advertisement cost and benefit optimization

d. system module performance estimation

e. ground ozone level estimation

f. chip failure rate

g. query optimization

3. A fresh and unconventional look at accurate machine learning and

data mining without making strong assumptions.

Software and some datasets are available for download at:

http://www.cs.columbia.edu/~wfan/software.htm

Short Biography

Dr. Wei Fan received his PhD in Computer Science fromColumbia

University in 2001 and has been working in IBM T.J.Watson Research since

then. He published more than 50 papers in top data mining, machine

learning and database conferences, such as KDD, SDM, ICDM, ECML/PKDD,

SIGMOD, VLDB, ICDE, AAAI, etc. Dr. Fan has served as Area Chair, Senior

PC of SIGKDD'06, SDM'08 and ICDM'08, as well as PC of several

prestigious conferences in the area including KDD'08/07/05,

ICDM'07/06/05/04/03, SDM'07/06/05/04, CIKM'07/06, ECML/PKDD'07'06,

ICDE'04, AAAI'07, PAKDD'08/07, EDBT'04, WWW'08/07, etc. Dr. Fan was

invited to speak at ICMLA'06. He served as US NSF panelist in 2007/08.

His main research interests and experiences are in various areas of data

mining and database systems, such as, risk analysis, high performance

computing, extremely skewed distribution, cost-sensitive learning, data

stre ams, ensemble methods, graph mining, feature discovery, and

commercial data mining systems. He is particularly interested in simple,

unconventional, but effective methods to solve difficult problems. His

thesis work on intrusion detection has been licensed by a start-up

company since 2001. His co-authored paper in ICDM'06 that uses

"Randomized Decision Tree" to predict skewed ozone days won the best

application paper award. His co-authored paper in KDD'97 on distributed

learning system "JAM" won the runner-up best research paper award.