【主讲】Wei Fan哥伦比亚大学计算机科学博士现任职于IBM T.J.Watson Research
【主题】Let the Data Speak for Themselves - Simplistic Approaches to Difficult Data Mining Problems
【时间】2008-5-29(周四)10:30-12:00
【地点】伟伦楼241
【语言】中/英文
【主办】管理科学与工程系
【Backgound Information】
Abstract:
Inductive learning is to construct an accurate model from labeled
training examples to match the true model that generates the data. One
major difficulty is that the actual true model is never known for many
practical problems. Any assumption about the exact form of the true
model could be wrong. The validity of any assumption is difficult to
verify since labeled examples are non-exhaustive for most applications,
and there may be little known truth about the exact generating
mechanism. The main stream research of machine learning has been
focusing on rather sophisticated and well-thought approaches to
approximating the true models in classification, regression and
probability estimation problems. Examples of well-known algorithms
belonging to this family include Boosting, Bagging, SVM, Mixture models,
Logistic Regression, and GUIDE, among many others.
In this talk, we will discuss a family of Randomized Decision Tree
algorithms or RDT that can be used efficiently and accurately for
classification, regression and probability estimation problems. The
training procedure of RDT incorporates some surprisingly simple and
unconventional random factors that "encode" the training data into
multiple decision trees. However, its accuracy in all three major
problems is either higher or significantly higher than many well-known
sophisticated approaches on many difficult problems.
In summary, this talk offers the following insights:
1. Introduction of Randomized Decision Trees and its application in
classification, regression, and probability estimation.
2. Selected applications of RDT
a. fraud detection,
b. default and late payment prediction
c. advertisement cost and benefit optimization
d. system module performance estimation
e. ground ozone level estimation
f. chip failure rate
g. query optimization
3. A fresh and unconventional look at accurate machine learning and
data mining without making strong assumptions.
Software and some datasets are available for download at:
http://www.weifan.info/software.htm or
http://www.cs.columbia.edu/~wfan/software.htm
Short Biography
Dr. Wei Fan received his PhD in Computer Science fromColumbia
University in 2001 and has been working in IBM T.J.Watson Research since
then. He published more than 50 papers in top data mining, machine
learning and database conferences, such as KDD, SDM, ICDM, ECML/PKDD,
SIGMOD, VLDB, ICDE, AAAI, etc. Dr. Fan has served as Area Chair, Senior
PC of SIGKDD'06, SDM'08 and ICDM'08, as well as PC of several
prestigious conferences in the area including KDD'08/07/05,
ICDM'07/06/05/04/03, SDM'07/06/05/04, CIKM'07/06, ECML/PKDD'07'06,
ICDE'04, AAAI'07, PAKDD'08/07, EDBT'04, WWW'08/07, etc. Dr. Fan was
invited to speak at ICMLA'06. He served as US NSF panelist in 2007/08.
His main research interests and experiences are in various areas of data
mining and database systems, such as, risk analysis, high performance
computing, extremely skewed distribution, cost-sensitive learning, data
stre ams, ensemble methods, graph mining, feature discovery, and
commercial data mining systems. He is particularly interested in simple,
unconventional, but effective methods to solve difficult problems. His
thesis work on intrusion detection has been licensed by a start-up
company since 2001. His co-authored paper in ICDM'06 that uses
"Randomized Decision Tree" to predict skewed ozone days won the best
application paper award. His co-authored paper in KDD'97 on distributed
learning system "JAM" won the runner-up best research paper award.