KDD 2010 Banner
Tutorials
KDD-2010 Tutorials provide introductions to new areas of particular interest to the data mining community, and are hosted by the leading practitioners and teachers in the field. This is a great way to bring your skills up-to-date on vibrant new areas of data mining and get the most out of the papers in the KDD-2010 main conference program.

Tutorial TitleSchedule
T1Large-scale Data Mining: MapReduce and BeyondSun-Morn
T2New Developments in the Theory of ClusteringSun-Morn
T3Temporal Pattern MiningSun-Morn
T4Learning through ExplorationSun-Morn
T5Geometric Tools for Graph Mining of Large Social and Information NetworksSun-Morn
T8Mining Web Search and Browse LogsSun-Morn
T6Privacy-aware Data Mining in Information NetworksSun-AftN
T7Introduction to Graphical Models for Data MiningSun-AftN
T9Mining Heterogeneous Information NetworksSun-AftN
T10Outlier Detection TechniquesSun-AftN
T11Recommender Problems for Web ApplicationsSun-AftN
T12Indexing and Mining Time SequencesSun-AftN

Detailed Descriptions
T1. Large-scale Data Mining: MapReduce and Beyond
Jimeng Sun (IBM Research), Spiros Papadimitriou (IBM Research) and Rong Yan (Facebook)

Abstract

Data are becoming available in unprecedented volumes. This difference in scale is difference in kind, presenting new opportunities. Map-reduce has drawn a lot of attention recent years for large-scale data processing and mining. In this tutorial, we introduce Map-reduce and its application and research in data mining. In particular, we want to answer the following questions:

  • What is Map-reduce and why do we need it for data mining?
  • What mining applications need Map-reduce?
  • What are the advantages and limitations using Map-Reduce?
  • How do you use Map-reduce?
  • What are other tools out there for large-scale data processing and mining?

More specifically, this tutorial is organized into three parts:

  1. MapReduce basic includes MapReduce programming model, system architecture, its OpenSource implementation Hadoop and its extensions such as HBase, Pig, Cascading, Hive.
  2. MapReduce algorithms cover MapReduce implementation of standard data mining algorithms such as clustering (K-means), classification (k-NN, naive Bayes), graph mining (page rank).
  3. MapReduce applications present the general applications of MapReduce that are beyond data mining, which include text processing, data warehousing.

Bios

Jimeng Sun is a research staff member at IBM TJ Watson lab. He received a Bachelor and MPhil in Computer Science from Hong Kong University of Science and Technology in 2002 and 2003. After that, he obtained a MS and PhD degree in Computer Science from Carnegie Mellon University in 2006 and 2007. His research interests include data mining for streams and networks, databases and health care analytics. He has received the best research paper award in ICDM 2008, the best research paper award in SDM 2007. He has published over 40 refereed articles and two book chapter. He filed four patents and has given five tutorials. He has served as a program committee member of SIGKDD, ICDM, SDM and CIKM and a reviewer for TKDD, TKDE, VLDB, ICDE.

Spiros Papadimitriou is a research staff member at IBM T.J. Watson. His main interests are data mining for graphs and streaming data, clustering, time series and systems for large-scale data processing. He has published more than thirty papers on these topics in refereed conferences and journals. He has three invited journal publications in best paper issues,several book chapters and he has filed multiple patents. He was a Siebel scholarship recipient in 2005 and received the best paper award in SDM 2008.He has also been invited to give keynote talks on graph and social network analysis (WAAMD 2008, and ADN 2009) and tutorials on time series stream mining (University of Maine Summer School, 2008). He obtained his BSc in Computer Science from the University of Crete, Heraclion and his MSc and PhD degrees from Carnegie Mellon University.

Rong Yan is a research scientist at FaceBook focusing on social ad-targeting and optimization. Dr. Yan received his M.Sc. (2004) and Ph.D. (2006) degree from Carnegie Mellon University's School of Computer Science, and worked at IBM Research from 2006 to 2009. His research interests include multimedia information retrieval, large-scale machine learning, data mining, video content analysis, and computer vision. Dr. Yan is the leading designer of the automatic video retrieval system that achieves the best performance in the world-wide TRECVID evaluation in 2003 / 2005. He received the Best Paper Runner-Up awards in ACM MM 2004 and ACM CIVR 2007.

Go to top


T2. New Developments in the Theory of Clustering
Sergei Vassilvitskii (Yahoo! Research) and Suresh Venkatasubramanian (U. of Utah)

Abstract

Theoretical and applied research in clustering have often followed separate paths, with only the occasional confluence of interest. In this tutorial, we provide an overview of recent results in the theory of clustering that bridge this divide and are of interest to practitioners. We describe a new approach to selecting the initial cluster centers in the k-means algorithm, which leads both to provable approximation guarantees, and practical improvements in the quality of the clustering. We continue by explaining why the algorithm works in non-Euclidean spaces, for example, for clustering under information measures like the Kullback-Leibler divergence, and present new algorithms for these metrics. Finally, we discuss recent results on the stability of clusterings and their implication for our ability to judge the quality of a clustering.

Bios

Sergei Vassilvitskii is a Research Scientist at Yahoo! Research and an Adjunct Professor at Columbia University. He received his PhD from Stanford University where he was a Microsoft Research and NDSEG fellow. At Yahoo! he works on computational advertising, an area at the intersection of large scale data analysis, machine learning and economics.

Suresh Venkatasubramanian is the John and Marva Warnock Assistant Professor at the University of Utah. He received his Ph.D from Stanford University, and his B. Tech from the Indian Institute of Technology, Kanpur. His research interests lie in algorithms, computational geometry, and large data mining and analysis.

Go to top


T3. Temporal Pattern Mining in Symbolic Time Point and Time Interval Data
Fabian Moerchen (Siemens Corporate Research)

Abstract

We present a unifying view of temporal concepts and data models in order to categorize existing approaches for unsupervised pattern mining from symbolic temporal data. We distinguish time point-based methods and interval-based methods as well as univariate and multivariate methods. For each of the main categories we present the most important algorithms. For time points, sequential pattern mining algorithms can be used to express equality and order of time points with gaps in multivariate data. Recently, efficient algorithms have been proposed to mine the more general concept of partial order from time points. For time interval data with precise start and end points the relations of Allen can be used to formulate patterns. Several alternatives and extensions have been proposed. We further point the audience to preprocessing methods from temporal data mining.

Bio

Fabian Moerchen graduated with a Ph.D. in Feb 2006 from University or Marburg, Germany after just over 3 years with summa cum laude. In his thesis he proposed a radically different approach to temporal interval patterns. Since 2006 he has been working at Siemens Corporate Research, a division of Siemens Corporation, on data mining projects with applications in predictive maintenance, text mining, healthcare, and sustainability. The studies of temporal data mining were continued in the context of real life applications leading among other publications to the first paper using semi-intervals for pattern mining unifying many previous publications that extended Allen's interval algebra. His broad expertise in temporal data mining was acknowledged with the publication of a survey in the SIGKDD Explorations and presentation of a tutorial at IEEE SSCI 2009. He co-organized the SIGKDD 2007 workshop on time series classification.

Go to top


T4. A Tutorial on Learning through Exploration
Alina Beygelzimer (IBM T.J. Watson Research) and John Langford (Yahoo! Research)

Abstract

This tutorial is about learning through exploration. The goal is to learn how to make decisions in partial feedback settings where an agent repeatedly observes some information, chooses an action, and then learns how this action paid off (but doesn't get to see how other actions would have paid off). We plan to cover all aspects of this general problem: learning, evaluation, limitations of ability to learn in this setting, and the relationship to traditional supervised learning.

Bios

Alina Beygelzimer is a research staff member at the IBM Thomas J. Watson Research Center, working on machine learning, learning theory and algorithms. She received a PhD in Computer Science from the University of Rochester in 2003.

John Langford is a senior research scientist at Yahoo! Research. His work includes research in machine learning, learning theory, algorithms, game theory, steganography, and Captchas. He earned a PhD in Computer Science from Carnegie Mellon University in 2002.

Go to top


T5. Geometric Tools for Graph Mining of Large Social and Information Networks
Michael Mahoney (Stanford University)

Abstract

The tutorial will cover recent algorithmic and statistical work on identifying and exploiting "geometric" structure in large informatics graphs such as large social and information networks. Such tools (e.g., Principal Component Analysis and related non-linear dimensionality reduction methods) are popular in many areas of machine learning and data analysis due to their relatively-nice algorithmic properties and their connections with regularization and statistical inference. These tools are not, however, immediately-applicable in many large informatics graphs applications since graphs are more combinatorial objects; due to the noise and sparsity patterns of many real-world networks, etc. Recent theoretical and empirical work has begun to remedy this, and in doing so it has already elucidated several surprising and counterintuitive properties of very large networks. Topics include: underlying theoretical ideas; tips to bridge the theory-practice gap; empirical observations; and the usefulness of these tools for such diverse applications as community detection, routing, inference, and visualization.

Bio

The tutor, Michael Mahoney, is currently at Stanford University. His research interests focus on theoretical and applied aspects of algorithms for large-scale data problems in scientific and Internet applications. Currently, he is working on geometric network analysis methods; developing approximate computation and regularization methods for large informatics graphs; and applications to community detection, clustering, learning, and information dynamics in large social and information networks. In the past, he has worked on the design and analysis of randomized algorithms for matrices, as well as applications of those methods in genetics and medical imaging. He has been a faculty member at Yale University and a researcher at Yahoo, and his PhD was is computational statistical mechanics at Yale University.

Go to top


T6. Privacy-aware Data Mining in Information Networks
Kun Liu (Yahoo! Research), Gerome Miklau (UMass), Jian Pei (SFU), and Evimaria Terzi (Boston U.)

Abstract

The proliferation of social information networks, as means of sharing information in many application domains, has raised privacy concerns for enterprises that own such data and for individual users that participate in such networks. For enterprises the main challenge is to satisfy two contradicting goals: release the network data for useful data analysis and also preserve the identities or sensitive relationships of the individuals participating in the network. Individual users, on the other hand, will benefit from personalized methods that focus on increasing user's privacy awareness, simplifying the management of user's information sharing, and encouraging online activity and presence.

In this tutorial, we will give a systematic survey on problems and state-of-the-art methods related to both enterprise and personalized privacy in social information networks. We will focus on privacy attacks and risks, privacy-preserving data publishing and personalized privacy monitoring and managements, all of which are particularly tailored for network data. We will also highlight the special nature of the data and describe the unique challenges they impose on the research community.

Bios

Kun Liu is a Scientist at Yahoo! Labs. His research interests include privacy-preserving data mining, social-network analysis, text analytics, and distributed data mining. He co-organized the First SIAM International Workshop on Practical Privacy-Preserving Data Mining. He has co-authored three book chapters on privacy-preserving social-network analysis and privacy-preserving data mining. His contributions to these areas have won him five IBM Invention Achievement Awards, one IBM Invention Plateau Award and one IBM Bravo Award. Recently, he is focusing on problems related to computational display advertising.

Gerome Miklau is an Assistant Professor at University of Massachusetts Amherst. His primary research interest is the secure management of large-scale data. His recent work related to the tutorial has included: assessing the threat of re-identification risk in social networks, theoretically characterizing topological properties that contribute to privacy threats, devising private graph transformations, and devising differentially-private output perturbation schemes for network data. He received an NSF CAREER Award in 2007 and won the 2006 ACM SIGMOD Dissertation Award.

Jian Pei is an Associate Professor at Simon Fraser University. His research focuses on data mining and analytic queries in databases. With prolific publications in refereed journals, conferences, and workshops, he is the recipient of the British Columbia Innovation Council 2005 Young Innovator Award, an associate editor of IEEE Transactions on Knowledge and Data Engineering, and a senior member of both ACM and IEEE. He has served regularly in the organization committees and the program committees of numerous international conferences and workshops.

Evimaria Terzi is an Assistant Professor at Boston University. His research focuses on algorithmic aspects of data mining and data analysis. Recently, she focuses on topics related to information propagation in social networks, privacy-preserving social network analysis. She has won the IBM outstanding award for her work in privacy. Her work at IBM has also won seven IBM Invention Achievement Awards and one IBM Invention Plateau Award.

Go to top


T7. Introduction to Graphical Models for Data Mining
Arindam Banerjee (U. of Minnesota)

Abstract

Graphical models for large scale data mining constitute an exciting development in statistical data analysis which has gained significant momentum in the past decade. Unlike traditional statistical models which often make `i.i.d.' assumptions, graphical models acknowledge dependencies among variables of interest and investigate inference/prediction while taking into account such dependencies. In recent years, latent variable Bayesian networks, such as latent Dirichlet allocation, stochastic block models, Bayesian co-clustering, and probabilistic matrix factorization techniques have achieved unprecedented success in a variety of application domains including topic modeling and text mining, recommendation systems, multi-relational data analysis, etc. The tutorial will give a broad overview of graphical models, and discuss recent developments in the context of mixed-membership models, matrix analysis models, and their generalizations. The tutorial will present a balanced mix of models, inference/learning methods, and applications.

Bio

Arindam Banerjee is an Assistant Professor and a McKnight Land Grant Professor in the Department of Computer Science and Engineering and a Resident Fellow in the Institute on Environment at the University of Minnesota, Twin Cities. He received his Ph.D. from the University of Texas at Austin in 2005, where his dissertation was nominated for the best dissertation award. His research interests are in Machine Learning, Data Mining, Information Theory, Convex Analysis and Optimization, and their applications in complex real world learning problems including problems in Text and Web Mining, Bioinformatics, Social Networks, Finance, and Climate Sciences. He has published extensively in the top conferences, including KDD, ICML, ICDM, and SDM, as well as top journals, including Journal of Machine Learning Research (JMLR), IEEE Transactions on Information Theory (ITIT), and IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). He has won several awards including the NSF CAREER award, 2010, the McKnight Land-Grant Professorship at the University of Minnesota, Twin Cities, 2009-2011, the J. T. Oden Faculty Research Fellowship from the Institute for Computational Engineering and Sciences (ICES), University of Texas at Austin, 2006, and the prestigious IBM PhD fellowship for the academic years 2003-2004 and 2004-2005. He has also won several awards for his publications, including the Best Paper Award at the SIAM International Conference on Data Mining (SDM), 2004, the Best Research Paper Award under University Cooperative Society Research Excellence Awards, University of Texas at Austin, 2005, and the Best of SIAM Data Mining (SDM) Award at the SIAM International Conference on Data Mining, 2007.

Go to top


T8. Mining Web Search and Browse Logs: Challenges, Methods, and Applications
Daxin Jiang (Microsoft Research Asia), Jian Pei (SFU), and Hang Li (Microsoft Research Asia)

Abstract

Huge amounts of search log data have been accumulated in various search engines. Currently, a commercial search engine receives billions of queries and collects tera bytes of log data on any single day. Other than search log data, browse logs can be collected by client-side browser plug-ins, which record the browse information if users' permissions are granted. Such massive search/browse log data, on the one hand, provides great opportunities to mine the wisdom of crowds and improve Web search as well as online advertisement. On the other hand, designing effective and efficient algorithms to clean, model, and process large scale log data also presents great challenges. In this tutorial, we will give a systematic survey on the applications, challenges, fundamental principles, and state-of-the-art methods of mining large scale search and browse log data. We will start with an introduction of search and browse log data and an overview of various log mining applications. Then, we will focus on four popular areas of log mining applications, namely query understanding, document understanding, query-document matching, and user understanding. For each area, we will survey the major tasks, analyze the challenges, and exemplify several representative solutions. Finally, we will highlight several new directions in search and browse log mining.

Bios

Daxin Jiang is a Researcher at Microsoft Research Asia. His research focuses on data mining and information retrieval. He received Ph.D. in computer science from the State University of New York at Buffalo. He has published extensively in prestigious conferences and journals, and served as a PC member of many conferences. He received the Best Application Paper Award of SIGKDD'08 and the Runner-up for Best Application Paper Award of SIGKDD'04.

Jian Pei is an Associate Professor and the Associate Director, Research, of the School of Computing Science, Simon Fraser University. His research focuses on data mining and analytic queries in databases. With prolific publications in refereed journals and conferences, he is the recipient of several prestigious awards. He is an associate editor of ACM Transactions on Knowledge Discovery from Data (TKDD) and IEEE Transactions on Knowledge and Data Engineering (TKDE). He has served regularly in the organization committees and the program committees of numerous international conferences and workshops. He is a senior member of both ACM and IEEE.

Hang Li is a Senior Researcher and Research Manager at Microsoft Research Asia. His research areas include natural language processing, information retrieval, statistical machine learning, and data mining. He graduated from Kyoto University and holds a PhD in computer science from the University of Tokyo. Hang has about 80 publications in international conferences and journals. He is associate editor of ACM Transaction on Asian Language Information Processing and area editor of Journal for Computer and Science Technology, etc. His recent academic activities include senior PC member of SIGIR 2010, WSDM 2010, and KDD 2010, area chair of ACL 2010, and PC member of WWW 2010.

Go to top


T9. Mining Heterogeneous Information Networks
Jiawei Han (UIUC), Yizhou Sun (UIUC), Xifeng Yan (UCSB), and Philip S. Yu (UIC)

Abstract

With the ubiquity of information networks and their broad applications, there have been numerous studies on the construction, online analytical processing, and mining of information networks in multiple disciplines, including social network analysis, World-Wide Web, database systems, data mining, machine learning, and networked communication and information systems. Moreover, with a great demand of research in this direction, there is a need of a systematic introduction of methods for analysis of information networks from multiple disciplines. Recently there have been some tutorials on structures and laws of homogeneous information networks and graphs. However, there are few systematic tutorials on mining a more important kind of networks, heterogeneous information networks, where information networks are formed by interconnected, multi-typed nodes and links. In this tutorial, we will present an organized picture on scalable mining of heterogeneous information networks, which complements existing tutorials on knowledge discovery in homogeneous information networks. The tutorial includes the following topics:

  1. introduction: information networks and information network analysis,
  2. data integration, data cleaning and data validation in heterogeneous information networks,
  3. clustering and ranking in heterogeneous information networks
  4. classification of heterogeneous information networks,
  5. summarization, OLAP and multidimensional analysis in heterogeneous information networks,
  6. evolution of dynamic heterogeneous information networks, and
  7. research challenges on mining heterogeneous information networks.

Bios

Jiawei Han is a Professor at the Department of Computer Science at Univ. of Illinois at Urbana-Champaign. He has been researching into data mining, data warehousing, information network analysis, etc., with over 400 conference and journal publications. He is Fellow of ACM and IEEE, recipient of ACM SIGKDD Innovations Award (2004), the first author of the textbook “Data Mining: Concepts and Techniques" 2nd ed., (Morgan Kaufmann, 2006), and is the Director of Information Network Academic Research Center, supported by the U.S. Army Research Lab, in the amount of $16.75 Million, since 2009.

Yizhou Sun is a Ph.D. candidate at the University of Illinois at Urbana-Champaign. She received M.S. on Signal and Information Processing in 2007 and B.S. degrees on Computer Science and Statistics in 2004, all from Peking Univ. She has developed RankClus and NetClus algorithms and is one of the team members developing a system, iNextCube, for information network analysis on multidimensional text databases.

Xifeng Yan, (Ph.D., Univ. of Illinois), is a chair assistant professor at Univ. of California at Santa Barbara. He was a research staff member at IBM T. J. Watson Research Center in 2006-2008. His research interests are in the areas of data mining, database systems, and bioinformatics. Specifically, Xifeng investigates models and algorithms for managing and mining complex graphs and networks in the biological, physical, and social worlds. He has published more than 50 papers in refereed journals and conferences, including TODS, Bioinformatics, TSE, SIGMOD, SIGKDD, VLDB, FSE, and ISMB. He gave tutorials in major data mining and database conferences, including ICDM'05, ICDE'06, SIGKDD'06, SIGKDD'08, and EDBT'09. He received the 2007 ACM-SIGMOD (Special Interest Group on Management of Data) Doctoral Dissertation Runner-Up Award for his work in graph mining and graph data management. He also received the Best Student Paper Award in ICDE 2007 and PAKDD 2007.

Philip S. Yu, (Ph.D., Stanford University), is a Professor in the Department of Computer Science at the University of Illinois at Chicago and also holds the Wexler Chair in Information and Technology. He spent most of his career at IBM Thomas J. Watson Research Center and was manager of the Software Tools and Techniques group. Dr. Yu is a Fellow of the ACM and a Fellow of the IEEE. He served as the Editor-in-Chief of IEEE Transactions on Knowledge and Data Engineering (2001-2004). He is an associate editor of ACM Transactions of the Internet Technology and also ACM Transactions on Knowledge Discovery from Data. He serves on the steering committee of IEEE Int. Conference on Data Mining. He was a member of the IEEE Data Engineering steering committee. He received an IEEE Region 1 Award for promoting and perpetuating numerous new electrical engineering concepts. Dr. Yu's research interests include data mining, privacy, data stream, database systems and web technologies. Dr. Yu has published more than 540 papers in refereed journals and conferences. He holds or has applied for more than 300 US patents. Dr. Yu was an IBM Master Inventor.

Go to top


T10. Outlier Detection Techniques
Hans-Peter Kriegel (LMU Munich), Peer Kröger (LMU Munich), and Arthur Zimek (LMU Munich)

Abstract

This tutorial provides a comprehensive and comparative overview of a broad range of state-of-the-art algorithms for finding outliers in massive data sets. It sketches important applications of the introduced methods, and presents a taxonomy of existing approaches. In addition, relationships between the algorithmic approaches of each category of the taxonomy are discussed. Last but not least, at least one algorithm of each category is used for an empirical evaluation of the different approaches for outlier detection. The intended audience of this tutorial ranges from novice researchers to advanced experts as well as practitioners from any application domain where outlier detection methods are required.

Bios

Hans-Peter Kriegel is a full professor for database systems and data mining in the Department “Institute for Informatics” at the Ludwig-Maximilians-Universität München, Germany and has served as the department chair or vice chair over the last years. His research interests are in spatial and multimedia database systems, particularly in query processing, performance issues, similarity search, high-dimensional indexing as well as in knowledge discovery and data mining. Hans-Peter Kriegel has been chairman and program committee member in many international database and data mining conferences. He has published over 200 refereed conference and journal papers including several contributions in the scope of the tutorial, and he received the “SIGMOD Best Paper Award” 1997 and the “DASFAA Best Paper Award” 2006 together with members of his research team.

Peer Kröger holds a tenured position as Lecturer and Researcher (“Akademischer Rat”) at the database systems and data mining group of Hans-Peter Kriegel at the Ludwig-Maximilians-Universität München, Germany. He finished his PhD thesis on clustering moderate-to-high dimensional data in summer 2004 and his Habilitation in 2009. His research interests include data mining and similarity search in high dimensional multimedia and biomedical data including the scope of this tutorial. Peer Kröger has been a program committee member in several major database and data mining conferences. He has published more than 60 refereed conference and journal papers and he received the 2008 Best Paper Honorable Mention Award at the SIAM Data Mining Conference together with his co-authors.

Arthur Zimek is a postdoc in the database systems and data mining group of Hans-Peter Kriegel at the Ludwig-Maximilians-Universität München, Germany. He finished his PhD thesis in computer science on “Correlation Clustering” in summer 2008. He received the “Best Paper Honorable Mention Award” at the SIAM Int. Conf. on Data Mining (SDM) together with his co-authors in 2008 and has been selected as runner-up of the “SIGKDD Doctoral Dissertation Award” in 2009. His research interests include data mining for high dimensional data and structured data especially for bioinformatics applications. Several of his recent publications cover the scope of this tutorial.

Go to top


T11. Recommender Problems for Web Applications
Deepak Agarwal (Yahoo! Research) and Bee-Chung Chen (Yahoo! Research)

Abstract

In this half-day tutorial, we provide an in-depth introduction of data mining challenges that arise in the context of recommender problems for web applications. Since Netflix released a large movie ratings dataset, recommender problems have received considerable attention. The focus of this tutorial however is on web applications and we will cover topics that are significant extensions of those discussed in the context of Netflix contest. While the research pursued in the context of Netflix contest made great advances in “offline-research” (fitting model to historic data obtained from a recommender system), this tutorial goes beyond this and discusses important issues in “online-research” (the science of designing sequential schemes that serve items to users in an optimal fashion). At an abstract level, the goal of recommender systems is to display items from an inventory for each user visit to some website. Each display results in some response from the user (clicks, ratings and so on) that updates our belief about user preferences and results in better recommendations in the future. The data mining challenge is to construct a serving scheme or sequential design that learns user preferences through interactions with items in order to maximize some utility function over a long time horizon. For instance, a portal like Yahoo! maybe interested in constructing a serving scheme that displays articles to users visiting their front page to maximize click rates. The tutorial will begin with a formal definition of the problem through real life examples drawn from actual applications. We provide a detailed discussion of various scenarios under which recommendations may have to be made, including varying item pool size, dynamic versus static item pool, degrees of cold-start both for users and items, number of items to be selected for each display, user fatigue and so on. These scenarios go significantly beyond the classical movie recommendation problem. We then provide a comprehensive overview of state-of-the-art techniques in this area with detailed discussion of several open problems that we hope will stimulate further research. In particular, we describe time-series models, multi-armed bandit schemes, regression approaches, matrix factorization approaches, cold-start, similarity-based approaches exemplified through real world examples. We will end with detailed discussion of several technical challenges in this area.

Bios

Deepak Agarwal is currently a Principal research scientist at Yahoo! Research. Prior to joining Yahoo!, he was a member of the statistics department at AT&T Research. He is a statistician interested in scalable modeling approaches for large scale applications. He has done extensive research on large scale hierarchical random effects model, computational advertising, modeling massive social networks with applications to call graph that arise in the telecommunications industry and modeling massive dyadic data that arise in applications like recommender systems. He has won four best paper awards (JSM 2001, SDM 2004, KDD 2007, ICDM 2009) that are directly related to the material of the tutorial. He has also done research in anomaly detection using a time series approach and computational approaches for scaling spatial scan statistic to large data sets. He regularly serves on program committees of data mining and machine learning conferences. He is currently associate editor for Journal of American Statistical Association, the top journal in the field of Statistics. He have given two tutorials on Statistical Challenges in Online Advertising at CIKM 2009 and KDD 2009. Deepak has also worked on developing algorithms for real recommender systems that are successfully being used and thus has experience with both practical and scientific issues that arise in such applications.

Bee-Chung Chen is currently a research scientist at Yahoo! Research. His research interests include recommender systems, large scale data analysis and scalable methods for modeling and mining. He received his Ph.D. from the University of Wisconsin - Madison, and an outstanding graduate student research award from the Department of Computer Sciences. His thesis work on Cube-Space Data Mining is recognized by the ACM SIGMOD Doctoral Dissertation Award Honorable Mention. Recently, he also won the ICDM 2009 best paper award for his work on Explore/Exploit Schemes for Web Content Optimization. He has done significant research in extending OLAP with machine-learning techniques, data privacy and Web recommendation problems, and also worked on algorithms powering real recommendation applications including a number of major Yahoo! sites.

Go to top


T12. Indexing and Mining Time Sequences
Christos Faloutsos (CMU) and Lei Li (CMU)

Abstract

How can we find patterns in a sequence of sensor measurements (eg., a sequence of temperatures, or water-pollutant measurements)? How can we compress it? What are the major tools for forecasting and outlier detection? The objective of this tutorial is to provide a concise and intuitive overview of the most important tools that can help us find patterns in sensor sequences. Sensor data analysis becomes of increasingly high importance, thanks to the decreasing cost of hardware and the increasing on-sensor processing abilities. We review the state of the art in three related fields: (a) fast similarity search for time sequences, (b) linear forecasting with the traditional AR (autoregressive) and ARIMA methodologies, (c) non-linear forecasting, for chaotic/self-similar time sequences, using lag-plots and fractals, and (d) Kalman filters. The emphasis of the tutorial is to give the intuition behind these powerful tools, which is usually lost in the technical literature, as well as to give case studies that illustrate their practical use.

Bios

Christos Faloutsos is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), the Research Contributions Award in ICDM 2006, fifteen ``best paper'' awards, and several teaching awards. He has served as a member of the executive committee of SIGKDD; he has published over 200 refereed articles, 11 book chapters and one monograph. He holds five patents and he has given over 30 tutorials and over 10 invited distinguished lectures. His research interests include data mining for graphs and streams, fractals, database performance, and indexing for multimedia and bio-informatics data.

Lei Li is currently a Ph.D. candidate at Computer Science Department, Carnegie Mellon University. Lei Li is received his B.E. degree from the Department of Computer Science and Engineering, Shanghai Jiao Tong University. His research interests include machine learning, data mining, time series analysis, and motion capture data.

Go to top

Washington Monument data mining picture


Tutorials - Call For Proposals (Expired)
KDD-2010 will host tutorials covering topics in data mining of interest to the research community as well as application developers. The tutorials will be part of the main conference technical program, and are free of charge to the attendees of the conference.
We invite proposals for half-day tutorials from active researchers and experienced tutors. Ideally, a tutorial will cover the state-of-the-art research, development and applications in a specific data mining direction, and stimulate and facilitate future work. Tutorials on interdisciplinary directions, novel and fast growing directions, and significant applications are highly encouraged.  Accepted tutorials will receive one free registration to the conference.
Areas of interest include, but are not limited to:
  • Introductory tutorials on data mining foundations and theory
  • "How to" tutorials for practitioners
  • Mining environmental, climate, and scientific data
  • Mining dynamic and evolving data
  • Mining social networks and graph data
  • Mining spatial and temporal data
  • Mining data streams and sensor data
  • Mining multimedia data
  • Mining biological and biomedical data
  • Mining text, Web, and semi-structured data
  • Mining uncertain data
  • Mining for computational advertising
  • Mining user behavioral and feedback data
  • Mixed-initiative data mining and active learning
  • Mining with auxiliary data sources and transfer learning
  • Topic models and matrix methods in data mining
  • Outlier analysis and anomaly detection
  • Robust and highly scalable statistical methods and mining algorithms
  • Security, privacy, and adversarial data mining
  • High performance and parallel/distributed data mining
  • Mining tera-/peta-scale data
  • Visual data mining and data visualization
  • Data integration issues in mining
A tutorial proposal should be formatted in the following sections.
  1. Title
  2. Abstract (up to 150 words)
  3. Target audience and prerequisites. Proposals must clearly identify the intended audience for the tutorial (e.g., novice users of statistical techniques, or expert researchers in text mining). What background will be required of the audience? Why is this topic important/interesting to the KDD community? What is the benefit to participants?
  4. Outline of the tutorial. Enough material should be included to provide a sense of both the scope of material to be covered and the depth to which it will be covered. The more details that can be provided, the better (up to and including links to the actual slides). Note that the tutors should NOT focus mainly on their own research results. A KDD tutorial is not a forum for promoting one's research or product.
  5. A list of forums and their time and locations if the tutorial or a similar/highly related tutorial has been presented by the same author(s) before, and highlight the similarity/difference between those and the one proposed for KDD-2010 (up to 100 words for each entry)
  6. A list of tutorials on the same/similar/highly related topics given by other people, and highlight the difference between yours and theirs (up to 100 words for each entry)
  7. A list of other tutorials given by the authors, please list the titles, the presenters and the forums only.
  8. Tutors' short bio and their expertise related to the tutorial (up to 200 words per tutor)
  9. A list of up to 20 most important references that will be covered in the tutorial
  10. (Optional) URLs of the slides/notes of the previous tutorials given by the authors, and any specific audio/video/computer requirements for the tutorial.
Important Dates
  • Proposals due: February 26, 2010
  • Notification of Acceptance: May 14, 2010
Please send your submission to Tutorial Co-chairs Francoise Soulie Fogelman and Tina Eliassi-Rad.

Gold Sponsor

Microsoft Advertising Logo

Silver sponsors

Yahoo Logo

SAS Logo

Google Logo

Accenture logo


Become a Corporate Sponsor!