New Directions in Statistical Graph Mining

Koji Tsuda (Max Planck Institute for Biological Cybernetics, Germany)

Sep 25, 13:30 - 14:30

Presentation download (PDF)

Abstract

In these three years, our group has been working on applying statistical methods such as boosting, LARS, probabilistic clustering and outlier detection to structured data such as graphs, sequences and itemsets. Applications include chemical compounds, mutation sets, videos and images. The common strategy is to use weighted substructure mining repeatedly for collecting substructure patterns progressively. We transform conventional statistical algorithms such that they accept structured data as input (i.e., structurization) by replacing one step of the algorithm with weighted substructure mining. Sometimes, structurization is straightforward (e.g., boosting), but in most cases, substantial effort is necessary in reformulating the algorithms. Based on the ground work already done, we are now heading for solving more complex learning problems in structured domains. In this talk, I present recent work on partial least squares (PLS), principal component analysis(PCA), and Bayesian regression for graph data.

Privacy-preserving Data Mining and Machine Learning

Jun Sakuma (Tokyo Institute of Technology, Japan)

Sep 25, 15:45 - 16:45

Presentation download (PDF)

Abstract

Data mining technology has been extensively studied aiming at discovery of valuable knowledge from large-scale datasets. When the data mining takes unified dataset as inputs, the performance criteria can be the accuracy of the obtained knowledge or the scalability. On the other hand, when the dataset is distributed among two or more individuals or enterprises, not only the accuracy and the scalability but also maintaining the confidentiality and preserving the privacy of datasets could be a critical issue. In many cases, such information stays in-house and is not exploited for knowledge discovery because of leakage risk and legal restrictions.

Privacy-preserving data mining (PPDM) provides a secure way to discover knowledge from privately distributed datasets without violating the privacy. Many of PPDM algorithms are realized through the composition of data mining algorithms and cryptographic building blocks, such as secure function evaluation or homomorphic public-key cryptography which have been extensively studied in security communities.

This talk introduces fundamental cryptographic tools which are often composed with distributed data mining and machine learning. Thorough a few examples of privacy-preserving data mining and machine learning, the future direction of this line of research is discussed, too.

Theoretical Explanation of the Boosting Algorithms

Liwei Wang (Peking University, China)

Sep 26, 13:30 - 14:30

Presentation download (PDF)

Abstract

There have been a lot of arguments on why boosting has excellent empirical performance and why it is often immune to overfitting. The most influential explanation is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, Breiman raised important questions about the margin explanation by developing a boosting algorithm arc-gv that provably generates a larger minimum margin than AdaBoost. He also gave a sharper bound in terms of the minimum margin, and argued that the minimum margin governs the generalization. In experiments however, arc-gv usually performs worse than AdaBoost, putting the margin explanation into serious doubts. In this talk, we try to give a complete answer to Breiman's critique by proving a bound in terms of a new margin measure called Equilibrium margin (Emargin). The Emargin bound is uniformly sharper than Breiman's minimum margin bound. This result suggests that the minimum margin is not crucial for the generalization error. We also show that a large Emargin implies good generalization. Experimental results on benchmark datasets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than arc-gv, which agrees well with our theory.