The DMSS and LLLL workshops jointly invite the following speakers. All IBISML/DMSS/LLLL participants can attend these invited talks. All talks are held at the Tamokuteki Space 2 room (3F).

All DMSS and LLLL attendees can join the invited talk of the 4th IBISML meeting by Yoshitaka Kameya in March 28th, 2011.

- Distributional Learning of Extensions of Context-Free Grammars (Ryo Yoshinaka)
- Computation over Topological Spaces via Streams with a Bottom (Hideki Tsuiki)
- Kernel-based Similarity Search in Massive Graph Databases with Wavelet Trees (Yasuo Tabei)

## Distributional Learning of Extensions of Context-Free Grammars

- ERATO MINATO Discrete Structure Manipulation System Project, Japan Science and Technology Agency
- Graduate School of Information Science and Technology, Hokkaido University

**Date:** March 29th, 2011

This talk is concerned with algorithmic learning of formal languages, rather than heuristic techniques. We would like to construct a finite description, called a grammar, of a language from limited information sources of the language. Intensive research on the learning of regular languages has achieved a plenty of positive results so far. Yet techniques for learning context-free structures are desired from many application areas such as natural language processing, bioinformatics, XML technologies, and so on. It had been thought to be quite a hard task to learn a considerably rich class of languages that handle context-free structures.

In these years, a new line of research on the learning of context-free languages, called “distributional learning”, has been proposed and has produced several fruitful results. Distributional learning algorithms model and exploit the distribution of substrings in grammatical sentences. In other words, they observe the relation on a set of substrings and a set of contexts obtained from the observed sentences, where a context is a pair of strings that wraps a string in the middle to form a complete sentence.

This talk demonstrates how the idea of distributional learning can be extended to even richer formalisms, namely, multiple context-free grammars and simple context-free tree grammars. Context-free grammars are thought to be fairly expressive, yet there are natural language phenomena and biological sequences that are beyond context-freeness. A virtue of those formalisms is the expressive power that handles those complex phenomena, while keeping the tractability. Whereas context-free grammars treat strings, multiple context-free grammars treat tuples of strings. Context-free tree grammars generate tree languages in the way analogous to the derivation of strings by usual context-free grammars. The notion of substrings and contexts are generalized accordingly to those formalisms so that the fruits on the distributional learning of context-free grammars can be smoothly transferred.

## Computation over Topological Spaces via Streams with a Bottom

- Course of Mathematical Science, Graduate School of Human and Environmental Studies, Kyoto University

**Date:** March 30th, 2011

If a stream contains a bottom cell, that is, if the evaluation of a part of the stream causes a non-terminating computation, then the rest of it is not accessible. In this talk, first, we show that such a bottomed stream is useful in exact representation of continuous data like real numbers. In particular, we show that the set of real numbers can be embed into the set of 1-bottomed sequences which are infinite sequences with at most one copy of bottom. We call this embedding the Gray-code embedding. Then, we introduce an IM2-machine which is an indeterministic machine which access such an extended stream with input/output tapes with multiple heads. An IM2-machine can be naturally implemented with logic programming languages with committed choice, and is also implemented as an extension of the Haskell language. Finally, we show that real number computation can be realized through the combination of Gray-code embedding and an IM2-machine, and present simple algorithms like conversions between this Gray representation and signed digit expansion, and addition and multiplication on real numbers, and show how an IM2-machine works by avoiding accesses to bottom cells. We also show some mathematical results about this kind of representation of topological spaces. For example, we show that every n-dimensional metric space can be embed into the space of infinite sequences with at most n copies of bottoms, and therefore an IM2-machine with (n+1)-heads can access such spaces.

## Kernel-based Similarity Search in Massive Graph Databases with Wavelet Trees

- ERATO MINATO Discrete Structure Manipulation System Project, Japan Science and Technology Agency
- Graduate School of Information Science and Engineering, Tokyo Institute of Technology

**Date:** March 30th, 2011

Labeled graphs are general and powerful data types that can be used to represent diverse kinds of real-world objects, including biological sequences, semi-structured texts such as HTML and XML, chemical compounds, social networks, and so forth. The amount of available graph data is ever increasing. For example, the PubChem database for chemical compounds files more than 20 million compounds. Similarity search in databases of labeled graphs is a fundamental task in managing graph data. Typically, a graph is decomposed to a set of substructures (e.g., paths, trees and subgraphs) and a similarity measure is defined via the number of common substructures. Using the representation, graphs can be stored in a document database by regarding graphs as documents and substructures as words. A graph similarity query then translates to a semi-conjunctive query that retrieves graphs sharing at least k substructures in common with the query graph. We argue that this kind of query cannot be solved efficiently by conventional inverted indexes, and develop a novel recursive search algorithm on wavelet trees (Grossi et al., SODA’03). Unlike gIndex, it does not require frequent subgraph mining for indexing. In experiments, our method was successfully applied to 25 million chemical compounds.