FFTED - Fast and Flexible Tree Edit Distance

Duration:
2017 - 2020
Funding:
388'836 EUR (Austrian Science Fund - FWF, P 29859 Einzelprojekte)

Abstract

Data that are structured in hierarchies are called trees. An example of tree data is the organizational hierarchy of a company with departments, their managers, and subordinate employees. Other examples include enterprise assets, bills of material, natural or computer language syntax trees, directories in file systems, or molecular structures in bioinformatics.

When tree data change, new tree versions are generated. An important task is to compute the difference (called diff) between two version of a tree. A good diff is as concise as possible. A common approach to compute diffs for trees is the so-called tree edit distance (TED), which computes the minimal diff.

The computation of the minimal diff is challenging. Current solutions suffer from two problems that make them impractical in many scenarios: (1) Efficiency: The computation of the minimal diff is too slow. When the tree size doubles, the runtime increase by a factor of four or more. Current techniques are applicable to hierarchies with about ten thousand elements, but many interesting trees have millions of elements (e.g., bills of material). (2) Flexibility: TED computes the most concise diff and disregards any application semantics, which may lead to non-intuitive results. For example, TED may turn a department into an employee to keep the diff small.

Various attempts have been made to overcome these problems, but all solutions suffer from at least one of the following issues: quality is traded off against speed (i.e., the diffs are too verbose), often without any guarantee about the deviation from the optimal result; the application semantics are hard-wired in the solution, thus restricting its scope to very specific applications; or only one of the issues (efficiency or flexibility) is addressed.

The goal of this project is to solve the efficiency and flexibility problems of TED without trading off quality. We will allow users to specify constraints that fit their application semantics as part of the input (e.g., departments cannot be turned into employees), and the minimal diff that respects these constraints will be computed. This will make our solution applicable to a wide range of different domains. In order to gain efficiency, we will exploit two observations that have received little attention in the past. First, changes between tree versions typically affect only a small portion of the tree; by focusing the computational effort on the parts that actually changed we hope to achieve considerable runtime improvements. Second, user constraints limit the number of allowable diffs, and fewer options must be considered; we plan to leverage this fact as well to gain performance.

If successful, our solution will be the first one to deal with user-defined constraints and compute the minimal diff efficiently.

Resources

[PDF] Slides by Nikolaus Augsten of the invited talk "Effiziente Techniken für Ähnlichkeitsabfragen in hierarchischen Daten" at ACSD 2017 / IMAGINE 2017 in Vienna.