II Unifying, Understanding and Improving Bundling Techniques

Télécharger le fichier original (Mémoire de fin d’études)

Clutter Reduction for Node-Link Diagrams

Bundling combines two concepts of Information Visualization: path drawing and clutter reduction. This chapter clarifies the main definition and concept for graph drawing and trails as well as clutter reduction techniques and criteria. We start by introducing the types of data (graph or trails) and their existing visualizations that bundling works on. Then we detail the basis of clutter reduction techniques to better understand the scope of bundling in this particular landscape. Finally, we introduce the main definitions and notations we deem necessary to understand the design space of bundling techniques.
Les techniques de bundling (simplification de liens par agrégation visuelle) combinent deux concepts majeurs de la visualisation d’informations : le dessin de liens et la ré-duction d’occultation. Dans ce chapitre, nous clarifions les principales définitions et concepts relatifs aux dessins de liens (dessins de graphes ou dessins de trajectoires) ainsi que les techniques et critères relatifs à la réduction d’occultation de la visuali-sation de données. Nous débutons par l’analyse du type de données utilisables par les techniques de bundling (graphes ou trajectoires) ainsi que les moyens existants pour les visualiser (voir section 2.1). Ensuite nous détaillons les fondations des techniques de réduction d’occultation afin de mesurer la place du bundling parmi ces dernières (sec-tion 2.2). Enfin, nous introduisons les principales définitions et notations nécessaires à la compréhension de l’espace de conception des algorithmes de bundling (section 2.3). Dans ce cours résumé en français, nous nous limiterons à définir les concepts et nota-tions relatifs aux techniques d’agrégation visuelle.

Definition de l’Opérateur Bundling

L’espace de conception des techniques de bundling est vaste et complexe contenant pléthore d’implémentations variées et de types de jeux de données utilisables. Dans l’optique d’unifier et d’améliorer la compréhension de cet espace, nous débutons notre travail par la définition des concepts et notations sur lequels se fondent cette thèse.

Les Objectifs du Bundling

Les graphes de grande taille, fortement connectés, issus d’applications concrètes ont beaucoup plus de liens que de nœuds. Au cause de cela, les représentations de ces graphes par des diagrammes nœuds-liens classiques deviennent rapidement ineﬃcaces pour les explorer et les analyser. Ce problème est souvent référencé dans la littérature comme l’encombrement de liens (edge congestion, Carpendale et Rong, 2001, l’occul-tation de liens (visual clutter, Ellis et Dix, 2007) ou le problème de la boule de poils (hairball problem, Schulz et Hurter, 2013). Ainsi pour résoudre ce problème, plusieurs techniques existent (voir section 2.2 et Ellis et Dix, 2007) et parmi elles le bundling.
De manière informelle, le bundling échange l’occultation pour du ”sur-dessin“ (Te-lea et Ersoy, 2010). Cependant, bien qu’il y ait des dizaines d’articles sur le bundling, aucun ne définit formellement ce qu’est le bundling. Dans notre travail, nous argu-mentons qu’une telle définition est nécessaire si l’on souhaite comprendre le processus, comparer les méthodes et débattre des garanties et des limitations du bundling et plus généralement faire avancer la recherche sur le bundling.

Définition du Bundling

Soit G = (V, E ⊂ V ×V ) un graphe avec des nœuds V = {vi} et des liens E = {ei}. Soit d la dimension de l’espace de dessin dans lequel nous souhaitons appliquer le bundling (généralement cette dimension est 2 ou 3). En parallèle, soit T = {ti } ce que nous appellerons un jeu de trajectoires comme définis par Phan et al., 2005 et Zhou et al., 2013. Une trajectoire ti ⊂ Rd est une courbe orientée. Généralement, les trajectoires décrivent les déplacements d’objets dans l’espace, e.g. des avions (Hurter, Tissoires et Conversy, 2009), des tracés oculaires (Peysakhovich, Hurter et Telea, 2015), des bateaux (Scheepens et al., 2011a), ou des personnes (Nagel, Pietsch et Dörk, 2016). Cependant, elles peuvent aussi représenter des courbes non liées à des déplacements, e.g. des lignes brisées dans la visualisation de type coordonnées parallèles (Parallel Coordinates Plots
– PCP) (Inselberg, 2009) ou des tracés de tenseurs de diﬀusion d’IRM (Everts et al., 2015). Il est important de noter qu’en théorie des graphes la notion de trajectoire a une signification diﬀérente (un type de chemin dans un graphe pour lequel tous les liens sont distincts, voir Harris, Hirst et Mossinghoﬀ, 2008). Soient G et T l’espace de tous les graphes et respectivement tous les jeux de trajectoires.
L’élément clé unifiant les graphes et les jeux de trajectoires est ce que l’on nomme l’opérateur de dessin D. Pour les graphes, D : G → Rd est typiquement une technique d’agencement ou de dessin des liens d’un graphe (voir Battista et al., 1998). Par analo-gie, nous noterons D(ei) et D(vi) comme la transformée (le dessin) des nœuds ei ⊂ E, respectivement des liens vi ⊂ V . Pour les trajectoires, l’opérateur de dessin est la fonc-tion identité, i.e. D(ti) = ti, car les trajectoires sont déjà intégrées dans un espace géométrique. Nous unifions les deux espaces G et T en définissant P l’espace des jeux de chemins. Ainsi P dénote soit un graphe G soit un jeu de trajectoires T et par ana-logie, un chemin p ∈ P est soit un lien de graphe e ou une trajectoire t. Les chemins peuvent avoir des attributs additionnels, tels que la direction, le poids, la temporalité, un nom ou un type générique comme introduit par Peysakhovich, Hurter et Telea, 2015 et Diehl et Telea, 2014. Ainsi, un chemin p peut être vu comme un vecteur de dimension n + d, avec n le nombre d’attributs et d la dimension spatiale du chemin.
Soit D ⊂ Rd l’espace de l’ensemble des dessins de chemins D(P ). Soit B : D → D un opérateur denotant l’agrégation (le bundling) d’un jeu de chemins ; et finalement soit B(D(p)) la courbe représentant l’agrégation du chemin p. Alors B est une méthode de bundling (ou d’agrégation) si :
∀(pi pj) ∈ P
×P |pi 6=pj ∧κ(pi,pj) < κmax → δ(B(D(pi)), B(D(pj))) δ(D(pi), D(pj)).

Équation du bundling

Dans cette équation, δ est une distance entre les chemins de Rd, e.g. la distance de Hausdorﬀ comme défini par Berg et al., 2010. κ : P × P → R+ est ce que nous appellerons une fonction de compatibilité dénotant la similarité entre les chemins. En d’autres termes, quand κ tend vers 0 alors les chemins sont similaires, lorsque κ tend vers l’infini alors les chemins sont diﬀérents (i.e. incompatibles). Dans tous les cas, κ se doit d’être une fonction qui reflète la compatibilité géométrique des chemins dans leur espace de dessins D(P ), i.e. lorsque κ(pi,pj) tend vers 0 alors δ(D(pi), D(pj)) tend lui aussi vers 0. De plus, κ peut inclure des compatibilités sur les n attributs additionnels (basés sur les données) mentionnés plus tôt, i.e. κ peut être modélisée par une distance entre chemins en dimension n + d (l’espace géométrique et des attributs, c.f. Peysakhovich, Hurter et Telea, 2015). Dans notre équation, seuls les chemins dont la compatibilité est inférieure à un seuil prédéfinis (κmax) doivent être bundlés – sinon le bundling de D(P ) subirait des déformations trop importantes pour être correctement analysé. De façon informelle, notre Équation du bundling définie que pour deux chemins compatibles alors le dessin bundlé de ces chemins B(D(pi)) est géométriquement plus proche que le dessin original de ces premiers D(pi).

Diﬀérences et Similitudes entre Graphes et Trajectoires

Dans la litérature des techniques de bundling, les auteurs font souvent référence aux termes edge bundling ou graph bundling de manière non discriminée, même dans le cas où les données à bundler sont des trajectoires. Ainsi, il est donc important de clarifier les diﬀérences et ressemblances entre le bundling de ces deux entités.
Pour résumer, les graphes et les trajectoires sont deux types de données radicalement diﬀérents – le premier n’étant pas lié à un espace géométrique (pour ce faire nous avons besoins du dessin du graphe) ; le second étant par définition intégré dans un espace géométrique (Euclidien). Les trajectoires sont typiquement des courbes orientées, alors que les liens d’un graphe ne le sont pas forcément. En théorie, la plupart des algorithmes de bundling peuvent bundler des dessins de graphes ou des trajectoires. Cependant, choisir le correct algorithme de bundling pour agréger un jeu de chemins donné est guidé par des critères plus subtiles que simplement le type de données d’entrée (e.g. la définition des fonctions de compatibilités, la quantité d’agrégation à fournir ou le type de structure à mettre en valeur). Tous ces aspects sont importants pour comprendre et utiliser de façon optimale le bundling, comme nous allons le démontrer dans la suite de cette thèse.
Au cours de ce chapitre, nous avons établi le cadre global de visualisation dans lequel les techniques de bundling prennent racine. Nous avons introduit les diﬀérents principes et méthodes de mise en forme des diﬀérents jeux de données utilisables par le bund-ling : les graphes et leurs techniques de dessins (voir section 2.1) et les trajectoires. En section 5.1.1, nous avons détaillé et explicité les techniques de réduction d’occultation existantes qui nous permettrons dans les chapitres suivant d’améliorer la compréhension et la place du bundling dans l’ensemble des techniques de réduction d’occultation. En-fin, nous avons défini (et résumé en français) les principes importants et les principales définitions propres au bundling (e.g. Équation du bundling) nécessaire pour débuter notre analyse de l’espace des techniques de bundling dans les chapitres suivants. Au cours des chapitres à venir, nous exposerons le bundling sous 3 angles diﬀérents, en partant d’une taxonomie orientée type de jeu de données d’entrée (chapitre 3) vers la construction d’un pipeline unifié des méthodes de bundling (chapitre 4) et une analyse des tâches et des domaines d’application du bundling (chapitre 5).

From Data to Point/Line Visualization

Bundling aims at providing a visual simplification by spatial grouping similar lines. This process requires the ability to represent information embedded in a given data-set to be represented as a point/line visualization. In this section, we will detail two formalisms of data-sets structure and how each type of data-set can be plotted on a screen in order to be bundled.

Graphs

“Graphs in general form one of the most important data models in computer science”, Beck et al., 2014. Formally, any relational structures consisting of a set of entities and relationships between those entities can be modeled as graphs. In the last two decades, graph theory and graph drawing have played a large part in the research themes of Human-Computer Interaction and Information Visualization. In this subsection, we will define and present the basis of graph drawing, specifically, how we transform a graph based data-set into a visual structure. This subsection highlights existing surveys of the field (i.e. Von Landesberger et al., 2011, Herman, Melançon, and Marshall, 2000, Battista et al., 1998).

Definitions

What we call graph refers to a set of vertices and a set of edges that connect pairs of vertices. A graph is a pair of these sets, G = (V, E) where V is a set of vertices and E is a set of edges between vertices (i.e. E ⊂ V 2 e.g. an edge e = (v1, v2) where v1 ∈ V and v2 ∈ V ), see Diestel, 2000. This definition can be extended by attaching attributes to vertices and/or edges to denote their type, size, or some other application related information.
The formal model of graph is further refined and classified into undirected and directed graphs (Herman, Melançon, and Marshall, 2000). A directed graph (or digraph) is defined similarly with the exception that the elements of E, called directed edges, are ordered pairs of vertices. In other terms, in a directed graph, for two vertices v1, v2, (v1, v2) 6=v(2, v1). Conversely, for an undirected graph, ∀v1, v2, (v1, v2) = (v2, v1). A graph containing both directed and undirected edges is called a mixed graph.
In graph theory, a (directed) path in a (directed) graph G is a sequence (v1, v2, …, vn) of distinct vertices where vi ∈ V and (vi, vi+1) ∈ E. A (directed) path is a (directed) cycle if (vh, v1) ∈ E. Similarly, a directed graph that has no directed cycles is called a directed acyclic graph (DAG).
A tree is a connected undirected graph without cycle (Diestel, 2000). A Tree T is called rooted when one vertex r is distinguished as a so called root node: T = (V, E, r). In graph drawing, we often called such rooted tree a hierarchical graph. However, in the formal framework of mathematical graph theory, a hierarchical graph is a DAG (meaning a node can have several paths to the root node).
Additionally, we call compound graph C as a pair (G, T ) where G = (V, EG) and
T = (V, ET , r) that share the same set of vertices, such that:
∀e = (v1, v2) ∈ EG, v1 ∈/ pathT (r, v2) & v2 ∈/ pathT (r, v1) (2.1)
Relationships between vertices are expressed by the tree T . Here, vertices sharing a common parent in T belong to the same “branch” (or “group”) of T .
Graphs may also change over time, forming what we call dynamic graphs. Time dependent changes can aﬀect the attributes of nodes and edges but also the graph structure. A dynamic graph G(t) = (V (t), E(t)), t ∈ R+. A a given time ti, G(ti) is called a frame. Two diﬀerent forms of dynamic graphs are known: streaming graphs and sequence graphs.
In a streaming graph G = (V, E), each edge ei ∈ E has a so-called lifetime [tstarti, tendi], of duration λj = tendi − tstarti, where tstarti < tendi. That is, ei exists only between tstarti and tendi. The dynamic graph G(t) thus contains all edges ei ∈ E that are alive at t, i.e. for which tstarti ≤ t ≤ tendi. The same holds for the streaming graph’s nodes vi ∈ V . Streaming graphs can be available in an online manner – that is, one does not know upfront all moments tstarti and tendi.
Sequence graphs are, as their name says, ordered sets G = {Gi} of static graphs Gi = (V i, Ei). In contrast to streaming graphs, edges Ei do not have a lifetime, but belong to a single frame i. They typically capture a system’s structure at several discrete time moments ti. Well-known examples are the set of call graphs mined from the several revisions of software system stored in a software repository (Diehl and Telea, 2014 and Reniers et al., 2014). In practice, the frames Gi are usually not unrelated, but have nodes and edges which capture the evolution in time of the same items. For instance, two edges eia ∈ Ei and eib+1 ∈ Ei+1 can represent the same call relation in two consecutive frames of a software system. Such links relating information between diﬀerent sequence frames can be modelled by a so-called correspondence function c : Ei → Ei+1 ∪ {?}. Here, c(e ∈ Ei) yields an edge e0 ∈ Ei+1 which logically corresponds to e, if such an edge exists in Ei+1, or the empty set, if there is no correspondence, i.e. if e disappears in the transition from Gi to Gi+1.

Graph Layout for Node-link Diagrams

Here, it is important to emphasize that a graph and its drawing are quite diﬀerent objects. Per definition, a graph does not have any drawing and is not embedded in any kind of Hilbertian space. “In general a graph has many diﬀerent drawings” (Battista et al., 1998), and choosing the correct layout for a given graph is a recurrent research of the Graph Drawing and Information Visualization community. Overall, graph layout techniques aim at optimizing several criteria depicted in three groups in the recent survey “Visual Analysis of Large Graphs: State-of-the-Art and Future Research Challenges” by Von Landesberger et al., 2011 (see Von Landesberger et al., 2011 section 3). These three groups are:
• General criteria: They encompass criteria such as reduction of visual clutter or maximization of space eﬃciency.
• Dynamic graphs: Maximization of display stability between time points or reduc-tion of cognitive load.
• Aesthetic scalability criteria: These criteria refer to graph readability for larger graphs, i.e. scalability in number of vertices, edges, and graphs.
According to Von Landesberger et al., 2011, despite being important these criteria cannot be simultaneously optimized and are not suﬃcient to design a good layout which is usually data and/or task dependent. Therefore, multiple graph layout are fundamental to exploratory graph visualization.
A plethora of graph layout methods and algorithm exist in the literature and gener-ally a graph layout methods is optimized for a specific type of graph (tree, coumpound, directed graph…). An extensive study of existing graph layout methods is detailled by Von Landesberger et al., 2011 in section 3. We can note a few classes of layout techniques for static graphs studied by Von Landesberger et al., 2011:
• Space-Filling techniques: They typically try to use the full area of the display to present the hierarchy of a graph (see figure 2.2a). Here, the spatial position of nodes are employed to display edges between nodes by using either closeness or enclosure.
• Matrix-Representation: These techniques displays the adjacency matrix of a given graph. Here each row and column displays a node and the links and attribute are displayed by a cell (see figure 2.2b).
• Node-Link Techniques: They are by far the most used layout methods for graphs. Here, the idea is to represents nodes as points and links as edges (see figure 2.2c). Node-link layout are generally embedded in 2 or 3 dimensional spaces. Many techniques exist to optimize the aforementionned layout criteria. Von Landes-berger et al., 2011 distinguish force-based, constraint-based, multi-scale, layered, and non-standard layouts.

Table des matières

1 Introduction
1.1 Context and Motivation
1.2 Research Questions and Approach
1.3 Thesis Outline
1.4 Introduction (Français)
I Bundling in the Information Visualization Context
2 Clutter Reduction for Node-Link Diagrams
2.0 Résumé (Français)
2.1 From Data to Point/Line Visualization
2.2 Taxonomy of Clutter Reduction
2.3 Definition of the Bundling Operator
2.4 Summary
3 A Data-based Taxonomy of Bundling Methods
3.0 Résumé (Français)
3.1 Graph Bundling Methods
3.2 Trail-Set Bundling Methods
3.3 Summary
II Unifying, Understanding and Improving Bundling Techniques
4 Towards a Unified Bundling Framework
4.0 Résumé (Français)
4.1 Similarity Definition
4.2 Bundling Operator Definition
4.3 Bundling Visualization
4.4 Future Challenges
4.5 Summary
5 Bundling Tasks and Applications
5.0 Résumé (Français)
5.1 Task Support
5.2 Typical Bundling Use-cases
5.3 Quality and Faith-fullness of Bundling
5.4 Summary
6 Improving Bundling Scalability
6.0 Résumé (Français)
6.1 Kernel Density Estimation Based Bundling Methods
6.2 FFTEB Framework
6.3 Example Applications
6.4 Discussion
6.5 Summary
III Application to the Study of Alzheimer Disease
7 A Visual Analytics Approach for Digitizing and Cleaning Health Data
7.0 Résumé (Français)
7.1 Related Work
7.2 Definitions
7.3 A User Centered Design Process
7.4 Errors and Uncertainties
7.5 Design Requirements
7.6 Tools
7.7 Scenario
7.8 Results
7.9 Summary
8 Bundling Cognitive Maps
8.0 Résumé (Français)
8.1 Preliminary Analysis
8.2 Visualizing and Bundling Fluency-tests
8.3 Exploration of Bundled Cognitive Maps
8.4 New Hypothesis on Alzheimer’s Disease
8.5 Summary
9 Conclusion
9.1 Summary and Contributions
9.2 Future Research
9.3 Conclusion (Français)
Bibliography