Flexible and Scalable Models For Clustering and Few-Shot Learning

The advancement of data acquisition technologies through Internet, digital media, social networks, video surveillance and growing business industries have produced massive amounts of high dimensional data. The manipulation of these huge volumes of data is crucial to learn resourceful models, specific to the related applications, for automatic data analysis or predictive analytics. In this regard, machine learning comes into play as a very promising field, where methods are broadly categorized as supervised or unsupervised. In supervised learning, a predictive model is built given a set of labeled data points whereas, in the unsupervised setting, the data is available, but without the labels. Unsupervised learning methods are difficult to design, and often lead to difficult optimization problems, due to inherent challenges such as the lack of prior knowledge. Supervised techniques, such as those based on deep learning, have achieved outstanding performances on classification tasks, given huge volumes of correctly labeled datasets, such as the well-known ImageNet (Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg & Fei-Fei, 2015), along with powerful and intensive computing resources (e.g., GPUs). However, the collection of those colossal amounts of labeled data is often exhaustive, and may not be possible in a breadth of applications, for instance, in medical image analysis, where annotations from domain experts can be prohibitively costly and time consuming. There are also semi-supervised learning methods, which leverage unlabeled data, along with small amounts of labeled data samples that guide the training process. Overall, there is a clear consensus among the learning community that the future of artificial intelligence (AI) will transition towards more unsupervised forms of learning, which leverage information from large-scale amounts of unlabeled data.

In this thesis, we investigate bound-optimization based clustering methods, which integrate various constraints and regularizers (e.g. fairness constraints or Laplacian regularization), and can deal efficiently with large-scale applications. Specifically, we propose flexible models, which optimize multi-term functionals, each integrating clustering objectives with some constraints or regularizers. We further derive tight upper bounds (auxiliary functions) of the functionals to provide scalable (parallel update) solutions, with convergence guarantee and competitive clustering performances. Furthermore, we address the interesting and challenging few-shot learning problem, where only a few labeled examples are available for training a model. Specifically, we propose a transductive Laplacian-regularized inference for few-shot tasks, which minimizes a quadratic binary-assignment function integrating supervision information from the few labeled samples and Laplacian regularization over unlabeled samples. Our transductive few-shot inference can be viewed as a constrained graph clustering, and consistently outperforms state-of-the-art methods by significant margins, across different models, settings and data sets. Furthermore, it is very fast, and can be used for large-scale few-shot tasks.

Clustering methods
Data clustering methods have been studied as a core part of unsupervised learning. These techniques have been used in a wide range of data analysis applications such as segmentation, classification, business data analysis, market analysis, social network analysis and many other computer vision, machine learning and data mining applications. The aim of a clustering method is to assign similar instances to the same group (cluster), given a pool of instances belonging to different groups. Among numerous popular problems in unsupervised learning, there have been a surge of different clustering algorithms proposed based on exploiting the domain of the particular problem in some specialized field. Thus, one particular clustering algorithm being successful in a specific application may not necessarily succeed in other applications. A large amount of clustering methods appear in the literature, in various fields, including social science, biology, psychology, medical science, mathematics, computer science, to name a few. Therefore, it is very difficult to list, study and link all the existing clustering methods till date  in an exhaustive taxonomy. Yet, several surveys of clustering methods are available (Hennig, Meila, Murtagh & Rocci, 2015; Xu & Tian, 2015).

With the very large number of clustering algorithms available in the literature, it is not possible to find a single algorithm capable of yielding the best accuracy in every application. If the clustering method can capture the underlying structure of the data, then one could expect good results. However, it is not possible to know the distribution and nature of the data a priori. The complexity of the problem is further compounded in the case of high dimensional applications, as in computer vision problems, where numerous images and videos are involved, even in a single application. There are several major challenges involved in clustering large-scale, high-dimensional data, e.g., images and/or other high-dimensional feature spaces:

– The well-known curse of dimensionality problem. It is not yet possible to visualize and represent high-dimensional data accurately. Also, complete enumeration of all sub-spaces in high dimensions is a very challenging task. Finding the nearest (or farthest) feature-space point from another given point becomes difficult as the number of dimensions grows. In fact, the direct use of traditional distance measures between two data points might be irrelevant in feature spaces of very high dimensions;
– The time and memory complexity increases with large-scale and high-dimensional applications, such as high-resolution image/video segmentation, large-scale image clustering, among many other interesting problems;
– The objective of clustering is to gather instances that are neighbors according to observations of their high-dimensional feature-space values. However, when dealing with a large number of attributes (or feature dimensions), some of these might be irrelevant for a given cluster. This is also known as the local feature relevance problem, which states that different clusters might be found in different sub-spaces. Therefore, a global filtering of feature dimensions is not sufficient.

Graph Based Clustering

Most of the clustering methods discussed earlier use standard dissimilarity measures between data points and prototypes, such as the Euclidean distance, so as to achieve a partition of the data. However, standard distance measures may not be effective in high-dimensional feature spaces. The use of graph Laplacian from spectral graph theory has led to several widely used graph clustering methods, which can deal effectively with high dimensions, noise and outliers. This category of methods builds an affinity/similarity matrix between pairs of data points, and uses the eigenvectors of the affinity matrix to find the clusters. The eigen decomposition of the Laplacian matrix enables to capture the intrinsic non-linearity of the data, preserving the locality of each cluster. There are very popular methods from this class of spectral clustering techniques (Von Luxburg, 2007b), such as Normalized cut (NC), which has been widely used in a breadth of computer vision problems, e.g., unsupervised image segmentation. These spectral clustering techniques are also closely related to very popular non-linear dimensionality reduction methods such as Laplacian Eigenmap (Belkin & Niyogi, 2001).

Normalized Cut (NC): (Shi & Malik, 2000) method is one of the most popular clustering methods based on graph-theoretic formulations and pairwise affinities. NC is inspired by graph-partitioning methods based on the minimum cut (min-cut) criterion, such as (Wu & Leahy, 1993). Let 𝐺(X, E) denotes a graph, where X = {x1, x2..x𝑁 } is the set of 𝑁 vertices (or nodes), each corresponding to a data point, and E is the set of weighted edges, each connecting a pair of vertices. Each edge connecting x𝑝 and x𝑞 carries a non-negative weight based on some kernel-based similarity measure 𝑤(x𝑝, x𝑞). The similarity matrix is constructed as W ={𝑤(x𝑝,x𝑞)∈Rⁿ×ⁿ.

Table des matières

INTRODUCTION
CHAPTER 1 LITERATURE REVIEW
1.1 Clustering methods
1.1.1 Prototype based clustering
1.1.2 Graph Based Clustering
1.1.3 Clustering with constraints
1.2 Few-shot Learning
1.2.1 Non parametric meta-learning
1.2.2 Optimization based meta-learning
1.2.3 Baselines based on regular training
1.2.4 Inductive vs Transductive inference
CHAPTER 2 SCALABLE LAPLACIAN K-MODES
2.1 Introduction
2.2 Concave-convex relaxation
2.3 Bound optimization
2.3.1 Mode updates
2.4 Experiments
2.4.1 Datasets and evaluation metrics
2.4.2 Implementation details
2.4.3 Clustering results
2.4.4 Comparison in terms of optimization quality
2.4.5 Running Time
2.5 Conclusion
2.6 Supplemental
2.6.1 Proof of Proposition 1
2.6.2 Convergence of SLK
CHAPTER 3 VARIATIONAL FAIR CLUSTERING
3.1 Introduction
3.2 Proposed bound optimization
3.3 Experiments
3.3.1 Datasets
3.3.2 Results
3.4 Conclusion
3.5 Supplemental
3.5.1 Proof of Proposition 2
3.5.2 Output clusters with respect to 𝜆
CHAPTER 4 LAPLACIAN REGULARIZED FEW-SHOT LEARNING
4.1 Introduction
4.2 Laplacian Regularized Few-Shot Learning
4.2.1 Proposed Formulation
4.2.2 Optimization
4.2.3 Proposed Algorithm
4.3 Experiments
4.3.1 Datasets
4.3.2 Evaluation Protocol
4.3.3 Network Models
4.3.4 Implementation Details
4.3.5 Results
4.3.6 Ablation Study
4.4 Conclusion
CONCLUSION

Cours gratuitTélécharger le document complet

Télécharger aussi :

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *