Apprentissage automatique rapide et lent

In recent years we have witnessed a digital revolution that has dramatically changed the way in which we generate and consume data. In 2016, 90% of the world’s data was created just on the last 2 years and is expected that by 2020 the digital universe will reach 44 zettabytes (44 trillion gigabytes ). This new paradigm of ubiquitous data has impacted different sectors of society including government, healthcare, banking and entertainment, to name a few. Due to the extent of its potential data has been dubbed “the oil of the digital era”.

The term Big Data is used to refer to this groundbreaking phenomenon. A continuous effort exists to delimit this term; a popular approach are the so called “4 Vs” of big data: Volume, Velocity, Variety and Veracity. Volume is related to the scale of the data which varies depending on the application. Velocity considers the rate at which data is generated/collected. Variety represents the different types of data, including traditional structured data and emerging unstructured data. Finally, Veracity corresponds to the reliability that can be attributed to the data.

Another significant factor is the dramatic growth of the Internet of Things (IoT), which is the ecosystem where devices (things) connect, interact and exchange data. Such devices can be analog or digital, e.g. cars, airplanes, cellphones, etc. By 2013, 20 billion of such devices were connected to the internet and this number is expected to grow to 32 billion by 2020, this means an increment from 7% to 15% of the total number of connected devices. The contribution of IoT to the digital universe is considerable. For example, data only from embedded systems accounted for 2% of the world’s data in 2013, and is expected to hit 10% by 2020 .

In its raw state, data contains latent information which can potentially be converted into knowledge (and value). However, as the amount of data and its complexity increases, its analysis becomes unfeasible for humans. To overcome this, Artificial Intelligence is the field of study that aims to produce machines which can replicate human’s intelligence without its biological limitations.

The ultimate goal is to convert data into knowledge which in turn translates into actions. In other words, we are interested in finding patterns in the data. Machine Learning, a sub-field of artificial intelligence, is a kind of data analysis that automates the process of finding and describing patterns in data by building analytical models. Machine learning is based on the assumption that machines can learn from data, identify patterns and make predictions with minimal human intervention. The minimal pipeline of machine learning .

Artificial intelligence (including machine learning) is already a key element for competitiveness and growth in different industries, what’s more, it has disrupted not only the private sector but the public sector as well. Several countries include artificial intelligence initiatives as a key axis for national development and competitiveness. For example, artificial intelligence is one of the pillars of Horizon 2020, the EU’s Research and Innovation program that implements the Europe 2020 initiative for Europe’s global competitiveness. Similarly, in March 2018 France unveiled its national artificial intelligence strategy which not only aims at strengthening the role of France in this field, but also proposes an ethical framework to regulate it.

Machine learning is not new, in reality is a well established field of study with a strong scientific and technical background. However, the disruptive impact of big data and the challenges it poses has reinvigorated the research community. Furthermore, it has contributed to turn machine learning into a matter of great interest to the general public. Currently, one of the most active research directions in the ubiquitous data era is to facilitate the mechanisms to perform machine learning at scale. In this context, it is vital that learning methods are able to keep the pace with data, not only in terms of volume but also the speed at which it is generated and processed, in order to be useful to humans.

For example, online financial operations in the digital era are becoming the norm in multiple countries. In this context, automatic fraud detection mechanisms are required to protect millions of users. Training is performed on massive amounts of data, thus consideration of runtime is critical: waiting until a model is trained means that potential frauds may pass undetected. Another example is the analysis of communication logs for security, where storing all logs is impractical (and in most cases unnecessary). The requirement to store all data is an important limitation of methods that rely on doing multiple passes over the data.

Data Science is an emerging interdisciplinary field in the digital era which unifies statistics, data analysis and machine learning to extract knowledge and insights from data. One of the major contributors to the fast adoption of data science is the Open Source movement. The designation “open source” is attributed to something that people, other than the author, can modify and share because its design is publicly available. In the context of software development it refers to the process used to develop computer programs. The machine learning community has benefitted from an ample number of open source frameworks focused on multiple topics and platforms (operative system and programming language). As examples of the advantages of open source research we can pinpoint:

• Reproducible research, an essential part of the scientific process.
• Faster development, since researchers can focus on the core elements of their work without getting sidetracked by technical details.
• Fosters collaboration, by providing a common platform on which a community can thrive.
• Democratization of machine learning by reducing the technical gap for non-expert individuals.
• Maintainability based on the community and not in isolated individuals or groups.

Table des matières

Introduction
1 introduction
1.1 Motivation
1.2 Challenges and Opportunities
1.3 Open Data Science
1.4 Contributions
1.5 Publications
1.6 Outline
2 preliminaries and related work
2.1 Streaming Supervised Learning
2.1.1 Performance Evaluation
2.1.2 Concept Drift
2.1.3 Ensemble Learning
2.1.4 Incremental Learning
2.2 Over-Indebtedness Prediction
2.3 Missing Data Imputation
3 over-indebtedness prediction
3.1 A Multifaceted Problem
3.1.1 Feature Selection
3.1.2 Data Balancing
3.1.3 Supervised Learning
3.1.4 Stream Learning
3.2 A Data-driven Warnining Mechanism
3.2.1 Generalization
3.3 Experimental Evaluation
3.4 Results
3.4.1 Feature Selection
3.4.2 Data Balancing
3.4.3 Batch vs Stream Learning
4 missing data imputation at scale
4.1 A Model-based Imputation Method
4.1.1 Cascade Imputation
4.1.2 Multi-label Cascade Imputation
4.2 Experimental Evaluation
4.2.1 Impact on Classification
4.2.2 Scalability
4.2.3 Imputed vs Incomplete Data
4.2.4 Multi-label Imputation
4.2.5 Measuring Performance
4.3 Results
5 learning fast and slow
5.1 From Psychology to Machine Learning
5.2 Fast and Slow Learning Framework
5.2.1 Fast and Slow Classifier
5.3 Experimental Evaluation
5.4 Results
Conclusion