Bayesian Probabilistic Model of Deep Convolutional Neural Networks

The Neural Networks

Artificial neural networks are a brain-inspired system which aims to replicate the human brain learning system. The neural network is currently the state of the art for image-based pattern recognition. Such a system can be trained recognize images via a learning procedure, typically the error backpropagation algorithm, based on a set of labelled training images. It consists of a large number of interconnected processing nodes called neurons which form so-called hidden layers located between the input and output layer. At the input, nodes correspond to input samples, e.g. pixels in an image to processed or classified. Node values in one layer are multiplied by weight parameters and summed to form the values of subsequent nodes in the network up until the output layer, where the node values correspond to the predicted image label. Figure 1.1, shows mathematical model correspond to the biological neuron. Network learning or training consists of learning weight values that minimize the error of the task at hand, e.g image classification, typically via the iterative backpropagation algorithm, although other methods can be used, i.e. via a single pass of layer-wise estimation Kuo & Chen (2018); Gan et al. (2015).

Nodes in one layer are generally connected to all nodes in the next layer via trainable weight parameters, i.e. a fully connected neural network, however this is generally computationally intractible for large input data such as images. The widely used convolutional neural network (CNN) solves tractibility by limiting connections, particularly in early layers, to small sets of shared weights which are equivalent to linear filters. Layers nearer to the output are typically fully connected. Information comes from the input layer and flows to the next layers. Each layer consist of a set of nodes that compute the weighted sum of their inputs which came from the previous layer and then pass it through a nonlinear function. The out put of each node is get from the applied function to a weighted sum of each node’s input. The inputs of each node multiply by weights of the connection they have to next layer’s node, and then adds up all the input it receives. This design is called feedforward network LeCun et al. (2015), Hagan et al. (1996), Haykin (1994), Schmidhuber (2015). The simplest feed forward neural network is a single-layer perceptron in which there is one series of weights. The weights are updating through training with the learning rule called gradient descent Rosenblatt (1958). Figure1.2 shows the structure of the feed forward pass neural network.

In Figure1.2, the input 𝑋 is multiplied by the weights of connection and adds with the other inputs and also with the bias value 𝑏1. The weighted sum of the inputs pass through the activation function 𝐹. The results of each nodes multiply by the weights of next connections to form the output layer. As the weights are drawn from a random distribution it is required to initialize the weights to keep the neuron from being too big or too small. Accordingly, with each passing layer, the weights are initialized in a way that the variance remains the same Glorot & Bengio (2010); Joshi (2016). This is Glorot uniform initializer also known as Xavier initializer. Backpropagation algorithm is used to improve the training of multi-layered network efficiently by updating weights iteratively using gradient descent algorithm. Generally, back propagation, calculate the gradient of loss function and calculate the weights updates and pass it back through the network Hecht-Nielsen (1992), Rosenblatt (1961). Considering 𝑦 as a truth label and ˆ𝑦 as a network prediction, the loss function 𝐽 is calculated using the squared error loss: In this process the weights of connection between nodes, are modified backward from the output nodes to input nodes in order to reduce the difference between output produced by the network and the output that meant to be produced Rumelhart et al. (1988).

The multilayer neural network is typically trained by stochastic gradient descent (SGD) learning rule. Weights are randomly initialized, then iteratively updated via alternating forward and backward passes of training data through the network. In the forward pass, a training image is sent through the network weight structure in order to generate the output. In the backward pass, the error or loss between the network output and the training label is computed, then backpropagated through the network in order to update the weights of the network such that the output error is reduced. The process iterates through training items until convergence. Due to differentiable characteristics of multilayer neural network, we can use gradient descent. The calculation is done by the chain rule of derivatives. Partial derivative of a loss function with respect to a particular weight shows the gradients of the curve 𝜕𝐽 𝜕𝜔𝑖 . Therefore the opposite direction of the gradient minimize the loss function output. The problem can be break into multiplication of derivatives by chain rule of differentiation Vink (2017). For example if we apply the chain rule upon 𝜔2 we have:

Deep

Learning strength and challenges Deep learning is largely responsible for the growth in computer vision and artificial intelligent. It gives the computer the ability for image classification and recognizing the sound as good as human. There are a plenty of advantages behind DNNs. One advantages of DNN over other machine learning algorithm is that there is no need for feature selection. We can feed the DNNs with the raw data. Another is that it gives the best result with unstructured data. The other is its efficiency in delivering a high quality results. The well trained DNN can perform a lot of task with a high level of precision Shchutskaya (2018), Lippi (2017). Besides the benefits there are some major challenges that we face during working with DNNs. One major problem is the limitation of memory. Memory is one of the biggest challenges in deep neural networks today. To store the high amount of weights and activations we need dynamic random access memory (DRAM) devices with higher capacity. Memory in neural network should be large enough to store the input data, weight parameters and activations Hanlon (2017). Another challenge is the large amount of data that we need to train the DNN model.

The amount of data training in DNN is much higher than the other machine learning algorithm. As the algorithm needs to learn about the domain, it needs to train the model in large amount of data and huge number of parameters to tune. Training data are including data augmentation in order to be robust and usable. Overfitting is also another problem that we might encounter. When the algorithm model the data very well or in the other word overtrain the data, it happen to learns the detail and noise in the training data which have impact on the performance of the model. Although there are some ways to avoid overfitting such as dropout, L2 L1 regularization, still modern neural network have a tendency to overfit Cogswell et al. (2015). Convolutional Neural Network (CNN) is an efficient form of deep neural network with sharedweight architecture. In next chapter we will discuss about this architecture.

Table des matières

INTRODUCTION
CHAPTER 1 RELATED WORK
1.1 The Neural Networks
1.1.1 Deep Learning in Neural Network
1.1.1.1 Deep Learning strength and challenges
1.1.2 The Convolutional Neural Network
1.1.2.1 Pooling
1.1.2.2 Data Augmentation and Regularization
1.1.2.3 Dropout
1.1.2.4 Batch Normalization
1.1.3 Several Common CNN Architectures
1.1.3.1 ResNet
1.1.3.2 DenseNet
1.1.3.3 Recurrent Neural Network
1.1.3.4 U-net Convolutional Network for Segmentation
1.2 Transfer Learning
1.3 Methods for Complexity Reduction
1.3.1 Low Rank methods
1.3.2 Weight Compression
1.3.3 Sparse Convolutions
1.3.4 Vector Quantization
1.3.5 Hashing
1.3.6 Pruning
1.4 Computation vs. Memory
1.5 Information Theory
1.5.1 Entropy
1.5.2 Joint and Conditional Entropy
1.5.3 Mutual Information
CHAPTER 2 METHODOLOGY
2.1 Bayesian Probabilistic Model of Deep Convolutional Neural Networks
2.1.1 Information Theory Analysis in Deep Convolutional Neural Networks
2.2 Using Principle Component to Reduce CNN Computation
2.2.1 Linear Subspace Model
CHAPTER 3 EXPERIMENTS
3.1 CENT Analysis .
3.1.1 Analysis of CENT features efficiency using transfer learning
3.1.2 2D Classification of Visual Object Classes
3.1.2.1 Classification of 10 categories not used in VGG
3.1.2.2 Histogram of information and entropy
3.1.2.3 2D classification of two painting category
3.1.2.4 2D classification of Alzheimer’s Disease vs. Healthy subject
3.1.2.5 2D classification of easily fooled classes
3.1.3 3D Image Classification: Brain MRIs
3.1.3.1 3D CNN Architecture
3.1.3.2 Alzheimer’s Disease vs. Healthy Brains
3.1.3.3 Young vs. Old Brains
3.2 Classification performance over CNN reconstructed by principal component
CONCLUSION AND RECOMMENDATIONS .
APPENDIX I METHODOLOGY WAVELET
APPENDIX II METHODOLOGY PCA
APPENDIX III RESULTS
BIBLIOGRAPHY