Ensemble of CNN for Multi-Focus Image...

Accepted Manuscript

Ensemble of CNN for Multi-Focus Image Fusion

Mostafa Amin-Naji , Ali Aghagolzadeh , Mehdi Ezoji

PII: S1566-2535(18)30604-3DOI: https://doi.org/10.1016/j.inffus.2019.02.003Reference: INFFUS 1077

To appear in: Information Fusion

Received date: 29 August 2018Revised date: 7 February 2019Accepted date: 11 February 2019

Please cite this article as: Mostafa Amin-Naji , Ali Aghagolzadeh , Mehdi Ezoji , Ensemble of CNN forMulti-Focus Image Fusion, Information Fusion (2019), doi: https://doi.org/10.1016/j.inffus.2019.02.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a serviceto our customers we are providing this early version of the manuscript. The manuscript will undergocopyediting, typesetting, and review of the resulting proof before it is published in its final form. Pleasenote that during the production process errors may be discovered which could affect the content, andall legal disclaimers that apply to the journal pertain.

https://doi.org/10.1016/j.inffus.2019.02.003

https://doi.org/10.1016/j.inffus.2019.02.003

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Highlights:

Devise a new type of multi-focus dataset and patch feeding strategy to the network

Propose the three feeding modal of images entrance to the network

Design an ensemble learning based CNNs architecture for multi-focus image fusion

The neatest initial decision map and the least demanding to post-processing steps

The best quality of the fused image among the other state of the art methods

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Ensemble of CNN for Multi-Focus Image Fusion

Mostafa Amin-Naji, Ali Aghagolzadeh1, and Mehdi Ezoji

Faculty of Electrical and Computer Engineering

Babol Noshirvani University of Technology, Babol, Iran.

Abstract

The Convolution Neural Networks (CNNs) based multi-focus image fusion methods have recently attracted enormous

attention. They greatly enhanced the constructed decision map compared with the previous state of the art methods that

have been done in the spatial and transform domains. Nevertheless, these methods have not reached to the satisfactory

initial decision map, and they need to undergo vast post-processing algorithms to achieve a satisfactory decision map. In

this paper, a novel CNNs based method with the help of the ensemble learning is proposed. It is very reasonable to use

various models and datasets rather than just one. The ensemble learning based methods intend to pursue increasing

diversity among the models and datasets in order to decrease the problem of the overfitting on the training dataset. It is

obvious that the results of an ensemble of CNNs are better than just one single CNNs. Also, the proposed method

introduces a new simple type of multi-focus images dataset. It simply changes the arranging of the patches of the multi-

focus datasets, which is very useful for obtaining the better accuracy. With this new type arrangement of datasets, the three

different datasets including the original and the Gradient in directions of vertical and horizontal patches are generated from

the COCO dataset. Therefore, the proposed method introduces a new network that three CNNs models which have been

trained on three different created datasets to construct the initial segmented decision map. These ideas greatly improve the

initial segmented decision map of the proposed method which is similar, or even better than, the other final decision map

of CNNs based methods obtained after applying many post-processing algorithms. Many real multi-focus test images are

used in these experiments, and the results are compared with quantitative and qualitative criteria. The obtained

experimental results indicate that the proposed CNNs based network is more accurate and have the better decision map

without post-processing algorithms than the other existing state of the art multi-focus fusion methods which used many

post-processing algorithms.

Keywords: Multi-Focus Image Fusion, Deep Learning, Convolution Neural Network, Ensemble Learning

1. Introduction

In recent years, image fusion has been used in many varieties of applications such as remote sensing, surveillance,

medical diagnosis, and photography applications. The two major application examples of image fusion in the photography

applications are fusion of multi-focus images and multi-exposure images [1]. This paper is focused on multi-focus image

fusion that it is a subset of photography applications. The purpose of multi-focus image fusion is gathering the necessary

information and focused areas from the input multi-focus images into a single image. Due to the limited depth of field for

optical lenses of CCD/CMOS cameras, it is difficult to capture a single image that all components be obvious in it.

Therefore, some areas of the captured images with camera sensors are blurred. There is a capability of recording images

with different depth of focuses using the several cameras in Visual Sensor Network (VSN) [1-4]. Since the camera

1 Corresponding author: Ali Aghagolzadeh, Faculty of Electrical and Computer Engineering, Babol Noshirvani University of

Technology, Babol, Iran, Post Code : 47148 – 71167.

E-mail Addresses: [email protected] (M. Amin-Naji), [email protected] (A. Aghgaolzadeh), [email protected] (M. Ezoji)

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

generates a large amount of data compared to the other sensors such as pressure or temperature sensors, there are some

limitations such as limited bandwidth, energy consumption and processing time, which leads processing the local input

images for decreasing the amount of transmission data [4-6]. For these reasons, many researchers are seeking useful

methods for multi-focus image fusion.

1.1 Literature review and related works

Vast researches for multi-focus image fusion have been done in the recent years and can be classified into two

categories of transform and spatial domains [1-4,7]. Common used transforms for image fusion are Discrete Cosine

Transform (DCT) and Multi-Scale Transform (MST) [1-3,7]. Many types of researches have been done on the DCT

domain [2, 3, 8-13]. These researches consider a criterion of focus measurement in DCT domain for choosing the suitable

blocks from the divided blocks of the input multi-focus images. These methods are very suitable for real-time applications

in VSN, especially for the JPEG images. But they suffer from the blocking artifacts, which reduce the quality of the output

fused image [2,3]. In addition, the multi-scale image fusion methods are very common and convenient [14-23]. The image

fusion methods based on Multi-Scale Transform (MST) such as Discrete Wavelet Transform (DWT) need a large number

of convolution operations, which the quality of the output fused image is reduced due to creating the ringing artifacts in the

edge places of the image [2, 3, 7]. Also, there are another image fusion methods that are based on sparse representation [1,

4, 19, 24, 27]. The framework of sparse representation (SR) based methods is focused on three key problems of sparse

representation models, dictionary learning methods, and activity levels and fusion rules [24]. Zhang and Levine [25]

introduced the prominent example of SR-based multi-focus image fusion methods of multi-task robust sparse

representation model and spatial context information to the fusion of multi-focus images with miss-registration. Another

class of the multi-focus image fusion methods is based on the spatial domain. These methods are usually fused the multi-

focus images in the pixel, region or blocks using the intensity of the input images directly. Several methods are introduced

for the spatial domain in the recent years [1, 28-49]. Many of these methods attempt to achieve a satisfactory decision map

for creating the final fused image, but they have not yet come up with an ideal focus map. In addition, all these methods

create artifacts and image blurring and reduce the image contrast.

Recently, Deep Learning (DL) has been thriving in several image processing and computer vision applications [50-52].

That is why DL based methods have been an attractive scientific discussion in image fusion researches [52-59]. Y. Liu et

al. [54] were the first researchers that used CNN for multi-focus image fusion. They used the Siamese architecture for

comparing the focused and unfocused patches. Chaoben and Shesheng [55] introduced image segmentation-based multi-

focus image fusion through the multi-scale convolution neural network (MSCNN). This work introduced image

segmentation between the focused and unfocused regions through the multi-scale convolution neural network. They also

introduced the all convolutional neural network (ACNN) for the multi-focus image fusion, which only replaced the Max-

Pooling layer with the stride of convolution layers which has not offered a powerful contribution [56]. Nevertheless, these

methods have many errors in their initial segmented decision map. Tang et al. [57] proposed the pixel-wise convolution

neural network (p-CNN) for recognizing the focused and unfocused pixels. They tried to create a new type of multi-focus

dataset and accelerate the process of constructing the fused image. But their created dataset does not provide better

performance compared with the previous datasets. Also, the analysis of this paper is performed on the tiny test multi-focus

images. Their new type of datasets are generated from the tiny 32×32 image of the Cifar-10 dataset, and they considered

many different conditions for focus with the 12 geometric type of 32×32 mask for the multi-focus image dataset.

Nevertheless, we think that these laborious works of creating the new dataset is not needed for the task of multi-focus

image fusion. Also, these datasets are generated from tiny 32×32 images. All of these state of the art CNNs based multi-

focus image fusion methods have greatly enhanced the decision map, but their initial segmented decision maps have many

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

errors. Along with the CNN based multi-focus image fusion methods, fully convolution network (FCN) is also utilized in

multi-focus image fusion [58, 59]. K. Xu et al. [58] used the end-to-end fully convolutional two-stream network for multi-

focus image fusion with four convolution layers for each of the two streams of the input images, and then fused these two

streams into one layer. This method obtains the fused image by another four deconvolution layers from the previous

convolution layers. Due to constructing the fused image without the decision map, the fused image of this method has

undesirable manipulated pixels that are not related to the source images. This network was not described in details and

comprehensive and reliable experiments and comparisons were not given. X. Guo et al. [59] have introduced the fully

convolution neural network for multi-focus image fusion. This network is very deeper network than the other deep

learning based methods for constructing the initial decision map. This network includes eighteen convolution layers and

three deconvolution layers. Therefore, their network has a huge number of weights that must be tuned during the training

process. Also, this paper has laborious steps of creating the dataset for its network training. They utilized PASCAL VOC

2012 which is a classification and segmentation dataset. Hence they need the segmentation results for creating their

dataset; it may not contain all scenes in nature. Also, the initial decision map of this method is very unsuitable and does not

show any superiority to others. So, they require a post processing algorithm to polish its unsuitable initial decision map. In

order to refine and polish their inappropriate initial decision map, they utilize the fully connected conditional random field

(CRF) [60] that is a method for multi-class image segmentation. In other word, the main share of acceptable quality of

their final decision map quality belongs to undergoing another algorithm that is a separated issue from the deep learning

based network for image fusion. All of these image fusion methods which are based on Deep Learning (CNNs) used a lot

of post-processing steps such as segmentation, guiding filter, morphological operation (opening & closing), watershed,

small region removal, and consistency verification (CV) on the initial decision map for enhancing the initial decision map

and reach to a satisfactory final decision map [41-44, 54-59]. Therefore, the large share of their suitable performance for

the final decision map quality is due to using vast post-processing algorithms, which is completely a separated issue from

their proposed CNNs network.

This paper introduces a new network for achieving the better initial segmented decision map compared with the others.

The proposed method introduces a new architecture which uses an ensemble of three Convolution Neural Networks

(CNNs) trained on three different datasets. Also, the proposed method prepares a new simple type of multi-focus image

datasets for achieving the better fusion performance than the other popular multi-focus image datasets. This idea is very

helpful to achieve the better initial segmented decision map, which is the same or even better than the others initial

segmented decision map by using vast post-processing algorithms.

The rest of this article is organized as follows: In section 2, the CNN and ensemble learning are briefly introduced. The

proposed method is explained in section 3, and it is compared with the previous algorithms using different experiments in

section 4. Finally, conclusions are given in section 5.

2. Proposed Method

In this section, the proposed method is introduced in details that how the ensemble learning of CNNs can improve the

initial segmented decision map. In the first part, the convolution neural networks and the ensemble learning are explained

briefly. In the second part, the ensemble learning of CNNs in multi-focus image fusion is described. In the third part, the

proposed type of creating simple and useful multi-focus image datasets is described. In the fourth part, the proposed

network architecture is described in details. Finally, the fusion scheme of the multi-focus image by the trained proposed

network is explained in the fifth part.

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

2.1 Preliminaries

Convolution Neural Networks. A popular and famous network for Deep Learning is Convolution Neural Networks

(CNNs or ConvNets). The CNNs are a special category of Artificial Neural Networks (ANN) designed for representation

and processing data (e.g., Image) using a grid-like structure. Typically, there are four layers in a simple CNNs architecture

including Convolutional Layer (conv), Rectifying Linear Unit (ReLU), Pooling layer (subsampling), and Fully-Connected

Layer (FC) [61-63]. In CNNs, each convolutional layers transforms a volume of input images into a volume of feature

maps; then, the next convolutional layers transform this volume of feature maps into another volume of feature maps by

convolution operations with a set of filters. The convolution operation and the activation function of ReLU in CNNs are

expressed as below:

( ∑ ) (1)

where and are the convolutional kernel and the bias, respectively. shows the operation of convolution.

After the convolution operations, a spatial pooling (e.g., Max-Pooling) is used; then the other convolution layers are

similarly arranged together. Finally, the FC layer is the last part of CNNs which is just a convolutional layer with a kernel

of the size of 1×1. The general schematic diagram CNNs is shown in Fig. 1.

The Ensemble Learning. The ensemble learning in a neural network defines a learning paradigm where several

networks are trained simultaneously on one or several datasets to solve an issue. It is very reasonable to use various

models and datasets instead of just one single model and dataset. The ensemble learning improves the generalization

capabilities compared with the single network or dataset [62-67]. There are many ideas and various ensemble methods in

machine learning [66]. The simplest examples are hard voting and soft voting of predictions from the individual trained

models on the different datasets. The ensemble learning methods are still applicable and extendable into Deep Learning

[67].

2.2 The proposed approach of patches feeding to the ensemble of CNNs

The proposed method is based on the following facts:

Fact 1: Feeding the focused and unfocused patches simultaneously into the proposed network can increase

classification accuracy in comparison with feeding these patches separately.

Fact 2: The edges in the unfocused patches are smoother than the corresponding edges in the focused patches.

Therefore, Gx and Gy – the gradient image in horizontal and vertical directions - carry important information

on detecting the focused patches.

Fact 3: Employing CNNs through ensemble learning and utilizing the proposed datasets for feeding them to

the proposed architecture would improve the accuracy of the focused patch detection and therefore enhance the

initial decision map.

In the following, we discuss the application of the above facts in the proposed algorithm briefly.

Figure. 1. The general schematic diagram CNNs

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Proposed Feeding Strategy and Constructing Training Dataset

The essential prerequisite is to have an appropriate multi-focus image dataset. Each of the previous CNN based methods

attempt to create a helpful dataset [54-59]. As we know, these methods feed either the only focused or the only unfocused

patches, separately, to the multiple paths of networks. i.e., each of these patches is individually either in focused or in

unfocused area. Therefore, each of these paths in the networks does not know anything about the others. To resolve this

problem, we suggest to concatenate the two extracted patches vertically from the source images into a single patch. The

schematic diagram of the procedure of creating dataset according to the proposed patch feeding strategy of this paper is

shown in Fig 2. As we can see from Fig. 2, the created macro-patches have both focused and unfocused area of the image

simultaneously. So, unlike the other methods such as [54-59], each path of the proposed network knows about the two

corresponding patches. In this paper, more than 2200 high-quality images of the COCO 2014 [68] datasets are used to

create the training dataset. The images of the COCO dataset are captured from the common objects in their natural context.

Some sample randomly selected images of COCO dataset are shown in Fig.3. The randomly selected images from COCO

dataset are converted into gray-scale. In order to create an unfocused condition like as real multi-focus captured with a

camera, each randomly selected image from COCO dataset is passed through four different Gaussian filters with the

standard deviation of 9 and the cut of 9×9, 11×11, 13×13, and 15×15. Therefore, five types of images including the

original image and the four versions of the blurred image are achieved for each selected images in COCO datasets. Then,

the gradient in the directions of horizontal (Gx) and vertical (Gy) are applied for each of these five types of images. We are

going to create three groups that each one contain the original images, the gradient in the directions of horizontal (Gx) and

vertical (Gy) of images. So, there are five types of images for each randomly selected image from COCO dataset in the

original, Gx, and Gy datasets. Afterward, each type of images in these three groups are divided into 32×32 blocks or

patches. Therefore, there are a lot of patches from the original images, Gx, and Gy groups. However, we know with prior

knowledge that each patches are from the non-blurred version of image or one of the four blurred versions of images.

Since we are going to create our proposed three datasets with the prior knowledge, suppose PA and PB are the 32×32

patches obtained from the from the non-blurred version of the image (A) and one of the four blurred versions of images

(B), respectively. As is depicted in the schematic diagram of Fig. 2, our proposed method creates a 64×32 macro-patch by

vertically concatenating PA and PB. So, the upper and the lower half of this macro-patch are from the non-blurred version

of the image and one of the four blurred versions of images, respectively. This macro-patch is known as the upper focused

data and is labeled as 0. In the same way, the vertical symmetry of this macro-patch is known as the lower focused data

and is labeled as 1. With this procedure of creating macro-patch from the original images for original dataset, the macro-

patches from the gradient in the directions of horizontal (Gx) and vertical (Gy) of images are created for Gx and Gy

datasets. By selecting more than 2200 high-quality images randomly from COCO dataset and the explained procedure as

Fig. 2, 1,000,000 macro-patches for training and 300,000 patches for test datasets are generated for each original, Gx, and

Gy datasets. So according to the Fact-2 given above, we construct the corresponding macro-patch for each mode of the

input sources, i.e. original, Gx, and Gy datasets, separately.

After changing the arrangement of the focused and unfocused parts for the proposed method, there is the question of

which type of the datasets images are better to use for training. The best results in machine learning are obtained when the

algorithm get advices from different networks that are trained on the different datasets. Our proposed method prepares

three suitable multi-focus image datasets for each mode of information sources, according to the new arrangement of Fig.

2. It is obvious that all of three datasets that have good features and information and they help the proposed networks for

classification.

The ensemble learning methods can even be extended into CNNs models, and the layers of convolutions are

concatenated together to make a better representation. It has the ability to train the proposed architecture with the various

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

models on the various datasets, then concatenate these models together, and finally concatenate their predictions to a final

prediction. Combining all of these ideas would help to achieve a higher accuracy and capability for classification.

Therefore, the created original, Gx, and Gy datasets according to the proposed patch feeding are very appropriate approach

for three feeding modal of the proposed ensemble based architecture.

2.3 The Proposed Network Architecture

The schematic diagram of the proposed architecture is shown in Fig.4. There are many architectures for CNNs model,

but this paper attempts to implement the simple CNNs in order to simplify the problem. It would be wise to make the

Figure 2. The schematic diagram of generating three datasets according to the proposed patch feeding which is used in the training

procedure.

Figure 3. Some samples of used images for training from COCO Dataset [68].

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

easiest and most straightforward way to get the satisfactory results. This paper considers a few numbers of convolution

layers due to the simplicity of the multi-focus image fusion compared with the advanced issues such as object detection

and semantic segmentation.

The proposed architecture is containing the convolution layers with the kernel size of 3×3, stride of 1×1, padding size of

1×1, and non-linear activation function of ReLU. The Max pooling of 2×2 is used for this architecture. Actually, we can

divide the proposed network into five paths. The original dataset is fed to network in the first path. The Gx and Gy datasets

are fed to network in the second and the third paths, respectively. Then the outputs of these two paths are concatenated in

the fourth path. After it, the output of the first and the fourth paths are concatenated in the fifth path. In the fifth path, the

FC layer is mapped to the final two neurons for detecting the focused and unfocused labels. The details of these paths are

as follows:

The input 32×64 macro-patches are from the original, Gx, Gy datasets.

Path #1:

1- The 32×64 macro-patch of the original dataset is fed to the first convolution layer for obtaining 64 feature maps.

The result of this layer has the volume of 32×64×64.

2- The volume of 32×64×64 is fed to the second, third, and fourth convolution layers with 128, 128, and 256 filters,

respectively. For these convolution layers, the max-pooling of 2×2 is used. Then the volume of 4×8×256 is

achieved at the end of the fourth convolution layer.

Path #2:

1- The 32×64 macro-patch of Gx dataset is fed to the first convolution layer for obtaining 64 feature maps. The

result of this layer is the volume of 32×64×64.

2- The volume of 32×64×64 is fed to the second and third convolution layer with 128 and 128 filters, respectively.

For these convolution layers, the max-pooling of 2×2 is used. Then the volume of 8×16×128 is achieved at the

end of the third convolution layer.

Path #3:

1- The 32×64 macro-patch of Gy dataset is fed to the first convolution layer for obtaining 64 feature maps. The

result of this layer has the volume of 32×64×64.

2- The volume of 32×64×64 is fed to the second and third convolution layer with 128 and 128 filters, respectively.

For these convolution layers, the max-pooling of 2×2 is used. Then the volume of 8×16×128 is achieved at the

end of the third convolution layer.

Path #4:

1- The two output volume of 8×16×128 for path #2 and path #3 are concatenated to construct the volume of

8×16×256.

2- The volume of 8×16×256 is fed to the convolution layer with 256 filters with the max-pooling of 2×2 for

achieving the volume of 4×8×256.

Path #5:

1- The two output 4×8×256 volume of paths #1 and path #4 are concatenated for achieving the volume of 4×8×512.

This volume expands to 1×16384 which is the Fully Connected (FC) layer. This FC is mapped to the two neurons

for final prediction which indicate the focused and unfocused labels.

With this procedure, the proposed architecture is trained on the created 1,000,000 macro-patches of the original, Gx,

and Gy datasets. To normalize the input patches, the mean and variance of 1,000,000 original, Gx, and Gy patches of the

training datasets are calculated as µ1=0.45, σ1=0.1, µ2=0.05, σ2=0.09, and µ3=0.06, σ3=0.09, respectively. The proposed

network is trained with the stochastic gradient descent (SGD) with a learning rate 0f 0.0002, the momentum of 0.9, the

weight decay of 0.0005. Also, the scheduler of StepLR with a step size of 1, and a gamma value of 0.9 are used. The batch

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

size of 64 is selected for training of the proposed network. Also, the Cross Entropy Loss is used for the criterion of the

proposed network. The batch normalization is used for the training of the proposed model. Due to the three types of useful

datasets and designing an ensemble learning based network in this paper, the multi-focus patches of datasets of this paper

have a very fast training procedure, and it can be learned quickly. The classification accuracy of the trained proposed

network is 99.794% and 99.786% for 1,000,000 training and 300,000 test datasets, respectively.

2.4 The fusion scheme

For a simple description of the proposed method, two images A and B are considered; a part of image A is focused

whereas this part in image B is unfocused. It is assumed that the input images were aligned by an image registration

method before performing the image fusion process. The input multi-focus images should be transformed into grayscale

images for constructing a decision map if they are color images; after constructing decision map, the color multi-focus

image could be fused. The proposed method could be easily extended to fuse more than two input images. The input multi-

focus images A and B are fed into the pre-trained proposed network according to the patch feeding strategy that is used for

creating the three datasets in Fig. 2. The extracted patches from input multi-focus images are overlapped for feeding the

pre-trained network in order to simulate pixel-wise image fusion. Then, the pre-trained proposed network returns the labels

of 0 and 1 which indicate the focused and unfocused labels, respectively. With this procedure, each pixel is contributed

several times in getting the label of focused and unfocused. Every macro-patch which is fed to the network, updates the

score map of the input multi-focus images with the proposed fusion method as (2).

Figure 4. The schematic of the proposed ECNN architecture with all details of models of CNNs.

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

( ) { ( ) ( )

(2)

where r, c, and M indicate the row and column of the input images, and decision map, respectively. Also, b is the size of

width and height of the extracted patches from the input multi-focus images for constructing macro-patches and it must be

resized to 32 for feeding to the pre-trained network. The value of b can be set to 16 and 32 for tiny and large images,

respectively. Then, the initial segmented decision map of the proposed method is constructed as below:

( ) { ( )

(3)

Finally, the final fused image is calculated as below:

( ) ( ) ( ) ( ( )) ( ) (4)

where A(r,c) and B(r,c) are input multi-focus images.

This proposed method is devoid of the complicity of the previous state of the art methods based CNN and FCN [54-59]

to create the initial and final decision map. These methods used many post-processing to compensate for shortcoming of

their initial segmented decision map. If there is a need of post-processing, it can be simply applied on the initial decision

map of the proposed method. However, this post-processing in the proposed method is not needed to be applied for every

input images, because the initial segmented decision map of the proposed method has satisfactory quality without any

post-processing algorithm. The upcoming results show that the initial segmented decision map of the proposed method is

very better than that of the others even they used many post-processing algorithms on their initial segmented decision map.

The flowchart of the proposed method (ECNN) for getting the initial segmented decision map of an example of a multi-

focus image is shown in Fig.5.

2.5 Complexity comparison of deep learning based networks

In the last decade, GPU hardware bcomes widespread and it encourages the researchers to implement deep learning in

image processing and computer vision applications. Hence, the deep learning based multi-focus image fusion methods

have greatly enhanced the decision map and the quality of the fused image. These deep learning based fusion networks are

implemented and trained in the various frameworks such as Pytorch, Caffe, and Tensorflow. Also, most of their source

codes of training and fusing are not provided. So, there is no way to compare the deep learning based methods in terms of

time-consuming in fair conditions. Therefore, the best fairly solution of complexity comparison for deep learning based

multi-focus fusion methods is based on the number of weights of the main network instead of time-consuming comparison.

Figure. 5. The flowchart of the proposed method of ECNN for getting the initial segmented decision map of multi-focus image fusion.

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

We calculate the weights and biases of the network with the following procedure. For each convolution layer, the weights

are calculated as W×H×C×F, where W and H are the width and height of kernel, C is the number of input channels, and F

is the number of the kernels or filters of the convolution layer. In the first convolution layer, C is equal to the image or

patch channel (C=1 for gray-scale, and C=3 for RGB patches), and C is equal to the number of the kernel of the previous

convolution layer. Also, for each convolution layer, the number of biases is equal to the number of the kernels. For

calculating the number of weights of fully connected (FC) layer, there are two cases. In the first case, the FC layer is

connected to the convolution layer. In this case, the number of weights of FC layer is calculated as OW×OH×C×N, where

OW, OH, and C are the width, height, and the number of kernels of the output volume of the last convolution layer. Also, N

is the number of neurons in the FC layer. In the second case, the FC layer is connected to another FC layer. In this case,

the number of weights of the FC layer is calculated as N1×N2, where N1 and N2 are the number of neurons of the previous

and current FC layers, respectively. Also, the number of biases for both two cases are equal to the number of neurons of

the FC layer. The number of weights of the proposed network is remarkably lower than the CNN [54], and FCN [59], but

is higher than MSCNN [55] and p-CNN [57]. With this procedure, we calculated the number of weights and biases of the

proposed network and the four previous state of the art multi-focus image fusion methods based on deep learning that are

listed in Table 1. These weights and biases must be tuned during the training process, and the input multi-focus patch

images must be passed through these parameters of the pre-trained network after the training process. Besides this, it is

important to remember that the proposed network has the neatest initial segmented decision map among the others and

therefore it has the least demanding to huge post-processing steps which the previous state of the arts deep learning based

methods have a strong dependency to post-processing steps for refining the initial decision map. Therefore, our proposed

network does not need to spend extra time and process for post-processing procedure for refining of the initial segmented

decision map.

3. Experimental Results

This section discusses the performance of the proposed method and exhibits the results of simulations for the proposed

method and the other state of the art methods for the comparison. The proposed algorithm is coded and trained using

PyTorch (0.4.0) on the operating system of Ubuntu Linux 16.04 LTS. The used hardware is CPU of Core i7 6900k with

32Gb RAM and GPU of STRIX-GTX1080-O8G. The results of the simulation for the proposed method are compared with

the results of some state of the art methods based on the spatial domain, multi-scale transform, CNN and FCN based

methods. In order to compare the proposed method with the previous methods in fair conditions, we used the reported

experimental results of the recent papers. Also, we used their fusion quality metrics and the non-referenced multi-focus

images which were captured with the real cameras. The 25 state of the art previous methods which were compared with

our proposed method are GFF [43], IMF [44], DSIFT [40], BFMM [42], MWGF [21], SSDI [41], SRCF [4], MSTSR [27],

CBF [48], CSR [26], IFGD [23], GIF [37], DCHWT [18], ICA [49], NSCT [22], PCNN [46, 47], PCA [29], DCTLP [12],

DCTV [3], WSSM [19], CNN [54], MSCNN [55], p-CNN [57], CAB [33], and, FCN [59]. For obtaining the initial

decision map and the fused image of some of those methods, the available source codes of their algorithms were used; for

the rest of methods, the reported results in [33, 54, 55, 59] were used in this paper. The evaluation performance metrics of

Table 1.

Comparison of number of weights and biases between the deep learning based network of the proposed method and the others.

Method CNN [54] MSCNN [55] p-CNN [57] FCN [59] ECNN

(the proposed)

Weights 4,933,248 803,968 304,902 16,813,184 1,582,784

Biases 1,154 898 235 12,430 1,474

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

image fusion that were used in the previous methods of [33, 54, 55, 59] are used for the assessment of our proposed

method and comparing with the previous methods in this paper. These fusion metrics and the test multi-focus images are

obtained directly by kindly corresponding with their authors. This paper used several of the non-referenced image fusing

metrics. The total information transferred from the source images to the fused image QAB/F

[69, 70], the similarity based

quality metric Yc or Q(A,B,F) [71], the mutual information (MI), the phase congruency-based fusion metric QPC [72],

structural similarity-based fusion metric QW [73], the human perception-based fusing metric QCB [74], the normalized the

mutual information (NMI) [75], visual information fidelity (VIF) [77], feature mutual information (FMI) [76], and the

nonlinear correlation information entropy NICER (QNICE) [78] are used. The assessment of the fusion process is very hard

for the non-referenced multi-focus images that their ground-truth images are not available. Also, when the results of state

of the art methods are close together, the quantity of the non-referenced fusion metrics are not reliable for the judgments.

Therefore, the most reliable way to compare the proposed method with the previous methods is the visual comparison of

the initial segmented decision map with the initial and final decision maps of the other methods. The critical point that

should be considered is that the initial segmented decision maps are obtained without applying any post-processing

algorithm, unlike the final decision map. This paper shows the initial segmented decision map of the proposed method is

the same or even better than that of the others.

The proposed method of ECNN is applied on wide range of famous test multi-focus images that were recently used in

many state of the art methods for comparison. We applied ECNN on 20 pairs color multi-focus image of Lytro dataset and

achieve a high quality of the fused image as shown in Fig. 6. Also, we applied ECNN on other famous gray-scale and

color test multi-focus images such as “Flower” and “Leopard” that the source images and the fused images of ECNN are

shown in Fig. 6. Now, we are going to compare some samples of these images both qualitatively and quantitatively.

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

(lytro-01-A) (lytro-01-B) (lytro-01-F) (lytro-02-A) (lytro-02-B) (lytro-02-F) (lytro-03-A) (lytro-03-B) (lytro-03-F)






(lytro-19-A) (lytro-19-B) (lytro-19-F) (lytro-20-A) (lytro-20-B) (lytro-20-F) (Temple-A) (Temple-B) (Temple-F)

(Flower-A) (Flower-B) (Flower-F) (Lab-A) (Lab-B) (Lab-F)

(Calendar-A) (Calendar-B) (Calendar-F) (Book-A) (Book-B) (Book-F)

(Leopard-A) (Leopard-B) (Leopard-F) (Desk-02-A) (Desk-02-B) (Desk-02-F)

(Clock-A) (Clock-B) (Clock-F) (Newspaper-A) (Newspaper-B) (Newspaper-F)

Figure 6. The Lytro dataset and other famous multi-focus image dataset were used in experiments, and the fused images of the proposed ECNN method. The

symbols of “A” and “B” are stand for the source input multi-focus images, and the symbol “F” are stand for the fused image of the proposed ECNN method,

respectively.

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

Fig. 7 shows the comparison of the initial and final decision maps of the different methods for the “Flower” color multi-

focus images. The two multi-focus source images of “Flower” are shown in Figs. 7(a) and (b). The final decision map of

MWGF is shown in Fig. 7(c). This method introduces the ringing and jagged artifacts in the edge places of the decision

map. The initial and final segmented decision maps of IMF are shown in Fig. 7(d) and (e), respectively. The Fig. 7(d)

shows many errors even after applying the post-processing algorithms. The final decision map of IMF shows undesirable

results comparing with the source images, and it is not acceptable. The initial segmented decision map and the final

decision map of GFF are shown in Fig. 7(f) and (g), respectively. These decision maps are also inappropriate and

unacceptable. The initial and final decision maps of BFMM in Fig. 7(h) and (i) are very undesirable and useless for fusion

of these images, because of the jagged artifacts on the edge places of the input images. The decision map of DSIFT in Fig.

7(j) are irregular and also shows the wrong hole in the map. The initial segmented and final decision maps of SSDI are

shown in Fig. 7(k) and (l), respectively. These maps are also irregular and have the thick boundaries in the decision area.

The initial segmented decision map and the final decision map that obtained after applying many post-processing

algorithms are shown in Fig. 7(m)-(r) for the CNNs based methods of CNN, MSCNN, and p-CNN, respectively. These

methods have many errors comparing to the regions in input multi-focus images for the initial segmented decision map.

These methods give the best results according to the reported results in their papers [54-57]. The final decision maps of

these methods are obtained after applying many post-processing on their initial segmented decision map, like Consistency

verification (CV), Guiding filter, and small region removal, watershed, morphological filter (closing and opening). While

their final decision maps in Fig. 7(n), (p), and (r) do not show the best matching according to the focused areas of the input

multi-focus images. Therefore, the final decision maps of these methods have shortcoming results. The first segmented

decision map of the proposed method is shown in Fig. 7(s) which is the first segmented decision map of the proposed

method without any post-processing algorithms. It is obvious that the first segmented initial decision map of the proposed

method is very better than that the other initial and final segmented decision maps which obtained after applying a lot of

post-processing on their initial segmented decision map. The fused image of the proposed method (ECNN) according to its

initial segmented decision map is shown in Fig. 7(t).

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

(m) (n) (o) (p )

(q) (r) (s) (t)

Figure.7. The initial and final segmentation map (with and without post-processing) of the proposed method and the others

for “Flower” image. (a)The first image, (b)The second image, (c)MWGF [21], (d)IMF [44] without post-processing, (e)IMF

[44] with post-processing, (f)GFF [43] without post-processing, (g) GF [43] with post-processing, (h)BFMM [42] without

post-processing, (i)BFMM [43] with post-processing, (j)SSDI [41] without post-processing, (k) DSIFT [40] without post-

processing, (l)DSIFT [40] with post-processing (m) CNN [54] without post-processing, (n)CNN [54] with post-processing,

(o)MSCNN [55] without post-processing, (p)MSCNN [55] with post-processing, (q) p-CNN [57] without post-processing,

(r)p-CNN [57] with post-processing, (s)The first initial map of ECNN (the proposed method without post-processing),

(t)The fused image using ECNN (the proposed method).

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

In the similar experiment, Fig. 8 shows the comparison of the initial segmented decision map and final decision map of

the “Children” color multi-focus images. The final decision map of MWGF is shown in Fig. 8(c), which shows ringing and

jagged artifacts in the edge places of the decision map. The initial and final segmented decision maps of IMF are shown in

Fig. 8(d) and (e), respectively. The final map shows the undesirable side effects on the edge boundaries of the decision

map. The initial and final segmented decision maps of GFF are shown in Fig. 8(f) and (g), respectively. These decision

maps are also very inappropriate and could not be acceptable for a good decision map. The initial and final decision maps

of BFMM are shown in Fig. 8(h) and (i), respectively. This method shows the undesirable jagged artifacts in the

boundaries of the final decision map. The decision map of DSIFT in Fig. 8(j) is also irregular. The initial segmented and

final decision maps of SSDI are shown in Fig. 8(k) and (l), respectively. These maps also show the unfavorable areas in

the decision map. The initial and final segmented decision maps of CNN are shown in Fig. 8(m) and (n), respectively. The

initial segmented decision map of this method shows many errors compared with the focused area of the source input

images. The final decision maps of CNN and MSCNN which obtained by applying many post-processing algorithms are

shown in Fig. 8(n) and (o), respectively. The initial segmented and the final decision maps of FCN are shown in Fig. 8(p)

and (q), respectively. The initial segmented decision map of FCN is an inappropriate decision map which is very

unsuitable comparing to the others. The first results of FCN method can not be considered for a suitable fusion process.

So, some post processing algorithms should be used to polish its unsuitable initial decision map. To refine and polish their

inappropriate initial decision map and achieve the final decision map, as depicted in Fig. 8(q), they utilize the fully

connected conditional random field (CRF) [60] that is a method for multi-class image segmentation. In overall, the main

share of their acceptable quality for their final decision map quality belongs to an undergoing algorithm which is a

separated issue from the deep learning based network for image fusion. The initial segmented decision map of the

proposed method (ECNN) is shown in Fig. 8(r). As expected, the initial segmented decision map of the proposed method

(without any post-processing) is very better than that of all others’ initial and final segmented decision maps. The final

decision map of MSCNN which is obtained after applying many post-processing algorithms are close to the initial

segmented decision map of the proposed method obtained without any post-processing. The fused image of our proposed

method is shown in Fig. 8(s) which shows the best quality among the other methods.

In another experiment, the two multi-focus source images of “Lytro-10” are shown in Figs. 9(a) and 9(b). In this

experiment, we compare the initial segmented decision map of our proposed method with the initial and final decision

maps of the methods of MWGF, SSDI, IMF, GFF, BFMM, DSIFT, CNN, and CAB. The decision map of the MWGF is

shown in Fig. 9(c), which has ringing artifacts in the edge places of the focused region. The final decision maps of SSDI,

IMF, GFF, BFMM, DSIFT, and CNN are shown in Figs. 9(d)-(i), respectively. These decision maps are achieved after

applying vast post-processing steps. But yet their final decision maps have some undesirable side effects such as jagged

artifacts, mistaken areas from the focused regions of the source images. The initial segmented decision map and final

decision map of the recently published method of CAB are shown in Fig. 9(j) and (k), respectively. The initial decision

map of this method has very jagged artifacts and the mistaken areas from the focused region in the edge of the source

images. Also, the final decision map of this method has also some areas that is not related to the focused regions of the

source images. The initial decision map of our proposed ECNN method which is achieved without applying any post-

processing is shown in Fig. 9(l). It is obvious that the initial decision map of ECNN is very neatest than those of the other

methods which are achieved with undergoing vast post-processing steps. The fused image of the proposed ECNN method

with the achieved initial segmented decision map is shown in Fig. 9(m).

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

(k) (l) (m) (n) (o)

(p) (q) (r) (s)

Figure.8. The initial and final segmentation map (with and without post-processing) of the proposed method and the others for

“Children” image. (a)The first image, (b)The second image, (c)MWGF [21], (d)IMF [44] without post-processing, (e)IMF [44]

with post-processing, (f)GFF [43] without post-processing, (g)GFF [43] with post-processing, (h)BFMM [42] without post-

processing, (i)BFMM [42] with post-processing, (j)SSDI [41] (k) DSIFT [40] without post-processing, (l)DSIFT [40] with

post-processing (m)CNN [54] without post-processing, (n)CNN [54] with post-processing, (o)MSCNN [55] with post-

processing, (p) FCN [59] without post-processing (q) FCN [59] with post-processing, (r)The first initial map of ECNN

without any post-processing(proposed method), (u) The fused image using ECNN (the proposed method).

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

In the last qualitatively comparison, we compare our proposed ECNN method with methods of MWGF, SSDI, IMF,

GFF, BFMM, and CNN for the gray-scale multi-focus images of “Leopard”. The two multi-focus source images of

“Leopard” are shown in Figs. 10(a) and (b). The decision map of MWGF is shown in Fig. 10(c), which has ringing

artifacts on the edge places of the decision map. The initial and final decision map of SSDI are shown in Figs. 10(d) and

(e) which have many mistaken areas from the focused region of the source images. The initial and final decision maps of

IMF are shown in Figs. 10(f) and (g) that have many undesirable regions such as fading in the decision map. The initial

and final decision maps of GFF are shown in Figs. 10(h) and (i). The initial and final decision map of this method also

have many undesirable and mistaken regions that are not related to the focused region of the source images. The initial and

final decision maps of BFMM are shown in Figs. 10(j) and (k) that have many jagged artifacts and they are not suitable for

an ideal decision map. The initial and final decision maps of CNN are shown in Figs. 10(l) and (m). The initial segmented

and the final decision map of this method have a big mistaken area of the focused region according to the source images.

This experiment shows that these methods that do not give an acceptable initial decision map; they could not achieve a

suitable decision map even after applying a lot of post-processing algorithms. The initial segmented decision map and the

fused image of our proposed ECNN method are shown in Fig. 10(n) and (o). The initial segmented decision map of our

proposed ECNN method that is achieved without applying any post-processing is very neater and cleaner than the others’

initial and final decision map.

(a) (b) (c) (d) (e)

(f) (g) (h) (i)

(j) (k) (l) (m)

Figure.9. The initial and final segmentation map (with and without post-processing) of the proposed method and the others for

“Lytro-10” image. (a)The first image, (b)The second image, (c)MWGF [21], (d)SSDI [41] with post-processing, (e)IMF [44] with

post-processing, (f)GFF [43] with post-processing, (g)BFMM [42] with post-processing, (h)DSIFT [40] with post-processing,

(i)CNN [54] with post-processing, (j) CAB [33] without post-processing, (k)CAB [33] with post-processing (l)The first initial map

of ECNN without any post-processing (proposed method), (m) The fused image using ECNN (the proposed method).

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

(m) (n) (o)

Figure 10. The initial and final segmentation map (with and without post-processing) of our proposed method and the

others for “Leopard” image. (a)The first source image, (b)The second source image, (c)MWGF [21], (d)SSDI [41]

without post-processing, (e)SSDI [41] with post-processing, (f)IMF [44] without post-processing, (g)IMF [44] with

post-processing, (h)GFF [43] without post-processing, (i)GFF [43] with post-processing, (j)BFMM [42] without post-

processing, (k)BFMM [42] with post-processing, (l)CNN [54] without post-processing, (m)CNN [54] with post-

processing, (n)The first initial map of ECNN without any post-processing (proposed method), (o) The fused image

using ECNN (the proposed method).

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

As mentioned before, the best way of comparison the fused image of different methods is a visual and qualitatively

comparison. Because the ground-truth of the real multi-focus images is not available, we have to use the non-references

images fusion quality metrics. Therefore, the quantity assessments are not always reliable compared with the referenced

metrics like MSE and SSIM. However, we compare the proposed method with the others using the reported results of the

non-referenced quality metrics for MSCNN, p-CNN, FCN, and CAB [55, 57, 59, 33]. In the first objective assessments,

the proposed method is compared with MWGF, SSDI, CNN, DSIFT, and MSCNN using the quality metrics of MI, QAB/F

,

and Q(A,B,F) in Table 2. In overall comparison, values in Table 2 indicate that the proposed method shows better results

in the most cases. In the second objective assessments, the proposed method is compared with GF, IM, CNN, DSIFT, BF,

and p-CNN using the quality metrics of QPC, QW, and QCB in Table 3. As expected from the qualitative comparison, the

results of our proposed method are better than that of the other methods in the quantitative comparison. In the last quantity

assessments, we compare our proposed method with NSCT, GFF, IMF, CBF, DCHWT, MWGF, BFMM, DSIFT, CNN,

FCN, WSSM, PCNN, DCTLP, MSTSR, DCTV, SRCF, GIF, IFGD, ICA, PCA, CSR, and CAB using the quality metrics

of MI, QAB/F

, VIF, NMI, FMI, Yc, and QNICE in Table 4. In this experiment, we used 20 pairs of the color multi-focus

image of Lytro dataset. The average scores of these fusion metrics on 20 pairs color multi-focus images of Lytro dataset

for these 22 methods are listed in Table 4. The scores of these metrics for our proposed ECNN method is higher than the

other 22 methods.

Table 2

Comparison of objective quality metrics of our proposed multi-focus image fusion method and the others. (* from [55])

Test Images Fusion

Metrics

MWGF*

[21]

SSDI*

[41]

DSIFT*

[40]

CNN*

[54]

MSCNN*

[55]

ECNN

(proposed)

Lab

MI 8.0618 8.1412 8.2501 8.6008 8.8044 8.8531

QAB/F 0.7147 0.7528 0.7585 0.7573 0.7588 0.7588

Q(A,B,F) 0.8746 0.8823 0.9132 0.8947 0.9148 0.9831

Temple

MI 5.9655 7.0896 7.3514 6.8895 7.4177 7.3727

QAB/F 0.7501 0.7634 0.7643 0.7590 0.7623 0.7675

Q(A,B,F) 0.8992 0.9125 0.9138 0.9063 0.9251 0.9908

Seascape

MI 7.1404 7.4824 7.9487 7.6285 8.0214 8.3935

QAB/F 0.7059 0.7110 0.7126 0.7113 0.7122 0.7377

Q(A,B,F) 0.9366 0.9473 0.9452 0.9481 0.9547 0.9752

Book

(color)

MI 8.2368 8.4008 8.6623 8.7796 8.8947 8.9319

QAB/F 0.7240 0.7260 07134 0.7277 0.7284 0.7259

Q(A,B,F) 0.9120 0.9221 0.9045 0.9374 0.9473 0.9830

Leopard

MI 9.9474 10.8887 10.9226 10.8792 10.9420 10.9400

QAB/F 0.8175 0.8171 0.8069 0.7973 0.8267 0.8275

Q(A,B,F) 0.9435 0.9325 0.9572 0.9218 0.9748 0.9933

Children

MI 8.2622 7.8505 8.5252 8.3338 8.5363 8.4414

QAB/F 0.6741 0.6799 0.7394 0.7408 0.7384 0.7467

Q(A,B,F) 0.8675 0.8752 0.9255 0.9263 0.9341 0.9882

Flower

MI 8.3255 8.1049 8.5365 8.2695 8.6125 8.5859

QAB/F 0.6913 0.6490 0.7159 0.7183 0.7157 0.7221

Q(A,B,F) 0.9460 0.9207 0.9479 0.9566 0.9689 0.9793

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

On the overall, some of the selected method among of the 25 compared methods with our proposed method were the

state of the art methods that show the best results in among of the introduced methods for multi-focus image fusion in the

recent years. However, the initial segmented decision maps of these methods are undesirable and unacceptable compared

to the focused area of the source images. Also, their final decision maps are still unsatisfactory after applying many post-

processing algorithms on the initial decision maps. By visual observation, we observe that the initial segmented decision

map of our proposed method (ECNN), without any post-processing algorithms, is remarkably better than those of the

others with or without post-processing algorithms. Also, with many conducted comparisons quantitatively, our proposed

ECNN method shows the best results using 10 fusion metrics comparing to 25 other methods.

Table 3

Comparison of objective quality metrics of the proposed multi-focus image fusion method and the others. (* from [57])

Test

Images

Fusion

Metrics GFF* [43] IMF* [44] DSIFT* [40]

BFMM*

[42]

CNN*

[54]

p-CNN*

[57]

ECNN

(proposed)

Book

QPC 0.6822 0.6827 0.6631 0.6812 0.6829 0.6835 0.8195

QW 0.6272 0.6264 0.5985 0.6361 0.6219 0.6162 0.9279

QCB 0.7143 0.7358 0.7355 0.7277 0.7224 0.7359 0.7771

Calendar

QPC 0.6479 0.6451 0.6476 0.6493 0.6494 0.6495 0.7534

QW 0.6842 0.6970 0.6903 0.6924 0.6868 0.6870 0.9189

QCB 0.7033 0.7217 0.7153 0.7256 0.7255 0.7332 0.8030

Flower

QPC 0.7084 0.6876 0.7014 0.7032 0.7093 0.7090 0.7594

QW 0.4781 0.4683 0.5009 0.4123 0.5049 0.5051 0.9198

QCB 0.8048 0.8387 0.8333 0.7153 0.5051 0.8049 0.8270

Lab

QPC 0.6865 0.6876 0.7014 0.7032 0.7037 0.7046 0.7935

QW 0.5100 0.7014 0.5009 0.4984 0.5012 0.5018 0.9163

QCB 0.8344 0.7032 0.8333 0.8321 0.8356 0.8398 0.7489

Desk

QPC 0.7364 0.7246 0.7154 0.7270 0.7037 0.7360 0.7796

QW 0.7038 0.6743 0.6721 0.6745 0.7037 0.7269 0.9036

QCB 0.5714 0.5585 0.5619 0.5628 0.5694 0.5720 0.7602

Newspaper

QPC 0.2195 0.1769 0.1849 0.1960 0.1845 0.1865 0.6483

QW 0.6207 0.6171 0.6252 0.6182 0.6270 0.6273 0.7722

QCB 0.7413 0.7441 0.7484 0.6732 0.7503 0.7512 0.7450

Clock

QPC 0.7089 0.7110 0.6855 0.7098 0.7016 0.7130 0.9060

QW 0.6123 0.6158 0.5791 0.7016 0.6180 0.6197 0.9311

QCB 0.7426 0.7448 0.7497 0.7130 0.7511 0.7512 0.7831

Leopard

QPC 0.7215 0.7112 0.7079 0.7205 0.7207 0.7206 0.9514

QW 0.8226 0.8192 0.5297 0.8230 0.8205 0.8232 0.9572

QCB 0.8225 0.8584 0.7275 0.8270 0.8581 0.8601 0.8820

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

4. Conclusions

A new multi-focus image fusion method based on convolution neural network were introduced in this paper. The main idea of this method is to use an ensemble of three CNNs. The proposed network contains three CNNs, which are trained by three different datasets. The ensemble learning is to help the network significantly to predict the decision map correctly. Also, the proposed method introduces simple arranging of the dataset for the multi-focus for obtaining the better accuracy. In qualitative and quantitative assessments, the obtained results strongly indicate that the first initial segmented decision map is very better than that of all previous state of arts methods. Also, it was showed that the initial segmented decision map of the proposed method is similar, or even better than, the others’ initial and final segmented decision maps obtained after applying many post-processing algorithms. The conducted assessments and experiments are performed with many famous real non-referenced multi-focus images and the fusion quality metrics. These experiments show the superiority of the output image’s quality of the proposed algorithm in comparison with the other state of the arts methods. The source code of our proposed method and all of the supplementary files will be provided on the personal website2 and GitHub3 of this paper’s authors.

2 www.amin-naji.com and www.imagefusion.com

3 www.github.com/mostafaaminnaji

Table 4 MI, QAB/F , VIF, NMI, FMI, Yc, and QNICE comparison of various image fusion methods on 20 pairs color multi-focus

images of Lytro dataset. (* FROM [59], ** FORM [33])

Methods

The average values of fusion metrics for the 20 pairs of multi-focus images of Lytro dataset

MI QAB/F VIF NMI FMI Yc QNICE

NSCT [22] 3.1473* 0.5709* 0.5132* N/A N/A N/A N/A

GFF [43] 4.1211* 0.7601* 0.7430* N/A N/A N/A N/A

IMF [44] 4.2879* 0.7534* 0.7233* 1.1420** 0.6543** 0.9861** 0.8440**

CBF [48] 3.8211* 0.7528* 0.6870* 1.0184** 0.6072** 0.9680** 0.8349**

DCHWT [18] 3.3649* 0.7124* 0.6465* 0.8971** 0.5481** 0.9280** 0.8275**

MWGF [21] 4.2336* 0.7479* 0.7316* 1.1479** 0.6527** 0.9884** 0.8427**

BFMM [42] 4.4376* 0.7572* 0.7412* N/A N/A N/A N/A

DSIFT [40] 4.4588* 0.7621* 0.7492* N/A N/A N/A N/A

CNN [54] 4.3211* 0.7618* 0.7465* N/A N/A N/A N/A

FCN [59] 4.4578* 0.7655* 0.7531* N/A N/A N/A N/A

WSSM [19] N/A 0.7296** N/A 0.9623** 0.5732** 0.9594** 0.8323**

PCNN [46, 47] N/A 0.7036** N/A 1.2068** 0.6354** 0.9690** 0.8482**

DCTLP [12] N/A 0.6562** N/A 0.8296** 0.5018** 0.8821** 0.8235**

MSTSR [13] N/A 0.7543** N/A 0.9995** 0.6081** 0.9675** 0.8323**

DCTV [3] N/A 0.7530** N/A 1.1860** 0.6333** 0.9657** 0.8428**

SRCF [4] N/A 0.7628** N/A 1.1930** 0.6623** 0.9892** 0.8465**

GIF [17] N/A 0.7608** N/A 1.1853** 0.6612** 0.9889** 0.8468**

IFGD [16] N/A 0.7174** N/A 1.0456** 0.5387** 0.8554** 0.8136**

ICA [49] N/A 0.7445** N/A 0.9374** 0.5834** 0.9555** 0.8286**

PCA [29] N/A 0.5992** N/A 0.8939** 0.5707** 0.8483** 0.8529**

CSR [26] N/A 0.7422** N/A 1.0135** 0.5575** 0.9402** 0.8327**

CAB [33] N/A 0.7645** N/A 1.2097** 0.6626** 0.9895** 0.8474**

ECNN (Proposed) 4.6565 0.7867 0.7595 1.2401 0.6782 0.9910 0.8551

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

References

[1] S. Li, X. Kang, L. Fang, J. Hu, H. Yin, Pixel-level image fusion: A survey of the state of the art, Information

Fusion, 33 (2017) 100-112.

[2] M. Amin-Naji, A. Aghagolzadeh, Multi-Focus Image Fusion in DCT Domain using Variance and Energy of

Laplacian and Correlation Coefficient for Visual Sensor Networks, Journal of AI and Data Mining, 6 (2) (2018)

233-250.

[3] M. Haghighat, A. Aghagolzadeh, H. Seyedarabi, Multi-focus image fusion for visual sensor networks in DCT

domain, Computers & Electrical Engineering, 37 (5) (2011) 789-797.

[4] M. Nejati, S. Shadrokh, S. Shirani. Multi-focus image fusion using dictionary-based sparse representation.

Information Fusion, 25 (2015) 72-84.

[5] D. Drajic, N. Cvejic. Adaptive fusion of multimodal surveillance image sequences in visual sensor networks.

IEEE Transactions on Consumer Electronics, 53 (4) (2017) 1456-1462.

[6] S. Soro, W. Heinzelman. A Survey of Visual Sensor Networks, Advances in Multimedia, (2009).

[7] T. Stathaki, Image fusion: algorithms and applications, Academic Press Elsevier, 2011.

[8] M. Amin-Naji, P. Ranjbar-Noiey, A. Aghagolzadeh, Multi-focus image fusion using Singular Value

Decomposition in DCT domain, The 10th Iranian Conference on Machine Vision and Image Processing (MVIP),

2017, pp. 45-51.

[9] M. A. Naji, A. Aghagolzadeh, Multi-focus image fusion in DCT domain based on correlation coefficient, 2nd

International Conference on Knowledge-Based Engineering and Innovation (KBEI), 2015, pp. 632-639.

[10] M. A. Naji, A. Aghagolzadeh, A new multi-focus image fusion technique based on variance in DCT domain, 2nd

International Conference on Knowledge-Based Engineering and Innovation (KBEI), 2015, pp. 478-484.

[11] Y. Phamila, R. Amutha, Discrete Cosine Transform based fusion of multi-focus images for visual sensor

networks, Signal Processing, 95 (2014) 161-170.

[12] V. Naidu, B. Elias, A novel image fusion technique using dct based laplacian pyramid, Int. J. of Inventive

Engineering and Sciences (IJIES) (2013) 2319–9598.

[13] L. Cao, L. Jin, H. Tao, G. Li, Z. Zhuang, Y. Zhang, Multi-Focus Image Fusion Based on Spatial Frequency in

Discrete Cosine Transform Domain, IEEE Signal Processing Letters, 22 (2) (2015) 220-224.

[14] I. De, B. Chanda. A simple and efficient algorithm for multifocus image fusion using morphological wavelets,

Signal Processing. 86 (5) (2006) 924-936.

[15] H. Li, B. Manjunath, S. Mitra, Multisensor Image Fusion Using the Wavelet Transform, Graphical Models and

Image Processing, 57 (3), (1995) 235-245.

[16] O. Rockinger, Image sequence fusion using a shift-invariant wavelet transform, Proceedings of IEEE

International Conference on Image Processing, 3 (1997) 288-291.

[17] V.S. Petrovic, C.S. Xydeas. Gradient-based multiresolution image fusion, IEEE Transactions on Image

Processing. 13 (2) (2004) 228-237.

[18] B. K. S. Kumar, Multifocus and multispectral image fusion based on pixel significance using discrete cosine

harmonic wavelet transform, Signal, Image and Video Processing, 7 (6) (2013), 1125-1143.

[19] J. Tian, L. Chen, Adaptive multi-focus image fusion using a wavelet based statistical sharpness measure, Signal

Processing, 92 (9) (2012) 2137-2146.

[20] V. Naidu, J. Raol, Pixel-level image fusion using wavelets and principal component analysis, Defence Science

Journal, 58 (3) (2008) 338.

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

[21] Z. Zhou, S. Li, B. Wang, Multi-scale weighted gradient-based fusion for multi-focus images, Information Fusion,

20 (2014) 60-72.

[22] Q. Zhang B-l. Guo Multifocus image fusion using the nonsubsampled contourlet transform. Signal Processing, 89

(7), (2009) 1334–1346.

[23] S. Paul, I. S. Sevcenco, P. Agathoklis, Multi-exposure and multi-focus image fusion in gradient domain, Journal

of Circuits, Systems and Computers, 25 (10) (2016) 1650123.

[24] Q. Zhang, Y. Liu, R.S. Blum, J. Han, D. Tao, Sparse representation based multi-sensor image fusion for multi-

focus and multi-modality images: a review, Information Fusion, 40 (2018) 57-75.

[25] Q. Zhang, M.D. Levine, Robust multi-focus image fusion using multi-task sparse representation and spatial

context, IEEE Transactions on Image Processing 25 (5) (2016) 2045-2058.

[26] Y. Liu, X. Chen, R.K. Ward, Z.J. Wang, Image fusion with convolutional sparse representation, IEEE Signal

Processing Letter 23 (12) (2016) 1882–1886.

[27] Y. Liu, S. Liu, Z. Wang, A general framework for image fusion based on multi-scale transform and sparse

representation, Inf. Fusion 24 (2015) 147 – 164.

[28] W. Huang, Z. Jing, Evaluation of focus measures in multi-focus image fusion, Pattern Recognition Letters, 28 (4)

(2007) 493-500.

[29] W. Wu, X. Yang, Y. Pang, J. Peng, G. Jeon, A multifocus image fusion method by using hidden Markov model,

Optics Communications, 287 (2013) 63-72.

[30] M. Nejati, S. Samavi, N. Karimi, S.M.R. Soroushmehr, S. Shirani, I. Rosta, K. Najarian, Surface area-based focus

criterion for multi-focus image fusion, Information Fusion, 36 (2017) 284–295.

[31] J. Liang, Y. He, D. Liu, X. Zeng, Image fusion using higher order singular value decomposition, IEEE

Transactions on Image Processing, 21 (5) (2012) 2898-2909.

[32] S. Pertuz, D. Puig, M. A. Garcia, Analysis of focus measure operators for shape-from-focus,” Pattern

Recognition, 46 (5) (2013) 1415-1432, 2013.

[33] M. S. Farid, A. Mahmood, S.A. Al-Maadeed, Multi-focus image fusion using Content Adaptive Blurring.


[34] S. Li, B. Yang, Multifocus image fusion using region segmentation and spatial frequency, Image and Vision

Computing, 26 (7) (2008) 971-979.

[35] S. Mahajan, A. Singh, A Comparative Analysis of Different Image Fusion Techniques, IPASJ International

Journal of Computer Science (IIJCS), 2 (1) (2014) 8-15.

[36] H. A. Eltoukhy S. Kavusi, Computationally efficient algorithm for multifocus image reconstruction, Proceedings

SPIE Electronic Imaging, 5017 (2003) 332–341.

[37] K. Zhan, J. Teng, Q. Li, J. Shi. A Novel Explicit Multi-focus Image Fusion Method, Journal of Information

Hiding and Multimedia Signal Processing, 6 (3) (2015) 600-612.

[38] S. Li, J.T. Kwok, Y. Wang, Combination of images with diverse focuses using the spatial frequency, Information

Fusion, 2 (3) (2001) 169-176.

[39] Y. Yang, M. Yang, S. Huang, Y. Que, M. Ding, J. Sun, Multifocus image fusion based on extreme learning

machine and human visual system, IEEE Access, 5 (2017) 6989-7000.

[40] Y. Liu, S. Liu, Z. Wang, Multi-focus image fusion with dense SIFT, Information Fusion. 23 (2015) 139-155.

[41] D. Guo, J. Yan, X. Qu, High quality multi-focus image fusion using self-similarity and depth information. Optics

Communications, 338 (2015) 138-144.

[42] Y. Zhang, X. Bai, T. Wang, Boundary finding based multi-focus image fusion through multi-scale morphological

focus-measure, Information Fusion, 35 (2017) 81-101.

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

[43] S. Li, X. Kang, J. Hu, Image fusion with guided filtering, IEEE Transactions on Image Processing, 22 (7) (2013)

2864-2875.

[44] S. Li, X. Kang, J. Hu, B. Yang, Image matting for fusion of multi-focus images in dynamic scenes, Information

Fusion, 14 (2) (2013) 147-162.

[45] M. Li, W. Cai, Z. Tan, A region-based multi-sensor image fusion scheme using pulse-coupled neural network,''

Pattern Recognit. Letter, 27 (16) (2006), 1948-1956.

[46] X. Qu, C. Hu, J. Yan, Image fusion algorithm based on orientation information motivated pulse coupled neural

networks, in: 7th World Congress on Intelligent Control and Automation, 2008, pp. 2437–2441.

[47] X.-B. Qu, J.-W. Yan, H.-Z. Xiao, Z.-Q. Zhu, Image fusion algorithm based on spatial frequency-motivated pulse

coupled neural networks in nonsubsampled contourlet transform domain, Acta Automatica Sinica 34 (12) (2008)

1508 – 1514.

[48] B. K. Shreyamsha Kumar, Image fusion based on pixel significance using cross bilateral filter, Signal Image

Video Process. 9 (5) (2015) 1193–1204.

[49] N. Mitianoudis, T. Stathaki, Pixel-based and region-based image fusion schemes using ICA bases, Information

Fusion, special Issue on Image Fusion: Advances in the State of the Art, 8 (2) (2007) 131 – 142.

[50] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature, 521 (7553) (2015) 436-444.

[51] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning, MIT Press, 2016.

[52] Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, and X. Wang, “Deep learning for pixel-level image fusion:

Recent advances and future prospects”. Information Fusion, vol. 42, pp.158-173, 2018.

[53] M. Amin-Naji, A. Aghagolzadeh, M. Ezoji, CNNs hard voting for multi-focus image fusion, Journal of Ambient

Intelligence and Humanized Computing, (2019), 1-21.

[54] Y. Liu, X. Chen, H. Peng, Z. Wang, Multi-focus image fusion with a deep convolutional neural network,


[55] C. Du, S. Gao, Image segmentation-based multi-focus image fusion through multi-scale convolutional neural

network, IEEE Access, 5 (2017) 15750-15761.

[56] C. B. Du, S. Gao, Multi-focus image fusion with the all convolutional neural network. Optoelectronics Letters, 14

(1) (2018) 71-75.

[57] H. Tang, B. Xiao, W. Li, and G. Wang, “Pixel Convolutional Neural Network for Multi-Focus Image Fusion.”

Information Sciences, 433, (2017) 125-141.

[58] K. Xu, Z. Qin, G. Wang, H. Zhang, K. Huang, S. Ye, Multi-focus Image Fusion using Fully Convolutional Two-

stream Network for Visual Sensors KSII Transactions on Internet & Information Systems, 12 (5) (2018) 2253-

2271.

[59] X. Guo, R. Nie, J. Cao, D. Zhou, W. Qian, Fully Convolutional network-based multifocus image fusion. Neural

Computation, 30 (7) (2018), 1775–1800.

[60] P. Krähenbühl, V. Koltun, Efficient inference in fully connected CRFS with Gaussian edge potentials. In:

Advances in neural information processing systems, (2011), 109–117

[61] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner. “Gradient-based learning applied to document recognition.

Proceedings of the IEEE. vol. 86, no.11, pp. 2278-324, 1998.

[62] [Online]. Available: https://en.wikipedia.org/wiki/Deep_learning (accessed 2 August 2018).

[63] [Online]. Available: http://cs231n.github.io/convolutional-networks/ (accessed 2 August 2018).

[64] Z. H. Zhou, J. Wu, W. Tang, Ensembling neural networks: many could be better than all, Artificial intelligence,

137 (1-2) (2002) 239-263.

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIP

T

[65] T. G. Dietterich, Ensemble Methods in Machine Learning, International workshop on multiple classifier systems,

2000, pp. 1-15.

[66] D. Opitz, R. Maclin, Popular Ensemble Methods: An Empirical Study, 11 (1999) 169–198.

[67] D. Maji, A. Santara, P. Mitra, and D. Sheet, Ensemble of deep convolutional neural networks for learning to

detect retinal vessels in fundus images. arXiv preprint arXiv:1603.04833, (2016).

[68] [Dataset]. Available: http://cocodataset.org/ (accessed 2 August 2018).

[69] C. Xydeas, V. Petrovic, Objective image fusion performance measure, Electronics Letters, 36 (4) (2009) 308-309.

[70] V. Petrovic, C. Xydeas, Objective image fusion performance characterization, Tenth IEEE International

Conference on Computer Vision (ICCV), 2005, pp. 1866-1871.

[71] C. Yang, J.-Q. Zhang, X.-R. Wang, X. Liu, A novel similarity based quality metric for image fusion, Information

Fusion, 9 (2) (2008) 156-160.

[72] J. Zhao, R. Laganiere, Z. Liu, Performance assessment of combinative pixel-level image fusion based on an

absolute feature measurement. International Journal of Innovative Computing, Information and Control, 3 (6)

(2007) 1433-1447.

[73] G. Piella, H. Heijmans, A new quality metric for image fusion. In Image Processing, Proceedings 2003

International Conference on Image Processing (ICIP), (2003), pp. 173-176.

[74] Y. Chen, R.S. Blum, A new automated quality assessment algorithm for image fusion. Image and vision

computing, 27 (10) (2009) 1421-1432.

[75] M. Hossny, S. Nahavandi, D. Creighton, Comments on 'Information measure for performance of image fusion'.

Electronics letters, 44(18) (2008) 1066-1067.

[76] M. B. A. Haghighat, A. Aghagolzadeh, H. Seyedarabi, A non-reference image fusion metric based on mutual

information of image features. Computers & Electrical Engineering, 37(5) (2011) 744-756.

[77] H. Sheikh, A. Bovik, Image information and visual quality. IEEE Transaction on Image Processing 15 (2006)

430–444.

[78] Q. Wang Y. Shen, J. Jin, Performance evaluation of image fusion techniques. Image Fusion Algorithms

Application, 19 (2008) 469–492.

Ensemble of CNN for Multi-Focus Image...

Documents

Transcript of Ensemble of CNN for Multi-Focus Image...