Ensemble of CNN for Multi-Focus Image...
Transcript of Ensemble of CNN for Multi-Focus Image...
Accepted Manuscript
Ensemble of CNN for Multi-Focus Image Fusion
Mostafa Amin-Naji , Ali Aghagolzadeh , Mehdi Ezoji
PII: S1566-2535(18)30604-3DOI: https://doi.org/10.1016/j.inffus.2019.02.003Reference: INFFUS 1077
To appear in: Information Fusion
Received date: 29 August 2018Revised date: 7 February 2019Accepted date: 11 February 2019
Please cite this article as: Mostafa Amin-Naji , Ali Aghagolzadeh , Mehdi Ezoji , Ensemble of CNN forMulti-Focus Image Fusion, Information Fusion (2019), doi: https://doi.org/10.1016/j.inffus.2019.02.003
This is a PDF file of an unedited manuscript that has been accepted for publication. As a serviceto our customers we are providing this early version of the manuscript. The manuscript will undergocopyediting, typesetting, and review of the resulting proof before it is published in its final form. Pleasenote that during the production process errors may be discovered which could affect the content, andall legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
Highlights:
Devise a new type of multi-focus dataset and patch feeding strategy to the network
Propose the three feeding modal of images entrance to the network
Design an ensemble learning based CNNs architecture for multi-focus image fusion
The neatest initial decision map and the least demanding to post-processing steps
The best quality of the fused image among the other state of the art methods
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
Ensemble of CNN for Multi-Focus Image Fusion
Mostafa Amin-Naji, Ali Aghagolzadeh1, and Mehdi Ezoji
Faculty of Electrical and Computer Engineering
Babol Noshirvani University of Technology, Babol, Iran.
Abstract
The Convolution Neural Networks (CNNs) based multi-focus image fusion methods have recently attracted enormous
attention. They greatly enhanced the constructed decision map compared with the previous state of the art methods that
have been done in the spatial and transform domains. Nevertheless, these methods have not reached to the satisfactory
initial decision map, and they need to undergo vast post-processing algorithms to achieve a satisfactory decision map. In
this paper, a novel CNNs based method with the help of the ensemble learning is proposed. It is very reasonable to use
various models and datasets rather than just one. The ensemble learning based methods intend to pursue increasing
diversity among the models and datasets in order to decrease the problem of the overfitting on the training dataset. It is
obvious that the results of an ensemble of CNNs are better than just one single CNNs. Also, the proposed method
introduces a new simple type of multi-focus images dataset. It simply changes the arranging of the patches of the multi-
focus datasets, which is very useful for obtaining the better accuracy. With this new type arrangement of datasets, the three
different datasets including the original and the Gradient in directions of vertical and horizontal patches are generated from
the COCO dataset. Therefore, the proposed method introduces a new network that three CNNs models which have been
trained on three different created datasets to construct the initial segmented decision map. These ideas greatly improve the
initial segmented decision map of the proposed method which is similar, or even better than, the other final decision map
of CNNs based methods obtained after applying many post-processing algorithms. Many real multi-focus test images are
used in these experiments, and the results are compared with quantitative and qualitative criteria. The obtained
experimental results indicate that the proposed CNNs based network is more accurate and have the better decision map
without post-processing algorithms than the other existing state of the art multi-focus fusion methods which used many
post-processing algorithms.
Keywords: Multi-Focus Image Fusion, Deep Learning, Convolution Neural Network, Ensemble Learning
1. Introduction
In recent years, image fusion has been used in many varieties of applications such as remote sensing, surveillance,
medical diagnosis, and photography applications. The two major application examples of image fusion in the photography
applications are fusion of multi-focus images and multi-exposure images [1]. This paper is focused on multi-focus image
fusion that it is a subset of photography applications. The purpose of multi-focus image fusion is gathering the necessary
information and focused areas from the input multi-focus images into a single image. Due to the limited depth of field for
optical lenses of CCD/CMOS cameras, it is difficult to capture a single image that all components be obvious in it.
Therefore, some areas of the captured images with camera sensors are blurred. There is a capability of recording images
with different depth of focuses using the several cameras in Visual Sensor Network (VSN) [1-4]. Since the camera
1 Corresponding author: Ali Aghagolzadeh, Faculty of Electrical and Computer Engineering, Babol Noshirvani University of
Technology, Babol, Iran, Post Code : 47148 – 71167.
E-mail Addresses: [email protected] (M. Amin-Naji), [email protected] (A. Aghgaolzadeh), [email protected] (M. Ezoji)
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
generates a large amount of data compared to the other sensors such as pressure or temperature sensors, there are some
limitations such as limited bandwidth, energy consumption and processing time, which leads processing the local input
images for decreasing the amount of transmission data [4-6]. For these reasons, many researchers are seeking useful
methods for multi-focus image fusion.
1.1 Literature review and related works
Vast researches for multi-focus image fusion have been done in the recent years and can be classified into two
categories of transform and spatial domains [1-4,7]. Common used transforms for image fusion are Discrete Cosine
Transform (DCT) and Multi-Scale Transform (MST) [1-3,7]. Many types of researches have been done on the DCT
domain [2, 3, 8-13]. These researches consider a criterion of focus measurement in DCT domain for choosing the suitable
blocks from the divided blocks of the input multi-focus images. These methods are very suitable for real-time applications
in VSN, especially for the JPEG images. But they suffer from the blocking artifacts, which reduce the quality of the output
fused image [2,3]. In addition, the multi-scale image fusion methods are very common and convenient [14-23]. The image
fusion methods based on Multi-Scale Transform (MST) such as Discrete Wavelet Transform (DWT) need a large number
of convolution operations, which the quality of the output fused image is reduced due to creating the ringing artifacts in the
edge places of the image [2, 3, 7]. Also, there are another image fusion methods that are based on sparse representation [1,
4, 19, 24, 27]. The framework of sparse representation (SR) based methods is focused on three key problems of sparse
representation models, dictionary learning methods, and activity levels and fusion rules [24]. Zhang and Levine [25]
introduced the prominent example of SR-based multi-focus image fusion methods of multi-task robust sparse
representation model and spatial context information to the fusion of multi-focus images with miss-registration. Another
class of the multi-focus image fusion methods is based on the spatial domain. These methods are usually fused the multi-
focus images in the pixel, region or blocks using the intensity of the input images directly. Several methods are introduced
for the spatial domain in the recent years [1, 28-49]. Many of these methods attempt to achieve a satisfactory decision map
for creating the final fused image, but they have not yet come up with an ideal focus map. In addition, all these methods
create artifacts and image blurring and reduce the image contrast.
Recently, Deep Learning (DL) has been thriving in several image processing and computer vision applications [50-52].
That is why DL based methods have been an attractive scientific discussion in image fusion researches [52-59]. Y. Liu et
al. [54] were the first researchers that used CNN for multi-focus image fusion. They used the Siamese architecture for
comparing the focused and unfocused patches. Chaoben and Shesheng [55] introduced image segmentation-based multi-
focus image fusion through the multi-scale convolution neural network (MSCNN). This work introduced image
segmentation between the focused and unfocused regions through the multi-scale convolution neural network. They also
introduced the all convolutional neural network (ACNN) for the multi-focus image fusion, which only replaced the Max-
Pooling layer with the stride of convolution layers which has not offered a powerful contribution [56]. Nevertheless, these
methods have many errors in their initial segmented decision map. Tang et al. [57] proposed the pixel-wise convolution
neural network (p-CNN) for recognizing the focused and unfocused pixels. They tried to create a new type of multi-focus
dataset and accelerate the process of constructing the fused image. But their created dataset does not provide better
performance compared with the previous datasets. Also, the analysis of this paper is performed on the tiny test multi-focus
images. Their new type of datasets are generated from the tiny 32×32 image of the Cifar-10 dataset, and they considered
many different conditions for focus with the 12 geometric type of 32×32 mask for the multi-focus image dataset.
Nevertheless, we think that these laborious works of creating the new dataset is not needed for the task of multi-focus
image fusion. Also, these datasets are generated from tiny 32×32 images. All of these state of the art CNNs based multi-
focus image fusion methods have greatly enhanced the decision map, but their initial segmented decision maps have many
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
errors. Along with the CNN based multi-focus image fusion methods, fully convolution network (FCN) is also utilized in
multi-focus image fusion [58, 59]. K. Xu et al. [58] used the end-to-end fully convolutional two-stream network for multi-
focus image fusion with four convolution layers for each of the two streams of the input images, and then fused these two
streams into one layer. This method obtains the fused image by another four deconvolution layers from the previous
convolution layers. Due to constructing the fused image without the decision map, the fused image of this method has
undesirable manipulated pixels that are not related to the source images. This network was not described in details and
comprehensive and reliable experiments and comparisons were not given. X. Guo et al. [59] have introduced the fully
convolution neural network for multi-focus image fusion. This network is very deeper network than the other deep
learning based methods for constructing the initial decision map. This network includes eighteen convolution layers and
three deconvolution layers. Therefore, their network has a huge number of weights that must be tuned during the training
process. Also, this paper has laborious steps of creating the dataset for its network training. They utilized PASCAL VOC
2012 which is a classification and segmentation dataset. Hence they need the segmentation results for creating their
dataset; it may not contain all scenes in nature. Also, the initial decision map of this method is very unsuitable and does not
show any superiority to others. So, they require a post processing algorithm to polish its unsuitable initial decision map. In
order to refine and polish their inappropriate initial decision map, they utilize the fully connected conditional random field
(CRF) [60] that is a method for multi-class image segmentation. In other word, the main share of acceptable quality of
their final decision map quality belongs to undergoing another algorithm that is a separated issue from the deep learning
based network for image fusion. All of these image fusion methods which are based on Deep Learning (CNNs) used a lot
of post-processing steps such as segmentation, guiding filter, morphological operation (opening & closing), watershed,
small region removal, and consistency verification (CV) on the initial decision map for enhancing the initial decision map
and reach to a satisfactory final decision map [41-44, 54-59]. Therefore, the large share of their suitable performance for
the final decision map quality is due to using vast post-processing algorithms, which is completely a separated issue from
their proposed CNNs network.
This paper introduces a new network for achieving the better initial segmented decision map compared with the others.
The proposed method introduces a new architecture which uses an ensemble of three Convolution Neural Networks
(CNNs) trained on three different datasets. Also, the proposed method prepares a new simple type of multi-focus image
datasets for achieving the better fusion performance than the other popular multi-focus image datasets. This idea is very
helpful to achieve the better initial segmented decision map, which is the same or even better than the others initial
segmented decision map by using vast post-processing algorithms.
The rest of this article is organized as follows: In section 2, the CNN and ensemble learning are briefly introduced. The
proposed method is explained in section 3, and it is compared with the previous algorithms using different experiments in
section 4. Finally, conclusions are given in section 5.
2. Proposed Method
In this section, the proposed method is introduced in details that how the ensemble learning of CNNs can improve the
initial segmented decision map. In the first part, the convolution neural networks and the ensemble learning are explained
briefly. In the second part, the ensemble learning of CNNs in multi-focus image fusion is described. In the third part, the
proposed type of creating simple and useful multi-focus image datasets is described. In the fourth part, the proposed
network architecture is described in details. Finally, the fusion scheme of the multi-focus image by the trained proposed
network is explained in the fifth part.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
2.1 Preliminaries
Convolution Neural Networks. A popular and famous network for Deep Learning is Convolution Neural Networks
(CNNs or ConvNets). The CNNs are a special category of Artificial Neural Networks (ANN) designed for representation
and processing data (e.g., Image) using a grid-like structure. Typically, there are four layers in a simple CNNs architecture
including Convolutional Layer (conv), Rectifying Linear Unit (ReLU), Pooling layer (subsampling), and Fully-Connected
Layer (FC) [61-63]. In CNNs, each convolutional layers transforms a volume of input images into a volume of feature
maps; then, the next convolutional layers transform this volume of feature maps into another volume of feature maps by
convolution operations with a set of filters. The convolution operation and the activation function of ReLU in CNNs are
expressed as below:
( ∑ ) (1)
where and are the convolutional kernel and the bias, respectively. shows the operation of convolution.
After the convolution operations, a spatial pooling (e.g., Max-Pooling) is used; then the other convolution layers are
similarly arranged together. Finally, the FC layer is the last part of CNNs which is just a convolutional layer with a kernel
of the size of 1×1. The general schematic diagram CNNs is shown in Fig. 1.
The Ensemble Learning. The ensemble learning in a neural network defines a learning paradigm where several
networks are trained simultaneously on one or several datasets to solve an issue. It is very reasonable to use various
models and datasets instead of just one single model and dataset. The ensemble learning improves the generalization
capabilities compared with the single network or dataset [62-67]. There are many ideas and various ensemble methods in
machine learning [66]. The simplest examples are hard voting and soft voting of predictions from the individual trained
models on the different datasets. The ensemble learning methods are still applicable and extendable into Deep Learning
[67].
2.2 The proposed approach of patches feeding to the ensemble of CNNs
The proposed method is based on the following facts:
Fact 1: Feeding the focused and unfocused patches simultaneously into the proposed network can increase
classification accuracy in comparison with feeding these patches separately.
Fact 2: The edges in the unfocused patches are smoother than the corresponding edges in the focused patches.
Therefore, Gx and Gy – the gradient image in horizontal and vertical directions - carry important information
on detecting the focused patches.
Fact 3: Employing CNNs through ensemble learning and utilizing the proposed datasets for feeding them to
the proposed architecture would improve the accuracy of the focused patch detection and therefore enhance the
initial decision map.
In the following, we discuss the application of the above facts in the proposed algorithm briefly.
Figure. 1. The general schematic diagram CNNs
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
Proposed Feeding Strategy and Constructing Training Dataset
The essential prerequisite is to have an appropriate multi-focus image dataset. Each of the previous CNN based methods
attempt to create a helpful dataset [54-59]. As we know, these methods feed either the only focused or the only unfocused
patches, separately, to the multiple paths of networks. i.e., each of these patches is individually either in focused or in
unfocused area. Therefore, each of these paths in the networks does not know anything about the others. To resolve this
problem, we suggest to concatenate the two extracted patches vertically from the source images into a single patch. The
schematic diagram of the procedure of creating dataset according to the proposed patch feeding strategy of this paper is
shown in Fig 2. As we can see from Fig. 2, the created macro-patches have both focused and unfocused area of the image
simultaneously. So, unlike the other methods such as [54-59], each path of the proposed network knows about the two
corresponding patches. In this paper, more than 2200 high-quality images of the COCO 2014 [68] datasets are used to
create the training dataset. The images of the COCO dataset are captured from the common objects in their natural context.
Some sample randomly selected images of COCO dataset are shown in Fig.3. The randomly selected images from COCO
dataset are converted into gray-scale. In order to create an unfocused condition like as real multi-focus captured with a
camera, each randomly selected image from COCO dataset is passed through four different Gaussian filters with the
standard deviation of 9 and the cut of 9×9, 11×11, 13×13, and 15×15. Therefore, five types of images including the
original image and the four versions of the blurred image are achieved for each selected images in COCO datasets. Then,
the gradient in the directions of horizontal (Gx) and vertical (Gy) are applied for each of these five types of images. We are
going to create three groups that each one contain the original images, the gradient in the directions of horizontal (Gx) and
vertical (Gy) of images. So, there are five types of images for each randomly selected image from COCO dataset in the
original, Gx, and Gy datasets. Afterward, each type of images in these three groups are divided into 32×32 blocks or
patches. Therefore, there are a lot of patches from the original images, Gx, and Gy groups. However, we know with prior
knowledge that each patches are from the non-blurred version of image or one of the four blurred versions of images.
Since we are going to create our proposed three datasets with the prior knowledge, suppose PA and PB are the 32×32
patches obtained from the from the non-blurred version of the image (A) and one of the four blurred versions of images
(B), respectively. As is depicted in the schematic diagram of Fig. 2, our proposed method creates a 64×32 macro-patch by
vertically concatenating PA and PB. So, the upper and the lower half of this macro-patch are from the non-blurred version
of the image and one of the four blurred versions of images, respectively. This macro-patch is known as the upper focused
data and is labeled as 0. In the same way, the vertical symmetry of this macro-patch is known as the lower focused data
and is labeled as 1. With this procedure of creating macro-patch from the original images for original dataset, the macro-
patches from the gradient in the directions of horizontal (Gx) and vertical (Gy) of images are created for Gx and Gy
datasets. By selecting more than 2200 high-quality images randomly from COCO dataset and the explained procedure as
Fig. 2, 1,000,000 macro-patches for training and 300,000 patches for test datasets are generated for each original, Gx, and
Gy datasets. So according to the Fact-2 given above, we construct the corresponding macro-patch for each mode of the
input sources, i.e. original, Gx, and Gy datasets, separately.
After changing the arrangement of the focused and unfocused parts for the proposed method, there is the question of
which type of the datasets images are better to use for training. The best results in machine learning are obtained when the
algorithm get advices from different networks that are trained on the different datasets. Our proposed method prepares
three suitable multi-focus image datasets for each mode of information sources, according to the new arrangement of Fig.
2. It is obvious that all of three datasets that have good features and information and they help the proposed networks for
classification.
The ensemble learning methods can even be extended into CNNs models, and the layers of convolutions are
concatenated together to make a better representation. It has the ability to train the proposed architecture with the various
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
models on the various datasets, then concatenate these models together, and finally concatenate their predictions to a final
prediction. Combining all of these ideas would help to achieve a higher accuracy and capability for classification.
Therefore, the created original, Gx, and Gy datasets according to the proposed patch feeding are very appropriate approach
for three feeding modal of the proposed ensemble based architecture.
2.3 The Proposed Network Architecture
The schematic diagram of the proposed architecture is shown in Fig.4. There are many architectures for CNNs model,
but this paper attempts to implement the simple CNNs in order to simplify the problem. It would be wise to make the
Figure 2. The schematic diagram of generating three datasets according to the proposed patch feeding which is used in the training
procedure.
Figure 3. Some samples of used images for training from COCO Dataset [68].
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
easiest and most straightforward way to get the satisfactory results. This paper considers a few numbers of convolution
layers due to the simplicity of the multi-focus image fusion compared with the advanced issues such as object detection
and semantic segmentation.
The proposed architecture is containing the convolution layers with the kernel size of 3×3, stride of 1×1, padding size of
1×1, and non-linear activation function of ReLU. The Max pooling of 2×2 is used for this architecture. Actually, we can
divide the proposed network into five paths. The original dataset is fed to network in the first path. The Gx and Gy datasets
are fed to network in the second and the third paths, respectively. Then the outputs of these two paths are concatenated in
the fourth path. After it, the output of the first and the fourth paths are concatenated in the fifth path. In the fifth path, the
FC layer is mapped to the final two neurons for detecting the focused and unfocused labels. The details of these paths are
as follows:
The input 32×64 macro-patches are from the original, Gx, Gy datasets.
Path #1:
1- The 32×64 macro-patch of the original dataset is fed to the first convolution layer for obtaining 64 feature maps.
The result of this layer has the volume of 32×64×64.
2- The volume of 32×64×64 is fed to the second, third, and fourth convolution layers with 128, 128, and 256 filters,
respectively. For these convolution layers, the max-pooling of 2×2 is used. Then the volume of 4×8×256 is
achieved at the end of the fourth convolution layer.
Path #2:
1- The 32×64 macro-patch of Gx dataset is fed to the first convolution layer for obtaining 64 feature maps. The
result of this layer is the volume of 32×64×64.
2- The volume of 32×64×64 is fed to the second and third convolution layer with 128 and 128 filters, respectively.
For these convolution layers, the max-pooling of 2×2 is used. Then the volume of 8×16×128 is achieved at the
end of the third convolution layer.
Path #3:
1- The 32×64 macro-patch of Gy dataset is fed to the first convolution layer for obtaining 64 feature maps. The
result of this layer has the volume of 32×64×64.
2- The volume of 32×64×64 is fed to the second and third convolution layer with 128 and 128 filters, respectively.
For these convolution layers, the max-pooling of 2×2 is used. Then the volume of 8×16×128 is achieved at the
end of the third convolution layer.
Path #4:
1- The two output volume of 8×16×128 for path #2 and path #3 are concatenated to construct the volume of
8×16×256.
2- The volume of 8×16×256 is fed to the convolution layer with 256 filters with the max-pooling of 2×2 for
achieving the volume of 4×8×256.
Path #5:
1- The two output 4×8×256 volume of paths #1 and path #4 are concatenated for achieving the volume of 4×8×512.
This volume expands to 1×16384 which is the Fully Connected (FC) layer. This FC is mapped to the two neurons
for final prediction which indicate the focused and unfocused labels.
With this procedure, the proposed architecture is trained on the created 1,000,000 macro-patches of the original, Gx,
and Gy datasets. To normalize the input patches, the mean and variance of 1,000,000 original, Gx, and Gy patches of the
training datasets are calculated as µ1=0.45, σ1=0.1, µ2=0.05, σ2=0.09, and µ3=0.06, σ3=0.09, respectively. The proposed
network is trained with the stochastic gradient descent (SGD) with a learning rate 0f 0.0002, the momentum of 0.9, the
weight decay of 0.0005. Also, the scheduler of StepLR with a step size of 1, and a gamma value of 0.9 are used. The batch
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
size of 64 is selected for training of the proposed network. Also, the Cross Entropy Loss is used for the criterion of the
proposed network. The batch normalization is used for the training of the proposed model. Due to the three types of useful
datasets and designing an ensemble learning based network in this paper, the multi-focus patches of datasets of this paper
have a very fast training procedure, and it can be learned quickly. The classification accuracy of the trained proposed
network is 99.794% and 99.786% for 1,000,000 training and 300,000 test datasets, respectively.
2.4 The fusion scheme
For a simple description of the proposed method, two images A and B are considered; a part of image A is focused
whereas this part in image B is unfocused. It is assumed that the input images were aligned by an image registration
method before performing the image fusion process. The input multi-focus images should be transformed into grayscale
images for constructing a decision map if they are color images; after constructing decision map, the color multi-focus
image could be fused. The proposed method could be easily extended to fuse more than two input images. The input multi-
focus images A and B are fed into the pre-trained proposed network according to the patch feeding strategy that is used for
creating the three datasets in Fig. 2. The extracted patches from input multi-focus images are overlapped for feeding the
pre-trained network in order to simulate pixel-wise image fusion. Then, the pre-trained proposed network returns the labels
of 0 and 1 which indicate the focused and unfocused labels, respectively. With this procedure, each pixel is contributed
several times in getting the label of focused and unfocused. Every macro-patch which is fed to the network, updates the
score map of the input multi-focus images with the proposed fusion method as (2).
Figure 4. The schematic of the proposed ECNN architecture with all details of models of CNNs.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
( ) { ( ) ( )
(2)
where r, c, and M indicate the row and column of the input images, and decision map, respectively. Also, b is the size of
width and height of the extracted patches from the input multi-focus images for constructing macro-patches and it must be
resized to 32 for feeding to the pre-trained network. The value of b can be set to 16 and 32 for tiny and large images,
respectively. Then, the initial segmented decision map of the proposed method is constructed as below:
( ) { ( )
(3)
Finally, the final fused image is calculated as below:
( ) ( ) ( ) ( ( )) ( ) (4)
where A(r,c) and B(r,c) are input multi-focus images.
This proposed method is devoid of the complicity of the previous state of the art methods based CNN and FCN [54-59]
to create the initial and final decision map. These methods used many post-processing to compensate for shortcoming of
their initial segmented decision map. If there is a need of post-processing, it can be simply applied on the initial decision
map of the proposed method. However, this post-processing in the proposed method is not needed to be applied for every
input images, because the initial segmented decision map of the proposed method has satisfactory quality without any
post-processing algorithm. The upcoming results show that the initial segmented decision map of the proposed method is
very better than that of the others even they used many post-processing algorithms on their initial segmented decision map.
The flowchart of the proposed method (ECNN) for getting the initial segmented decision map of an example of a multi-
focus image is shown in Fig.5.
2.5 Complexity comparison of deep learning based networks
In the last decade, GPU hardware bcomes widespread and it encourages the researchers to implement deep learning in
image processing and computer vision applications. Hence, the deep learning based multi-focus image fusion methods
have greatly enhanced the decision map and the quality of the fused image. These deep learning based fusion networks are
implemented and trained in the various frameworks such as Pytorch, Caffe, and Tensorflow. Also, most of their source
codes of training and fusing are not provided. So, there is no way to compare the deep learning based methods in terms of
time-consuming in fair conditions. Therefore, the best fairly solution of complexity comparison for deep learning based
multi-focus fusion methods is based on the number of weights of the main network instead of time-consuming comparison.
Figure. 5. The flowchart of the proposed method of ECNN for getting the initial segmented decision map of multi-focus image fusion.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
We calculate the weights and biases of the network with the following procedure. For each convolution layer, the weights
are calculated as W×H×C×F, where W and H are the width and height of kernel, C is the number of input channels, and F
is the number of the kernels or filters of the convolution layer. In the first convolution layer, C is equal to the image or
patch channel (C=1 for gray-scale, and C=3 for RGB patches), and C is equal to the number of the kernel of the previous
convolution layer. Also, for each convolution layer, the number of biases is equal to the number of the kernels. For
calculating the number of weights of fully connected (FC) layer, there are two cases. In the first case, the FC layer is
connected to the convolution layer. In this case, the number of weights of FC layer is calculated as OW×OH×C×N, where
OW, OH, and C are the width, height, and the number of kernels of the output volume of the last convolution layer. Also, N
is the number of neurons in the FC layer. In the second case, the FC layer is connected to another FC layer. In this case,
the number of weights of the FC layer is calculated as N1×N2, where N1 and N2 are the number of neurons of the previous
and current FC layers, respectively. Also, the number of biases for both two cases are equal to the number of neurons of
the FC layer. The number of weights of the proposed network is remarkably lower than the CNN [54], and FCN [59], but
is higher than MSCNN [55] and p-CNN [57]. With this procedure, we calculated the number of weights and biases of the
proposed network and the four previous state of the art multi-focus image fusion methods based on deep learning that are
listed in Table 1. These weights and biases must be tuned during the training process, and the input multi-focus patch
images must be passed through these parameters of the pre-trained network after the training process. Besides this, it is
important to remember that the proposed network has the neatest initial segmented decision map among the others and
therefore it has the least demanding to huge post-processing steps which the previous state of the arts deep learning based
methods have a strong dependency to post-processing steps for refining the initial decision map. Therefore, our proposed
network does not need to spend extra time and process for post-processing procedure for refining of the initial segmented
decision map.
3. Experimental Results
This section discusses the performance of the proposed method and exhibits the results of simulations for the proposed
method and the other state of the art methods for the comparison. The proposed algorithm is coded and trained using
PyTorch (0.4.0) on the operating system of Ubuntu Linux 16.04 LTS. The used hardware is CPU of Core i7 6900k with
32Gb RAM and GPU of STRIX-GTX1080-O8G. The results of the simulation for the proposed method are compared with
the results of some state of the art methods based on the spatial domain, multi-scale transform, CNN and FCN based
methods. In order to compare the proposed method with the previous methods in fair conditions, we used the reported
experimental results of the recent papers. Also, we used their fusion quality metrics and the non-referenced multi-focus
images which were captured with the real cameras. The 25 state of the art previous methods which were compared with
our proposed method are GFF [43], IMF [44], DSIFT [40], BFMM [42], MWGF [21], SSDI [41], SRCF [4], MSTSR [27],
CBF [48], CSR [26], IFGD [23], GIF [37], DCHWT [18], ICA [49], NSCT [22], PCNN [46, 47], PCA [29], DCTLP [12],
DCTV [3], WSSM [19], CNN [54], MSCNN [55], p-CNN [57], CAB [33], and, FCN [59]. For obtaining the initial
decision map and the fused image of some of those methods, the available source codes of their algorithms were used; for
the rest of methods, the reported results in [33, 54, 55, 59] were used in this paper. The evaluation performance metrics of
Table 1.
Comparison of number of weights and biases between the deep learning based network of the proposed method and the others.
Method CNN [54] MSCNN [55] p-CNN [57] FCN [59] ECNN
(the proposed)
Weights 4,933,248 803,968 304,902 16,813,184 1,582,784
Biases 1,154 898 235 12,430 1,474
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
image fusion that were used in the previous methods of [33, 54, 55, 59] are used for the assessment of our proposed
method and comparing with the previous methods in this paper. These fusion metrics and the test multi-focus images are
obtained directly by kindly corresponding with their authors. This paper used several of the non-referenced image fusing
metrics. The total information transferred from the source images to the fused image QAB/F
[69, 70], the similarity based
quality metric Yc or Q(A,B,F) [71], the mutual information (MI), the phase congruency-based fusion metric QPC [72],
structural similarity-based fusion metric QW [73], the human perception-based fusing metric QCB [74], the normalized the
mutual information (NMI) [75], visual information fidelity (VIF) [77], feature mutual information (FMI) [76], and the
nonlinear correlation information entropy NICER (QNICE) [78] are used. The assessment of the fusion process is very hard
for the non-referenced multi-focus images that their ground-truth images are not available. Also, when the results of state
of the art methods are close together, the quantity of the non-referenced fusion metrics are not reliable for the judgments.
Therefore, the most reliable way to compare the proposed method with the previous methods is the visual comparison of
the initial segmented decision map with the initial and final decision maps of the other methods. The critical point that
should be considered is that the initial segmented decision maps are obtained without applying any post-processing
algorithm, unlike the final decision map. This paper shows the initial segmented decision map of the proposed method is
the same or even better than that of the others.
The proposed method of ECNN is applied on wide range of famous test multi-focus images that were recently used in
many state of the art methods for comparison. We applied ECNN on 20 pairs color multi-focus image of Lytro dataset and
achieve a high quality of the fused image as shown in Fig. 6. Also, we applied ECNN on other famous gray-scale and
color test multi-focus images such as “Flower” and “Leopard” that the source images and the fused images of ECNN are
shown in Fig. 6. Now, we are going to compare some samples of these images both qualitatively and quantitatively.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
(lytro-01-A) (lytro-01-B) (lytro-01-F) (lytro-02-A) (lytro-02-B) (lytro-02-F) (lytro-03-A) (lytro-03-B) (lytro-03-F)
(lytro-04-A) (lytro-04-B) (lytro-04-F) (lytro-05-A) (lytro-05-B) (lytro-05-F) (lytro-06-A) (lytro-06-B) (lytro-06-F)
(lytro-07-A) (lytro-07-B) (lytro-07-F) (lytro-08-A) (lytro-08-B) (lytro-08-F) (lytro-09-A) (lytro-09-B) (lytro-09-F)
(lytro-10-A) (lytro-10-B) (lytro-10-F) (lytro-11-A) (lytro-11-B) (lytro-11-F) (lytro-12-A) (lytro-12-B) (lytro-12-F)
(lytro-13-A) (lytro-13-B) (lytro-13-F) (lytro-14-A) (lytro-14-B) (lytro-14-F) (lytro-15-A) (lytro-15-B) (lytro-15-F)
(lytro-16-A) (lytro-16-B) (lytro-16-F) (lytro-17-A) (lytro-17-B) (lytro-17-F) (lytro-18-A) (lytro-18-B) (lytro-18-F)
(lytro-19-A) (lytro-19-B) (lytro-19-F) (lytro-20-A) (lytro-20-B) (lytro-20-F) (Temple-A) (Temple-B) (Temple-F)
(Flower-A) (Flower-B) (Flower-F) (Lab-A) (Lab-B) (Lab-F)
(Calendar-A) (Calendar-B) (Calendar-F) (Book-A) (Book-B) (Book-F)
(Leopard-A) (Leopard-B) (Leopard-F) (Desk-02-A) (Desk-02-B) (Desk-02-F)
(Clock-A) (Clock-B) (Clock-F) (Newspaper-A) (Newspaper-B) (Newspaper-F)
Figure 6. The Lytro dataset and other famous multi-focus image dataset were used in experiments, and the fused images of the proposed ECNN method. The
symbols of “A” and “B” are stand for the source input multi-focus images, and the symbol “F” are stand for the fused image of the proposed ECNN method,
respectively.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
Fig. 7 shows the comparison of the initial and final decision maps of the different methods for the “Flower” color multi-
focus images. The two multi-focus source images of “Flower” are shown in Figs. 7(a) and (b). The final decision map of
MWGF is shown in Fig. 7(c). This method introduces the ringing and jagged artifacts in the edge places of the decision
map. The initial and final segmented decision maps of IMF are shown in Fig. 7(d) and (e), respectively. The Fig. 7(d)
shows many errors even after applying the post-processing algorithms. The final decision map of IMF shows undesirable
results comparing with the source images, and it is not acceptable. The initial segmented decision map and the final
decision map of GFF are shown in Fig. 7(f) and (g), respectively. These decision maps are also inappropriate and
unacceptable. The initial and final decision maps of BFMM in Fig. 7(h) and (i) are very undesirable and useless for fusion
of these images, because of the jagged artifacts on the edge places of the input images. The decision map of DSIFT in Fig.
7(j) are irregular and also shows the wrong hole in the map. The initial segmented and final decision maps of SSDI are
shown in Fig. 7(k) and (l), respectively. These maps are also irregular and have the thick boundaries in the decision area.
The initial segmented decision map and the final decision map that obtained after applying many post-processing
algorithms are shown in Fig. 7(m)-(r) for the CNNs based methods of CNN, MSCNN, and p-CNN, respectively. These
methods have many errors comparing to the regions in input multi-focus images for the initial segmented decision map.
These methods give the best results according to the reported results in their papers [54-57]. The final decision maps of
these methods are obtained after applying many post-processing on their initial segmented decision map, like Consistency
verification (CV), Guiding filter, and small region removal, watershed, morphological filter (closing and opening). While
their final decision maps in Fig. 7(n), (p), and (r) do not show the best matching according to the focused areas of the input
multi-focus images. Therefore, the final decision maps of these methods have shortcoming results. The first segmented
decision map of the proposed method is shown in Fig. 7(s) which is the first segmented decision map of the proposed
method without any post-processing algorithms. It is obvious that the first segmented initial decision map of the proposed
method is very better than that the other initial and final segmented decision maps which obtained after applying a lot of
post-processing on their initial segmented decision map. The fused image of the proposed method (ECNN) according to its
initial segmented decision map is shown in Fig. 7(t).
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l)
(m) (n) (o) (p )
(q) (r) (s) (t)
Figure.7. The initial and final segmentation map (with and without post-processing) of the proposed method and the others
for “Flower” image. (a)The first image, (b)The second image, (c)MWGF [21], (d)IMF [44] without post-processing, (e)IMF
[44] with post-processing, (f)GFF [43] without post-processing, (g) GF [43] with post-processing, (h)BFMM [42] without
post-processing, (i)BFMM [43] with post-processing, (j)SSDI [41] without post-processing, (k) DSIFT [40] without post-
processing, (l)DSIFT [40] with post-processing (m) CNN [54] without post-processing, (n)CNN [54] with post-processing,
(o)MSCNN [55] without post-processing, (p)MSCNN [55] with post-processing, (q) p-CNN [57] without post-processing,
(r)p-CNN [57] with post-processing, (s)The first initial map of ECNN (the proposed method without post-processing),
(t)The fused image using ECNN (the proposed method).
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
In the similar experiment, Fig. 8 shows the comparison of the initial segmented decision map and final decision map of
the “Children” color multi-focus images. The final decision map of MWGF is shown in Fig. 8(c), which shows ringing and
jagged artifacts in the edge places of the decision map. The initial and final segmented decision maps of IMF are shown in
Fig. 8(d) and (e), respectively. The final map shows the undesirable side effects on the edge boundaries of the decision
map. The initial and final segmented decision maps of GFF are shown in Fig. 8(f) and (g), respectively. These decision
maps are also very inappropriate and could not be acceptable for a good decision map. The initial and final decision maps
of BFMM are shown in Fig. 8(h) and (i), respectively. This method shows the undesirable jagged artifacts in the
boundaries of the final decision map. The decision map of DSIFT in Fig. 8(j) is also irregular. The initial segmented and
final decision maps of SSDI are shown in Fig. 8(k) and (l), respectively. These maps also show the unfavorable areas in
the decision map. The initial and final segmented decision maps of CNN are shown in Fig. 8(m) and (n), respectively. The
initial segmented decision map of this method shows many errors compared with the focused area of the source input
images. The final decision maps of CNN and MSCNN which obtained by applying many post-processing algorithms are
shown in Fig. 8(n) and (o), respectively. The initial segmented and the final decision maps of FCN are shown in Fig. 8(p)
and (q), respectively. The initial segmented decision map of FCN is an inappropriate decision map which is very
unsuitable comparing to the others. The first results of FCN method can not be considered for a suitable fusion process.
So, some post processing algorithms should be used to polish its unsuitable initial decision map. To refine and polish their
inappropriate initial decision map and achieve the final decision map, as depicted in Fig. 8(q), they utilize the fully
connected conditional random field (CRF) [60] that is a method for multi-class image segmentation. In overall, the main
share of their acceptable quality for their final decision map quality belongs to an undergoing algorithm which is a
separated issue from the deep learning based network for image fusion. The initial segmented decision map of the
proposed method (ECNN) is shown in Fig. 8(r). As expected, the initial segmented decision map of the proposed method
(without any post-processing) is very better than that of all others’ initial and final segmented decision maps. The final
decision map of MSCNN which is obtained after applying many post-processing algorithms are close to the initial
segmented decision map of the proposed method obtained without any post-processing. The fused image of our proposed
method is shown in Fig. 8(s) which shows the best quality among the other methods.
In another experiment, the two multi-focus source images of “Lytro-10” are shown in Figs. 9(a) and 9(b). In this
experiment, we compare the initial segmented decision map of our proposed method with the initial and final decision
maps of the methods of MWGF, SSDI, IMF, GFF, BFMM, DSIFT, CNN, and CAB. The decision map of the MWGF is
shown in Fig. 9(c), which has ringing artifacts in the edge places of the focused region. The final decision maps of SSDI,
IMF, GFF, BFMM, DSIFT, and CNN are shown in Figs. 9(d)-(i), respectively. These decision maps are achieved after
applying vast post-processing steps. But yet their final decision maps have some undesirable side effects such as jagged
artifacts, mistaken areas from the focused regions of the source images. The initial segmented decision map and final
decision map of the recently published method of CAB are shown in Fig. 9(j) and (k), respectively. The initial decision
map of this method has very jagged artifacts and the mistaken areas from the focused region in the edge of the source
images. Also, the final decision map of this method has also some areas that is not related to the focused regions of the
source images. The initial decision map of our proposed ECNN method which is achieved without applying any post-
processing is shown in Fig. 9(l). It is obvious that the initial decision map of ECNN is very neatest than those of the other
methods which are achieved with undergoing vast post-processing steps. The fused image of the proposed ECNN method
with the achieved initial segmented decision map is shown in Fig. 9(m).
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
(k) (l) (m) (n) (o)
(p) (q) (r) (s)
Figure.8. The initial and final segmentation map (with and without post-processing) of the proposed method and the others for
“Children” image. (a)The first image, (b)The second image, (c)MWGF [21], (d)IMF [44] without post-processing, (e)IMF [44]
with post-processing, (f)GFF [43] without post-processing, (g)GFF [43] with post-processing, (h)BFMM [42] without post-
processing, (i)BFMM [42] with post-processing, (j)SSDI [41] (k) DSIFT [40] without post-processing, (l)DSIFT [40] with
post-processing (m)CNN [54] without post-processing, (n)CNN [54] with post-processing, (o)MSCNN [55] with post-
processing, (p) FCN [59] without post-processing (q) FCN [59] with post-processing, (r)The first initial map of ECNN
without any post-processing(proposed method), (u) The fused image using ECNN (the proposed method).
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
In the last qualitatively comparison, we compare our proposed ECNN method with methods of MWGF, SSDI, IMF,
GFF, BFMM, and CNN for the gray-scale multi-focus images of “Leopard”. The two multi-focus source images of
“Leopard” are shown in Figs. 10(a) and (b). The decision map of MWGF is shown in Fig. 10(c), which has ringing
artifacts on the edge places of the decision map. The initial and final decision map of SSDI are shown in Figs. 10(d) and
(e) which have many mistaken areas from the focused region of the source images. The initial and final decision maps of
IMF are shown in Figs. 10(f) and (g) that have many undesirable regions such as fading in the decision map. The initial
and final decision maps of GFF are shown in Figs. 10(h) and (i). The initial and final decision map of this method also
have many undesirable and mistaken regions that are not related to the focused region of the source images. The initial and
final decision maps of BFMM are shown in Figs. 10(j) and (k) that have many jagged artifacts and they are not suitable for
an ideal decision map. The initial and final decision maps of CNN are shown in Figs. 10(l) and (m). The initial segmented
and the final decision map of this method have a big mistaken area of the focused region according to the source images.
This experiment shows that these methods that do not give an acceptable initial decision map; they could not achieve a
suitable decision map even after applying a lot of post-processing algorithms. The initial segmented decision map and the
fused image of our proposed ECNN method are shown in Fig. 10(n) and (o). The initial segmented decision map of our
proposed ECNN method that is achieved without applying any post-processing is very neater and cleaner than the others’
initial and final decision map.
(a) (b) (c) (d) (e)
(f) (g) (h) (i)
(j) (k) (l) (m)
Figure.9. The initial and final segmentation map (with and without post-processing) of the proposed method and the others for
“Lytro-10” image. (a)The first image, (b)The second image, (c)MWGF [21], (d)SSDI [41] with post-processing, (e)IMF [44] with
post-processing, (f)GFF [43] with post-processing, (g)BFMM [42] with post-processing, (h)DSIFT [40] with post-processing,
(i)CNN [54] with post-processing, (j) CAB [33] without post-processing, (k)CAB [33] with post-processing (l)The first initial map
of ECNN without any post-processing (proposed method), (m) The fused image using ECNN (the proposed method).
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l)
(m) (n) (o)
Figure 10. The initial and final segmentation map (with and without post-processing) of our proposed method and the
others for “Leopard” image. (a)The first source image, (b)The second source image, (c)MWGF [21], (d)SSDI [41]
without post-processing, (e)SSDI [41] with post-processing, (f)IMF [44] without post-processing, (g)IMF [44] with
post-processing, (h)GFF [43] without post-processing, (i)GFF [43] with post-processing, (j)BFMM [42] without post-
processing, (k)BFMM [42] with post-processing, (l)CNN [54] without post-processing, (m)CNN [54] with post-
processing, (n)The first initial map of ECNN without any post-processing (proposed method), (o) The fused image
using ECNN (the proposed method).
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
As mentioned before, the best way of comparison the fused image of different methods is a visual and qualitatively
comparison. Because the ground-truth of the real multi-focus images is not available, we have to use the non-references
images fusion quality metrics. Therefore, the quantity assessments are not always reliable compared with the referenced
metrics like MSE and SSIM. However, we compare the proposed method with the others using the reported results of the
non-referenced quality metrics for MSCNN, p-CNN, FCN, and CAB [55, 57, 59, 33]. In the first objective assessments,
the proposed method is compared with MWGF, SSDI, CNN, DSIFT, and MSCNN using the quality metrics of MI, QAB/F
,
and Q(A,B,F) in Table 2. In overall comparison, values in Table 2 indicate that the proposed method shows better results
in the most cases. In the second objective assessments, the proposed method is compared with GF, IM, CNN, DSIFT, BF,
and p-CNN using the quality metrics of QPC, QW, and QCB in Table 3. As expected from the qualitative comparison, the
results of our proposed method are better than that of the other methods in the quantitative comparison. In the last quantity
assessments, we compare our proposed method with NSCT, GFF, IMF, CBF, DCHWT, MWGF, BFMM, DSIFT, CNN,
FCN, WSSM, PCNN, DCTLP, MSTSR, DCTV, SRCF, GIF, IFGD, ICA, PCA, CSR, and CAB using the quality metrics
of MI, QAB/F
, VIF, NMI, FMI, Yc, and QNICE in Table 4. In this experiment, we used 20 pairs of the color multi-focus
image of Lytro dataset. The average scores of these fusion metrics on 20 pairs color multi-focus images of Lytro dataset
for these 22 methods are listed in Table 4. The scores of these metrics for our proposed ECNN method is higher than the
other 22 methods.
Table 2
Comparison of objective quality metrics of our proposed multi-focus image fusion method and the others. (* from [55])
Test Images Fusion
Metrics
MWGF*
[21]
SSDI*
[41]
DSIFT*
[40]
CNN*
[54]
MSCNN*
[55]
ECNN
(proposed)
Lab
MI 8.0618 8.1412 8.2501 8.6008 8.8044 8.8531
QAB/F 0.7147 0.7528 0.7585 0.7573 0.7588 0.7588
Q(A,B,F) 0.8746 0.8823 0.9132 0.8947 0.9148 0.9831
Temple
MI 5.9655 7.0896 7.3514 6.8895 7.4177 7.3727
QAB/F 0.7501 0.7634 0.7643 0.7590 0.7623 0.7675
Q(A,B,F) 0.8992 0.9125 0.9138 0.9063 0.9251 0.9908
Seascape
MI 7.1404 7.4824 7.9487 7.6285 8.0214 8.3935
QAB/F 0.7059 0.7110 0.7126 0.7113 0.7122 0.7377
Q(A,B,F) 0.9366 0.9473 0.9452 0.9481 0.9547 0.9752
Book
(color)
MI 8.2368 8.4008 8.6623 8.7796 8.8947 8.9319
QAB/F 0.7240 0.7260 07134 0.7277 0.7284 0.7259
Q(A,B,F) 0.9120 0.9221 0.9045 0.9374 0.9473 0.9830
Leopard
MI 9.9474 10.8887 10.9226 10.8792 10.9420 10.9400
QAB/F 0.8175 0.8171 0.8069 0.7973 0.8267 0.8275
Q(A,B,F) 0.9435 0.9325 0.9572 0.9218 0.9748 0.9933
Children
MI 8.2622 7.8505 8.5252 8.3338 8.5363 8.4414
QAB/F 0.6741 0.6799 0.7394 0.7408 0.7384 0.7467
Q(A,B,F) 0.8675 0.8752 0.9255 0.9263 0.9341 0.9882
Flower
MI 8.3255 8.1049 8.5365 8.2695 8.6125 8.5859
QAB/F 0.6913 0.6490 0.7159 0.7183 0.7157 0.7221
Q(A,B,F) 0.9460 0.9207 0.9479 0.9566 0.9689 0.9793
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
On the overall, some of the selected method among of the 25 compared methods with our proposed method were the
state of the art methods that show the best results in among of the introduced methods for multi-focus image fusion in the
recent years. However, the initial segmented decision maps of these methods are undesirable and unacceptable compared
to the focused area of the source images. Also, their final decision maps are still unsatisfactory after applying many post-
processing algorithms on the initial decision maps. By visual observation, we observe that the initial segmented decision
map of our proposed method (ECNN), without any post-processing algorithms, is remarkably better than those of the
others with or without post-processing algorithms. Also, with many conducted comparisons quantitatively, our proposed
ECNN method shows the best results using 10 fusion metrics comparing to 25 other methods.
Table 3
Comparison of objective quality metrics of the proposed multi-focus image fusion method and the others. (* from [57])
Test
Images
Fusion
Metrics GFF* [43] IMF* [44] DSIFT* [40]
BFMM*
[42]
CNN*
[54]
p-CNN*
[57]
ECNN
(proposed)
Book
QPC 0.6822 0.6827 0.6631 0.6812 0.6829 0.6835 0.8195
QW 0.6272 0.6264 0.5985 0.6361 0.6219 0.6162 0.9279
QCB 0.7143 0.7358 0.7355 0.7277 0.7224 0.7359 0.7771
Calendar
QPC 0.6479 0.6451 0.6476 0.6493 0.6494 0.6495 0.7534
QW 0.6842 0.6970 0.6903 0.6924 0.6868 0.6870 0.9189
QCB 0.7033 0.7217 0.7153 0.7256 0.7255 0.7332 0.8030
Flower
QPC 0.7084 0.6876 0.7014 0.7032 0.7093 0.7090 0.7594
QW 0.4781 0.4683 0.5009 0.4123 0.5049 0.5051 0.9198
QCB 0.8048 0.8387 0.8333 0.7153 0.5051 0.8049 0.8270
Lab
QPC 0.6865 0.6876 0.7014 0.7032 0.7037 0.7046 0.7935
QW 0.5100 0.7014 0.5009 0.4984 0.5012 0.5018 0.9163
QCB 0.8344 0.7032 0.8333 0.8321 0.8356 0.8398 0.7489
Desk
QPC 0.7364 0.7246 0.7154 0.7270 0.7037 0.7360 0.7796
QW 0.7038 0.6743 0.6721 0.6745 0.7037 0.7269 0.9036
QCB 0.5714 0.5585 0.5619 0.5628 0.5694 0.5720 0.7602
Newspaper
QPC 0.2195 0.1769 0.1849 0.1960 0.1845 0.1865 0.6483
QW 0.6207 0.6171 0.6252 0.6182 0.6270 0.6273 0.7722
QCB 0.7413 0.7441 0.7484 0.6732 0.7503 0.7512 0.7450
Clock
QPC 0.7089 0.7110 0.6855 0.7098 0.7016 0.7130 0.9060
QW 0.6123 0.6158 0.5791 0.7016 0.6180 0.6197 0.9311
QCB 0.7426 0.7448 0.7497 0.7130 0.7511 0.7512 0.7831
Leopard
QPC 0.7215 0.7112 0.7079 0.7205 0.7207 0.7206 0.9514
QW 0.8226 0.8192 0.5297 0.8230 0.8205 0.8232 0.9572
QCB 0.8225 0.8584 0.7275 0.8270 0.8581 0.8601 0.8820
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
4. Conclusions
A new multi-focus image fusion method based on convolution neural network were introduced in this paper. The main idea of this method is to use an ensemble of three CNNs. The proposed network contains three CNNs, which are trained by three different datasets. The ensemble learning is to help the network significantly to predict the decision map correctly. Also, the proposed method introduces simple arranging of the dataset for the multi-focus for obtaining the better accuracy. In qualitative and quantitative assessments, the obtained results strongly indicate that the first initial segmented decision map is very better than that of all previous state of arts methods. Also, it was showed that the initial segmented decision map of the proposed method is similar, or even better than, the others’ initial and final segmented decision maps obtained after applying many post-processing algorithms. The conducted assessments and experiments are performed with many famous real non-referenced multi-focus images and the fusion quality metrics. These experiments show the superiority of the output image’s quality of the proposed algorithm in comparison with the other state of the arts methods. The source code of our proposed method and all of the supplementary files will be provided on the personal website2 and GitHub3 of this paper’s authors.
2 www.amin-naji.com and www.imagefusion.com
3 www.github.com/mostafaaminnaji
Table 4 MI, QAB/F , VIF, NMI, FMI, Yc, and QNICE comparison of various image fusion methods on 20 pairs color multi-focus
images of Lytro dataset. (* FROM [59], ** FORM [33])
Methods
The average values of fusion metrics for the 20 pairs of multi-focus images of Lytro dataset
MI QAB/F VIF NMI FMI Yc QNICE
NSCT [22] 3.1473* 0.5709* 0.5132* N/A N/A N/A N/A
GFF [43] 4.1211* 0.7601* 0.7430* N/A N/A N/A N/A
IMF [44] 4.2879* 0.7534* 0.7233* 1.1420** 0.6543** 0.9861** 0.8440**
CBF [48] 3.8211* 0.7528* 0.6870* 1.0184** 0.6072** 0.9680** 0.8349**
DCHWT [18] 3.3649* 0.7124* 0.6465* 0.8971** 0.5481** 0.9280** 0.8275**
MWGF [21] 4.2336* 0.7479* 0.7316* 1.1479** 0.6527** 0.9884** 0.8427**
BFMM [42] 4.4376* 0.7572* 0.7412* N/A N/A N/A N/A
DSIFT [40] 4.4588* 0.7621* 0.7492* N/A N/A N/A N/A
CNN [54] 4.3211* 0.7618* 0.7465* N/A N/A N/A N/A
FCN [59] 4.4578* 0.7655* 0.7531* N/A N/A N/A N/A
WSSM [19] N/A 0.7296** N/A 0.9623** 0.5732** 0.9594** 0.8323**
PCNN [46, 47] N/A 0.7036** N/A 1.2068** 0.6354** 0.9690** 0.8482**
DCTLP [12] N/A 0.6562** N/A 0.8296** 0.5018** 0.8821** 0.8235**
MSTSR [13] N/A 0.7543** N/A 0.9995** 0.6081** 0.9675** 0.8323**
DCTV [3] N/A 0.7530** N/A 1.1860** 0.6333** 0.9657** 0.8428**
SRCF [4] N/A 0.7628** N/A 1.1930** 0.6623** 0.9892** 0.8465**
GIF [17] N/A 0.7608** N/A 1.1853** 0.6612** 0.9889** 0.8468**
IFGD [16] N/A 0.7174** N/A 1.0456** 0.5387** 0.8554** 0.8136**
ICA [49] N/A 0.7445** N/A 0.9374** 0.5834** 0.9555** 0.8286**
PCA [29] N/A 0.5992** N/A 0.8939** 0.5707** 0.8483** 0.8529**
CSR [26] N/A 0.7422** N/A 1.0135** 0.5575** 0.9402** 0.8327**
CAB [33] N/A 0.7645** N/A 1.2097** 0.6626** 0.9895** 0.8474**
ECNN (Proposed) 4.6565 0.7867 0.7595 1.2401 0.6782 0.9910 0.8551
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
References
[1] S. Li, X. Kang, L. Fang, J. Hu, H. Yin, Pixel-level image fusion: A survey of the state of the art, Information
Fusion, 33 (2017) 100-112.
[2] M. Amin-Naji, A. Aghagolzadeh, Multi-Focus Image Fusion in DCT Domain using Variance and Energy of
Laplacian and Correlation Coefficient for Visual Sensor Networks, Journal of AI and Data Mining, 6 (2) (2018)
233-250.
[3] M. Haghighat, A. Aghagolzadeh, H. Seyedarabi, Multi-focus image fusion for visual sensor networks in DCT
domain, Computers & Electrical Engineering, 37 (5) (2011) 789-797.
[4] M. Nejati, S. Shadrokh, S. Shirani. Multi-focus image fusion using dictionary-based sparse representation.
Information Fusion, 25 (2015) 72-84.
[5] D. Drajic, N. Cvejic. Adaptive fusion of multimodal surveillance image sequences in visual sensor networks.
IEEE Transactions on Consumer Electronics, 53 (4) (2017) 1456-1462.
[6] S. Soro, W. Heinzelman. A Survey of Visual Sensor Networks, Advances in Multimedia, (2009).
[7] T. Stathaki, Image fusion: algorithms and applications, Academic Press Elsevier, 2011.
[8] M. Amin-Naji, P. Ranjbar-Noiey, A. Aghagolzadeh, Multi-focus image fusion using Singular Value
Decomposition in DCT domain, The 10th Iranian Conference on Machine Vision and Image Processing (MVIP),
2017, pp. 45-51.
[9] M. A. Naji, A. Aghagolzadeh, Multi-focus image fusion in DCT domain based on correlation coefficient, 2nd
International Conference on Knowledge-Based Engineering and Innovation (KBEI), 2015, pp. 632-639.
[10] M. A. Naji, A. Aghagolzadeh, A new multi-focus image fusion technique based on variance in DCT domain, 2nd
International Conference on Knowledge-Based Engineering and Innovation (KBEI), 2015, pp. 478-484.
[11] Y. Phamila, R. Amutha, Discrete Cosine Transform based fusion of multi-focus images for visual sensor
networks, Signal Processing, 95 (2014) 161-170.
[12] V. Naidu, B. Elias, A novel image fusion technique using dct based laplacian pyramid, Int. J. of Inventive
Engineering and Sciences (IJIES) (2013) 2319–9598.
[13] L. Cao, L. Jin, H. Tao, G. Li, Z. Zhuang, Y. Zhang, Multi-Focus Image Fusion Based on Spatial Frequency in
Discrete Cosine Transform Domain, IEEE Signal Processing Letters, 22 (2) (2015) 220-224.
[14] I. De, B. Chanda. A simple and efficient algorithm for multifocus image fusion using morphological wavelets,
Signal Processing. 86 (5) (2006) 924-936.
[15] H. Li, B. Manjunath, S. Mitra, Multisensor Image Fusion Using the Wavelet Transform, Graphical Models and
Image Processing, 57 (3), (1995) 235-245.
[16] O. Rockinger, Image sequence fusion using a shift-invariant wavelet transform, Proceedings of IEEE
International Conference on Image Processing, 3 (1997) 288-291.
[17] V.S. Petrovic, C.S. Xydeas. Gradient-based multiresolution image fusion, IEEE Transactions on Image
Processing. 13 (2) (2004) 228-237.
[18] B. K. S. Kumar, Multifocus and multispectral image fusion based on pixel significance using discrete cosine
harmonic wavelet transform, Signal, Image and Video Processing, 7 (6) (2013), 1125-1143.
[19] J. Tian, L. Chen, Adaptive multi-focus image fusion using a wavelet based statistical sharpness measure, Signal
Processing, 92 (9) (2012) 2137-2146.
[20] V. Naidu, J. Raol, Pixel-level image fusion using wavelets and principal component analysis, Defence Science
Journal, 58 (3) (2008) 338.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
[21] Z. Zhou, S. Li, B. Wang, Multi-scale weighted gradient-based fusion for multi-focus images, Information Fusion,
20 (2014) 60-72.
[22] Q. Zhang B-l. Guo Multifocus image fusion using the nonsubsampled contourlet transform. Signal Processing, 89
(7), (2009) 1334–1346.
[23] S. Paul, I. S. Sevcenco, P. Agathoklis, Multi-exposure and multi-focus image fusion in gradient domain, Journal
of Circuits, Systems and Computers, 25 (10) (2016) 1650123.
[24] Q. Zhang, Y. Liu, R.S. Blum, J. Han, D. Tao, Sparse representation based multi-sensor image fusion for multi-
focus and multi-modality images: a review, Information Fusion, 40 (2018) 57-75.
[25] Q. Zhang, M.D. Levine, Robust multi-focus image fusion using multi-task sparse representation and spatial
context, IEEE Transactions on Image Processing 25 (5) (2016) 2045-2058.
[26] Y. Liu, X. Chen, R.K. Ward, Z.J. Wang, Image fusion with convolutional sparse representation, IEEE Signal
Processing Letter 23 (12) (2016) 1882–1886.
[27] Y. Liu, S. Liu, Z. Wang, A general framework for image fusion based on multi-scale transform and sparse
representation, Inf. Fusion 24 (2015) 147 – 164.
[28] W. Huang, Z. Jing, Evaluation of focus measures in multi-focus image fusion, Pattern Recognition Letters, 28 (4)
(2007) 493-500.
[29] W. Wu, X. Yang, Y. Pang, J. Peng, G. Jeon, A multifocus image fusion method by using hidden Markov model,
Optics Communications, 287 (2013) 63-72.
[30] M. Nejati, S. Samavi, N. Karimi, S.M.R. Soroushmehr, S. Shirani, I. Rosta, K. Najarian, Surface area-based focus
criterion for multi-focus image fusion, Information Fusion, 36 (2017) 284–295.
[31] J. Liang, Y. He, D. Liu, X. Zeng, Image fusion using higher order singular value decomposition, IEEE
Transactions on Image Processing, 21 (5) (2012) 2898-2909.
[32] S. Pertuz, D. Puig, M. A. Garcia, Analysis of focus measure operators for shape-from-focus,” Pattern
Recognition, 46 (5) (2013) 1415-1432, 2013.
[33] M. S. Farid, A. Mahmood, S.A. Al-Maadeed, Multi-focus image fusion using Content Adaptive Blurring.
Information Fusion, 45 (2019) 96-112.
[34] S. Li, B. Yang, Multifocus image fusion using region segmentation and spatial frequency, Image and Vision
Computing, 26 (7) (2008) 971-979.
[35] S. Mahajan, A. Singh, A Comparative Analysis of Different Image Fusion Techniques, IPASJ International
Journal of Computer Science (IIJCS), 2 (1) (2014) 8-15.
[36] H. A. Eltoukhy S. Kavusi, Computationally efficient algorithm for multifocus image reconstruction, Proceedings
SPIE Electronic Imaging, 5017 (2003) 332–341.
[37] K. Zhan, J. Teng, Q. Li, J. Shi. A Novel Explicit Multi-focus Image Fusion Method, Journal of Information
Hiding and Multimedia Signal Processing, 6 (3) (2015) 600-612.
[38] S. Li, J.T. Kwok, Y. Wang, Combination of images with diverse focuses using the spatial frequency, Information
Fusion, 2 (3) (2001) 169-176.
[39] Y. Yang, M. Yang, S. Huang, Y. Que, M. Ding, J. Sun, Multifocus image fusion based on extreme learning
machine and human visual system, IEEE Access, 5 (2017) 6989-7000.
[40] Y. Liu, S. Liu, Z. Wang, Multi-focus image fusion with dense SIFT, Information Fusion. 23 (2015) 139-155.
[41] D. Guo, J. Yan, X. Qu, High quality multi-focus image fusion using self-similarity and depth information. Optics
Communications, 338 (2015) 138-144.
[42] Y. Zhang, X. Bai, T. Wang, Boundary finding based multi-focus image fusion through multi-scale morphological
focus-measure, Information Fusion, 35 (2017) 81-101.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
[43] S. Li, X. Kang, J. Hu, Image fusion with guided filtering, IEEE Transactions on Image Processing, 22 (7) (2013)
2864-2875.
[44] S. Li, X. Kang, J. Hu, B. Yang, Image matting for fusion of multi-focus images in dynamic scenes, Information
Fusion, 14 (2) (2013) 147-162.
[45] M. Li, W. Cai, Z. Tan, A region-based multi-sensor image fusion scheme using pulse-coupled neural network,''
Pattern Recognit. Letter, 27 (16) (2006), 1948-1956.
[46] X. Qu, C. Hu, J. Yan, Image fusion algorithm based on orientation information motivated pulse coupled neural
networks, in: 7th World Congress on Intelligent Control and Automation, 2008, pp. 2437–2441.
[47] X.-B. Qu, J.-W. Yan, H.-Z. Xiao, Z.-Q. Zhu, Image fusion algorithm based on spatial frequency-motivated pulse
coupled neural networks in nonsubsampled contourlet transform domain, Acta Automatica Sinica 34 (12) (2008)
1508 – 1514.
[48] B. K. Shreyamsha Kumar, Image fusion based on pixel significance using cross bilateral filter, Signal Image
Video Process. 9 (5) (2015) 1193–1204.
[49] N. Mitianoudis, T. Stathaki, Pixel-based and region-based image fusion schemes using ICA bases, Information
Fusion, special Issue on Image Fusion: Advances in the State of the Art, 8 (2) (2007) 131 – 142.
[50] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature, 521 (7553) (2015) 436-444.
[51] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning, MIT Press, 2016.
[52] Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, and X. Wang, “Deep learning for pixel-level image fusion:
Recent advances and future prospects”. Information Fusion, vol. 42, pp.158-173, 2018.
[53] M. Amin-Naji, A. Aghagolzadeh, M. Ezoji, CNNs hard voting for multi-focus image fusion, Journal of Ambient
Intelligence and Humanized Computing, (2019), 1-21.
[54] Y. Liu, X. Chen, H. Peng, Z. Wang, Multi-focus image fusion with a deep convolutional neural network,
Information Fusion, 36 (2017) 191-207.
[55] C. Du, S. Gao, Image segmentation-based multi-focus image fusion through multi-scale convolutional neural
network, IEEE Access, 5 (2017) 15750-15761.
[56] C. B. Du, S. Gao, Multi-focus image fusion with the all convolutional neural network. Optoelectronics Letters, 14
(1) (2018) 71-75.
[57] H. Tang, B. Xiao, W. Li, and G. Wang, “Pixel Convolutional Neural Network for Multi-Focus Image Fusion.”
Information Sciences, 433, (2017) 125-141.
[58] K. Xu, Z. Qin, G. Wang, H. Zhang, K. Huang, S. Ye, Multi-focus Image Fusion using Fully Convolutional Two-
stream Network for Visual Sensors KSII Transactions on Internet & Information Systems, 12 (5) (2018) 2253-
2271.
[59] X. Guo, R. Nie, J. Cao, D. Zhou, W. Qian, Fully Convolutional network-based multifocus image fusion. Neural
Computation, 30 (7) (2018), 1775–1800.
[60] P. Krähenbühl, V. Koltun, Efficient inference in fully connected CRFS with Gaussian edge potentials. In:
Advances in neural information processing systems, (2011), 109–117
[61] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner. “Gradient-based learning applied to document recognition.
Proceedings of the IEEE. vol. 86, no.11, pp. 2278-324, 1998.
[62] [Online]. Available: https://en.wikipedia.org/wiki/Deep_learning (accessed 2 August 2018).
[63] [Online]. Available: http://cs231n.github.io/convolutional-networks/ (accessed 2 August 2018).
[64] Z. H. Zhou, J. Wu, W. Tang, Ensembling neural networks: many could be better than all, Artificial intelligence,
137 (1-2) (2002) 239-263.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIP
T
[65] T. G. Dietterich, Ensemble Methods in Machine Learning, International workshop on multiple classifier systems,
2000, pp. 1-15.
[66] D. Opitz, R. Maclin, Popular Ensemble Methods: An Empirical Study, 11 (1999) 169–198.
[67] D. Maji, A. Santara, P. Mitra, and D. Sheet, Ensemble of deep convolutional neural networks for learning to
detect retinal vessels in fundus images. arXiv preprint arXiv:1603.04833, (2016).
[68] [Dataset]. Available: http://cocodataset.org/ (accessed 2 August 2018).
[69] C. Xydeas, V. Petrovic, Objective image fusion performance measure, Electronics Letters, 36 (4) (2009) 308-309.
[70] V. Petrovic, C. Xydeas, Objective image fusion performance characterization, Tenth IEEE International
Conference on Computer Vision (ICCV), 2005, pp. 1866-1871.
[71] C. Yang, J.-Q. Zhang, X.-R. Wang, X. Liu, A novel similarity based quality metric for image fusion, Information
Fusion, 9 (2) (2008) 156-160.
[72] J. Zhao, R. Laganiere, Z. Liu, Performance assessment of combinative pixel-level image fusion based on an
absolute feature measurement. International Journal of Innovative Computing, Information and Control, 3 (6)
(2007) 1433-1447.
[73] G. Piella, H. Heijmans, A new quality metric for image fusion. In Image Processing, Proceedings 2003
International Conference on Image Processing (ICIP), (2003), pp. 173-176.
[74] Y. Chen, R.S. Blum, A new automated quality assessment algorithm for image fusion. Image and vision
computing, 27 (10) (2009) 1421-1432.
[75] M. Hossny, S. Nahavandi, D. Creighton, Comments on 'Information measure for performance of image fusion'.
Electronics letters, 44(18) (2008) 1066-1067.
[76] M. B. A. Haghighat, A. Aghagolzadeh, H. Seyedarabi, A non-reference image fusion metric based on mutual
information of image features. Computers & Electrical Engineering, 37(5) (2011) 744-756.
[77] H. Sheikh, A. Bovik, Image information and visual quality. IEEE Transaction on Image Processing 15 (2006)
430–444.
[78] Q. Wang Y. Shen, J. Jin, Performance evaluation of image fusion techniques. Image Fusion Algorithms
Application, 19 (2008) 469–492.