2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 

19 

 
3D Target Recognition Based on Decision Layer Fusion 

Ma xing
a*

, Yu Fan
b
, Yu Haige, Wei Yanxi, Yang Wenhui 

School of computer science and engineering  

Xi’an Techonolgical University 

Xi’an ,710021, Shaanxi  

e-mail: 
a*

512066020@qq.com; 
b
yffshun@163.com

 
Abstract—Target recognition has always been a hot research 

topic in computer image and pattern recognition. This paper 

proposes a target recognition method based on decision layer 

fusion. ModelNet[1]—The 3D CAD model library，which is 
used to be identified. Features are extracted from the model's 

point cloud data and multi-view images. The image is 

identified using the AlexNet[2] network ,The point cloud is 

identified by the VoxNet[3] network. The fusion algorithm is 

used in the decision layer to complete the fusion of features. 

The results show that the proposed method improves the 

accuracy of object recognition. 

Keywords-Target Recognition; Convolutional Neural 

Network; Decision Fusion 

 
I. INTRODUCTION 

At present, the methods for identifying objects are 
mainly divided into two categories. The first is to identify 
the images generated by the objects, and the second is to 
identify the point clouds generated by the objects. 

In terms of image recognition, the current deep learning 
method has a high recognition rate. For instance, Xie etal. [4] 
adopt the multi-view depth image representation and propose 
multi-view deepextreme learning machine (MVD-ELM) to 
achieve fast and quality projective feature learning for 3D 
shapes. Zhu et al.[5] alsoproject 3D shapes into 2D space 
and use autoencoder for featurelearning on 2D images. In 
2012, the ImageNet contest champion model--Alexnet 
became a classical convolutional neural network image 
classification model. 

For the identification of point clouds, domestic and 
foreign scholars have done a lot of research. Some scholars 
use the method of manually extracting features to classify 
and identify. R.B.Rusu[6]and others used the relationship 
between the normal vectors of a region as a feature to 
classify objects for recognition. Yasir Salihet al. [7] used 
VFH as a feature and used a support vector machine as a 
classifier to classify and recognize point clouds. Manually 
extracting features requires very professional knowledge and 
rich experience. Convolutional neural networks can 
automatically extract features and classify them, and they are 
invariant to displacements, scaling, and other forms of rigid 
body changes. Some experts and scholars have used 
convolutional neural networks to classify and recognize 
point cloud images, of which the VoxNet network has the 
highest recognition rate. 

The above method achieves higher accuracy in target 
classification recognition. I have seen that image recognition 
is affected by factors such as lighting and viewing angles. 
The accuracy of point cloud recognition is lower than that of 
image recognition. Therefore, this paper integrates the image 
and the point cloud at the decision-making level to improve 
the accuracy of object recognition. 

II. NETWORK STRUCTURE 

A. Convolutional neural network 

Traditional shallow learning methods such as support 
vector machines require manually extracting image features 
and then sending the features into the classifier for training. 
This leads to a problem that the manually extracted feature is 
not necessarily the best description of the current image. 
Even if the selected feature is very suitable for the current 
image, when the external conditions of the object such as the 
angle, size, and illumination of the image change, the 
manually selected feature cannot adapt well to this change, 
and it is necessary to artificially adjust the selected feature 
according to the situation. Different from the traditional 
shallow learning method, the input of the convolutional 
neural network is the entire image. It continuously adjusts 
the parameters of the network through a learning algorithm 
and adaptively extracts the most significant features of the 
current image, which avoids manual intervention and saves a 
large amount of manpower, with the continuous updating of 
the input pictures, the essential features of the current picture 
are extracted with the times, ensuring the accuracy and 
efficiency of the recognition. As a special architecture that is 
particularly suitable for classifying images, compared with 
conventional shallow machine learning methods such as 
support vector machines, convolutional neural networks can 
be much smaller than normal when faced with large-scale 
high-resolution image classification problems. The method 
learns more picture information in the training time, and the 
classification accuracy is higher than the conventional 
method, which is due to its unique network architecture. 
The convolutional neural network consists of three parts: the 
convolutional layer, the down-sampling layer and the fully 
connected layer. The down-sampling is usually after the 
convolutional layer, alternating with the convolutional layer, 
and finally connected to the fully connected layer. 
Convolutional neural networks use local connections, weight 
sharing, and spatial or temporal correlation down-sampling 
to obtain good translation, scaling, and distortion invariance, 
making the extracted features more distinguishable. CNNs 


2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 

20 

 
training includes forward propagation. The two processes of 
forward propagation and reverse propagation are the process 
of the input signal output from the input layer, through 
several hidden layers, and the output layer. The reverse 
propagation is the process of back propagation of the error 
signal from the output layer to the input layer. It mainly uses 
errors. The back propagation (Error Back Propagation, EBP) 
algorithm and gradient descent adjust the weights at each 
level of the network and is similar to the training process of 
an ordinary neural network. 

B. AlexNet network structure 

The standard convolutional neural network(CNN) is a 
special multilayer feed forward neural network. It has a deep 
network structure and is generally composed of an input 
layer, a convolutional layer, a down-sampling layer, a fully 
connected layer, and an output layer. The convolutional layer, 
the down-sampling layer, and the fully connected layer are 
hidden layers. AlexNet uses two GPU services. The model is 
divided into eight layers, five convolutional layers and three 
fully connected layers. Each convolutional layer contains the 
excitation function RELU and local response normalization 
(LRN) processing, and then after down-sampling (POOL). 
The network structure designed in this paper is shown in 
Figure 1. 

 
Figure 1. Multi-view image network structure diagram 

The input layer are images, there are 5 convolutional 
layers. The number of feature maps are 55, 27, 31, 13,and 6 
features respectively. The convolution kernel sizes 
respectively are 11, 5, 3, 3, and 3,Below the first two 
convolutional layers there is a largest pooled layer with an 
LRN layer for localized normalization. The largest pooled 
layer is to take the maximum value of the feature points in 
the domain. After processing through the convolution layer, 
many features are obtained. The amount of direct calculation 
is very large, and the increase of features is particularly 
prone to overfitting. Therefore, the network constructed in 
this paper is Each time a convolution process is performed, a 
Max-Pooling layer is added. The Dropout layer has a discard 
rate of 0.5. Dropout temporarily discards some networks in 
the training process with a certain probability, and each 
mini-batch discards different networks. It can reduce the 
amount of calculation, and more importantly it can prevent 
over-fitting. The number of neurons in two fully connected 
layers is 256 and 10, respectively. The last output layer is the 
Softmax[6]layer, which does not directly output the 
classification of the identified image, but rather the 

probability that the output image belongs to each 
classification. 

C. VoxNet Network Structure 

Figure 2 shows the network structure of VoxNet. The 
input layer accepts data as a form. There are a total of two 
convolutional layers, and the number of feature maps is 32, 
using the sum of the convolution kernels. The Dropout layer 
discard rates are 0.2 and 0.3, respectively, to prevent 
overfitting while reducing the amount of computation. The 
largest pooled layer, the filter used. Finally, there is a fully 
connected layer with 128 neurons and a Dropout layer with a 
discard rate of 0.4. The seventh layer is the output layer and 
the number of neurons is 10. 

Input Layer Cov3d Layer Dropout Layer

Cov3d Layer Fully Layer Dropout Layer

Max Pool 3d Layer Dropout Layer Output Layer

 
Figure 2. VoxNet network structure diagram 

D. Network convergence structure 

Multi sensor information fusion is to extract and integrate 
the same target image with a multi-source information 
channel for further processing. Information fusion can be 
divided into three layers: the fusion based on data layer, the 
fusion based on the feature layer, and the fusion based on the 
decision layer. The level of fusion was from low to high. So 
this paper uses the method of decision layer fusion. 

The feature fusion of the decision layer is usually the 
fusion of the prediction results of multiple classifiers. We 
extract various features from feature extraction algorithm. 
We assume that all kinds of features are independent of each 
other, and these features can separately predict the result of 
recognition separately. On this basis, we send data into their 
respective classifiers, get the prediction results of each 
classifier, then combine all the classifier's prediction to get 
the final recognition results, and complete the fusion of 
multiple features in the decision level. 

As shown in Figure 3, point cloud is extracted from 
VoxNet using point cloud feature, and AlexNet is used to 
extract image features. Softmax regression model is used to 
complete recognition and classification respectively. The 
fusion algorithm is used in the decision layer to complete the 
fusion of features. 


2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 

21 

 
Figure 3. Fusion model 

III. . EXPERIMENTS 

A. Experiment environement 

The environment used in this experiment was 
TensorFlow-GPU 0.12.1 open source software library, 
Windows 7 operating system, and Nvidia GTX 950 graphics 
card. The data used in the experiment was ModelNet of 
Princeton University. ModelNet is a large-scale 3D CAD 
model database, similar to ImageNet in 2D images. 

B. Experiment Datasets 

This article uses the ModelNet40 dataset, where the point 
cloud dataset is from the dataset in PointNet[12], and the 2D 
image is from Multi-view. On the topside of Figure 4 is the 
point cloud image, at the bottom is a two-dimensional image. 

 
Figure 4. Point cloud image and two-dimensional image 

C. Linear Combination Coefficient Selection 

Static linear combination of the results of the AlexNet 
network and VoxNet network prediction results in the final 
prediction. Before linear combination, we need to determine 
the coefficients of each classifier's prediction result, which 
controls the relative importance of the results predicted by 
each classifier. It is very important to choose a suitable 
coefficient. The appropriate coefficient can fully exert the 
advantages of each classifier and make a joint decision. The 
final recognition accuracy rate will be better than the 
recognition accuracy of a single classifier. The use of 
inappropriate coefficients will result in the classification 
accuracy of the final joint decision even lower than that of a 
single classifier. 

There will be a Softmax classifier at the last level of the 
AlexNet and VoxNet network. For each input sample, the 
output of the Softmax classifier is a probability 
vector , that the sample may belong to. 
Where  represents the probability that the sample belongs 
to class n, and n represents the number of classes of all 
samples. With , the sum of the 
probabilities that a sample belongs to all classes is 1. In the 
object recognition of a single classifier, we will choose the 
probability. 

The class label corresponding to the largest element in 
the vector  is the class corresponding to the sample. In this 
chapter, for each test sample, the obtained recognition rate 

results are 、 , and the coefficients of each 
classifier are α and β respectively. Then we can complete the 
fusion of all base classifiers according to Eq. 1. 

  

After the fusion is complete, we get a k-dimensional 
vector,  where n is the number of 
categories. We take the largest element among them as the 
final sample tag. 

D. Recognition results 

In order to test the accuracy of this experimental method, 
the method of this paper is compared with the recognition 
accuracy of VoxNet and AlexNet. The experimental results 
are shown in Table 1. 

TABLE I. ACCURACY OF DIFFERENT METHODS 

recognition methods accuracy rate/% 

AlexNet 85 

VoxNet 83 

 
In this paper, the coefficients α and β before AlexNet and 

VoxNet are set to different values, the method with higher 
accuracy is used to set larger coefficients, the method with 
lower accuracy is set with smaller coefficients, and the 
comparison between different combinations of coefficients is 


2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 

22 

 
accurate for the network. influences. The experimental 
results are shown in Table 2. 

TABLE II. EFFECTS OF DIFFERENT COMBINATIONS OF COEFFICIENTS ON 
NETWORK ACCURACY 

weight（ ） accuracy/% 

（0.5，0.5） 79.8 

（0.6，0.4） 81.2 

（0.7，0.3） 91 

（0.8，0.2） 87.5 

（0.9，0.1） 86.3 

 
From Table 2, the network has the highest recognition 

rate when α and β are set to 0.7 and 0.3. Compare this 
method with the recognition accuracy of VoxNet and 
AlexNet. Experimental results show that this method has the 
highest recognition rate. 

IV. CONCLUSION 

This paper presents a decision-level three-dimensional 
target fusion algorithm, we using different convolutional 
neural network frameworks to extract point cloud features 
and visual features of three-dimensional objects, respectively, 
and finally achieved effective fusion. Experimental results 
show that feature fusion in the decision-making layer is also 
an effective feature fusion method. This method improves 
the accuracy of object recognition. In the process of 
integration, a method with a higher recognition accuracy rate 
sets a larger coefficient, and a method with a low recognition 
accuracy rate sets a smaller coefficient, and the final 
accuracy rate is the highest. 

REFERENCES 

[1] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xiao. 3D 
ShapeNets: A Deep Representation for Volumetric Shapes. 
CVPR2015. 

[2] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with 
deep convolutional neural networks[C]// International Conference on 
Neural Information Processing Systems. Curran Associates Inc. 
2012:1097-1105. 

[3] D. Maturana and S. Scherer. VoxNet: A 3D Convolutional Neural 
Network for Real-Time Object Recognition. IROS2015. 

[4] H. Su, S. Maji, E. Kalogerakis, E. Learned-Miller. Multi-view 
Convolutional Neural Networks for 3D Shape Recognition. 
ICCV2015. 

[5] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 
PointNet: Deep Learning on Point Sets for 3D Classification and 
Segmentation. CVPR 2017. 

[6] Z. Xie, K. Xu, W. Shan, L. Liu, Y. Xiong, H. Huang, Projective 
feature learning for 3D shapes with multiview depth images, in: 
Computer Graphics Forum, vol.34, Wiley Online Library, 2015, 
pp.1–11. 

[7] Z. Zhu, X. Wang, S. Bai, C. Yao, X. Bai, Deep learning 
representation using autoencoder for 3D shape retrieval, in: 
Proceedings of the International Conference on Security, Pattern 
Analysis, and Cybernetics(SPAC), IEEE, 2014, pp.279–284. 

[8] Holz D, Holzer S, Rusu R B, et al. Real-Time Plane Segmentation 
Using RGB-D Cameras[C]// Robot Soccer World Cup XV. 
Springer-Verlag, 2012:306-317. 

[9] Salih Y, Malik A S. Comparison of stochastic filtering methods for 
3D tracking[J]. Pattern Recognition, 2011, 44(10–11):2711-2737. 

[10] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. 
Guibas.Volumetric and multi-view cnns for object classification on 
3d data. In Proc. Computer Vision and Pattern Recognition (CVPR), 
IEEE, 2016. 

[11] http://www.cnblogs.com/graphics/archive/2010/08/05/1793393.html 
The Princeton ModelNet. http://modelnet.cs. 

[12] The Princeton ModelNet. http://modelnet.cs.  

[13] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature 
hierarchies for accurate object detection and semantic segmentation. 
In Proc. CVPR, 2014. 

 
http://3dshapenets.cs.princeton.edu/
http://3dshapenets.cs.princeton.edu/
http://danielmaturana.net/extra/voxnet_maturana_scherer_iros15.pdf
http://danielmaturana.net/extra/voxnet_maturana_scherer_iros15.pdf
http://people.cs.umass.edu/~kalo/papers/viewbasedcnn/index.html
http://people.cs.umass.edu/~kalo/papers/viewbasedcnn/index.html
https://arxiv.org/abs/1612.00593
https://arxiv.org/abs/1612.00593