()


SCIENCE CHINA
Information Sciences

February 2020, Vol. 63 120112:1–120112:3

https://doi.org/10.1007/s11432-019-2721-0

c© Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2020 info.scichina.com link.springer.com

. LETTER .
Special Focus on Deep Learning for Computer Vision

Multi-attention based cross-domain beauty product

image retrieval

Zhihui WANG, Xing LIU, Jiawen LIN, Caifei YANG & Haojie LI*

International School of Information Science and Engineering, Dalian University of Technology, Dalian 116086, China

Received 28 July 2019/Revised 9 October 2019/Accepted 12 November 2019/Published online 14 January 2020

Citation Wang Z H, Liu X, Lin J W, et al. Multi-attention based cross-domain beauty product image retrieval.

Sci China Inf Sci, 2020, 63(2): 120112, https://doi.org/10.1007/s11432-019-2721-0

Dear editor,
In recent years, the Perfect Half Million Beauty
Product Image Recognition Challenge has been
held by ACM MultiMedia 2018 [1] for beauty prod-
uct image retrieval task, and the Perfect-500K
dataset has been released, acting as a large-scale
beauty product dataset. Retrieval methods ex-
ploit classic CNN models to extract features and
conduct either fusion or post-process to enhance
accuracy of feature description (e.g., [2–4]). Oth-
ers are inclined to design network architecture to
achieve the idential effect (e.g., [5–7]). By observ-
ing the dataset, we argue that the beauty product
objects are conspicuous and the text regions of im-
ages display noticeable discrimination.

To concentrate on the salient objective area, as
well as the the prominent text content, we propose
an end-to-end multi-attention classification net-
work MANet, accounting for the basic feature ex-
traction. Besides, a saliency-based regional max-
imum activation of convolutions (SR-MAC) mod-
ule for feature representation is proposed to re-
duce the effect of background regions unrelated to
the salient region on MANet’s convolution activa-
tion and to increase the feature weight of regions
related to the salient region; it is capable of ob-
jectively aggregating multiple local features and
making the feature representation of beauty prod-
uct images more discriminative.

Besides, word frequency statistics of the text
description of each image in Perfect-500K is an-
alyzed using the TF-IDF algorithm and 44 cate-

gories are roughly counted. Subsequently, some
images are extracted from Perfect-500K dataset
associated with these 44 categories and a well-
labeled “few-shot” dataset is built, named Perfect-
30K.

The proposed method for beauty product im-
age retrieval is illustrated in Figure 1, consisting
of the offline part and the online part. Given a
query image online, we use MANet to extract the
basic feature tensor, and SR-MAC is employed to
aggregate local features from it. After post-process
with L2 normalization, final features are obtained
for the query.

Multi-attention classification network. The pro-
posed MANet has three branches and employs a
full convolution structure: it is composed of the
saliency attention mechanism, the backbone net-
work, as well as the text attention mechanism. Be-
cause various branches have different tasks, differ-
ent network structures are designed.

Saliency attention mechanism. The saliency
attention mechanism in MANet covers the “up-
to-down” and “down-to-up” processes of feature
learning. The “up-to-down” process learns the
high-level semantic information of images, which
is capable of finding the location of salient regions
but at the expense of the loss of details. Besides,
the “down-to-up” process merges a wide range of
outputs, so the most visually distinctive objects
can be extracted.

During the training phase, the pseudo-saliency
mask is adopted to fine-tune the saliency branch

* Corresponding author (email: hjli@dlut.edu.cn)

http://crossmark.crossref.org/dialog/?doi=10.1007/s11432-019-2721-0&domain=pdf&date_stamp=2020-1-13
https://doi.org/10.1007/s11432-019-2721-0
info.scichina.com
link.springer.com
https://doi.org/10.1007/s11432-019-2721-0
https://doi.org/10.1007/s11432-019-2721-0


Wang Z H, et al. Sci China Inf Sci February 2020 Vol. 63 120112:2

Query 
image

SR-MAC L2

MANet

Salienct region

Feature extraction

Weighted 
feature

Online

Feature extraction

……

Offline

Dataset

Query 
feature

Rank list

Rerank
…

P
o
o
li

n
g

Saliency attention branch

Text attention branch

Figure 1 (Color online) Method overview. The whole framework includes two parts: the offline part and the online part.

network. The region with values greater than the
mean value of saliency mask in the pseudo-saliency
mask is treated as the product object region. The
value below the mean in the mask is set to 0; oth-
erwise, it is set to 1. The network is trained by
minimizing the following loss function:

LS = 1 −

N∑

i=1

2|f(Xi, Θ)
⋂
Yi|

|f(Xi, Θ)| + |Yi|
, (1)

where X = {Xi}
N
i=1 denotes the set of training

images, and {Yi}
N
i=1 denotes pseudo-saliency mask

corresponding to the training images. All the pa-
rameters of saliency branch network are defined as
Θ. f refers to the saliency attention model.

Text attention mechanism. To draw upon text
information, text attention mechanism is adopted
to make the network notice the features of the text
regions in the learning process. EAST [8] refers to
a powerful pipeline that yields fast and accurate
text detection; thus the text mask output from it
is directly adopted as the output of the branch.
Besides, the text mask is adjusted by (2) to retain
the weighting effect of saliency attention branch
on backbone network feature tensor:

T = sigmoid(Tm) + 1, (2)

where Tm refers to the text mask and T is the ad-
justed text mask. The feature X of backbone net-
work’s last convolution layer after being weighted
by saliency attention corresponding to channel k
with the spatial location of (i, j) is expressed as
Xkij, and the adjusted text mask T with location
of (i, j) is denoted by Tij. The final weighted fea-

ture tensor X̃ is produced using the text attention
mechanism as written in

X̃k = Tij ⊗ Xkij, (3)

where ⊗ represents element-wise multiplication.
The text mask generated by EAST is applied

to weight our feature map because training text
detection requires text region annotation, which
is beyond the scope of our research. So in order
to avoid extra supervised information of text re-
gion, text attention branch does not participate in
our training, whose parameters are frozen, and we
update the parameters of backbone network and
saliency attention branch only.

SR-MAC feature representation. The global fea-
ture from MANet expresses the overall information
of the image, with no discriminative details. To
aggregate the local features and obtain discrimi-
native features of the beauty product images, an
SR-MAC module is proposed. Besides, the feature
of backbone network’s last convolutional layer is
weighted by saliency attention and text attention
is acted as the basic feature tensor to extract local
features.

In accordance with R-MAC [9] method,
we obtain convolution responses XR =
{X1, . . . , Xr, . . . , Xm} corresponding to m re-
gions. We define the regional feature vector:

X
r = [fr1 , . . . , f

r
i , . . . , f

r
K], (4)

where fri = max(C
r
i ) denotes the maximum acti-

vation of the ith channel on the considered region
r. Moreover, these local regions are defined on the
space Ω of all valid positions for the considered
feature map (and not on the input image plane).

Our proposed SR-MAC uses the saliency atten-
tion mechanism to assign different weights to re-
spective regions. The local region weight is calcu-
lated as follows:

Wr =
R
⋂
S

R
, (5)


Wang Z H, et al. Sci China Inf Sci February 2020 Vol. 63 120112:3

where R denotes the local region, and S indicates
the saliency region. The weighted feature of the
local region r is calculated as follows:

X́
r = Wr · [f

r
1 , . . . , f

r
i , . . . , f

r
K]. (6)

The final SR-MAC feature is represented as fol-
lows:

Φ = [F́1, . . . , F́i, . . . , F́K], F́i =

m∑

r=1

norm(f́ri ),(7)

where f́ri = Wr · f
r
i .

Experiments. In this study, TF-IDF algorithm
is applied to analyze word frequency statistics of
all text descriptions appearing in Perfect-500K
dataset and 44 rough categories are counted. Se-
quentially, approximately 35000 images are ex-
tracted from Perfect-500K dataset associated with
the 44 categories based on the category keywords,
and a “few-shot” dataset is built, named Perfect-
30K. These 44 categories include lipstick, sun-
screen, razor, mask, each of which contains nearly
800 images. Lastly, the Perfect-30K dataset is split
into a train dataset and a validation dataset ac-
cording to the ratio of 8:2.

Compared with RA-MAC [2], MFF [3], and pre-
trained ResNet50 [4], which are competitors of
Half Million Beauty Product Image Recognition
Challenge 2018, our proposed MANet achieves
the optimal result on Perfect-500K with 0.395
MAP@7, which is higher than those of RA-MAC
with 0.348 MAP@7, MFF with 0.360 MAP@7, and
pre-trained ResNet50 with 0.207 MAP@7, respec-
tively.

The advantages of this study are interpreted as
follows: (1) saliency and text attention mechanism
are used to make MANet pay more attention to
the product objects and the text regions in the im-
ages; (2) a robust local feature aggregation method
is proposed, eliminating the interference of back-
ground information and retaining the key local ar-
eas in the product object region by using saliency
mechanism; (3) a well-labeled beauty product im-
age dataset is built, and the network is trained on
it to learn more accurate feature description for
beauty products. For more detailed experimental
results, please refer to the supplement materials.

Conclusion. An end-to-end multi-attention
classification network MANet for beauty product
images retrieval is proposed, focusing on the fea-
tures of saliency regions and text regions in the
images and suppressing the interference of irrele-
vant information. To take the details of beauty

product images, an SR-MAC feature representa-
tion module is proposed. The feature obtained
by SR-MAC eliminates the interference of object-
independent region in MANet’s convolution acti-
vation and enhances the feature weight of regions
related to the salient region. Besides, a “few-shot”
beauty product dataset, Perfect-30K, with 44 cat-
egories for training our proposed MANet is con-
structed. The retrieval performance of our method
on the Perfect-500K dataset outperforms the state-
of-the-art methods, which indicates the effective-
ness of our method.

Acknowledgements This work was supported in part

by National Natural Science Foundation of China (Grant

Nos. 61772108, 61932020, 61976038).

Supporting information Experiments. The support-

ing information is available online at info.scichina.com and

link.springer.com. The supporting materials are published

as submitted, without typesetting or editing. The respon-

sibility for scientific accuracy and content remains entirely

with the authors.

References

1 Cheng W-H, Jia J, Huang J. Half Million Beauty Prod-
uct Image Recognition. 2018. https://challenge2018.
perfectcorp.com/

2 Lin Z, Yang Z, Huang F, et al. Regional maximum
activations of convolutions with attention for cross-
domain beauty and personal care product retrieval.
In: Proceedings of ACM Conference on Multimedia,
2018. 2073–2077

3 Wang Q, Lai J X, Xu K, et al. Beauty product im-
age retrieval based on multi-feature fusion and feature
aggregation. In: Proceedings of ACM Conference on
Multimedia, 2018. 2063–2067

4 Lim J H, Japar N, Ng C C, et al. Unprecedented usage
of pre-trained CNNs on beauty product. In: Proceed-
ings of ACM Conference on Multimedia, 2018. 2068–
2072

5 Sun H Q, Pang Y W. GlanceNets–efficient convolu-
tional neural networks with adaptive hard example
mining. Sci China Inf Sci, 2018, 61: 109101

6 Zhong J, Sun Y X, Yu Y L, et al. Attribute-
guided network for cross-modal zero-shot hashing.
IEEE Trans Neural Netw Learn Syst, 2018. doi:
10.1109/TNNLS.2019.2904991

7 Li H J, Wang X H, Tang J H, et al. Combining global
and local matching of multiple features for precise item
image retrieval. Multimedia Syst, 2013, 19: 37–49

8 Zhou X, Yao C, Wen H, et al. East: an efficient and
accurate scene text detector. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition, 2017. 5551–5560

9 Tolias G, Sicre R, Jegou H. Particular object re-
trieval with integral max-pooling of CNN activations.
In: Proceedings of the 4th International Conference on
Learning Representations, San Juan, 2016

info.scichina.com
link.springer.com
https://doi.org/10.1007/s11432-018-9497-0
https://doi.org/10.1007/s00530-012-0265-1