key: cord-0643672-gp4gbab1 authors: Seol, Jinseok; Kim, Seongjae; Park, Sungchan; Lim, Holim; Na, Hyunsoo; Park, Eunyoung; Jung, Dohee; Park, Soyoung; Lee, Kangwoo; Lee, Sang-goo title: Technologies for AI-Driven Fashion Social Networking Service with E-Commerce date: 2022-03-11 journal: nan DOI: nan sha: be0710c5aaad941a935d8319f3dc893ad437c9f5 doc_id: 643672 cord_uid: gp4gbab1 The rapid growth of the online fashion market brought demands for innovative fashion services and commerce platforms. With the recent success of deep learning, many applications employ AI technologies such as visual search and recommender systems to provide novel and beneficial services. In this paper, we describe applied technologies for AI-driven fashion social networking service that incorporate fashion e-commerce. In the application, people can share and browse their outfit-of-the-day (OOTD) photos, while AI analyzes them and suggests similar style OOTDs and related products. To this end, we trained deep learning based AI models for fashion and integrated them to build a fashion visual search system and a recommender system for OOTD. With aforementioned technologies, the AI-driven fashion SNS platform, iTOO, has been successfully launched. With the development of the internet and computer technologies, the size of the e-commerce market has been growing steeply. Moreover, the social distancing environment of the COVID-19 pandemic has brought the growth and demands for innovative e-commerce platforms [1] . To meet the demands and achieve benefits, leading commerce companies such as Amazon [2] , eBay [3] , Alibaba [4] are introducing distinctive and novel services, including visual search and product recommendations [5, 6] . Meanwhile, in the fashion industry, a variety of innovative applications have emerged. For example, ViSENZE [7] provides fashion image processing solutions including image search and attribute prediction, and Zalando research team has been working for fashion product recommendation [8] . There are more interesting applications such as Intelistyle [9] which provides a chatbot-based AI stylist, and a wardrobe-based AI stylist Fitzme [10] . Fashion-focused social networking is another large portion of consumer activities [11] . We claim that people look for the outfit-of-the-day (OOTD) of other people to acquire insights and trends in fashion. To achieve this, consumers browse online applications such as Lookbook, Polyvore, Pinterest, etc. Moreover, users commonly share their OOTD photos through general social networking services (SNS) (e.g., Facebook, Instagram, and Twitter). In this environment, connecting fashion SNS with e-commerce is undoubtedly beneficial, as in the case of Instagram and Styleshare. Therefore, it is reasonable to come up with a new kind of service that incorporates e-commerce with fashion SNS by applying AI technologies. Normally, to connect user-uploaded OOTD photos with retail products, the uploader must directly attach a link to the product, which is a cumbersome task, leading to the automation challenge. We applied the fashion visual search system to overcome the challenge, thus the users can easily share their OOTD in the form of a common fashion SNS. It provides an opportunity for other users to purchase retail products similar to the OOTD, without burdening the uploader. To this end, we implemented AI components including a fashion object detector, a fashion category classifier, and a fashion attribute tagger. Moreover, we deployed a personalized recommender system that can handle OOTD data to engage more users to the application. These AI technologies are combined and enabled launching the iTOO. The most important media type in the fashion domain is photographic images. However, it contains major challenges that make it difficult to process images in the fashion domain [12] . Typically, self-occlusions may occur in the target of interest, and when a person is wearing fashion garments, the image variance can be amplified due to the viewpoint, posture, lighting, and scale of the subject [13] . In addition, several fashion items may appear in a single image simultaneously. Therefore, localization procedure is essential. To this end, region-ofinterest (RoI) detection, landmark detection [14] , parsing with a fashion component [15] , or pose estimation [16] is often employed. Moreover, classifying the categories of fashion products [17] , predicting colors and detailed attributes including latent fashion style, is also a major component to recognize fashion item [18, 19] . Image retrieval, or a visual search system, has been successfully applied in various areas such as face recognition [20] and product search [3] . Recently, many studies implemented a visual search model by comprehending images through CNN and learning through deep metric learning [21] . Academic datasets for fashion visual search are publicly available [22, 23] , but to apply in the real-world application, datasets should cover a broader range of categories. Therefore, collecting and refining the dataset is also an essential task [24] . Techniques using proxies by employing latent embeddings for each instance are the current state-of-the-art models for the visual search [25] , to the best of our knowledge. However, when the number of items becomes millions, training causes another challenge and requires complex techniques [26] . The recommender system is a core technology in contents platforms and ecommerce as Amazon [27] , YouTube [28] , and Netflix [29] have shown. Starting with Collaborative Filtering [30, 31] , advanced methods including implicit-based approaches [32] , content-based filtering [33] , and more recently, deep learningbased recommendation algorithms [34] have been studied. In the fashion domain, models using visual information have been mainly studied [35, 36] . Furthermore, there are many cases that multiple images form a single outfit, thus outfit recommendation techniques that consider multiple images simultaneously are being studied [37, 38] . In this section, we introduce iTOO from LOTTE. iTOO is a service that integrates Fashion SNS and commerce, where you can share and browse your OOTD photos, look for related products and styles, get recommendations, and even purchase fashion products in place. One of the main purposes of the application is to share a picture of your OOTD. A user can add a brief description and hashtags when uploading the picture. Immediately, the fashion items that comprise the OOTD are detected and analyzed automatically by AI. Therefore, users can easily share extensive information by simply taking a picture and leaving a short description. Moreover, users can browse through OOTDs posted by other users through curation or exploration. As shown in Figure 2 , the home screen recommends OOTDs that fit the preference, style, and body shape of the user. By examining the OOTD detail view, a user can check out purchasable retail products that are similar to the comprising items of the OOTD as illustrated in Figure 1 . This function benefits users who want to buy fashion products through OOTD curation in place, and retail shops can merchandise their product through viral marketing. In addition, other OOTDs with a similar style are recommended. Such curations help users to drill down OOTD pools that fit the preference of the user. Besides aforementioned OOTD recommendations, the "style leaders" who often post trending and decent OOTDs are also recommended to a user as who-tofollow. To get more accurate curations, a user can provide detailed information of the fashion persona such as demographic information, body shape, and preference style tags. All information including OOTD interactions, profile, following user list is gathered to recommender system and provides a personalized recommendation. This section introduces the AI technologies that enable fashion SNS focused on OOTD images. We used the best performing deep learning models in our knowledge, and they were fine-tuned to be suited for the fashion domain and the application. The core models of the application, visual search and OOTD recommender system, are described in Sections 5 and 6 respectively. Note that the part of the AI component is also used to construct datasets for training the other AI components. A fashion image may present a single product, but in many cases, it comes with a person wearing several fashion garments. Therefore, localizing or detecting where the fashion items are in the image is a process that must be preceded. We considered pose estimation and human parsing methods, however, we adopted a model that predicts region of interest (RoI) for practicality because human information does not always come in. As a training dataset, we mixed and reorganized Street2Shop [39] , ModaNet [40] , and DeepFashion [22, 23] datasets. In detail, we mapped category information to 6 super-categories (top, bottom, outer, dress, shoes, bag) in order to combine datasets from different sources. Due to the throughput performance issue, YoloV4 [41] were selected as our detector model. Note that in the application situation, we could assume that each OOTD image has at least one fashion item. Moreover, top/bottom items and dresses are mutually exclusive, so considering these properties, we added a post-processing module and achieved performance gain in terms of recall. Furthermore, we also included the fashion category classifier module from Section 4.2 to increase the precision. The predicted RoI is cropped and inferred by the category classifier, and filtered out if the super-category is different from the detector model. By combining the YoloV4 with the post-processing module, we could build a fast and accurate fashion object detector. Classifying the category of a fashion item is another basic element of fashion item recognition. We constructed an integrated dataset using ModaNet, DeepFashion, iMaterialist [42] and crawled data from YOOX and Polyvore. To increase category coverage, we crawled the data from the top popular 30 online fashion malls, summing up to 1.7M images in total. Since the crawled data does not have RoI labels, we used the fashion object detector model from Section 4.1 and filtered out RoIs with a super-category label parsed from metadata provided by the malls. It is necessary to reorganize the category hierarchy to integrate multiple datasets, thus we designed a category hierarchy consisting of 6 super-categories and 32 sub-categories: 6 from outer, 6 from top, 6 from bottom, 2 from dress, 7 from shoes, and 5 from bag. As a classification model backbone, EfficientNet [43] was employed under consideration of inference speed and memory usage. We also leverage the training techniques such as cosine annealing and label smoothing, etc. We build a fashion attribute tagger model to find detailed attributes of fashion items such as color, style, length. Specifically, we defined 18 attribute groups where 11 are categorical, and 7 are multi-label. Since some attribute groups are only limited according to the sub-category, outputs are filtered out through the post-processing module. Similar to other AI component models, we merged and reorganized multiple datasets from different sources: DeepFashion, iMaterialist, Fashion550k [44] , and MVC [45] . Adopted CNN backbone and training methods are the same as fashion category classifier. To train the visual search model, the same-class labels denoting different images of the same item are necessary. In the fashion domain, the image variance especially the gap between the product image provided by the shopping malls and the image uploaded by consumers is large. Therefore, it is important to construct a model and datasets that can cover the cross-domain image retrieval task. To this end, we collected multiple datasets and pre-processed them to build an in-house dataset suitable for the fashion visual search model in the application. Note that since it is common to learn through negative sampling rather than learning the distribution of all items, dataset quality is sensitive to false-positives rather than false-negatives. Collecting Data Academic datasets (e.g., DeepFashion) are often not complete and cannot be directly applied to real-world applications due to the limitation of category coverage. To fill this gap, we selected and crawled the top popular 30 online fashion malls and collected a total of 0.3M items with 1.3M images, including consumer photo review data. We conducted a small experiment to confirm that adding more data affects search performance in terms of category coverage. Preprocessing Similar to the case of the fashion category classifier from Section 4.2, crawled data cannot be used for training without preprocessing. We used the fashion object detector model from Section 4.1 and acquired RoI crops for the localization, and filtered out the crops that do not match the super-category information parsed from target malls. Meanwhile, when crawling the data from online malls, the abundant image data is often located in the "descriptive image", which consists of multiple photos of a fashion item, description texts, and even irrelevant images like advertisements. To gather meaningful data, we first separated the descriptive image with the connected components algorithm, then removed duplicate images using perceptual hashing [46] . The detector model and post-process procedures are applied then after. Additionally, to reduce false-positive errors, we use the fashion category classifier and select images with the sub-category of the majority. An easy-to-miss aspect when building a dataset for a fashion visual search model is to separate fashion items that have multiple color variants. Many online fashion malls, including DeepFashion dataset, treat item images that differ only in color as the same item. However, this scheme can lead CNN to neglect the color information of the input image, and a "shortcut" by color information cannot be used. In our settings, it is beneficial to use this shortcut because the precision of the search result is more important than the recall, and by conducting a benchmark experiment, we confirmed that separating the color variants into different items helps to improve precision. In the case of the DeepFashion dataset, the color information labels are provided with fine granularity, so we re-adjusted the same-class label using the color tag of our fashion attribute tagger from Section 4.3. Again, when it comes to precision, only the false-positives of the dataset matter so the inaccuracy of the attribute tagging model does not affect critically. Most of the recent state-of-the-art methods on image retrieval tasks are based on metric learning. When the model is trained, we can obtain a representation vector from the input image by feeding it into the model, and similar items can be retrieved through cosine similarity. We considered basic metric learning [20] , AP learning [47] and proxy-based methods [25] . However, methods that require item embeddings are often difficult to deal with numerous or variable item pool. Moreover, we use an under/over-sampling scheme to balance the datasets from different sources, which means, the whole item pool is changed on every epoch. Therefore, for the flexibility of the training, we adopted simple N -pair contrastive learning [48] . Although the basic metric learning cannot match the state-of-theart performance, it still serves as a decent baseline with advantages from other aspects. In concrete, to train the N -pair loss, we sample one positive (same-item) image per input image and gather N negative image samples from the training batch. The metric learning is performed using normalized-temperature crossentropy (NT-Xent) loss [49] . The rest of the training detail including backbone CNN is similar to the category classifier from Section 4.2. The dimension of the representation vector was set to 128 for memory efficiency, and although a larger dimension was under consideration, the performance improvement compared to memory usage was not significant. The under/over-sampling of datasets are empirically adjusted considering the image types, characteristics, and category distribution of each dataset. Table 1 . Performance comparison on DeepFashion In-shop dataset, using top-k accuracy. Our N -pair based model may not be state-of-the-art, but it can be easily scaled out to millions of items. The suggesting color separation training scheme (last row) shows that even with simple label modification, the top-1 accuracy can be improved by a large margin. Model k=1 k=5 k=10 k=20 Performance on Benchmark Dataset To show the precision gain on the color separation scheme, we experimented on DeepFashion In-shop dataset, which has 52,712 images with 7,982 items. Note that this dataset provides RoI crop data, so we use the box coordinates with 20-pixel margin, then resized it into 256 × 256 images. Table 1 shows the results in top-k accuracy, which checks whether the positive image is within the top-k items retrieved. Although the Npair baseline model cannot reach state-of-the-art performance, it is still a decent baseline compared to older and complex models. On the other hand, we can see the significant performance gain in top-1 when the color separation scheme is applied. Dataset Influence To see the influence of the dataset constitutions, we trained a N -pair model with three different dataset settings: using only DeepFashion dataset, adding more consumer photo review data from crawled online fashion malls, and adding shoes and bags which DeepFashion does not have. The examples of visual search results are shown in Figure 3 . In the results from the first query image, all three settings produced similar results. In the second case, since the query image involves a partially human shape, the setting with an additional consumer review image shows more robustness in terms of cross-domain image Fig. 3 . Examples of results from visual search models, trained in different dataset compositions. Results from the second query show that by adding more review data, robustness to cross-domain retrieval can be improved. The third query shows that the model cannot accurately deal with unseen categories. We can conclude that in the visual search model, dataset composition is critical as model architecture. retrieval. In the final case, where the query image represents shoes, it can be seen that settings without shoes and bags could not maintain the sub-category of the query image. As a result, we argue that constructing a well-tempered dataset is just as important as selecting the model. In the application, the visual search model has to be combined with other AI components. As shown in Figure 4 , when a user uploads an OOTD image, the fashion object detector first finds RoIs Then, the fashion category classifier and the fashion attribute tagger are applied to each cropped RoI in parallel. Finally, the visual search model extracts representation vectors and store them into the vector index, corresponding to the super-category. With the recommender system, users receive personalized OOTD recommendations, similar styled OOTDs for each OOTD, and style leaders to follow. On the first screen of the service, users can get the recommendation of OOTDs that suits their preferences. The recommendation basically leverage CF-CBF, and the final recommendation list is generated by mixing up with the weekly best products and best products by demographic-based user segment. In the case of CF-CBF, both user-based and item-based CF are used. Style Vector A fashion item vector v i of item I i consists of a concatenation of representations of the category classifier, the attribute tagger, and the visual search model: for an item image . For each fashion item, the item style vectorṽ i is obtained by subtracting the average of item vectors of the sub-category which the given item belongs: for c(i) a subcategory index of an item The OOTD style vector is defined as the average of the item style vector of the consisting items: Semantic OOTD Similarity Given two OOTDs, we define semantic OOTD similarity as a weighted sum of the cosine similarity between the OOTD style vectors of each OOTD and the Jaccard similarity of the hashtags that are dependent on the two OOTDs: for given OOTDs o t1 and o t2 , where a o (t) denotes a set of hashtags of an OOTD o t . Note that we use this similarity to make similar styled OOTD recommendations. Semantic User Similarity Similar to semantic OOTD similarity, we define a user style vector by aggregating style vectors of H OOTDs that the user has recently viewed or liked. We use a weighted average to reflect the preferences of recent interaction more strongly: for user u n , where w m = H−m+1 H α , and u * n = {t 1 , t 2 , ..., t H } is a set of indices of the user's recent OOTD views or likes, and 0 < α < 1 is a recency decay hyperparameter. Cosine similarity between two user style vectors and the Jaccard similarity between the preference tags in the user profiles are used to measure the semantic user similarity: for user u n1 and u n2 , CF-CBF for OOTD We use Collaborative Filtering (CF) as the basis for our recommendation algorithm. We first calculate the TF-IDF values from user-OOTD interactions. The value of TF-IDF is decayed to reflect the recency using time decay coefficient β d , where d is days passed since the interaction has occurred, and 0 < β < 1 is a decay rate hyper-parameter. We use both item-based and user-based CF and combine it with other recommendation results. In the case of item-based, the recommended OOTD list is obtained through similarity of the OOTDs that the user has recently viewed. In the case of user-based, the recommendation list is constructed by joining users obtained through user similarity and the OOTD list that the user has recently viewed. In both cases, CF-CBF can be implemented by considering TF-IDF as a CF part and semantic similarity as a Content-Based Filtering (CBF) part. Let r n be a TF-IDF vector of a user u n , treating user as a document when calculating the TF-IDF values. Then, the final similarity between two users u n1 , u n2 can be calculated as follows: where h is a shrinkage term for the case with relatively small interactions [50] . Similar methodology is applied to item-based CF-CBF. OOTD Curation With user-based and item-based CF-CBF, we mix up the weekly best OOTD list with the best OOTD list by segment based on demographic information. Note that since a decay term is used, a result closer to global taste is provided rather than personalized content to those who have not used the application for a long time. This reflects the characteristics of the fashion domain where trends change over time and provides exploration oppurtunity and serendipity. A style leader means a person who can be subscribed, and a user can receive better OOTD curation when they follow the style leader. For style leader recommendation, both the latent method and the graph based method are used. In the case of the latent method, recommendation candidates are determined using the modified semantic user similarity. Here, we use cosine similarity between the user style vectors of the recent view/like OOTD history of the follower and the user style vectors of the recent upload OOTD history of the followee candidates. On the other hand, when using the graph-based algorithm, recommendation candidates are obtained by performing a random walk twice in the following/follower relationship graph. Finally, we recommend a mixture of latent-based, graphbased, similar segment users using demographic information, and popular users. Segment and the weekly best serve as exploration and baseline at the same time. 7 System Deployment Serving deep learning models for real-world application requires high-cost and complex infrastructure. To minimize the burden, we adopt AWS Cloud Service, mainly orchestrated through Kubernetes. Deep learning models are loaded on auto-scalable GPU pods to adapt to the variable traffic. The overall architecture is illustrated in Figure 5 . The development of deep learning models is usually done using frameworks such as PyTorch or TensorFlow with an experimental environment. To serve the trained model for the inference, we used NVIDIA Triton Inference Server since it can accommodate all types of neural network models exported in ONNX format, independent from the deep learning framework. For the communication between the service API and the deep learning models, we created an in-house gRPC client library. Each inference step is divided into CPU-heavy parts such as image preprocessing or data loading, and the core GPU consuming part so that each component can be scaled out in parallel. Throughput of visual search system heavily relies on similar vector search algorithms [51] . We considered well-known approaches including Deep Hashing [51] and hierarchical search methods [52] . Empirically, vectors from visual search models form intrinsically clustered spaces, thus separating the hashing stage from the model does not degrade search accuracy compared to learning to hash methods. Therefore, HNSW [52] was adopted in consideration of implementation difficulty, search time complexity, and memory used. In our situation, hundreds of fashion items are added every day, so the vector index is rebuilt every dawn to include such items. In the case of HNSW, the memory consumption increases linearly for the items in the database. Therefore, whenever the index cannot be afforded by a single computing instance, we apply the sharding technique [4, 26] and rearrange the search results through post-processing. Note that since the super-category of a fashion item is almost always accessible, vector indexes are built separately according to the super-categories. When it comes to the recommender system, it is necessary to analyze logs and identify the user-item relationships from large-scale data. To this end, we implemented data processing modules and a basic CF model using AWS RedShift, the data warehouse instance. Moreover, both the visual search system and the recommender system are a pipeline of relatively small modules. In this structure, task parallelism can be applied to improve throughput. We adopted Argo as a Directed Acyclic Graph (DAG) task management tool to implement the task parallelism. Through Argo and Kubernetes configurations, we can automatically scale out the bottlenecks in the DAG. In this paper, we describe technologies for AI-driven fashion SNS that incorporate fashion e-commerce. Users can share and browse their OOTD, while detailed fashion attribute analysis, similar products search, and getting recommendations are all automatically provided by AI. To this end, we built a fashion object detector, a fashion category classifier, a fashion attribute tagger, a fashion visual search system, and an OOTD recommender system. With all these techniques, the fashion SNS platform iTOO from LOTTE has been launched. Future work is to tune the hyperparameters of the AI models and improve model architecture with user feedbacks. The future of fashion: How the quest for digitization and the use of artificial intelligence and extended reality will reshape the fashion industry after covid-19 Deep learning sentiment analysis of amazon Visual search at ebay Visual search at alibaba Recommender System with Machine Learning and Artificial Intelligence: Practical Tools and Applications in Medical, Agricultural and Other Industries Exploiting knowledge graphs for facilitating product/service discovery Transforming the vision of retail with ai Practical lessons from developing a large-scale recommender system at zalando fashion after fashion: A report of ai in fashion A deep learning based architecture for personal a.i. fashion stylist services Fashion and social networking: a motivations framework Fashion meets computer vision: A survey Cross-domain shoe retrieval using a three-level deep feature representation Fashion landmark detection in the wild Deep human parsing with active template regression Deeppose: Human pose estimation via deep neural networks Leveraging class hierarchy in fashion classification Human attribute recognition by deep hierarchical contexts Style2vec: Representation learning for fashion items from style sets Facenet: A unified embedding for face recognition and clustering Deep image retrieval: A survey Deepfashion: Powering robust clothes recognition and retrieval with rich annotations Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images Leveraging weakly annotated data for fashion image retrieval and label prediction Proxy anchor loss for deep metric learning Large-scale visual search with binary distributed graph at alibaba Amazon. com recommendations: Item-to-item collaborative filtering Deep neural networks for youtube recommendations The netflix prize Grouplens: An open architecture for collaborative filtering of netnews Item-based collaborative filtering recommendation algorithms Collaborative filtering for implicit feedback datasets Recommendation as classification: Using social and content-based information in recommendation Proceedings of the 26th international conference on world wide web Vbpr: visual bayesian personalized ranking from implicit feedback Enhancing fashion recommendation with visual compatibility relationship Outfitnet: Fashion outfit recommendation with attention-based multiple instance learning Personalized outfit recommendation with learnable anchors Where to buy it: Matching street clothing photos in online shops Modanet: A large-scale street fashion dataset with polygon annotations Yolov4: Optimal speed and accuracy of object detection The imaterialist fashion attribute dataset Efficientnet: Rethinking model scaling for convolutional neural networks Multi-label fashion image classification with minimal human supervision Mvc: A dataset for view-invariant clothing retrieval and attribute prediction Implementation and benchmarking of perceptual image hash functions Deep metric learning to rank Improved deep metric learning with multi-class n-pair loss objective A simple framework for contrastive learning of visual representations Improved neighborhood-based collaborative filtering Deep supervised hashing for fast image retrieval Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs This work was made in collaboration with Seoul National University, IntelliSys Co., Ltd., and LOTTE Homeshopping Inc. Also, this work was partly sup-