key: cord-0649437-bg1fpil0
authors: Skenderi, Geri; Joppi, Christian; Denitto, Matteo; Scarpa, Berniero; Cristani, Marco
title: The multi-modal universe of fast-fashion: the Visuelle 2.0 benchmark
date: 2022-04-14
journal: nan
DOI: nan
sha: d3c30d4cbdd61f7bb3b34088f8cb00351e7b0e9f
doc_id: 649437
cord_uid: bg1fpil0

We present Visuelle 2.0, the first dataset useful for facing diverse prediction problems that a fast-fashion company has to manage routinely. Furthermore, we demonstrate how the use of computer vision is substantial in this scenario. Visuelle 2.0 contains data for 6 seasons / 5355 clothing products of Nuna Lie, a famous Italian company with hundreds of shops located in different areas within the country. In particular, we focus on a specific prediction problem, namely short-observation new product sale forecasting (SO-fore). SO-fore assumes that the season has started and a set of new products is on the shelves of the different stores. The goal is to forecast the sales for a particular horizon, given a short, available past (few weeks), since no earlier statistics are available. To be successful, SO-fore approaches should capture this short past and exploit other modalities or exogenous data. To these aims, Visuelle 2.0 is equipped with disaggregated data at the item-shop level and multi-modal information for each clothing item, allowing computer vision approaches to come into play. The main message that we deliver is that the use of image data with deep networks boosts performances obtained when using the time series in long-term forecasting scenarios, ameliorating the WAPE by 8.2% and the MAE by 7.7%. The dataset is available at: https://humaticslab.github.io/forecasting/visuelle.

Fashion forecasting has traditionally been studied in scientific sectors other than computer vision, such as operational research and logistics, with the primary aim of predicting trends [16, 22] , sales [21, 24, 25] , and performing demand forecasting [13, 34] . In recent years this trend has faded, showing an increasing cross-fertilization with computer vision [1, 7, 10, 26, 28] .

In this paper, we present Visuelle 2.0, which contains real data for 5355 clothing products of a retail fast-fashion Italian company, Nuna Lie. For the first time ever, a retail fast-fashion company has decided to share part of its data to provide a genuine benchmark for research and innovation purposes. Specifically, Visuelle 2.0 provides data from 6 fashion seasons (partitioned in Autumn-Winter and Spring-Summer) from 2017-2019, right before the Covid-19 pandemic 2 . Each product in our dataset is accompanied by an HD image, textual tags and more. The time series data are disaggregated at the shop level, and include the sales, inventory stock, max-normalized prices 3 and discounts. This 2 The pandemic represented an unicum in the dynamics of the fast fashion companies, so it has not been included. The market effectively restarted in which is an ongoing season at the time of writing. 3 Prices have been normalized for the sake of confidentiality. permits to perform SO-fore considering each single store. Exogenous time series data is also provided, in the form of Google Trends based on the textual tags and multivariate weather conditions of the stores' locations. Finally, we also provide purchase data for 667K customers whose identity has been anonymized, to capture personal preferences. With these data, Visuelle 2.0 allows to cope with several problems which characterize the activity of a fast fashion company: new product demand forecasting [13, 34] , shortobservation new product sales forecasting [21, 24, 25] and product recommendation [6] .

In this paper, we focus on one of these problems: shortobservation new product forecasting (SO-fore). SO-fore aims at predicting the future sales in the short term, having a past statistic given by the early sales of a given product ( Fig. 2b provides a visual comparison with standard forecasting in Fig. 2a ). In practice, after a few weeks from the delivery on market, one has to sense how well a clothing item has performed and forecast its behavior in the coming weeks. This is crucial to improve restocking policies [27] : a clothing item with a rapid selling rate should be restocked to avoid stockouts. Two particular cases of the SO-fore problem will be taken into account: SO-fore 2−10 , in which the observed window is 2 weeks long and the forecasting horizon is 10 weeks long, required when a company wants to implement few restocks [14] ; SO-fore 2−1 , where the forecasting horizon changes to a single week, and is instead re- quired when a company wants to take decisions on a weekly basis, as in the ultra-fast fashion supply chain [9, 39] . Our findings show that the usage of image data is crucial in absence of long term statistics which characterize the sales, because the pictorial content of a fast fashion product can be used to inherit long term statistics from visual similarity (see Fig. 2b ): the images allow to refer with high precision to past data that are akin to the product of analysis, providing informative priors.

Visuelle 2.0 is a substantial extension of the unpublished Visuelle dataset [37] , used only for new product demand forecasting, where less data (no weather conditions, no customer data) was furnished, aggregated per-product over all the retail stores (no geographical/store dimension).

Visuelle 2.0 describes the sales between Nov. 2016 and Dec. 2019 of 5355 different products across 110 different shops. For each product, multi-modal information is available, as described in the following. Time series data. Given a product i of size s at a retail store r, we refer to its product sale signal as S(i, s, r, t) where t refers to the t-th week of market delivery, with i = 1, ..., N , s = 1, ..., M , r = 1, ..., L and t = 1, ..., K. We also define the inventory position signal I(i, s, r, t) indicating the inventory on-hand for that quadruplet (i, s, r, t). Combining these data, we can individuate all those legit sales signals that do not involve a stockout until the K-th week. Formally, a legit sale signal is S(i, s, r, 1), ..., S(i, s, r, t legit ) where t legit + 1 indicates the first week with a stockout I(i, s, r, t saf e + 1) = 0, and t saf e > K. Hence, we guarantee that a zero-sale legit signal (i.e. S(i, s, r, t) = 0) is provided only when nobody bought i (even tough it was available at the shop) and not because of a stockout. These signals are important because they focus on the net performance of a product, independently on the inventory management. To make the signals denser, we aggregate the different sizes obtaining the final sale(i, r, t) = M s=1 A(i, s, r, t). Additionally, we include the Restock flag signal R(i, s, r, t) indicating when a restocking has been carried out. Fig. 3a reports one example of a legit sale signal. The initial stock is represented by the initial amount of items available in the inventory. Fig. 3b depicts a log-density plot of the legit sales of all the products, averaged over categories and retail stores during the SS19 season. Fig. 5 shows sales statistics for the 110 available shops. Image data Each product is associated with an RGB image that has a resolution which varies from 256x256 to 1193x1172, with a median size of 575x722 (WxH). Each image portrays the clothing item on a white background, with no person wearing it. Some examples of these images are provided in Fig. 1 . Text data Multiple textual tags related to each product's visual attributes are available. These tags have been extracted with diverse procedures or chosen by hand, carefully validated by the Nuna Lie team. The first tag is the category, taken from a vocabulary of 27 elements, visualized in Fig. 4a ; the cardinality of the products shows large variability among categories overall, due to the fact (a) (b) Figure 6 . Customer shopping history statistics: a) the blob size indicates how many customers have a shopping history of n seasons long (even non consecutive seasons, at least one item bought per "active" season) on the x-axis, consisting of m items in total on the y-axis; b) Number of baskets associated to customers over shopping seasons.

that some categories (e.g. long sleeves) cost less and ensure higher earnings. The color tag ( Fig. 4b) represents the most dominant color chosen among 10 manually detected colors. The fabric tag comes directly from the technical sheet of the products, chosen from a vocabulary of 59 elements. Finally, the release date for each product-shop pair is recorded as a textual string.

Customer purchase data Visuelle 2.0 contains anonymized data for 667086 customers, who have requested a fidelity card thanks to which it is possible to extract the history of their purchases and the baskets of products they bought. These data consists of: ID of the purchased product, date-time of purchase, retail store ID and quantity. Fig. 6 gives a glimpse on the distribution of these data, where it is possible to note that there are around 6k users which have bought continuously a total of 25 products over 4 seasons (Fig. 6a ). More than 2K users have bought continuously 15 baskets of products over 4 seasons. Customer data are useful to test recommendation approaches, whose goal is to recommend available products that the user will eventually buy, possibly exploiting image data to capture personal aesthetic preferences. In this paper, we do not tackle this problem. Exogenous time series Google Trends time series for each product are provided, based on the product's three associated attributes: color, category, f abric. The trends are downloaded starting from 52 weeks before the product's release date, essentially providing a popularity curve for each of the attributes. Google Trends' efficacy for predictions on fast fashion problems has been demonstrated before in [35, 37] . Weather reports downloaded from IlMeteo 4 are also supplied, containing the real weather conditions on a daily basis at the municipality level. The efficacy of weather reports for forecasting in fast fashion has been demonstrated in [3, 38] .

Here we show how Visuelle 2.0 is a genuine benchmark for two types of SO-fore: i) SO-fore 2−10 and ii) SOfore 2−1 .

SO-fore 2−10 allows the company to customize the restocking operations for each product on the basis of the early sales, minimizing the number of such operations. Firstly, the sales series are split into an observation window and a horizon window (i.e. the known past and the future to forecast), set here to 2 and 10 weeks respectively, covering the 12-week fast fashion life-cycle [41] . Formally, the goal is to perform a multi-step forecast of the sales of an item i in a store r (sale(i, r, 3), ..., sale(i, r, 12)), given the first two time-steps. Two weeks is a standard period to sufficiently understand the whereabouts of the fashion market and take decisions for the future [40, 45] .

SO-fore 2−1 serves in other contexts where a weekly restocking schedule is adopted [9, 39] . The idea is to estimate the time-point sale(i, r, t) given the previous two time-steps sale(i, r, t − 1), sale(i, r, t − 2). Similarly to before, we set the initial observation window to 2, but use a sliding window approach to perform an autoregressive forecast.

In both cases we split the data into train and test sets, where the test set contains the 10% most recent item-shop pairs, such that the items that are seen in training will have always been released before the ones we test on. We use the following three approaches as baselines:

• Classical forecasting methods, namely the Naive method [18] (using the last observed ground truth value as the forecast) and Simple Exponential Smoothing (SES) [18, 19] ; • kNN [11, 29] , which produces forecasts based on product similarity. This is done by finding k-Nearest Neighbors from the past that are similar to the input product and performing a weighted average of their sales. We set k = 11 and we compute the similarity between products using the known time series or image features extracted by a pre-trained ResNet [15] ; • An autoregressive, attention-based RNN architecture [11] , where the different data modalities are first processed separately and then merged together through several additive attention modules [2] . Results are displayed in Table 1 and Table 2 , where all the listed approaches only use sales time series data as input, unless specified otherwise. Classical forecasting approaches tend to give poor performances due to the small number of observations [18] . The kNN-based methods show an improvement over the statistical forecasting baselines, demonstrating that inter-product similarity is important when predicting future sales. This is tied with the notion that new products will sell comparably to older, similar products. Trivially utilizing the images with kNN lowers [37] for the different baselines. The lower the better for both metrics.

In the first row we also report results for the demand forecasting of new products without past sales, demonstrating how much the knowledge of the initial sale dynamics improves forecasting. [37] for the different baselines. The lower the better for both metrics.

the performances, but their contribution is highlighted when utilising an expressive neural network architecture. Cross-Attention RNN [11] outperforms the others by a noticeable margin, because the model is able to learn non-linear, interproduct dependencies throughout the whole training set and also advanced temporal dynamics. It is worth noting that we reach the best performances in SO-fore 2−10 by pairing each product's time series input with its respective image. This shows that visual representations allow the model to better understand long term forecasting patterns. Another important takeaway is that the results for time series only methods are much better on SO-fore 2−1 , due to the localized temporal information and shorter forecasting horizon, while the images become less important. Additionally, we tested Cross-Attention RNN in the Demand forecasting task, i.e., predicting the full sales series without having access to any previous observations, but only to the product image. Obviously, results are definitely inferior to the SO-fore variants, but comparable to the kNN approaches. Our dataset and experiments provide a general overview of how problems in the fashion realm can be tackled and how the use of computer vision and multi-modal approaches is key to providing better solutions. The dataset page will contain further information regarding other, possible challenges and tasks for Visuelle 2. 

The main paper discusses two contributions: 1) the novel Visuelle 2.0 dataset and 2) the advantage of utilising a multi-modal approach for short-observation new product forecasting (SO-fore). This additional material adds to each one of these main topics, as follows: The Visuelle 2.0 dataset The following topics will be faced in Sec. 4.1:

• How Visuelle 2.0 compares with respect to the current datasets of forecasting for fast-fashion, exploring the literature of time-series forecasting and computer vision in this area (Sec.4.1.1);

• A showcase of the data contained in Visuelle 2.0, including examples of products, the associated time series data, image data and the text data. An excerpt of these data is reported in Fig. 7 . Subsequently, we will show some examples of exogenous data i.e., the Google Trends associated to a given product, and the weather reports associated to a given shop; finally, customer purchase data will be presented (Sec. 4.1.2);

• A list of possible challenges that can be studied on Visuelle 2.0 will be presented, and specifically: more tasks related to new product short observation forecasting; the new product demand forecasting; the product recommentation (Sec. 4.1.3).

The SO-fore problem In Sec.4.2, the Cross-Attention RNN will be detailed, with a graphical representation that illustrates its components (Sec. 4.2.1).

The Visuelle 2.0 dataset can be related to two scientific fields: 1) the one of forecasting for (fast) fashion [9, 30, 39] , with particular emphasis on those works that exploit deep learning techniques [11, 24] ; 2) the recent field at the intersection between computer vision and fashion [8] , with special emphasis on the task of popularity prediction and fashion forecasting. In both cases, Visuelle 2.0 innovates for specific reasons which will be detailed in the following, in two separate sections. In both the cases, Visuelle 2.0 has an unprecedented richness of data which other datasets do not possess, such as customer purchase information, and the different exogenous data.

Forecasting for (fast) fashion The most used forecasting techniques for fast fashion incorporate classical ARIMA, SARIMA, exponential smoothing [5] , regression [31] , Box & Jenkins [4] and Holt Winters [43] , as reported in the recent review of [24] ; machine learning approaches (decision trees, random forests, SVMs, neural networks) are at their infancy on this topic and most importantly they are not considering multi-modal data, but only time-series. This has naturally created an abundance of datasets for timeseries analysis for sale forecasting [33] and demand forecasting [32] , and the absence of datasets with images included, which we are filling with Visuelle 2.0. As a notable exception, the work of [11] proposes a set of techniques for demand forecasting, in which images are taken into account by an Attention-based RNN framework, which we also utilise for SO-fore as explained in Sec. 4.2.1. Unfortunately, the dataset on which they perform their experiments is not publicly available, while Visuelle 2.0 will be made publicly available.

The peculiarity of this field is the exploration of multi-modal data (images, text, time-series) for prediction tasks. In general, computer vision approaches have been considered for the task of popularity prediction or fashion forecasting [8] In both the cases, the ground truth signal is built on top the public ratings obtained on online platforms such as Chictopia.com [36, 44] , Lookbook.nu [23] or Amazon [1] , which consider outfits [23, 36, 36, 44] , or several outfits exhibiting the same style [1] : an outfit or a style is popular if it receives a high rating in terms of number of "likes" or "stars". In the case of Visuelle 2.0, we can assume that a product is more popular than others of the same category, if in the same season, it has sold more. In this setup, our dataset allows to be more fine-grained, since one can predict the popularity, in terms of sales, of a single product. Also, Visuelle 2.0 represents the very first dataset which permits multi-modal analysis on the data of a real fast fashion company, meaning that approaches which succeed on this benchmark can be directly applicable on the fast fashion market.

Example of products In Sec.2 of the main paper we have given some statistics about the products which are in the Visuelle 2.0 dataset. Here we will give some qualitative examples of their image data, text attributes and associated time series: product sales, inventory position, Restock flag, and discount, the latter omitted in the main paper due to the lack of space. Formally, following the notation of the main paper, given a product i at a retail store r, we refer to its discount signal as D(i, r, t) where t refers to the t-th week of market delivery, with i = 1, ..., N , r = 1, ..., L and t = 1, ..., K. The discount signal is expressed as a percentage, describing how much a particular item is discounted; for example, D(i, r, t) = 20 indicates that the initial price defined for a product i in the retail store r was discounted by a 20% at time t. Fig. 7 showcases all of these data for three products of season AW19. Other figures can be found at the end of this additional material, reporting products of other seasons (Fig. 12, Fig. 13 and Fig. 14) . Figure 8 . Excerpt of exogenous weather data (humidity and maximum temperature) for three major Italian municipalities: Milan, Turin and Rome. Humidity in these areas tends to be negatively correlated with maximum temperature. Rome tends to be much hotter than Milan and Turin throughout the year. An interesting observation which can prove beneficial to forecasting is the seasonal nature of weather phenomena, which can induce information as to which clothing products may sell more in a particular period.

Exogenous data Exogenous data is often neglected as a resource within datasets, especially in forecasting. This is due to their nature, since they are, by definition, coming from an external phenomenon that is not directly related with the data being analysed. Nevertheless, adding exogenous variables such as weather data [3, 38] or popularity data [35, 37] to forecasting models has proven extremely beneficial in terms of forecasting performance. For this reason, we provide in Visuelle 2.0 multivariate exogenous data both for the weather and popularity, in the form of detailed weather reports (Fig. 8) and Google Trends (Fig. 9 ). An more profound explanation for both examples is provided in the respective captions.

Customer purchase data Quoting Sec.2 and Fig.6 of the main paper, among the 667086 total registered users of Nuna Lie, 6k users have bought continuously a total of 25 products over 4 seasons. In Fig. 10 we have a random excerpt of 10 of these users, with a random subset of 9 purchased items each (no cherry picking); in many cases, personal styles do emerge, showing systematic preferences on diverse attributes, as written in the caption of the figure. Figure 9 . Example of the exogenous Google Trends data available in Visuelle 2.0 for two completely different products. The signals can have regular trend and seasonality (left) or be stationary and seem noisy (right). This can prove helpful for forecasting models, because it allows to understand both global and cyclical popularity and therefore anticipate sales.

In this paper, we explored the problem of short-observation forecasting on Visuelle 2.0, with the precise focus of showing the benefit of the image data on this task. Obviously, we are far from saturating the performance, encouraging further improvements. These could be provided by diverse techniques (exploiting LSTM or transformer-based architectures), or including additional exogenous data, available in Visuelle 2.0. Google Trends and weather reports, in fact, are signals which have been shown elsewhere to be predictive [3, 35, 37, 38] , so this should be a natural next step.

Other challenges which can be experimented on Visuelle 2.0 are listed in the following.

• Demand forecasting. Forecasting demand is a crucial issue for driving efficient operations management plans [13, 30, 41] .This is especially the case in the fast fashion industry,where demand uncertainty, lack of historical data, variable ultra-fast life-cycle of a product and seasonal trends usually coexist [20, 27] . In rough terms demand forecasting outputs the amount of goods to buy from the suppliers. This amount is then distributed among the different retailers, with the aim of avoiding zero-stocks or excessive unsold inventory. In this paper we show a glimpse of demand forecasting on Visuelle 2.0, at the level of single shop (i.e. predicting how much a single shop will need during the next season), adopting the recent RNN-based approach of [11] on time series and time series + image.

In [37] we report some results on the aggregated signal in the old Visuelle dataset; it is worth remembering that in that case the signal about the single shops was missing, less products were available, and the only important result was to show how Google Trends data are beneficial. In this case, a deep analysis on demand forecasting on Visuelle 2.0 needs to be carried out, in-U1 U2 U3 U4 U5 U6 U7 U8 U9 U10 Figure 10 . A random sampling of users/ purchases. Personal styles do emerge: users 1 and 10 have no trousers in their logs, user 6 has bought almost short sleeves and no trousers, while user 7 seems to prefer long sleeves and several trousers; user 10 has a marked preference for light yellow-grayish colors.

cluding discounts and exogenous signals like weather reports (Fig. 8) and Google Trends (Fig. 9 ).

• Product recommendation. An important feature of Visuelle 2.0 is the presence of customer purchase data; 667086 customers have bought along 8 seasons a total amount of 3253876 items, which cover a consistent percentage (84%) of the total purchases collected within the dataset. A graphical representation of these data is reported in Fig. 10 , where it is visible that some users have marked preferences.

Product recommendation on these data would consist in defining a particular time index t rec , when the historical data of all the past purchases (older than t rec ) of all the customers will be taken into account. Therefore, two types of inferences will be possible: 1) to suggest which product (or category, or attribute) z k a specific customer u i could be interested in; a positive match will be in the case of an effective purchase of z k (or some item which is in the category z k or that expresses the attribute z k ) by u i after time t rec ; 2) same as before, but including a specific time interval T buy within which the customer will buy. In practice, a positive match will be in the case of an effective purchase of z k (or some item which is in the category z k or that expresses the attribute z k ) by u i in the time interval ]t rec , t rec + T buy ]. In general, product recommendation can be carried out by standard collaborative-filtering based techniques, but also considering recommendation as an instance of forecasting [12, 17] and viceversa: this interplay could be certaintly explored with the Visuelle 2.0 dataset.

The Cross-Attention RNN [11] , can be described as an autoregressive, sequence-to-sequence neural network that tries to understand the different, non-linear relationships in the various data modalities and then perform predictions by understanding which part of the data is most important for the forecasting task. The attention modules constitute a large part of the model and are exactly as in [2] , where at each decoding step we try to attend to the encoder fea- Figure 11 . A visual description of the Cross-Attention RNN model, the neural network architecture used in our SO-fore experiments. Taken from [11] .

tures based on the current decoder hidden state. The encoder starts by embedding each input modality into a common feature space R D . The input observation sales (time series) are first passed through an additional self-attention layer [42] , differently from the original work in [11] and then projected through a fully connected layer. This helps filter out initial noise from past sales observations. The image and textual tags are embedded processed using a ResNet-101 [15] and learnable embedding layers respectively. The temporal features extracted from the product's release date are also embedded using learnable embedding layers. Cross-attention RNN works, by default, in an autoregressive manner, therefore at each decoding step three different additive attention modules are applied. These modules allow the decoder hidden state to attend to the time series embedding, the image embedding and most importantly to the concatenated, multi-modal embedding. A residual learning approach [15] is applied to allow the network to scale better with the number of hidden layers and also learn to ignore null contributions from the attention mechanism. After each decoding step, the GRU hidden state is updated based on the last processed input. For extensive details on this model, we refer to [11] .

The model is trained with a batch size of 128 and MSE (Mean Squared Error) loss function, using the Adafactor optimizer, on two NVIDIA RTX TITAN GPUs. During training, we apply dropout after each embedding module and also apply teacher forcing at random with a probability p tf = 0.5. 

Fashion forward: Forecasting visual style in fashion

Neural machine translation by jointly learning to align and translate

A survey on retail sales forecasting and prediction in fashion markets

Time series analysis: forecasting and control

Smoothing, forecasting and prediction of discrete time series

Fashion recommendation systems, models and methods: A review

Fashion trend forecasting using machine learning techniques: A review

Fashion meets computer vision: A survey

Fast fashion sales forecasting with limited data and time

Leveraging multiple relations for fashion trend forecasting based on social media

Attention based multi-modal new product sales time-series forecasting

Recommendation systems: Algorithms, challenges, metrics, and business opportunities. applied sciences

Machine learning demand forecasting and supply chain performance

Optimizing inventory replenishment of retail fashion products

Manufacturing & service operations management

Deep residual learning for image recognition

Fashion trend forecasting

Web service recommendation based on time series forecasting and collaborative filtering

Forecasting: Principles and Practice

Forecasting with exponential smoothing: the state space approach

An improved demand forecasting with limited historical sales data

A data-driven forecasting approach for newly launched seasonal products by leveraging machinelearning approaches

Fashion trends: Analysis and forecasting

Dressing for attention: Outfit based fashion popularity prediction

Exploring the use of deep neural networks for sales forecasting in fashion retail

Retail sales forecasting with meta-learning

Knowledge enhanced neural fashion trend forecasting

Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision analytics

Geostyle: Discovering fashion trends and events

A methodology for applying k-nearest neighbor to time series forecasting

Demand forecasting in the fashion industry: a review

A regression-based approach to short-term system load forecasting

Using stacking approaches for machine learning models

Linear, machine learning and probabilistic approaches for time series analysis

Explainable ai based interventions for pre-season decision making in fashion retail

Googling fashion: forecasting fashion consumer behaviour using google trends

Neuroaesthetics in fashion: Modeling the perception of fashionability

Well googled is half done: Multimodal forecasting of new fashion product sales with image-based google trends

Fast fashion lessons

Global commodity chains and fast fashion: How the apparel industry continues to re-invent itself

Sales forecasting in apparel and fashion industry: A review. Intelligent fashion forecasting systems: Models and applications

Intelligent demand forecasting systems for fast fashion

Attention is all you need

Forecasting sales by exponentially weighted moving averages

Chic or social: Visual popularity analysis in online fashion networks

A web-based system for fashion sales forecasting