Image- Based Virtual Try-on System: A Survey of Deep Learning-Based Methods

Since the last years and until now, technology has made fast progress for many industries, in particularly, garment industry which aims to follow consumer desires and demands. One of these demands is to ﬁt clothes before purchasing them on-line. Therefore, many research works have been focused on how to develop an intelligent apparel industry to ensure the online shopping experience. Most of these works focus on the virtual try-on task to develop Image-based virtual ﬁtting systems which present various challenging issues since persons can appear in diﬀerents poses and views. In last years, many studies have developped by using deep learning methods to face the challenges of pose variation, occlusion and illumination changes. Thus, we reviewed, in this paper, a large range of research works focused on using deep learning methods in image-based virtual ﬁtting solutions by summarizing their challenges, their main frameworks and the popular benchmark datasets used for training. Hence, an overview of diﬀerent evaluation metrics is presented with some examples of performance comparison, and lastly, some promising future research directions are discussed.


Introduction
In the few last years, online shopping for clothes has become a common practice among millions of people around the world. It shows a great progress and become a habitual activity for many consumers. For this reason, online shopping for clothes has earned its place deservedly. With statistical proof, the global fashion apparel has exceeded 3 trillion US dollars, in currently year, and presents two percent of the world's Gross Domestic Product (GDP) [1]. In 2020, a revenue of 718 billion US dollars area attained in the fashion sector and an expectation to reach a growth of 8.4% for 2021 [2].
The main reasons of online shopping growth in the last years, is that this kind of trade become more and more like shopping in person thanks to the efforts of businesses to add new features and services with the intent of providing their customers the same support and comfort that they would have during an in-person shopping experience. This goal has been achieved by using the computer technology to develop virtual try on applications that assist the fit of garment product in order to make consumers know how cloths look on themselves, how both the top and bottom matches together, and how the size of clothes fits to them. Therefore, Online shopping would give more information and availability of all kinds of products to encourage fashion trailers to invest in the way to explore new sales methods and optimization of technological process of purchasing clothes like virtual fitting system. These solutions draw a new picture of online shopping experience and bring it to a high level of reality and comfort.
Instead of using current graphics tools that fail to meet the increasing demands for personalized visual content manipulation, there are many proposed algorithms to address swapping clothes by using recent advances in computer vision tasks like fashion detection, fashion analysis or fashion synthesis. These solutions require considerable effort from researchers to perform the task of changing clothes across images with preserving details and identities. However, using current image editing technology e.g., Adobe Photoshop or Adobe illustrator cannot give a realistic result due to many challenges of changing clothing in 2D images, such as the deformation of the clothes, different poses, and different textures. Hence, recent studies adopted deep-learning-based methods to encounter these problems and to achieve more accurate results.
In the literature, a few fashion surveys are proposed [3,4]. Most Early, an overview on intelligent facial and clothing analysis was presented by Liu et al. [3]. In 2018, Song and Mei [4] introduced the development of fashion tasks due to its emergence with multimedia. Next, a general survey painted the global picture of intelligent fashion without taken a specific problem [5]. Then and due to the rapid development of computer vision, many tasks are appeared within intelligent fashion, hence, many related works must be updated. In this direction, this research aims to conduct a comprehensive literature review of deep learning methods applied in the fashion industry by citing research works published in the last years and mentioning their relationship to the early studies. The contribution of this work consists in responding to the following research questions: -RQ1. What is the impact of using Artificial Intellignce (AI) and deep learning (DL) on fashion apparel industry? -RQ2. How virtual try on system are developed?
-RQ3. What are the planned improvements to extent research on this area?
In this paper, different sections are structured as follow: Section 2 outlines the research framework adopted to realize this research review. Section 3 is dedicated to literature's review which is divided into two main parts, the first one presents the fashion detection tasks including fashion parsing, fashion synthesis, and landmark detection. The second one illustrates the works for fashion synthesis containing style transfer, pose transfer, and clothing simulation. Section 4 provides an overview of fashion benchmark datasets and presents the performance of popular works on different tasks. Section 5 shows related applications and future directions. Finally, a conclusion is given in Section 6.

Research Framework
In this study, a Systematic Literature Review (SLR) [6] is chosen to focus on research works related to virtual fitting system based on 2D images with deep learning methods and applied in the fashion industry. The SLR methodology adopted is shown in Fig. 1. The review process commenced with collecting and preparing data from scientific databases. Subsequently, articles were selected in different phases.
According to our research framework, we have selected more than 130 articles from both journals and conference. Articles were retrieved from popular databases and engines such as Google scholar 1 and Research Gate 2 , then, a screening process is used to select the articles relevant to address the research questions mentioned in previous section. Then, a categorization of research articles must be done according to the main steps used to develop image-based virtual fitting system with deep learning methods. After categorization, there is the process of information extraction and classification of the selected articles based on the key terms of research topic to address our research questions. As shown in Fig. 1 that presented the article classification according to the research questions, RQ1 is focused on understanding the overall trend of AI in the Fashion industry. Hence, the focus of the screening process was limited to those articles discussing the implementation and execution of AI techniques to improve online shopping. RQ2 aimed at identifying the various stages on virtual fitting framework where the AI method was employed. RQ3 aims to understand the extent of online shopping problems which being a focus of research studies. These keys modules were considered during information extraction from research articles.

Review of literatures
In recent years, advanced machine learning approaches have been successfully applied to various fashion-based problems. The topics of fashion research in the literature of image-based garment transfer are summarized in Fig. 2. One of the branches in fashion research is fashion detection, which aims to label each pixel in the scene (i.e., fashion parsing, landmark detection, and pose estimation), supported by fashion synthesis, which lead us a step closer to a fashion intelligent assistant.

Fashion Detection
Fashion detection is an essential task for virtual try-on task, it consists to detect the human body part to predict the region of clothing synthesis. To apply this task in virtual try-on systems, three aspects must be presented: Fashion parsing, Human Pose Estimation and Fashion landmark detection.

Fashion Parsing
Fashion parsing or in other words human parsing with clothes classes, is a specific form of semantic segmentation. This task refers to generate pixel-level labels on the image which are based on the clothing items like hair, head, upper clothes, pants, etc. It is a very challenging problem since the number of garment types, the variation in configuration and appearance are enormous. In Fig. 3, we present an example of fashion parsing results generated by the work of Ji et al. [7] Fig. 3. Examples of fashion parsing based on semantic segmentation [7].
In fashion domain, largest number of potential applications have been devoted to various tasks and particularly to human parsing [8][9][10][11][12][13][14]. The beginning is with Yamaguchi et al. [8] who proposed a combination between fashion parsing and human pose estimation. Then, they proposed clothes parsing with a retrieval-based approach [9,10] to resolve the constrained parsing problem. After that, a weak supervision approach for fashion parsing are presented by Liu et al. [11] who resort to label images with color-category labels instead of pixel-level. The inconsistent targets between pose estimation and clothing parsing presented by these works leads to obtain results far from the perfect. Therefore, Other research studies attempted to relax this restriction, such as the work of Dong et al. [12] which proposed a traditional hand-crafted pipeline that wasn't considered as a perfect solution because many hand designed processing steps were needed. Then, Liang et al. [13,14] treat the human parsing with the contextualized approach by providing the clothing tags at the image level. These hand-crafted approaches present many limitations because they need to be designed carefully.
To fix this problem, some approaches based on CNN are exploited like the framework of Liang et al. [15] based on deep human parsing with active template regression for semantic labeling. With the intent to ameliorate the parsing results of their previous work, Liang et al. [14] developed a Contextualized CNN (Co-CNN) architecture to capture, synchronously, the context of cross-layer, global image-level, and local super-pixel. In 2018, Liao et al. [16] built a Matching CNN (M-CNN) network to solve the issues of parametric and non-parametric CNN-based methods. In the same year, Liang et al. [17] implemented an important self-supervised method under the name "Look Into Person" (LIP) to eschew the necessity of labeling the human joints in model training (Fig. 4). Following their previous work [17], the same authors proposed a JPPNet network [18] to deal with both the human parsing and human pose estimation task. Different from the abovementioned works that only focused on single person parsing task, several works [20][21][22] focus on handling the scenario with multiple persons. A deep Nested Adversarial Network (NAN) is presented in the work of Zhao et al. [20] to understand humans in crowed scenes. This network is composed, respectively, of three Generative Adversarial Network (GAN) for semantic saliency prediction, instance-agnostic parsing, and instance-aware clustering. Gong et al. [22] proposed the first attempt to explore a detection-free Part Grouping Network (PGN) used for semantic part segmentation, instance-aware edge detection and instance-level human parsing. In 2019, Ruan et al. [21] presented a Context Embedding with Edge Perceiving (CE2P) framework to deal with both single and multiple human parsing. Most recently, hierarchical graph is considered for human parsing tasks [23,24] to ensure perfect parsing performance. Wang et al. [23] considered the human body as a hierarchy of multi-level semantic parts to capture the human parsing information. Basing on transfer learning technique, Gong et al. [24] designed a human parsing model untitled Graphonomy by including hierarchical graph into conventional parsing network.

Human Pose Estimation
Advanced in computer vision are realized by many tasks especially with deep learning-based approaches such as Human Pose Estimation (HPE) that is applied in many fields like fashion fitting to get specific postures from human body by joints' localization. To overcome the challenges related to HPE, many research efforts have been devoted to the related fields. We present, in this section, recent researches in HPE methods based on 2D images which are classified into two groups: single person pose estimation and multi-person pose estimation.

A-Single-person Human Pose Estimation
Single-person human pose estimation (HPE) refers to the task of localizing human skeletal keypoints of a person from an image or video frames. In the following Figure (Fig. 5), we present some results of Single-person HPE obtained from MPII Human Pose dataset [25] and Leeds Sports Poses (LSP) dataset [26].  [26] Most early, Single-person HPE methods began with a traditional way by adopting handcraft feature extraction and sophisticated body models to obtain local representations and global pose structures [27][28][29]. Then, deep learning-based methods have resorted to neural networks [30,31] to extend the traditional works. According to the different structures of HPE task, methods based on CNN can take different aspects such as regression methods and detection methods.
Regression-based methods produced joint coordinates directly by learning mapping from image [32]. The early deep learning-based network adopted by many researches was AlexNet [33] due to its simple architecture. Toshev et al. [32] applied AlexNet to learn joint coordinates from full images. Also, Pfister et al. [34] exploited this network to ensure the prediction of the human pose from videos. Then, Luvizon et al. [35] proposed a regression approach with Soft-argmax function to ensure the directly conversion of feature maps to joint coordinates. This framework enabled the learning of heat maps representations, without requiring more steps of artificial ground truth generation. Nibali et al. [36] proposed a numerical coordinate regression by using CNN to calculate joint coordinates from heatmaps.
Due to the difficulty of prediction directly the joint coordinates from input images, many works interested to this challenge and proposed effective networks based on body model structure. Sun et al. [37] proposed a structure-aware regression method using bones instead of joints. Li et al. [38] employed an AlexNet as a multi-task framework to predict the joint coordinate from full images. a R-CNN architecture [39] is used to detect person, estimate pose, and classify action. Fan et al. [38] proposed a dual-source deep CNNs which take image patches to combine both local and contextual information to generate an output designed with a combination of joint detection and joint localization. For video sequences, Luvizon et al. [40] used a multi-task deep learning method to deal with both pose estimation and action recognition.
Detection-based methods treat the body parts as detection targets based on two main representations: image patches and heatmaps of joint locations. The methods related to this category are intended to predict approximate locations of body parts [28] or joints [41]. For a more supervision information and easy training, recent works [42,43] used heatmaps based methods to indicate joint's ground truth location. Papandreou et al. [44] proposed a fully convolutional ResNet to ameliorate the representation of joint location with the prediction of dense heatmaps and offsets. GoogleNet proposed a network with multi-scale inputs [45] and ResNet-based network with deconvolutional layers [46] to ameliorate classic network. Many works [47][48][49][50][51] tackled the problem of design networks in a multi-stage style to refine results from coarse prediction.
Previous works attempt to adjust detected body parts into body models, but there are other recent works [52][53][54][55][56][57][58][59] which aim to encode human body structure information into networks. Yang et al. [52] proposed a CNN to predict joint locations in heatmap representation. An RNN was proposed [54] to output joint location one by one. Chu et al. [54] proposed to transform kernels by a bi-directional tree to pass information between corresponding joints in a tree body model. Tang et al. [59] proposed a hierarchical representation of body parts, then, they extended their work [60] to learn specific features of part group. Additionally, Chou et al. [61] introduced adversarial training including two hourglass networks with same architecture. Chen et al. [62] proposed a CNN to effectively localize the human body parts by taking priors into account during training. Peng et al. [63] exploited data augmentation to avoid the need of more data during training. Luo et al. [64] exploited temporal information with RNN redesigned from CPM by changing multi-stage architecture with LSTM structure. Tang et al. [65] committed to improve the network structure by proposing a densely connected U-nets and efficient usage of memory. Feng et al. [66] adopted a model learning strategy called Fast Pose Distillation (FPD) to design Hourglass network.

B-Multi-person Human Pose Estimation
The second category of HPE methods is the multi-person HPE which aims to handle detection and localization tasks. It can be divided, according to its different level, into top-down methods and bottom-up methods. Top-down methods used bounding box and estimators of single-person pose to detect person from image and predict human poses. The bottom-up methods put into skeletons the prediction of 2D joints of persons in the image. Fig. 6 shows examples of results from the work of Li et al. [67] Fig. 6. Example of multi-person HPE [67] A combination of existing detection networks and single HPE networks used to implement the Topdown HPE methods [25,26,44,53] that achieved state-of-the-art performance in almost benchmark datasets while the processing speed is depend to the number of detected people. For bottom-HPE methods, the main components include body joint detection and joint candidate grouping. The two components are handled separately for most algorithms. The bottom-up methods-based works [68,69] realized perfect performance expect some conditions like human occlusions or complex background.

Fashion Landmarks Detection
Fashion landmark detection is an important task in fashion analysis, it aims to predict clothes keypoints which are very essential for fashion images understanding by getting discriminative representation. The local regions of fashion landmarks give more significant variances since the clothes are more complicated than human body joints. Fig. 7 shows an example of fashion landmark detection from the work of Liu et al. [70].  [71]. First row presented the results on Deep-Fashion-C test set [72], and second row shows results on FLD dataset [70].
For the first time, Liu et al. [72] presented fashion landmark concept and, in parallel, they proposed a deep model called FashionNet [72] applied on predicted clothing landmarks. Then, they proposed a deep fashion alignment framework [70] based on CNN. This Framework is trained on different datasets and evaluated on two fashion applications, clothing attribute prediction and clothes retrieval. Another regression model proposed by Yan et al. [73] used to relax constraint of clothing bounding box due to its difficult application. A more recent work [74] mentioned that optimization on regression model is hard, so, they proposed to predict directly a confidence map of positional distributions for each landmark. Lee et al. [75] resorted to contextual knowledge to achieve perfect performance on landmark prediction. Ge et al. [76] built a Match R-CNN model to deal with their proposed versatile benchmark Deepfashion2 [77].

Fashion synthesis
Fashion synthesis is the task for generating new style across images and being able to imagine what that person would look in a different clothing style by synthesizing a realistic-looking image. In the following, we review existing methods for addressing the problem of generating images of people in clothing by focusing on style transfer, pose transformation, and physical simulation.

Style Transfer
In fashion synthesis task, style transfer is an important step that aims to transfer the style between images. It can be applied in various kinds of image especially facial image and garment image. CNNbased methods applied on this task exploit the feature extraction to obtain style information from image. Isola et al. [78] proposed the well-known style transfer work, pix2pix, which is a general solution for style transfer. For specific goal, based on a texture patch, these works [79,80] transferred the input image or sketch to the corresponding texture. An example of image style transfer from TextureGAN [80] is shown in Fig. 8. Han et al. [81] proposed VIrtual Try-On Network (VITON) to try clothing on image of person by generating a coarse tried-on result and predicted the mask for the clothing item, then, a refinement network for the clothing region was employed to synthesize a more detailed result. This framework fails to handle large deformation, especially with more texture details, due to the imperfect shapecontext matching for aligning clothes and body shape. CP-VTON model [82] was proposed to deal with this issue by handling the spatial deformation with a Geometric Matching Module, which explicitly aligned the input clothing with the body shape. Fig. 9 presents some results of VITON [83] and CP-VTON [82]. The previous works needed in-shop clothing image for virtual try-on, but there are other proposed models like FashionGAN [84] and M2E-TON [85] presented target try-on clothing image based on text description and model image by Giving an input image and a sentence describing a different outfit. First, a GAN generates the segmentation map according to the description and then, another GAN ensure rendering of the output image by the segmentation map. M2E-TON [85] was able to try on clothing from different images of people, and with different poses.
Other works attempts to resolve the problem with arbitrary poses such as Fit-Me [86] which was the first work building virtual try-on dealing with this challenge. Then, FashionOn [87] applied the semantic segmentation to present more realistic results. SwapNet [88] proposed a pipeline to transfer garment information across images with arbitrary clothing, body poses, and shapes by operating in image-space. Vtnfp [89] proposed a design strategy which generates, firstly, warped clothing, followed by body segmentation map of the person wearing the target clothing, and ending with a try-on synthesis module to fuse together all information for a final image synthesis. In 2019, Zheng et al. [90] proposed an architecture to Virtually trying on new clothing with arbitrary poses by using the body shape mask prediction for pose transformation. The work of Han et al. [91] focus on transferring the appearance naturally and synthesizing novel result by proposing ClothFlow model. In addition to their approaches related to image-based virtual try-on, Dong et al. [92] presented a Flow-Navigated Warping GAN for Video (FW-GAN) which aimed to synthesize a video of virtual try-on results.
Recent works [88,89,93,94] address challenging task of transferring garment between person's pictures with preserving the identity in the source and target images. To solve the problems of body parts missing and visual details, Feng et al. [93] proposed a novel image-based virtual try-on network which maintain the structural consistency between the generated image and the original image by human parsing. Then, Outfit-VITON [94] synthesizes a cohesive outfit from multiple images of clothed human models, while fitting the outfit to the body shape and pose of the query person.

Pose Transformation
Pose transformation is a crucial task for fashion synthesis, it takes an input image of person and a target pose to generate images of this persons in different poses with the preserving of original identity. Some examples of pose transformation are presented in Fig. 10. A two-stage adversarial network PG2 [95] achieved an early attempt on the challenging task of transferring a person to different poses. this framework generated both poses and appearance simultaneously by dividing the problem into two stages. Pose information are used in the first stage to generate human body structure in the desired image. Then and during the second stage, a deep convolutional GAN is used to treat the output of the first stage. This framework shows results for texture details which were highly blurred. To tackle this problem, the affine transform was employed to keep textures in the generated images better.
The work of Siarohin et al. [96] used a deformable GAN to generate images of person according to a target pose which allowed the extraction of the articulated object pose by resorting to a keypoint detector. Other recent work [97] address the problem of human pose synthesis with a modular generative neural network that synthesizes unseen poses by using four modules consisting of image segmentation, spatial transformation, foreground synthesis, and background synthesis. Si et al. [98] introduced a multi-stage pose-guided image synthesis framework which divided the network into three stages for pose transform in a novel 2D view, foreground synthesis, and background synthesis.
Previous research works present data limitation that was taken by Pumarola et al. [99] which borrowed the idea from [100] by leveraging cycle consistency. Different approaches [101,102] aimed to model body shape but they didn't show good results in the appearance of reference images. In 2019, the work of Song et al. [103] presented a solution for this limitation by proposing a novel approach which consisted of a decomposition of the hard mapping into semantic parsing transformation and appearance generation subtasks to improve the appearance performance.

Clothing Simulation
For more amelioration of fashion synthesis performance, the use of clothing simulation is essential. The abovementioned synthesis works are within the 2D domain where the clothing deformation is not considered to generate realistic appearance. This task presented many challenges like the need of creating more realistic results in real-time running with the treatment of more complex garments.
The traditional way to simulate realistic clothes is the building of models by using computer graphics tools [104][105][106][107]. For learning both stretching and bending in real cloth samples, Wang et al. [106] proposed a piecewise linear elastic model. For learning the physical properties of clothing on different human body shapes and poses, Guan et al. [104] designed a pose-dependent model to simulate clothes deformation. Pons-Moll et al. [105] designed ClothCap to simulate clothing deformation of people in motion. As shown in Fig. 11, they separated garments from the human body to estimate the body shape and pose, then, they tracked the 3D deformations of the clothing over time from 4D scans to help simulate the physical clothing deformations in different human posture.

Benchmark datasets
Recent advances in virtual try-on systems have been driven by the construction of clothes datasets. Due to the large variations in different tasks, it is difficult to build a universal dataset to evaluate the whole methods of virtual try-on. Therefore, some researchers resort to create datasets to evaluate their proposed methods, this diversity makes the comparison on different algorithms very difficult. Datasets, also, bring more challenges and complexity through their expansion and improvement. This section discusses the popular publicly available datasets for virtual try-on tasks and their characteristics.

Fashion datasets
Large number of benchmark datasets proposed to study fashion applications such as virtual try-on systems. In Table 1, we summarize some of these datasets.
As summarized in table 1, for each task there are specific datasets with according setting. Market-1501 [115] and DeepFashion [72] are the most popular datasets for virtual try-on. Fashion Landmark Dataset [70] is the most used dataset for fashion landmark detection. For fashion parsing task, there are multiple datasets and the popular one is the LIP dataset [17,18]. Datasets for physical simulation are different from other fashion tasks since the physical simulation is more related to computer graphics than computer vision. Physical simulation working within the fashion domain focus on clothing-body interactions, and datasets can be categorized into real data and created data. Despite the rapid revolution on previous datasets which are based on 2D images like DeepFashion [72], DeepFashion2 [77] and FashionAI [116], the production of datasets basing on 3D clothing is almost rare or not sufficient for training like the digital wardrobe released by MGN [117]. In 2020, Heming et al. [118] develop a comprehensive dataset named Deep Fashion3D which is richly annotated and covers a much larger variations of garment styles.

Performance Assessment
In image processing, measuring the perceptual assessments of generated results is an important step to validate research works. There is an emerging demand for quantitative performance evaluation in image-based garment transfer, which is caused by the requirement to objectively judge the quality of virtual fitting systems in order to facilitate comparability of the various existing approaches and to measure their improvements.

Image Quality Assessment (IQA)
The measure of performance of computer vision tasks is ensured by image quality assessment methods which divided into objective or subjective methods. The last one is based on the perception of humans to evaluate the realistic appearance of generated images. With each year, the number of proposed IQA algorithms are progressively growing, by proposing new one or extending existing IQA algorithms. In this section, we present the most popular IQA algorithms used to evaluate tasks of image-based garment transfer.

IQA for fashion Detection
For clothing fitting based on images, the fashion attributes must be first detected to predict the clothing style. Most works on clothing localization show validate results by using different metrics on different tasks such as landmark detection, pose estimation and human parsing.

Fashion parsing
In fashion Parsing, various metrics are used to evaluate proposed approaches on different datasets such as Fashionista [8] and LIP [17,18] and in terms of average Pixel Accuracy (aPA), mean Average Garment Recall (mAGR), Intersection over Union (IoU), mean accuracy, average precision, average recall, average F-1 score over pixels and foreground accuracy. Table 2 report some quantitative results measured by these metrics.

Human pose Estimation
Research in HPE has made significant progress during the last years. In this section, we present the most important evaluation metrics which are needed to measure the performance of human pose estimation models. Table 3 presents these different metrics used for comparisons of the existing state-of the-art approaches.

Fashion landmark detection
The most popular evaluation metrics in fashion detection are Normalized Error (NE) and Percentage of Detected Landmarks (PDL). NE is considered as the distance between predicted landmarks and ground-truth, while PDL is defined as the percentage of detected landmarks according to overlapping criterion. Typically, smaller values of NE or higher values of PDL indicate better results. Table 4 and Fig. 12 presented examples of these performances results.

IQA for Fashion synthesis
The image quality evaluation is essential for image generation methods to synthesize desired outputs. Recent image synthesis research [96,102,123,124] commonly uses simple loss functions to measure the difference between the generated image and the ground truth, e.g., L1-norm loss, adversarial loss, and perceptual loss. Here, we will present related evaluation metrics to each tasks of fashion synthesis including style transfer, pose transfer and clothing simulation.

Style transfer and Pose transfer
Image based garment transfer aims to transform a source person image to a target pose while retaining the appearance details. In this case two essential tasks are required to ensure this goal. That are, style transfer and pose transfer which are very challenging tasks especially in the case of human body occlusion, large pose transfer and complex textures and for measuring the quality of generated images common metrics are used.
The evaluation for style transfer is generally based on subjective assessment by rating the results into certain degrees and the percentages of each degree are, then, calculated to evaluate quality of results. Also, there are objective comparisons for virtual try-on, in terms of inception score (IS) or structural similarity (SSIM). IS [125] is used to evaluate the synthesis quality of images quantitatively. SSIM [126] is utilized to measure the similarity between input and output images ranging from zero (dissimilarity) to one (similarity).

Physical simulation
There are limited quantitative comparisons between physical simulation works. Most of them tend to calculate the qualitative results only within their work or show the vision comparison with related works. Fig. 13 presents an example of these comparisons.

Application and future work
Automate the manual processes is a great achievement ensured by new technologies such as computer vision. One of the popular industries that is influenced by technology advancement at a much faster speed than ever before is Fashion. Due to computer vision powered tools, a great experience can be born for both retailers and consumers. In the following, we discuss emerging uses of fashion technology in some application areas and present future works needed to achieve the promised benefits.

Application
Apparel industry is all about visual and computer vision can recognize images just as we do by making computers understand images. Thus, creating AI systems that can understand fashion in images, can have a big impact on the industry and create a next-level customer experience like online fashion shopping. Here is where the future research work, in this area, will bring value and become useful for fashion business by making smart shopping.
Going completely online brings a vast number of challenges for fashion retailers and gives an inspiration for new innovative digital products like virtual fitting systems to make the wholesale process completely digital. This goal can be achieved by using AI technology that has the power to better engage them with the personalized shopping experience that leads them to make more informed and confident purchase decisions. Large fashion brands implemented online virtual fitting rooms in a bid to reduce return rates and improve customer satisfaction. A virtual fitting would be a way to see the virtual effects, but it is still far from solved due to the challenge to virtually change the texture and pattern of clothes deformation and shading especially when we use an image-based approach to transfer clothes.

Future Directions
Despite the great development of image-based fitting systems, there remain some unresolved challenges and gap between research and practical applications such as the influence of body part occlusion and crowded people. Therefore, there are still many challenges in adopting fashion technologies in industry because real-world fashion is much more complex than in the experiments.
The main issue is related to system performance which is still far from human performance in realworld settings. The demand for a more robust system consequently grows with it. Thus, it is crucial to pay attention to handling data bias and variations for performance improvements. Moreover, there is a definite need to perform the task in a light but timely fashion. It is thus also beneficial to consider how to optimize the model to achieve higher performance.
Network efficiency is a very important factor to apply algorithms in real-life applications. Diversity data can improve the robustness of networks to handle complex scenes with irregular poses, occluded body limbs and crowded people. Data collection for specific complex scenarios is an option and there are other ways to extend existing datasets. Synthetic technology can theoretically generate unlimited data while there is a domain gap between synthetic data and real data. Cross-dataset supplementation, especially to supplement 3D datasets with 2D datasets, can mitigate the problem of insufficient diversity of training data. Transfer learning proves to be useful in this application.
In our future work, we will aim to provide an efficient virtual try-on system for fashion retailers to ensure a better shopping experience for customers. This goal can be achieved by developing an intelligent system that can understand fashion in images. This system must realize at first fashion detection to localize where in the image a fashion item appears or where the different body parts are localized. Then, it would swap clothes between different images of persons and deal with the large variations on body poses and shapes.

Conclusion
With the explosive growth of clothing images, its study has attracted more attention of researchers to develop applications based on clothing models. The future directions must bridge the gap between research and real industry demand. Given the huge profit potential in the fashion industry, the representative intelligent fashion analysis techniques surveyed here are just the beginning of this expanding research field because up to now, enormous research efforts have been spent on these tasks. In this direction, our future work will exploit the use of AI to develop virtual try-on system and overcome the challenges ranging from most of the important topics in computer vision domain, especially, the techniques used in virtual fitting like fashion detection and fashion synthesis.