key: cord-0668558-7yyb9nwx authors: Lin, Jiangke; Yuan, Yi; Zou, Zhengxia title: MeInGame: Create a Game Character Face from a Single Portrait date: 2021-02-04 journal: nan DOI: nan sha: e532172848febb7a0a42b853279ad948a2afdb5b doc_id: 668558 cord_uid: 7yyb9nwx Many deep learning based 3D face reconstruction methods have been proposed recently, however, few of them have applications in games. Current game character customization systems either require players to manually adjust considerable face attributes to obtain the desired face, or have limited freedom of facial shape and texture. In this paper, we propose an automatic character face creation method that predicts both facial shape and texture from a single portrait, and it can be integrated into most existing 3D games. Although 3D Morphable Face Model (3DMM) based methods can restore accurate 3D faces from single images, the topology of 3DMM mesh is different from the meshes used in most games. To acquire fidelity texture, existing methods require a large amount of face texture data for training, while building such datasets is time-consuming and laborious. Besides, such a dataset collected under laboratory conditions may not generalized well to in-the-wild situations. To tackle these problems, we propose 1) a low-cost facial texture acquisition method, 2) a shape transfer algorithm that can transform the shape of a 3DMM mesh to games, and 3) a new pipeline for training 3D game face reconstruction networks. The proposed method not only can produce detailed and vivid game characters similar to the input portrait, but can also eliminate the influence of lighting and occlusions. Experiments show that our method outperforms state-of-the-art methods used in games. Due to the COVID-19 pandemic, people have to keep social distancing. Most conferences this year have been switched to online/virtual meetings. Recently, Dr. Joshua D. Eisenberg organized a special conference, Animal Crossing Artificial Intelligence Workshop (ACAI) 1 in the Nintendo game Animal Crossing New Horizons 2 , and has received great attention. Also, it is reported that this year's International Conference on Distributed Artificial Intelligence (DAI) 3 will also be held in the game Justice 4 via cloud gaming techniques. As more and more social activities come to online instead of face-to-face, in this paper, we focus on game character auto-creation, which allows users to automatically create 3D avatars in the virtual game environment by simply upload a single portrait. Many video games feature character creation systems, which allow players to create personalized characters. How-ever, creating characters that look similar to the users themselves or their favorite celebrities is not an easy task and can be time-consuming even after considerable practices. For example, a player usually needs several hours of patience manually adjusting hundreds of parameters (e.g. face shape, eyes) to create a character similar to a specified portrait. To improve a player's gaming experience, several approaches for game character auto-creation have emerged recently. Shi et al.proposed a character auto-creation method that allows users to upload single face images and automatically generate the corresponding face parameters (Shi et al. 2019) . However, such a method and its latest variant (Shi et al. 2020) have limited freedom of the facial parameters and thus can not deal with buxom or slim faces very well. Besides, these methods do not take textures into account, which further limits their adaptability to different skin colors. Using a mobile device to scan a human face from multiple views to generate a 3D face model is another possible solution. The game NBA 2K 5 adopts this type of method. However, users have to wait several minutes before a character is created. Besides, this approach is not suitable for creating 3D faces for celebrities or anime characters, since their multi-view photos are hardly available for players. To tackle the above problems, we propose a new method for automatic game character creation and a low-cost method for building a 3D face texture dataset. Given an input face photo, we first reconstruct a 3D face based on 3D Morphable Face Model (3DMM) and Convolutional Neural Networks (CNNs), then transfer the shape of the 3D face to the template game mesh. The proposed network takes in the face photo and the unwrapped coarse UV texture map as input, then predicts lighting coefficients and refined texture map. By utilizing the power of neural networks, the undesired lighting component and occlusions from the input can be effectively removed. As the rendering process of a typical game engine is not differentiable, we also take advantage of the differentiable rendering method to make gradients backpropagated from the rendering output to every module that requires parameter updating during training. In this way, all network components can be smoothly training in an end-toend fashion. In addition to the differentiable rendering, we also design a new training pipeline based on semi-supervised learning in order to reduce the dependence of the training data. We use the paired data for supervised learning and the unlabeled data for self-supervised learning. Thus, our networks can be trained in a semi-supervised manner, reducing reliance on the pre-defined texture maps. Finally, by loading the generated face meshes and textures to the game environments, vivid in-game characters can be created for players. Various expressions can be further made on top of the created characters with blendshapes. Our contributions are summarized as follows: • We propose a low-cost method for 3D face dataset creation. The dataset we created is balanced in race-andgender, with both facial shape and texture created from in-the-wild images. We will make it publicly available after the paper got accepted. 5 https://www.nba2k.com/ • We propose a method to transfer the reconstructed 3DMM face shape to the game mesh which can be directly used in the game environment. The proposed method is independent of the mesh connectivity and is computationally efficient for practical usage. • To eliminate the influence of lighting and occlusions, we train a neural network to predict an integral diffuse map from a single in-the-wild human face image under an adversarial training paradigm. 2 Related Work Recovering 3D information from a single 2D image has long been a challenging but important task in computer vision. 3D Morphable Face Models (3DMM), as a group of representative methods for 3D face reconstruction, was originally proposed by Blanz and Vetter (Blanz and Vetter 1999) . In 3DMM and its recent variants (Booth et al. 2016; Cao et al. 2013; Gerig et al. 2018; Huber et al. 2016; Li et al. 2017) , the facial identity, expression and texture are approximated by low-dimensional representations from multiple face scans. In a typical 3DMM model, given a set of facial identity coefficients c i and expression coefficients c e , the face shape S can be represented as follows: where S mean is the mean face shape, I base and E base are the PCA bases of identity and expression respectively. Face texture restoration aims to extract textures from input face photos. A popular solution for face texture restoration is to frame this process as an image-to-texture prediction based on supervised training. Zafeiriou et al.captured a large scale 3D face dataset (Booth et al. 2016) . They made the shape model public available but remain the texture information private. With such a private large scale dataset, Zafeiriou and his colleagues (Deng et al. 2018; Zhou et al. 2019; Gecer et al. 2019 ) produced good results on shape and texture restoration. However, many 3D face reconstruction methods do not involve the face texture restoration so far since it is expensive to capture face textures. On one hand, the data acquired in controlled environments can not be easily applied to in-the-wild situations. On the other hand, it is not easy to balance subjects from different races, which may lead to bias and lack of diversity in the dataset. In particular, most subjects in (Booth et al. 2016) are Caucasian, therefore, methods based on such a dataset may not be well generalized to Asians or Africans or other races. Recently, Yang et al. (Yang et al. 2020 ) spent six months collecting a face dataset named FaceScape from 938 people (mostly Chinese) and made this dataset publicly available. However, compare to (Booth et al. 2016) , FaceScape still has very limited scale. To address the dataset issue, differentiable rendering techniques (Genova et al. 2018; Figure 2 : An overview of our method. Given an input photo, a pre-trained shape reconstructor predicts the 3DMM and pose coefficients and a shape transfer module transforms the 3DMM shape to the game mesh while keeping their topology. Then, a coarse texture map is created by unwrapping the input image to UV space based on the game mesh. The texture is further refined by a set of encoder and decoder modules. We also introduce a lighting regressor to predict lighting coefficients from image features. Finally, the predicted shape, texture, together with the lighting coefficients, are fed to a differentiable renderer, and we force the rendered output similar to the input photo. Two discriminators are introduced to further improve the results. have been introduced to face reconstruction recently. With a differentiable renderer, an unsupervised/self-supervised loop can be designed where the predicted 3D objects (representing by meshes, point clouds, or voxels) can be effectively back-projected to 2D space, and thereby maximizes their similarity between the projection and the input image. However, even with a differentiable renderer, there is still a lack of constraints on the reconstructed 3D face information. Methods that used differentiable rendering (Genova et al. Deng et al. 2019 ) still rely on the prior knowledge of 3DMM, which are not fully unsupervised. Besides, such textures restored by 3DMM based methods cannot faithfully represent the personalized characteristics (e.g. makeup, moles) of the input portrait. Lin et al. (Lin et al. 2020 ) recently propose to refine the textures from images by applying graph convolutional networks. While achieving high-fidelity results, these 3DMM based methods aim to reconstruct the 3D shape and texture for the face region rather than the whole head, which cannot be directly used for the game character creation. As a comparison, we aim to create the whole head model with a complete texture, whose shape and appearance are similar to the input. Shape transfer aims to transfer shapes between two meshes. To generate a full head model instead of a front face only, we use shape transfer to transfer a 3DMM mesh to a head mesh with game topology. Non-rigid Iterative Closest Point (Nonrigid ICP) algorithm (Amberg, Romdhani, and Vetter 2007) is the typical method for a shape transfer task, which performs iterative non-rigid registration between the surfaces of two meshes. Usually, non-rigid ICP and its variants has good performance on meshes with regular topology. However, such methods normally take several seconds to complete a transfer, which is not fast enough for our task. 3 Approach Fig. 2 shows an overview of the proposed method. We frame the reconstruction of the face shape and texture as a selfsupervised facial similarity measurement problem. With the help of differentiable rendering, we design a rendering loop and force the 2D face rendering from the predicted shape and texture similar to the input face photo. Our method consists of several trainable sub-networks. The image encoder takes in the face image as input and generates latent features. The image features are then flattened and fed to the lighting regressor -a lightweight network consists of several fully-connected layers and predicts lighting coefficients (light direction, ambient, diffuse and specular color). Similar to image encoder, we introduce a texture encoder. The features of the input image and coarse texture map are concatenated together and then fed into the texture decoder, producing the refined texture map. With the game mesh, the refined texture map, pose, and lighting coefficients, we use the differentiable renderer (Ravi et al. 2020) to render the face mesh to a 2D image and enforce this image to be similar with the input face photo. To further improve the results, we also introduce two discriminators, one for the rendered face image and another for the generated face texture maps. Here we introduce our low-cost 3D face dataset creation method. Unlike other methods that require multi-view images of subjects, which are difficult to capture, our method only uses single view images and is easily acquired. With such a method, we, therefore, create a Race-and-Gender-Balance (RGB) dataset and name it "RGB 3D face dataset". The dataset creation includes the following steps: i. Given an input face image, detect the skin region by using a pre-trained face segmentation network. ii. Compute the mean color of the input face skin and transfer the mean skin color to the template texture map (provided by the game developer). iii. Unwrapping the input face image to UV space according to the deformed game head mesh. iv. Blend the unwrapped image with the template texture map using Poisson blending. Remove the non-skin regions such as hair and glasses, and use symmetry to patch up the occluded regions when possible. Fig. 3 shows some texture maps created by using the above method. The texture maps with good quality are chosen for further refinement. We manually edit the coarse texture map by using an image editing tool (e.g. PhotoShop) to fix the shadows and highlights. Since we can control the quality of the generated texture, the workload of manual repair is very small, and each face only takes a few minutes to complete the refinement. Figure 3 : some examples of generated texture maps. the first row: input images; second row: coarse texture maps. we select those good quality texture maps (the three on the right) to further create ground truth, as shown in the last row. The very first step of our method is to predict the 3DMM shape and pose coefficients from an input image. In this paper, We adopt the 3DMM coefficient regressor in ) for face shape reconstruction, but other 3D face reconstruction methods will also work. Given a 2D image, we aim to predict a 257-dimensional vector (c i , c e , c t , p, l) ∈ R 257 , where c i ∈ R 80 , c e ∈ R 64 and c t ∈ R 80 represent the 3DMM identity, expression and texture coefficients respectively. p ∈ R 6 is the face pose and l ∈ R 27 represents the lightings. With the predicted coefficients, the face vertices' 3D positions S can be computed based on Eq. 1. The shape transfer module aims to transfer the reconstructed 3DMM mesh to the game mesh. We design our shape transfer module based on Radial Basis Function (RBF) interpolation (De Boer, Van der Schoot, and Bijl 2007) . Specifically, RBF defines a series of basis functions as follows: where x represents the center of ϕ and x − x denotes the Euclidean distance between the input point x and center x . In this way, given a point x and a set of RBF basis functions, the value of f (x) can be computed using RBF interpolation: where w i represents the weight of each basis, which are the unknowns to be solved. In the scenario of mesh shape transfer, we first manually specified 68 landmark pairs between the 3DMM template mesh and the game template mash, similar to those in dlib (King 2009 ). In the following, we donote those landmarks on template mesh as original face landmarks. Then we set the centers of the basis functions x as the original face landmarks L g on the game mesh. The input x is set to the original position of a game mesh vertex and the value f (x) we get is the offset of the transfer. In this way, the vertex's new position can be computed as x + f (x). To determine the weights w i , we propose to solve a linear least-squares problem by minimizing the distance of the face landmark pairs between the game mesh and 3DMM mesh. For more details about the shape transfer, please refer to our supplementary materials and code. We design our loss functions to minimize the distance between the rendered face image and the input face photo, and the distance between the refined texture map and the ground truth texture map. Within the rendering loop, we design four types of loss functions, i.e. the pixel loss, the perceptual loss, the skin regularization loss, and the adversarial loss, to measure the facial similarity from both global appearance and local details. We compute our pixel loss on both of the rendered image space and the texture UV space. For the rendered image R, the loss is computed between R and its corresponding input image I. We define the loss as the pixel-wise L1 distance between the two images: where i is the pixel index, M 2d is the skin region mask obtained by the face segmentation network in 2D image space. For the pixel loss on UV space, we define this loss as the L1 distance between the refined texture map F and the ground truth texture map G: where i is the pixel index and N is the number of pixels. Perceptual Loss We following Nazeri et al. (Nazeri et al. 2019 ) and design two losses in perception level, i.e. perceptual loss L perc and style loss L style . Given a pair of images x and x , the perceptual loss is defined by the distance between their activation maps of a pre-trained network (e.g., : where φ i is the activation map of the i th layer of the network. The style loss is on the other hand defined on the covariance matrices of the activation maps: where G φ j (x) is a Gram matrix constructed from activation maps φ j . We compute the above two losses above on both the face images and texture maps. To produce a constant skin tone across the whole face and remove highlights and shadows, we conduct two losses to regularize the face skin, namely a "symmetric loss" and a "standard-deviation loss". Unlike previous works (Tewari et al. 2018; Deng et al. 2019) that apply skin regularization directly on vertex color, we impose the penalties on the Gaussian blurred texture map. This is based on a fact that some personalized details (e.g. a mole) are not always symmetric and not related to skin tone. We define the symmetric loss as follows: where F is the Gaussian blurred refined texture map F . N U and N V are the numbers of columns and rows of the texture map respectively. We define the skin standard-deviation loss as follows: where theF is the mean value of F , M uv is the skin region mask in UV space and i is the pixel index. Adversarial Loss To further improve the fidelity of the reconstruction, we also use adversarial losses during the training. We introduce two discriminators, one for the rendered face and one for the generated UV texture maps, respectively. We train the discriminators to tell whether the generated outputs are real or fake, at the same time, we train other parts of our networks to fool the discriminators. The objective functions of the adversarial training are defined are follows: where D i ∈ {D img , D tex } are the discriminators for image and texture map separately. Final Loss Function By combining all the above defined losses, our final loss functions can be written as follows: L G =λ l1 (L ren (I, R) + L tex (F, G)) + λ perc (L perc (I, R) + L perc (F, G)) + λ sty (L sty (I, R) + L sty (F, G)) where L G is the loss for training the image encoder, texture map encoder, light regressor and texture map decoder. L D is the loss for training the discriminators. λs are the corresponding weights to balance the different loss terms. During the training, we aim to solve the following minimax optimization problem: min G max D L G + L D . In this way, all the network components to be optimized can be trained in an end-to-end fashion. For these texture maps that do not have paired ground truth data, we simply ignore corresponding loss items during the training process. We use the Basel Face Model (Paysan et al. 2009 ) as our 3DMM. We adopt the pre-trained network from to predict the 3DMM and pose coefficients. The 3DMM face we used contains 35,709 vertices and 70,789 faces. We employ the head mesh from a real game, which contains 8,520 vertices and 16,020 faces. The facial segmentation network we used is from (Shi et al. 2019) . We use the CelebA-HQ dataset (Karras et al. 2017 ) to create our dataset. During the UV unwrapping process, we set the resolution of the face images and the unwrapped texture maps to 512 × 512 and 1, 024 × 1, 024 respectively. The dataset we created consists of six subsets: {Caucasian, Asian and African} × {female and male}. Each subset contains 400 texture maps. From each subset, we randomly select 300 for training, 50 for evaluation, and 50 for testing. Note that the game mesh we produced only corresponds to an enclosed head. Hair, beard, eyeballs, etc. are not considered in our method. This is because, in many games, heads, hair, etc. are topologically independent modules. Players can freely change various hairstyles (including long hair, etc.) and other decorations without affecting the topological structure of the head. We use grid-search on the loss weights from 0.001 to 10, and we select the best configuration based on the overall loss on the validation set as well as quantitative comparison. The weights of loss terms are finally set as follows: λ l1 = 3, λ perc = 1, λ sty = 1, λ sym = 0.1, λ std = 3, λ adv = 0.001. The learning rate is set to 0.0001, we use the Adam optimizer and train our networks for 50 epochs. We run our experiments on an Intel i7 CPU and an NVIDIA 1080Ti GPU, with PyTorch3D (v0.2.0) and its dependencies. Given a portrait and coarse texture map, our network only takes 0.4s to produce a 1, 024 × 1, 024 refined texture map. We compare our method with some other state-of-theart game character auto-creation methods/systems, including the character customization systems in A Dream of Jianghu 6 , Loomie 7 , Justice 8 (which is based on the method of (Shi et al. 2020) ), and ZEPETO 9 . As shown in Fig. 4 , our results are more similar to the input images than the other results in both of the face shape and appearance. The faces reconstructed by Justice (Shi et al. 2020) , A Dream of Jianghu, and ZEPETO have limited shape variance, and also fail to recover the textures of the input images. Loomie restores both facial shape and texture, but it cannot handle difficult lighting conditions (e.g. highlights), occluded regions (e.g. eyebrow of the first example), and personalized details (e.g. makeup). We also compare with the state-of-the-art 3DMM based method in Fig. 4 which applies CNNs to reconstruct 3DMM faces from single images. We can see that the 3DMM only models facial features and does not include a complete head model as well as textures, making it difficult to be directly used in the game environments. For more comparisons please refer to our supplementary materials. Since it is hard to acquire ground truth data of game characters, we perform a user study between our results and others. We invited 30 people to conduct the evaluation. Each person was assigned with 480 groups of results. Each group of results included a portrait, our result, and a result from others. Participants were asked to choose a better one from the two results by comparing them with the reference portrait. We believe the user reported score reflects the quality of the results more faithfully than other indirect metrics. The statis- In addition to the user study on the final reconstructions, we also compare our method with Deng et al. (Deng et al. 2018 ) on the quality of restored textures. Deng et al.did not take lighting or occlusions into consideration, which makes their method more like image inpainting than texture reconstruction. We compute Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) metric between the refined texture maps and the ground truth texture maps. The scores are shown in Tab. 2. Note that Deng et al. (Deng et al. 2018) reported their results on WildUV, a dataset similar to ours which is also constructed from an inthe-wild dataset (UMD video dataset (Bansal et al. 2017) ). A direct comparison with our results on RGB 3D Face could be unfair to some extent. Nevertheless, here we still list their result as a reference. To evaluate the contribution of different loss functions, we conduct ablation studies on perceptual loss, skin regularization loss, and adversarial loss. Some examples are demonstrated in Fig. 5 , the full model generates more realistic texture maps. Our full model Without perceptual loss Input Images Figure 5 : Ablation study on different loss functions. We also compute the PSNR and SSIM metrics, the scores are shown in Tab. 3. Although we achieve higher accuracy than other methods in quantitative and qualitative metrics, our method still has some limitations. As shown in Fig. 6 (a) , when there are heavy occlusions (e.g., the hat), our method fails to produce faithful results since the renderer fails to model the shadow created by the objects outside the head mesh. Fig. 6 (b, c) show the results from two portraits of the same person under severe lighting changes. Given (b) or (c) alone, either of the results looks good. Theoretically, it should produce similar results for the same person. However, the results are affected by lights of different colors. In this paper, we present a novel method for automatic creation of game character faces. Our method produces character faces similar to the input photo in terms of both face shape and textures. Considering it is expensive to build 3D face datasets with both shape and texture, we propose a lowcost alternative to generate the data we need for training. We introduce a neural network that takes in a face image and a coarse texture map as inputs and predicts a refined texture map as well as lighting coefficients. The highlights, shadows, and occlusions are removed from the refined texture map and personalized details are preserved. We evaluate our method quantitatively and qualitatively. Experiments demonstrate our method outperforms the existing methods applied in games. Optimal step nonrigid ICP algorithms for surface registration The do's and don'ts for cnn-based face verification A morphable model for the synthesis of 3D faces A 3d morphable model learnt from 10,000 faces Facewarehouse: A 3d facial expression database for visual computing Learning to predict 3d objects with an interpolation-based differentiable renderer Mesh deformation based on radial basis function interpolation Uv-gan: Adversarial facial uv map completion for pose-invariant face recognition Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set GANFIT: Generative adversarial network fitting for high fidelity 3D face reconstruction Unsupervised training for 3d morphable model regression Morphable face models-an open framework A multiresolution 3d morphable face model and fitting framework Progressive growing of gans for improved quality, stability, and variation Dlib-ml: A Machine Learning Toolkit Learning a model of facial shape and expression from 4D scans Towards high-fidelity 3D face reconstruction from in-the-wild images using graph convolutional networks Soft rasterizer: A differentiable renderer for image-based 3d reasoning Edgeconnect: Structure guided image inpainting using edge prediction A 3D face model for pose and illumination invariant face recognition Accelerating 3D Deep Learning with PyTorch3D Face-to-Parameter Translation for Game Character Auto-Creation Fast and Robust Face-to-Parameter Translation for Game Character Auto-Creation Self-supervised multi-level face model learning for monocular reconstruction at over 250 hz FaceScape: a Large-scale High Quality 3D Face Dataset and Detailed Riggable 3D Face Prediction Dense 3d face decoding over 2500fps: Joint texture & shape convolutional mesh decoders