PRNet Extension: 3D Face Reconstruction and Dense Alignment

Project Video

Outline

  1. Introduction
  2. Background Work
  3. Datasets
  4. Baseline Model
  5. Baseline training and results
  6. Generalizability and testing of the baseline model
  7. Novelties: PRNet Extensions
  8. MobileNetV2 as encoder
  9. Resnet18 as encoder
  10. Training and results
  11. Generalizability and testing of the new model
  12. Footnote on our work prior to 3D face reconstruction

Introduction

3D face reconstruction is an interesting topic in computer vision. If the 2D images and the corresponding 3D models of faces are provided as input-label pairs, it is easy for the model to construct a 3D face of an unknown 2D image. But reconstructing 3D faces from a single 2D RGB image is where the challenge lies.

Then also comes the topic of face alignment. This means fitting a face model to an image and extracting the semantic meanings of facial pixels. Detecting the face’s alignment for large poses and using it in 3D reconstruction is another challenge. Models based on CNNs are one effective tool that can be used to understand face alignment and then can use this information to construct a 3D model. Following the same pathway, we have implemented a method that simultaneously reconstructs the 3D facial structure and provides dense alignment. This is done end-to-end in the model without restriction from low-dimensional solution space. The dense face alignment means the methods should offer the correspondence between two face images as well as between a 2D facial image and a 3D facial reference geometry

Background Work

The 3D face models were first proposed two decades back in the 3D Morphable Model (3DMM) model. Since then we have come a long way in reconstructing 3D faces from given 2D images. The Large-Pose 3D model provides a volumetric representation of 3D faces from a single 2D image but does not provide dense alignment. The problem with most of the models is that they are constrained by model shape space and post-processing to generate 3D mesh from estimated parameters. They would rely either on a 3D facial template or would warp the shape of a reference 3D model in the current image. Thus, they would get restricted as their outputs would change when the template changes. Also, model-based methods which regress a few model parameters rather than the coordinates of points usually need special care in training such as using Mahalanobis distance, and inevitably limits the estimated face geometry to their model space.

Regarding the face alignment, most of the models developed were on small to medium poses. Large poses bring more complexity to the model as the face appearances vary drastically depending on the angle of the image. 3DDFA by Zhu et al have made a new alignment framework in which a dense 3DMM is fitted in the image via Cascaded CNNs. But better performance and stability are achieved by 3DDFA_v2 by Guo et al where they achieve the best results on speed, accuracy, and stability by having a MobileNet backbone that dynamically regresses a small set of 3DMM parameters.

Position map Regression Network (PRN) or PRNet by Feng et.al focuses on regressing 3D facial geometry and its dense correspondence information from a single 2D image. They make the UV map representation of an image that can be directly predicted by the deep network. No intermediate parameters like 3DMM coefficients or warping parameters are needed and thus, their model is quite fast with decent accuracy.

Datasets

The dataset used for training tasks is the 300W-LP dataset released by Zhu et al through their publication Face Alignment Across Large Poses: A 3D Solution. The dataset includes 61,225 images of human faces in large poses, along with the fitted face models using face meshes. By stating large pose we mean that the dataset contains faces in a variety of poses many of which are at a large angle relative to the camera. For the purpose of this project, we have used the IBUG subset of the 300W-LP dataset.

Baseline Model

3DDFA_v2 by Guo et al is the current state-of-the-art in the 3D dense face alignment for the 300W-LP dataset, but since the work is totally new (ECCV 20, latest GitHub commit January 2021), there is not much information available about the model details and the training. Hence, we decided to choose PRNet as our baseline model and make novel extensions to it and try to see if the extensions match state-of-the-art.

Model Details and Architecture

To jointly solve the problem of 3D face reconstruction and face alignment, Feng et.al designed a 2D representation called UV position map which records the 3D shape of a complete face in UV space. UV space is a 2D image plane parameterized from the 3D space. Traditionally, it has been utilized as a way to express information including the texture of faces(texture map), 2.5D geometry(heightmap), and the correspondences between 3D facial meshes. Feng et.al use UV space to store the 3D coordinates of points from the 3D face model.

UV position map illustration — ground truth 3D face point cloud exactly matches the face in the 2D image when projected to the x-y plane. Thus the position map can be easily comprehended as replacing the r, g, b values in the texture map by x, y, z coordinates.
The architecture of PRNet. The Green rectangles represent the residual blocks, and the blue ones represent the transposed convolutional layers. It is an encoder-decoder structure, which learns how to transfer input images to a UV position map.

Then they train a simple Convolutional Neural Network to regress it from a single 2D image.

The network transfers the input RGB image into a position map image, where the authors employ an encoder-decoder structure to learn the transfer function. The encoder part of the network begins with one convolution layer followed by 10 residual blocks which reduce the 256 × 256 × 3 input image into 8 × 8 × 512 feature maps, the decoder part contains 17 transposed convolution layers to generate the predicted 256 × 256 × 3 position map.

They also integrate a weight mask into the loss function which assigns different weights to each point on the position map and computes a weighted loss during training to improve the performance of the network. The weight mask is a gray image recording the weight of each point on the position map. Simply using the MSE instead of this weighted loss would be inefficient for learning the position map as it would assign equal weights to the entire UV map. We could notice that the central region would have more distinct features of the face and thus, equal treatment of the UV position map is not efficient. A particular weight ratio is followed, subregion1 (68 facial landmarks): subregion2 (eye, nose, mouth): subregion3 (other face area): subregion4 (neck) = 16:4:3:0.

Illustration of the final weight mask. From left to right: UV texture map, UV position map, colored texture map with segmentation information (blue for eye region, red for nose region, green for mouth region, and purple for neck region), the final weight mask.

Baseline Training and Results

In order to test and debug our training code, we have used Google Colab Pro. The final training was run on AWS p2.xlarge instances using PyTorch v1.1.0, and TensorBoard v1.14.0. The model was trained using a batch size of 16, the learning rate of 0.0001, adam optimizer, and an exponential LR scheduler for a total of 150 epochs. The exact details of the config can be found here.

While training each epoch took an approx 6 minutes 30 seconds to run, with a total training time for 150 epochs running well over 17 hours. The associated logs and the tensorboard files for the baseline run can be found in the link here.

Baseline Training Loss over the epochs

We see that the training loss decreases with final loss = ~0.007.

Here we evaluated the model performance with original weighted loss and SSIM loss. SSIM is a measure of the structural similarity between original and reconstructed images. For more information, you can refer here.

Baseline Results

We represent the output of the model as the sparse and the dense alignment with 2D facial point projections. Some of the final results are shown below.

Top Pose: Left to Right: Original Ground Truth Image, 2D facial landmark points, 3D sparse alignment, 3D dense alignment
Side Pose: Left to Right: Original Ground Truth Image, 2D facial landmark points, 3D sparse alignment, 3D dense alignment
Side Pose: Left to Right: Original Ground Truth Image, 2D facial landmark points, 3D sparse alignment, 3D dense alignment
A GIF shows the model’s outputs of sparse and dense alignment from the 10th epoch to the 140th epoch.

We can clearly notice that lips and the eyes are the hardest to learn. The model learns the basic structure of the face first, i.e acknowledging the presence of jawline, alignment of nose wrt to lips, and then the presence of an eyebrow and eyes below it. As the learning progresses, it first learns the bowl of the face, then the eyebrows, and then gradually fits the nose and eyes in their respective places.

Generalizability and Testing of the baseline model

We used some of the images totally different from the training set of images and saw that the model gives decent results. One general problem of these 3D reconstruction models is that the learning is biased towards the texture, features, and alignment of the images fed in the training dataset.

Original Image of Prof. Deepak Pathak, 2D facial landmark points, 3D sparse alignment, 3D dense alignment
Testing the model’s outcomes on the images in the wild with images from different categories such as gender (male/female), poses(side/front/top) or accessories (with/without spectacles).

Novelties: PRNet Extensions

In order to improve on the baseline model, we primarily focus on 2 objectives:

  • Improving the long training times of the baseline model: Based on our training of the baseline model, it took well over 17 hours to complete only 30% of the training. And the total training time (500epochs) was expected to be around 2 and a half days.
  • Improving the embeddings fed via the encoders: Our intuition was to try different models as encoders, to experiment with different encoding architectures, and see how the model’s generalizability and performance varies.

After trying several different architectures as encoders, the following 2 were the ones that fit the overall model architecture, without diminishing actual results. We also saw significant improvement over the training times here.

MobilenetV2 as encoder

In the original experiment, Feng et.al have used ResNet10 as their encoder. And recent state-of-the-art implementations (like 3DDFA_V2 by Guo et al), in the field have suggested the use of MobileNet. Since mobilenet is lightweight we expected it to be training faster than the baseline while giving a decent performance.

MobileNetV2 Architecture

One challenge that we faced while training with the MobileNet encoder was that the loss was exploding if we used the same learning rate as the baseline, so we decreased the learning rate and trained the architecture. For training, we used a similar setup as the baseline using AWS p2.xlarge instances, PyTorch v1.1.0, and TensorBoard v1.14.0. The model was trained using a batch size of 16, the learning rate of 0.00001, adam optimizer, and an exponential LR scheduler for a total of 420 epochs. The exact details of the config can be found here.

Improvements

Each epoch took an approx 1 minute and 30 seconds to run, and the entire training of 420 epochs took around 11 hours to complete. This is almost a 75% reduction in training times over the baseline. But due to the low learning rate, the results were not as good. The associated logs and the tensorboard files for the mobilenet run can be found in the link here.

MobileNet Training Loss over the epochs

MobileNet Results

We represent the output of the model as the sparse and the dense alignment with 2D facial point projections. Some of the final results are shown below.

Left to Right: Original Ground Truth Image, 2D facial landmark points, 3D sparse alignment, 3D dense alignment

Generalizability and Testing of the model

While testing the generalizability of the model on our own test images, we found that the model with mobilenet as encoder gives decent outputs as it is able to track the faces and directions. The results are not as good as the baseline, but considering the lower training times, it is still acceptable.

Original Image of Prof. Deepak Pathak, 2D facial landmark points, 3D sparse alignment, 3D dense alignment
Testing the model’s outcomes on the images in the wild with images from different categories such as gender (male/female), poses(side/front/top) or accessories (with/without spectacles).

Resnet18 as Encoder

Our next experiment was where we tried using different ResNet architectures as encoders, we first experimented with Resnet34 and higher variants but they ended up making the output of the encoder too small to give decent results. Hence, we settled with ResNet18.

The model with resnet18 encoder was again trained on AWS p2.xlarge instances using PyTorch v1.1.0, and TensorBoard v1.14.0. The model was trained using a batch size of 16, the learning rate of 0.0001, adam optimizer, and an exponential LR scheduler for a total of 500 epochs. The exact details of the config can be found here.

While training each epoch took an approx 48 seconds to run, and the entire training of 500 epochs took under 7 hours to complete. The associated logs and the tensorboard files for the baseline run can be found in the link here.

ResNet18 Training Loss over the epochs

ResNet18 Results

Left to Right: Original Ground Truth Image, 2D facial landmark points, 3D sparse alignment, 3D dense alignment
A GIF shows the model’s outputs of sparse and dense alignment from the 0th epoch to the 490th epoch on the ResNet18 inspired model.

ResNet18 Model’s Generalizability

Below are some images on which we tested our model on which differ from the training image set completely. We found that even after having reduced training times the generalized results for the resnet18 are better or as good as that of the baseline.

Original Image of Prof. Deepak Pathak, 2D facial landmark points, 3D sparse alignment, 3D dense alignment
Testing the model’s outcomes on the images in the wild with images from different categories such as gender (male/female), poses(side/front/top) or accessories (with/without spectacles)

Conclusion and Improvements

We could see that with almost an 87% reduction in training time, we get better results.

  1. The loss goes down to 0.001 and even in sparse/dense alignment,
  2. The predictions are quite well molded. For example, in the results of Barack Obama, the ears were outlined too and the jawline fits way better than the original model.
  3. The training time went down from 2.5 days to under 7 hours.
Comparison: Baseline PRNet (Left) vs ResNet18 PRNet. The Jawline is more prominent on the ResNet18 trained model, and the eye alignment is also more exact.
  1. The generalizability of the code has also improved. We could see better dense alignments in test images, even with obstacles on the face and more inclined faces.
Comparison: Baseline PRNet(Left) vs ResNet18 PRNet: Dense Alignment of Side Face aligned picture
Comparison: Baseline PRNet(Left) vs ResNet18 PRNet: Dense Alignment of Top view aligned picture

Code and Saved Models

Code and saved models for the different training experiments can be found at the links below:

Note

Motivated by the ECCV 20 papers on animal 3D reconstruction, we chose to work on 3D animal reconstruction for dogs initially. We chose “Who Left the Dogs Out?” by Benjamin Biggs as our base model. But since the work was quite recent (March 2021), we ran into setup issues that even the authors were fixing in their GitHub code. Even though the authors offered to help, there were limitations on how much could be made public. Hence, we were not able to proceed with the model and conclude with the results. We had to change our pathway in the end and finally chose 3D Face Reconstruction as our final project, which is explained above. The resources gathered for the 3D animal construction project can be found here.

References

  1. PRNet unofficial implementation: https://github.com/tomguluson92/PRNet_PyTorch
  2. PRNet paper Feng et. al: https://arxiv.org/pdf/1803.07835.pdf
  3. PRNet official release: https://github.com/YadiraF/PRNet
  4. PRNet reference article: https://medium.com/@hyprsense/bridging-the-academia-gap-an-implementation-of-prnet-training-9fa035c27702
  5. State-of-the-art comparison: https://paperswithcode.com/sota/3d-face-reconstruction-on-aflw2000-3d
  6. Guo et al.: https://github.com/cleardusk/3DDFA_V2
  7. Zhu et al.: https://github.com/cleardusk/3DDFA
  8. 3DDFA paper: https://arxiv.org/pdf/1804.01005.pdf
  9. 3DDFA_v2 paper: https://guojianzhu.com/assets/pdfs/3162.pdf
  10. SSIM loss: https://ece.uwaterloo.ca/~z70wang/research/ssim/

--

--