We present a novel framework for free-viewpoint facial performance relighting using diffusion-based image-to-image translation. Leveraging a subject-specific dataset containing diverse facial expressions captured under various lighting conditions, including flat-lit and one-light-at-a-time (OLAT) scenarios, we train a diffusion model for precise lighting control, enabling high-fidelity relit facial images from flat-lit inputs. Our framework includes spatially-aligned conditioning of flat-lit captures and random noise, along with integrated lighting information for global control, utilizing prior knowledge from the pre-trained Stable Diffusion model. This model is then applied to dynamic facial performances captured in a consistent flat-lit environment and reconstructed for novel-view synthesis using a scalable dynamic 3D Gaussian Splatting method to maintain quality and consistency in the relit results. In addition, we introduce unified lighting control by integrating a novel area lighting representation with directional lighting, allowing for joint adjustments in light size and direction. We also enable high dynamic range imaging (HDRI) composition using multiple directional lights to produce dynamic sequences under complex lighting conditions. Our evaluations demonstrate the model’s efficiency in achieving precise lighting control and generalizing across various facial expressions while preserving detailed features such as skin texture and hair. The model accurately reproduces complex lighting effects like eye reflections, subsurface scattering, self-shadowing, and translucency, advancing photorealism within our framework.
We show the effectiveness of our method in transforming flat-lit performance captures to relit performances. Left video is a novel view and a novel performance rendered from a GS reconstruction into a flat-lit video and the right video is the relit version of it, generated by our method given novel light directions. Slider close to the bottom can be used to compare the two images. You can also use the buttons, mouse wheel or touch gestures to zoom in and out of the images.
We show how our method compares given unseen conditions (novel view, pose, light) against the ground truth images. On the left we present the ground truth image and on the right we present the predicted image. Slider close to the bottom can be used to compare the two images. You can also use the buttons, mouse wheel or touch gestures to zoom in and out of the images.
Starting with multi-view performance data of a subject in a neutral environment, we train a deformable 3DGS to create novel-view renderings of the dynamic sequence. These serve as inputs for a diffusion-based relighting model, trained on paired data to translate flat-lit input images to relit results based on specified lighting. Here, we show the inference step of the diffusion model, where the latent representation of the flat-lit image is concatenated with random noise as input for the diffusion U-Net. Lighting information, encoded as SH encoding together with the text embedding, regulates the diffusion process.
To optimize 3D Gaussians for lengthy performance sequences, instead of training all frames together with one shared set of Gaussians, we partition it into small overlapping segments with an equal number of frames, allowing varying Gaussians across segments. To minimize temporal inconsistency at the transition frames between segments and to preserve a similar level of reconstruction details across segments, we design a two-stage training strategy. We first sample some frames to partition the long sequence into segments. At Stage 1, we train the deformable 3DGS on the starting frames only to generate the initialization for the training of each segment. At Stage 2, we train a deformable 3DGS for each segment conditioned on the initialization. Linear interpolation is used to blend the results of the overlapping segments to ensure temporal consistency on segment transitions.
We compare the HDRI relighting results from our method with reference images captured in four real-world environments. For each HDRI relighting example, we present the reference image, two relighting results (one using the area-light model and the other using the OLAT-based model), along with the captured HDRI map and its approximations of 15 Spherical Gaussians (SGs) and 123 OLATs overlaid on the image. Both models achieve results comparable to the reference images.
BibTex TBA