Text-driven Visual Synthesis with Latent Diffusion Prior

University of Maryland, College Park



Text prompt: A very beautiful anime girl, full body, long braided curly silver hair, sky blue eyes, full round face, short smile, casual clothes, ice snowy lake setting, cinematic lightning, medium shot, mid-shot, highly detailed, trending on Artstation, Unreal Engine 4k, cinematic wallpaper by Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti, and Sakimichan.

StyleGAN Adaptation.

Jacobian NeRF

Jacobian NeRF + Ours.

Text prompt: A high quality photo of a jug made of blue and white porcelain.

Text-to-3D compare with Jacobian-NeRF, and Latent-NeRF.

Text prompt: Golden Horse.

Layer image editing.


There has been tremendous progress in large-scale text-to-image synthesis driven by diffusion models enabling versatile downstream applications such as 3D object synthesis from texts, image editing and generation. We present a generic approach using latent diffusion models as powerful image priors for various visual synthesis tasks. Existing methods that utilize such priors fail to use these models' full capabilities. To improve this, our core ideas are

We demonstrate the efficacy of our approach on three different applications, text-to-3D, StyleGAN adaptation, and layered image editing. Extensive results show our method compares favorably against baselines.


Our method guides the generation and editing given a text prompt. We obtain the latent code from a differentiable renderer under different applications. This latent code 𝑣 is perturbed following the latent diffusion model’s scheduler at a random timestep 𝑑, such that 𝐹𝑑: z𝑑 = 𝛼𝑑𝑣 + πœŽπ‘‘πœ–. This perturbed latent code z𝑑 is then passed to the UNet to generate the predicted noise πœ–Λ†. We then use the predicted noise πœ–Λ† to derive the latent score distillation gradient. To derive the feature matching gradient, we input the latent code 𝑣 and noised latent code 𝑣 + (πœ–Λ† βˆ’ πœ–) into the decoder πΊπœ™π‘‘π‘’π‘(Β·). We compute the difference between the multi-level features from three different layers of the decoder to compute the feature matching loss. Finally, both the latent score distillation and multi-level feature matching gradients are backpropagated to the differentiable renderer.


(a) Text-to-3D

(b) StyleGAN adaptation

(c) Layered image editing.

To apply our proposed method, we first obtain the latent code 𝑣 using the differentiable renderer in each application. As illustrated in the above figure, to obtain the image that produces 𝑣 with StableDiffusion encoder πΈπœ™π‘‘π‘’π‘ , in (a) Text-to-3D, we render from a NeRF model with a random camera viewpoint; (b) StyleGAN adaptation, we generate the image with a pretrained StyleGAN model; (c) Layered image editing application, we use the generator of Text2LIVE to synthesize the edited image, alpha map, and the alpha blending of the initial and edited images.

Comparison: StyleGAN Adaptation


[Gal et al. 2022]

[Song et al. 2022]


Text prompt: Photo of a face [SEP] A realistic detailed portrait, single face, science fiction, artstation, volumetric lighting, octane render.


Compare our method with StyleGAN-NADA [Gal et al. 2022] and StyleGANFusion [Song et al. 2022] on FID (left, the lower the better) and LPIPS/CLIP (right, the higher the better) score.

*Prompt1: "3d cute cat, closeup cute and adorable, cute big circular reflective eyes, long fuzzy fur, Pixar render, unreal engine cinematic smooth, intricate detail, cinematic"
**Ptompt2: "A beautiful portrait of a cute cat. character design by cory loftis, fenghua zhong, ryohei hase, ismail inceoglu and ruan jia. artstation, volumetric light, detailed, photorealistic, fantasy, rendered in octane"

Comparison: Text-to-3D

Jacobian NeRF
[Wang et al. 2022]


Text prompt: duck


Comparison: Layered Editing

Slide the bar to compare input (left) and output (right)

Source Image



      title   = {Text-driven Visual Synthesis with Latent Diffusion Prior},
      author  = {Liao, Ting-Hsuan and Ge Songwei and Xu Yiran and Lee, Yao-Chih and AlBahar Badour and Huang, Jia-Bin},
      journal = {arXiv preprint arXiv:},
      year    = {2023}


We thank Jacobian-NeRF, Latent-NeRF, Text2Live and StyleGANFusion authors.