Justin Pinkney

Experiments in Image Variation

Last touched December 17, 2022

I’ve been doing a bunch of quick experiments for my improved CLIP Image conditioned version of Stable Diffusion v1. We can actually do lots of fun stuff with this model including transforming this Ghibli house into a lighthouse!

First off you can find the actual model on huggingface hub. Throughout the post we’ll be using this crop from a Ghibli frame as our test image.

output 6 0

This is what the standard variations model produces when you feed this image in:


That was classifier free guidance scale 4, we can play with different levels of guidance: (0, 1, 2, 4, 8)

output 10 10

The image is turned into a CLIP image embedding, this is the condition vector. One thing I noticed is that because I set the unconditional embedding to zeros, scaling a condition vector has the effect of increasing its “strength”, this effect is different to the usual classifier free guidance. We can play with the length of this vector to control the “strength” of the conditioning by multiplying the vector by 0.25, 0.5, 1, 1.5, 2:

output 12 2

We can also mix different embeddings by averaging them together. Let’s mix between the embedding for our Ghibli image and a matisse painting. 0%, 25%, 50%, 75%, and 100% matisse:

output 14 2

Another way to mix is to concatenate the embeddings, yes we can have more than one if we want to. Here we do the following

  • ghibli, ghibli
  • ghibli, uncond
  • ghibli, matisse

output 16 2

We can also use text embeddings to mix with our image embedding, we can mix with the word “flowers” to add flowers. (The variations model actually works fine with text embeddings btw). At the far right it’s only using the text “flowers” (we re-invented a text to image model!). 0%, 25%, 50%, 75%, and 100% “flowers”:

output 18 2

We can do the same trick of concatenating the embeddings instead of averaging, it’s a bit funny though

  • ghibli, flowers
  • ghibli, flowers, flowers
  • ghibli, ghibli, flowers
  • ghibli, ghibli, flowers, uncond, uncond

output 20 8

More natural is to maybe use a direction in text embedding space to edit our ghibli conditioning. We can use the direction “still water” -> “wild flowers” and add this at different scales to your original conditioning. Here it’s adding 0%, 10%, 25%, 50% and 100% of the edit vector “still water” -> “wild flowers”:

output 22 2

Changing tack we can experiment with adding noise to our condition vector with scales = 0, 0.1, 0.2, 0.5

output 24 2

Or use multiplicative noise instead, scales = 0, 0.2, 0.5, 0.75

output 26 2

If we take random crops when computing the embeddings we get zoomed in images mostly, a bit boring.

output 28 2

Image inversion

Now we play with DDIM inversion, followed by editing, as shown in the DALLE-2 paper. We invert our image using standard DDIM inversion as in the original stable diffusion repo., we need to use lots of timesteps. Then check we can decode it to pretty much the same image (if we use cfg it’s more saturated). To get good inversion results we use: ddim_steps = 500, start_step=1, cfg scale = 3 (careful with cfg scale! not too high)

For decoding you don’t need to use so many timesteps, so it wont take so long, decode with 50 timesteps and you get the original picture back

output 33 2

Now we use our same text diff as before to replace the water with flowers! 10%, 20% 50% 100% and 150% edit:

output 35 2

Or how about turn the house into a lighthouse? 10%, 20% 50% 100% and 150% edit:

output 37 2

We can also use ddim_eta to make variations of our original, but ones that closely match in composition. Here each row is for ddim_eta 0.3, 0.6, and finally 1:

output 39 2

output 39 5

output 39 8

We can make more of these steadily ramping up the value of eta and make a variations video:

Last of all we can combine subtle variation (ddim_eta=0.4) with text diff editing: “ghibli matte painting” -> “dslr leica photo”

output 41 2

💬 If you want to comment say hello on Twitter.
© 2022, Justin Pinkney