Searching for Generative Train Journeys

05 November 2023
diffusion,
stylegan,
pixel2style2pixel

I've long had an obsession with trying to create a purely generated train journey with machine learning tools. Cruising through the latent space, inside the internal world of a model is super appealing, and this post documents my years long efforts to do using various generative models along the way.

In the end the rise of diffusion models made what I was trying to do possible, and if you want to see the end result, the thing I'd been striving for all these years here's the final video:

The StyleGAN era #

As many of the posts on this blog show I spent a lot of time playing with StyleGAN, and some of the very first StyleGAN models I trained myself were trying to generate non-existent train views. Turns out there are quite a lot of high quality YouTube videos of train rides taken from the cab to use as training data. Being the original StyleGAN, and some of the first GAN models I'd ever trained the results weren't great, but they were really exciting to me. Unfortunately they couldn't take me anywhere but the shifting landscapes and scenery in interpolation videos were still mesmerising. ^[0] 0. At the time I tried to fake the forward motion effect with some simple depth estimation and warping, it didn't look terrible but was pretty limited. .

Trying to move forward (pixel2style2pixel) #

Once I had a trained StyleGAN model I wanted to try travelling with it. There are some obvious approaches that I didn't try like trying to find a latent direction of "moving forward", and the method I took was inspired by the fairly convincing appearance of simply zooming into the image with some depth aware warping. If I could take an image generated by the model, warp it slightly to look like we're moving forward, then re-encode that image into the StyleGAN latent space I could recover the "next" image in the imagined journey, and keep repeating the process to create a video. To embed images into the latent space I trained a pixel2style2pixel ^[1] 1. E. Richardson et al., Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation, Jun. 2021. model on my train StyleGAN, then I generated an image, warped it, encoded the warped version in the StyleGAN latent space and generated the new image. Unfortunately my train could never move forward with this method!

Why won't my imaginary train move forward!?

Now I have a bit more experience and intuition of how StyleGAN works, it seems pretty clear that the injected noise maps are what's controlling the positions of small details like the poles and wires visible in the frame. As these noise maps are fixed during my video sequence, nothing really seems to change or move forward.

Swapping autoencoder #

After that failure I gave up on StyleGAN ^[2] 2. I did try one further StyleGAN based method which was generating infinite scrolling videos using Aligning Latent and Image Spaces to Connect the Unconnectable (ALIS), unfortunately without the effective of parallax the videos don't give the effect I wanted. , and instead of trying to generate a fully synthetic video, realised I could use a different model to fake the effect with a real video as a starting point.

A highly under-rated model of the GAN era was Swapping Autoencoder, it was trained to distentangle style from content so you could take two images and take the content/layout from one and apply the style of another. Like most GANs it worked best under a limited domain, and the released landscapes model was amazing ^[3] 3. T. Park et al., Swapping Autoencoder for Deep Image Manipulation,” 2020. .

Examples of Swapping Autoencoder outputs from the paper: "Swapping Autoencoder for Deep Image Manipulation".

I realised I could fake the effect I wanted by taking a real life video of the view from a train ride and applying the style of random other landscape photos, melting between them as the train traveled. It didn't quite achieve my goal of traveling real imagined landscapes but the effect was great.

Next frame prediction with Stable Diffusion #

Once stable diffusion came out and I had played with fine-tuning it a little I came back to my usual quest. This time I was took some side facing train video of my own and tried training a next frame prediction model. If you're not familiar with Next Frame Prediction (NFP) then it's essentially training a network to take in one image, and to predict another image that would be the next frame in a video. You can see a good tutorial on how to train your own by Derrick Schultz. The use of NFP models for making train video is very old (by machine learning standards), as seen in this amazing (and very inspirational) work by Damien Henry.

A schematic of the NFP version of Stable Diffusion.

After a while of training to predict the next frame from my tiny train dataset the results were almost too good. Recursive next frame prediction to make a video actually turns out so realistic it's kind of boring.

This is a video is made by giving the NFP Stable Diffusion model a single starting frame example, then continuously getting it to predict a new frame, and then feeding that frame back in to create a video.

To make it more interesting I needed to add some more levers of control. To do this I repurposed my image variations version of stable diffusion, and again fine-tuned it on the next frame prediction task. But this time instead of a only receiving the previous frame as input it also gets the clip image embedding of that frame. At inference time this lets you change the image embedding to anything you want which controls much of the style of the image, whereas the previous frame is most important for the global structure. This gives dream like videos of traveling sideways but with shifting and changing styles, colours and content depending on the clip embeddings used at each frame.

The final cut #

Putting all the pieces together I generated long sequences of frames where I would change the clip embeddings to that if various images and artwork. Sifted through for the section that worked best together and put them all together with an soundtrack assembled from various free audio samples of train noises.

You can jump to the final here, otherwise there are also some nice still frames that come out of this technique:

Previous: Stable Diffusion Image Variations
Next: Side notes and a Gallery