THIS AI CAN DRAW ME!!! – Dreambooth & Stable Diffusion
|These pictures aren’t photoshopped pictures of me, no one created them. They come from two artificial intelligence tools, called Stable diffusion and Dreambooth. Let me explain.
Stable Diffusion is a machine learning tool that takes a noise pattern and uses a neural network to pull an image from the noise. It’s pretty magical really, and allows you to create some amazing shots. Dreambooth is another machine learning tool that takes in images of a person or even a whole art style and trains a model to recognise and recreate the person or style. Combining the two means you can put yourself into the stable diffusion images – but to make it clear, this isn’t just fancy photoshop. This isn’t taking an actual picture of me and morphing it onto something else – this is a neural network that actively understands my face and can recreate it in any style or form I ask it to.
In this video, I want to give a run down of how you can set this up on your own and talk about my experience of playing around with the software for the last week or so. This is very much what you’d call “bleeding edge” tech, so anything I show here will be out of date by some time next week so it’s not worth going too deep into a how-to, but here’s a brief overview.
Dreambooth in particular requires a whole lot of GPU power, and especially VRAM. Because it is actively training the model, instead of just running the inference, it’s a rather memory and compute intensive process. That means you’ll need a beastly system like this one from Cyberpower – complete with an RTX 4090. You can run this with different cloud services if you are the 99% of people that don’t have a 4090 on hand. You’ll need Python 3.10.6 and git installed, then you can clone the Github repo linked in the description from Automatic1111, you’ll also need to download the stable diffusion model – it’s linked in the Automatic1111 dependencies file or on huggingface, which you can drop in /models/StableDiffusion. Use powershell to git clone the repo, change directory to the new folder, then run webui.bat (assuming you are on Windows anyway). Let it install everything, then you can start making some incredible images. That part doesn’t need a 4090 or 24GB of VRAM, although it does help.
For Dreambooth, head to the extensions tab, load available extensions, then click install on the Dreambooth extension. Assuming it installs ok and it isn’t broken when you try and use it – which it very well could be – you can hit apply and restart, then head to the new Dreambooth tab. You’ll want to use Stable diffusion to generate a few hundred generic faces, then create a folder with as many pictures of your face as you can find (or take). They’ll need to be 512×512 and ideally the majority of the frame will be your face. Then paste the links to those folders into the setup, give it an instance prompt (“a photo of yourname person”), a class prompt (“person”), click the “optimise for people” button, then hit train. This might take a while. Check the output directory for sample images to see how the training is going. I had to train my models multiple times – to the order of 6-10,000 steps – to get a good result. Once it’s done, hit “Generate Checkpoint”, hit refresh on the checkpoint list, then use your prompt to have it recreate you!
I’ve been playing with this – both Stable Diffusion itself and a couple of Dreambooth models – and I want to talk a bit about what this is and isn’t good for. Obviously this is a frankly amazing tool, but it definitely has its quirks. One of the really interesting things is that the canvas size is quite important for how the generation decides to work. If you leave it at 512×512, it generally does close-up portrait shots. If you set it to a wider aspect ratio, it’ll generally do wider, mode landscape shots. Set it to tall and narrow and it’ll do more full body portraits. What’s more amusing though is that with anything other than a square, it generally draws multiple copies of the subject – often morphing into each other. That’s thanks to the whole noise pattern thing, and the fact it isn’t actually aware of what it’s drawing from the noise. It basically starts the drawing process in multiple locations at the same time, meaning you get multiple copies. There are a number of little quirks like that you’ll find, and it can give you some really weird or even gruesome results…
Even with the model of my face, depending on your settings it can be pretty hard to get what you are looking for out of it. The settings can make a big difference to the result, for example the CFG setting basically determines how accurate to the prompt it needs to be. The lower the number, the more “creative” it’ll be. The art is in finding the balance between being accurate to the prompt and getting something that actually looks like me, and letting it be creative enough to match the style I’m asking it for. If you can’t get it to output anything near what you want, you can swap checkpoints back to the standard stable diffusion model, create anything you’d like from that, then use the “send to inpaint” option, switch checkpoints, paint out the face and ask it to generate your face instead. The “Denoise” option isn’t what it sounds like – that’s the option to tell it how much to differ from what’s under the masked area. If you set it to a higher number it’ll disregard the source image more, whereas a lower number has it stick closer to the original. The trick here is that the more denoised it is, the less like the rest of the image it will be.
One of the really interesting features is the ability to give it a source image and having the AI auto-magically improve the design. Here’s a terrible shot I made in paint in two minutes, and with a few tweaks and an extra text prompt to go with it, Stable Diffusion spat out this. Seriously, how much better does that look! It’s clearly still based on my paint masterpiece, but it’s much, much better.
Photorealism generally isn’t this thing’s strong suit though – the more stylised an image is, the better it’ll look. Looking at the more photo-real images from a distance look fine, but a closer inspection reveals quite a few flaws. Also, beware of errant hands. It both can’t draw hands well, and draws way, way too many of them. Seriously, it’s an epidemic at this point…
Here’s the thing, despite what it seems, this is still a tool for artists. At least at the moment, it’s almost impossible to get fully production ready shots out of this, without needing at least some tweaking in something like Photoshop. I can see it being a really useful tool for artists to use as a prompt, and I’m sure some talented people will be able to make the most of the tool as-is, but a lot of skill goes into putting in the right prompt alone, let alone tweaking the settings and using the extra tools like inpainting. One of the risks with a tool like this is the ease of which someone can make use of someone else’s style, characters or designs. If I put in “Simpsons”, it gives me fully new pictures of Homer. Corridor Crew used an artist’s style, there are models for Studio Ghibli’s art style and plenty more, so it’s not hard. What’s arguably more worrying is the fact that with a tool like Dreambooth, you can train models of anyone you can get even a handful of pictures of. Impersonation and deepfake style images are even easier than before, and that’s worrying.
Depending on how you view it, the good or bad news is that this tech is only likely to improve. Training models will become less intensive, services will pop up to let you do it more easily I’m sure, and the image generation will get more accurate over time too. The pre-trained models will get better, more content-rich, and more accurate, and the tools will slowly become easier to use. Whether that’s a good thing or not I’ll leave up to you, but I know that for someone with very little artistic talent, this is a new level of accessibility for someone like me to make some really cool stuff.