How to Turn Your AMD GPU into a Local LLM Beast: A Beginner’s Guide with ROCm

Those of us with NVIDIA GPUs, particularly ones with enough VRAM, have been able to run large language models locally for quite a while. I did a guide last year showing you how to run Vicuna locally, but that really only worked on NVIDIA GPUs. Support has improved a little, including running it all on your CPU instead, but with the launch of AMD’s ROCm software, it’s now not only possible to run large language models locally, but insanely easy. Now if you are just here for the guide, skip here. If you want to know how this works so damn well, stick around!

Technically speaking, ROCm – formerly known as Radeon Open Compute Platform – isn’t actually new. AMD launched it back in 2016, although the more recent “stable” releases came really within the last year or so. ROCm, similar to NVIDIA’s CUDA, is a platform of tools that allow your graphics card to act as a general purpose processor, rather than the specialised beast it normally is. It turns out that having thousands of cores available to do work simultaneously is pretty handy! NVIDIA has dominated this space, as developers need to integrate CUDA into their applications to leverage the benefits of offloading work to the graphics card, and until now, AMD hasn’t really had a comparable option that developers could integrate. There was OpenCL, but if you’ve ever tried using an AMD GPU for compute work, you’ll know that’s not great. ROCm, then, offers a wide set of tools for a bunch of applications. It’s open source and free to use, marking a significant change from NVIDIA’s proprietary licensing arrangement. 

ROCm actually comprises of a whole bunch of different tools. For machine learning, there’s a few interesting ones. MIVisionX is a set of computer vision tools, MIGraphX (and TorchMIGraphX) allow PyTorch to run on AMD GPUs, and there’s MIOpen which is a straight open source deep learning library. These tools, together, allow programs like LM Studio to include ROCm support with relative ease, and that means for you as the end-user, you get better compute performance and support for things like large language models to run locally! 

It also turns out you don’t need a 7900 XTX to run this well – Gigabyte sent over their RX 7600 XT, a card that AMD marketed specifically for use with large language models thanks to its hefty 16GB of VRAM. This bad boy is currently retailing for around £330, comes with a triple fan cooler, only needs two 8 pin PCIe power connections, and as you’ll find out in an upcoming video is actually a pretty decent gaming card, plus it turns out it’s perfect for LLMs! 

So, how do you make this work? It is ridiculously easy. Head to LM Studio’s website, specifically lmstudio.ai/rocm, and download LM Studio with beta ROCm support. Once it’s downloaded, you’ll need to let it close itself, then open it again, and you’re 90% done. Head to the search tab and find some models. If you want to use LLAMA V2, AMD recommends the Q4_K_M version from TheBloke. Just hit download, give it a minute, then head to the chat tab. On the right hand side are all the settings – the key one to check is that LM Studio detected your GPU as “AMD ROCm”. Check the “GPU Offload” checkbox, and set the GPU layers slider to max. Select the model at the top, then that’s it. Use it like ChatGPT – except here you can even change the system prompt which works impressively well!

Actually using this is ridiculously smooth and fast. I’m genuinely amazed – like this is a 7600 XT. This isn’t a top end £1000 card, it’s a budget card and yet I’m getting responses faster than ChatGPT and Gemini.. It’s crazy how fast this is, and even more interestingly the VRAM usage with this 7 billion parameter model is insanely low. It’s only using around 6GB, which while I appreciate that’s how much VRAM many of you have in total, but in my experience with using LLMs on NVIDIA cards, VRAM usage climbs a lot in use and the 3060 TI with 12GB of VRAM I often use gets overwhelmed easily. 

One of the interesting features of LM Studio, besides the amazing design and ease of use, is access to the system prompt right here. That means you can change how the model responds – quite drastically too. The system prompt, for those that don’t know, is basically the bit of the prompt that the software interjects before your prompt to set the context for the response. ChatGPT’s system prompt is likely incredibly long, and details things like the tone of the response, as well as I’m sure an awful lot of limitations. DPD, the parcel delivery company, had a bit of an issue with their AI chatbot in January, and during that time someone managed to get it to provide its system prompt. I’ll put it on screen so you can pause and have a look, but in short it’s a wall of text limiting how the LLM should process the response. Things it shouldn’t do, and things it should. I don’t think their prompt was amazing – I mean the bot was swearing at people so clearly not – but it gives you an idea of what you can do to get the sort of responses you want. I’ve got a simple example here too. I told it to answer as a trusted friend, as kindly as possible. The tone difference in the replies are amazing. It goes from being pretty clinical, to being a weirdly uncanny valley ‘friend’. It started using emojis in the responses! That’s a pretty big difference.

I’m still pretty blown away by the performance here. Not only are we barely touching the 16GB of VRAM we have on tap, but it’s responding incredibly quickly regardless of the prompt. Of course, this is the 7 billion parameter model, but there are higher count versions available. I did try using Vicuna 13B, and that loaded fine, using around 10GB of VRAM. That’s still great! I did try loading the 30 billion parameter version too, but that failed to allocate the 20GB it wanted and so that won’t work at least on this card. I can’t imagine what the 70B models would need…

There are hundreds or thousands of different models available, so obviously I can’t test them all here, but here’s what you should look for if you’re giving this a try. The primary thing is to find a model that lists “FULL GPU OFFLOAD POSSIBLE”. That should mean it’ll load into your GPU and VRAM, rather than run on your CPU – although even then it seems it isn’t heinously slow there either – and should run as fast as possible. Otherwise, the higher quantization bits, that’s the Q number, the better the quality you generally get out. 4 bit seems to be pretty common. Otherwise, experiment with it, and I’d love to hear how you get on in the comments below!