Large language models have traditionally been something you associate with data centers packed full of GPUs. Models with over 100 billion parameters usually seem far beyond the reach of a home PC, even a well-equipped “gaming” PC. Having enough memory to load such a model being probably the largest roadblock. Thankfully, between my professional work and hobbies. All my personal hardware except for my laptop has a generous 128GB of memory. The question I had was whether I could realistically run a 120-billion parameter open-source model and would it be usable.
To my surprise, it performed at a level that was genuinely usable.
My Daily Driver Desktop
My primary desktop I tested on is by no means a thread-ripping super-desktop-computer, though it’s a higher-end desktop by most standards. The specifications are:
- Intel Core i7 12700KF (12 Cores)
- 128 GB of DDR5 RAM
- NVIDIA RTX 5070 Ti with 16 GB of VRAM
I ran the model using LM Studio, which handled the heavy lifting of coordinating CPU and GPU workloads via Llama.cpp. Unlike many “GPT-class” models which insist on being mostly GPU-bound, this setup split the responsibilities between the graphics card and system memory, making the best of being limited to 16GB of VRAM.
How It Performed
The initial raw numbers on my system what that it managed around 8.95 tokens per second, with an initial time to first token of 1.34 seconds. While this wasn’t lightning fast, it was responsive enough that I felt it was very usable for such a large model on what is basically a consumer grade PC that has a substantial amount of system memory.

For reference, smaller models like 13B or 30B parameters will run substantially faster, but they also produce different qualities of text. Running something this large at an interactive pace on consumer hardware is simply outside the expectations most of us had even a year ago.
Why This Matters
There are a few angles to consider:
- Practical usability: Sub-10 tokens per second is not blazing fast, but it’s genuinely workable for most interactive writing, coding, or problem-solving use cases. I didn’t find myself waiting uncomfortably long for responses.
- Accessibility of frontier-scale models: Hosting a 120B parameter model locally represents a shift in where advanced AI can run. A desktop with lots of RAM and a decent GPU is hardly “entry-level,” but it’s nothing compared to professional inference servers.
- Ethics and Security: If you want or need the versatility and training of a 120b large language model but need to protect the information you are going to be using with it. The ability to run such a model on the privacy of your own hardware groundbreaking.
Cautions and Expectations
Despite being impressed, I think it’s important to temper expectations. Just because a model runs does not mean it’s the right fit for every task. Loading times are long, memory usage is significant, and running a machine this hard for sustained periods does draw a lot of heat and power. There’s also a broader discussion around whether we really want to normalize everyone pushing enormous models from home versus using more efficient smaller models that are specialized or quantized.
Not to mention, things do not always go as expected. I happened to be on Discord and was streaming this test to a friend of mine. Discord had issues where the video and audio cut out. Even after the shutdown the model, Discord could not recover, and I had to close it and restart it.
At the same time, from a research and experimentation perspective, being able to interact with such a large model directly is invaluable. It gives developers hands-on insight into what these systems are capable of without needing cloud credits, proprietary APIs, or exposing protected information.
Final Thoughts
Getting gpt-oss-120b to run at nearly 9 tokens per second on a consumer desktop felt like a milestone moment. A setup that would have been dismissed as totally inadequate for such a large model by many turned out to be a capable solution in a pinch.
Models of this size are still heavy lifts, but the direction is clear: open-source AI is closing the gap between what was once the exclusive domain of corporations and what individuals can explore at home. And with tools like Llama.cpp making mixed CPU-GPU execution more accessible, we may see more enthusiasts pushing the hardware envelope in the months ahead.
For me, it wasn’t just a technical success — it was also a reminder that the AI hardware landscape is changing fast, and “impossible” claims have a short shelf life.

