Unlocking Large Language Models at Home

August 23, 2025

Unlocking Large Language Models at Home

Large language models have traditionally been something you associate with data centers packed full of GPUs. Models with over 100 billion parameters usually seem far beyond the reach of a home PC, even a well-equipped “gaming” PC. Having enough memory to load such a model being probably the largest roadblock. Thankfully, between my professional work and hobbies. All my personal hardware except for my laptop has a generous 128GB of memory. The question I had was whether I could realistically run a 120-billion parameter open-source model and would it be usable.

To my surprise, it performed at a level that was genuinely usable.

My Daily Driver Desktop

My primary desktop I tested on is by no means a thread-ripping super-desktop-computer, though it’s a higher-end desktop by most standards. The specifications are:

Intel Core i7 12700KF (12 Cores)
128 GB of DDR5 RAM
NVIDIA RTX 5070 Ti with 16 GB of VRAM

I ran the model using LM Studio, which handled the heavy lifting of coordinating CPU and GPU workloads via Llama.cpp. Unlike many “GPT-class” models which insist on being mostly GPU-bound, this setup split the responsibilities between the graphics card and system memory, making the best of being limited to 16GB of VRAM.

How It Performed

The initial raw numbers on my system what that it managed around 8.95 tokens per second, with an initial time to first token of 1.34 seconds. While this wasn’t lightning fast, it was responsive enough that I felt it was very usable for such a large model on what is basically a consumer grade PC that has a substantial amount of system memory.

Performance monitoring screen showing the utilization metrics of an NVIDIA GeForce RTX 5070 Ti graphics card, including CPU usage, memory statistics, and temperature.

For reference, smaller models like 13B or 30B parameters will run substantially faster, but they also produce different qualities of text. Running something this large at an interactive pace on consumer hardware is simply outside the expectations most of us had even a year ago.

Why This Matters

There are a few angles to consider:

Practical usability: Sub-10 tokens per second is not blazing fast, but it’s genuinely workable for most interactive writing, coding, or problem-solving use cases. I didn’t find myself waiting uncomfortably long for responses.
Accessibility of frontier-scale models: Hosting a 120B parameter model locally represents a shift in where advanced AI can run. A desktop with lots of RAM and a decent GPU is hardly “entry-level,” but it’s nothing compared to professional inference servers.
Ethics and Security: If you want or need the versatility and training of a 120b large language model but need to protect the information you are going to be using with it. The ability to run such a model on the privacy of your own hardware groundbreaking.

Cautions and Expectations

Despite being impressed, I think it’s important to temper expectations. Just because a model runs does not mean it’s the right fit for every task. Loading times are long, memory usage is significant, and running a machine this hard for sustained periods does draw a lot of heat and power. There’s also a broader discussion around whether we really want to normalize everyone pushing enormous models from home versus using more efficient smaller models that are specialized or quantized.

Not to mention, things do not always go as expected. I happened to be on Discord and was streaming this test to a friend of mine. Discord had issues where the video and audio cut out. Even after the shutdown the model, Discord could not recover, and I had to close it and restart it.

At the same time, from a research and experimentation perspective, being able to interact with such a large model directly is invaluable. It gives developers hands-on insight into what these systems are capable of without needing cloud credits, proprietary APIs, or exposing protected information.

Final Thoughts

Getting gpt-oss-120b to run at nearly 9 tokens per second on a consumer desktop felt like a milestone moment. A setup that would have been dismissed as totally inadequate for such a large model by many turned out to be a capable solution in a pinch.

Models of this size are still heavy lifts, but the direction is clear: open-source AI is closing the gap between what was once the exclusive domain of corporations and what individuals can explore at home. And with tools like Llama.cpp making mixed CPU-GPU execution more accessible, we may see more enthusiasts pushing the hardware envelope in the months ahead.

For me, it wasn’t just a technical success — it was also a reminder that the AI hardware landscape is changing fast, and “impossible” claims have a short shelf life.

Blog Post

AI, gpt-oss-120b, LLM, Privacy

Posted by:

TheTechDjinn

About Me

I’m a lifelong learner and passionate hobbyist with over 25 years of experience building and managing IT infrastructure—both professionally and for personal projects. My expertise spans Linux systems, servers, networks, storage, security, databases, and even some software development. I take pride in being a well-rounded technologist who enjoys solving complex problems and exploring new technologies.

I grew up in the Dallas–Fort Worth area, but when my career plateaued and I craved new challenges, a friend told me, “If you can make it in New York, you can make it anywhere.” Inspired, I moved to New York on January 1, 2005. That leap of faith took me from being a Linux-focused IT Infrastructure engineer to becoming the Chief Technology Officer of a medium sized company. I have since stepped away from the CTO role and have settled into a full-time consultant role.

Today, I’m diving into the world of Artificial Intelligence and Machine Learning—not just because it represents a paradigm shift in technology, but because of the immense opportunities and, more importantly, the challenges it presents. I’m excited to be part of this next frontier.

I was recently asked where the name The Tech Djinn came from. Back in the mid-90s. I was playing old Telnet MUD (Multi-User Dungeon) game called MajorMUD. I had created a Druid character and named him Djinn which is an intelligent spirit of lower rank than Angels. Basically, a genie. A few years later, I needed a username for site or service and as I was fond of the term Djinn and worked in the tech industry. I sewed them together and the name The Tech Djinn was created.

Unlocking Large Language Models at Home

Share this:

Like this:

Work in Progress...

Discover more from The Tech Djinn