Seville, Spain
Seville, Spain
+(34) 624 816 969
Lately, you see it a lot on social media, right? Screenshots of terminals downloading models with ollama pull, debates about whether Llama 3.1 surpasses Qwen3 in Spanish, or people running code assistants on their own laptops without an internet connection. If just a year ago 'AI' was synonymous with ChatGPT (and paying a subscription), the landscape of 2025 is radically different.
We are living through a true 'Cambrian explosion' of language models, and the real revolution, the quietest one, isn't happening in the gigantic cloud datacenters, but on our own hard drives.
Just yesterday, I came across a listing of models available on platforms like Ollama or Hugging Face, and it's overwhelming: gpt-oss, qwen3-vl, deepseek-r1, llama3.1, phi3, gemma3... Names that accumulate millions of 'pulls' (downloads), each with dozens of 'tags' (versions) indicating their size or specialization.
The question we all ask in the industry (and that you're surely asking yourself) isn't just 'which one is the most powerful,' but 'what is each one for?' and, above all, 'can I run this on my machine?'
At ForgeNEX, where we breathe IT and development daily, this conversation is our bread and butter. So let's bring order to this chaos, compare the most relevant models from that list, and understand what battle is being fought in the field of local AI.
Table of contents [Show]
First, a quick context. Why now? Mainly, thanks to Meta. When they released Llama 2, and especially now with Llama 3 and Llama 3.1, they reshuffled the deck. They demonstrated that it was possible to create high-performance models and offer them in an 'open-weight' format (a more precise term than 'open-source'), allowing anyone to download, modify, and run them wherever they want.
This has forced everyone's hand: Google has responded with Gemma, Microsoft with Phi-3, and Chinese giants like Alibaba (with Qwen) and 01.AI (with DeepSeek) have entered the global competition with incredibly powerful models.
An important note about 'Spanish': although the user asked for 'models in Spanish,' the reality is that most of these models (Llama, Qwen, Mistral) are not specifically Spanish. They are global models trained with such a massive amount of multilingual data (trillions of tokens) that their performance in Spanish is simply spectacular. They have far surpassed the old native models and have become the default option.
Comparing Llama 3.1 with embeddinggemma is like comparing a Formula 1 car with a freight truck. Both have an engine, but they serve radically different purposes.
To understand the battle, we first need to group the contenders from that long list:
These are the models that try to do everything well: chat, reason, summarize, program... They are the direct competition to GPT-4.
These models give up being good at everything to be excellent at one thing.
qwen3-coder, deepseek-v3.1 (with its thinking mode). They are trained with more code than natural language.qwen3-vl, llava. These models 'see'! You can give them an image and ask questions about it.gpt-oss, glm-4.6. Models designed not just to chat, but to use tools (APIs, functions), bringing them closer to the idea of autonomous 'agents.'
Small, fast, and surprisingly capable models, designed to run on devices with limited resources.
These are the 'invisible' but crucial models for businesses.
embeddinggemma, nomic-embed-text: You don't chat with them. Their job is to convert text (like your SharePoint documents, PDFs, etc.) into numerical vectors. They are the engine of RAG (Retrieval-Augmented Generation), the technique that allows an AI to 'read' your private documents and answer questions about them.
Okay, let's get to the point. You have a project, which model do you download?
You're looking for an intelligent chat companion to help you draft emails, brainstorm, or explain complex concepts in fluent and natural Spanish.
You're tired of Copilot taking too long to respond or don't want to send your proprietary code to the cloud. You need a local code assistant.
You want an AI that understands images. You give it a screenshot of an error and it tells you how to fix it, or a photo of a chart and it summarizes it for you.
This is the star use case at ForgeNEX. You want an AI that answers questions using your company's database, product manuals, or internal OneDrive documents.
nomic-embed-text or embeddinggemmanomic-embed-text is famous for its large context window, allowing it to 'embed' large documents with high fidelity. It's the engine that powers your vector database.
This is where the rubber meets the road. All this sounds great, but does it work on my laptop? The short answer is: it depends on your VRAM (your graphics card's memory).
In local AI, VRAM is the new RAM. It's where the model's 'weight' (parameters) is loaded.
Here's a quick guide to real requirements for running these models (using quantization, GGUF/Q4, which is the standard):
Phi-3-mini, Llama 3.2 (3B), Mistral (7B) (just barely).
Llama 3.1 (8B), Phi-3-Medium (14B).
Qwen3 (32B), Llama 3.1 (70B).
Llama 3.1 (405B), Qwen3 (480B).
The era of a single giant cloud model that does everything is giving way to a much richer ecosystem. The future isn't a single 400B model, but an intelligent 'agent' on your own PC that knows when to use Phi-3 (for a quick task), when to call Qwen3-VL (to see an image), and when to wake up Llama 3.1 70B (to draft a complex report).
The real power has shifted from 'renting' AI in the cloud to 'controlling' AI at the edge. And for us at ForgeNEX, that's the most exciting part: designing solutions that use the right tool, in the right place, whether on a cloud server or on the end-user's laptop.
The battle of the models is on, and the winners, for now, are us.