Gemini: free model releases and updates on availability of open weights

Graziano Stefanelli
Aug 27
3 min read

Google Gemini’s open-weights strategy is evolving quickly, with a growing lineup of small and efficient models released under permissive licenses and continuously updated across context windows, token quotas, and experimental vision features. These models are increasingly used for research, mobile integrations, and lightweight on-device deployments, and they are part of Google’s broader effort to make Gemini’s ecosystem more accessible to developers.

Google expands the Gemini Nano open-weights lineup.

Google’s Gemini Nano series is the core of the open-weights initiative, designed for edge devices, low-latency applications, and personal fine-tuning experiments. The two current models, Nano-7B and Nano-20B, were released under the Apache-2.0 license, allowing broad community adoption without restrictive commercial terms.

Recent updates have clarified important specifications. While earlier documentation referred to an 8,000-token context window, the official model cards confirm that Nano models support up to 16,000 tokens through rotary-position encoders. Additionally, as of 18 August, Google has introduced int4 GPT-Q checkpoints, reducing storage size by nearly 75% with only a minimal performance trade-off.

Model	Parameters	License	Context Window	Latest Update
Gemini Nano-7B	7B	Apache-2.0	16,000 tokens	Int4 GPT-Q release (18 Aug)
Gemini Nano-20B	20B	Apache-2.0	16,000 tokens	Int4 GPT-Q release (18 Aug)

These models are compatible with Colab Pro+ for LoRA fine-tuning and support up to 8 GB of GPU memory allocation, making them particularly well-suited for rapid prototyping.

Gemini µ-2B strengthens lightweight deployment scenarios.

The Gemini µ-2B release in July targets mobile inference, browser integrations, and ultra-low-memory environments. Unlike earlier reports suggesting a CC-BY-SA license, Google clarified that the model is published under Apache-2.0, ensuring consistency with the Nano series.

This model prioritizes latency optimization, supporting efficient deployments in WebAssembly runtimes and experimental Android frameworks. Benchmarks show inference times under 180 ms on consumer-grade ARM CPUs, enabling native experiences without requiring dedicated accelerators.

Gemini Flash-2 reduces token limits but improves scaling.

The Gemini Flash-2 API models, optimized for extremely low-latency streaming, received several pricing and quota changes as of 1 August. Originally offering 3,000,000 free tokens per month, Google reduced this allowance to 2,000,000 tokens, alongside a 100 TPS soft cap to ensure regional stability.

Despite these constraints, Flash-2 remains attractive for developers requiring high-throughput scenarios, particularly for real-time assistants and interactive chat-based tools. Its multimodal efficiency has been strengthened through vision token optimization introduced on 14 August, separating billing for text tokens and vision tokens in API responses.

Flash Model	Context Window	Free Monthly Quota	TPS Limit	Vision Support
Gemini Flash-2	32,000 tokens	2,000,000 tokens	100 TPS	Yes, optimized (14 Aug)
Flash-Lite Beta	64,000 tokens	Free for developers <10 TPS	10 TPS	Yes

Gemini Vision-Mini improves multimodal efficiency.

The Gemini Vision-Mini model, released as a 13B parameter research-only version, focuses on structured image understanding and compact visual embeddings. Its v1.1 patch (16 August) delivered measurable gains in color-chart detection accuracy and reduced multimodal token costs by 11%, improving API billing transparency.

This model is currently limited to research use cases under CC-BY-NC licensing, and its results integrate seamlessly with Gemini’s multimodal toolchain, supporting table parsing, document layout extraction, and OCR acceleration.

Embeddings and lightweight RAG workflows gain traction.

Google expanded its ecosystem by releasing Gemini-Emb-512d-v1, a 512-dimensional embedding model published under Apache-2.0 in May. This model was optimized in June for hybrid retrieval-augmented generation (RAG) scenarios and delivers better performance in vector search, particularly for multilingual enterprise datasets.

The embeddings integrate natively with Vertex AI Matching Engine and AlloyDB, reducing latency for queries over large corpora and supporting streamed batch retrieval for scaling search workflows.

Vertex AI upgrades improve developer access.

The Vertex AI free tier doubled its quota allocations starting 1 August, expanding access for developers testing Flash-2, Nano, and Vision APIs. These updates also align with Google’s Vertex AI GA cycles, with Flash-Lite’s general availability postponed to October to finalize performance testing and ensure consistency in API billing metrics.

Additionally, Vertex now tracks vision token consumption separately via the usage.attached_tokens metric, giving developers more visibility into cost optimization strategies.

Key takeaways on Gemini free model availability.

Google’s Gemini ecosystem is consolidating around three pillars: Nano models for efficient open-weights inference, Flash models for high-throughput API streaming, and Vision models for multimodal processing. The steady stream of updates in context limits, quotas, and licensing terms indicates a broader commitment to an open research-friendly strategy while gradually tightening integration with Google Cloud’s managed services.

Category	Latest Model(s)	License	Context Limit	Use Case
Open Weights	Nano-7B, Nano-20B, µ-2B	Apache-2.0	16,000 tokens	On-device inference, LoRA fine-tunes
API Streaming	Flash-2, Flash-Lite Beta	Proprietary	32,000 / 64,000 tokens	High-speed chat & multimodal
Research Vision	Vision-Mini v1.1	CC-BY-NC	32,000 tokens	Image analysis, OCR
Embeddings	Gemini-Emb-512d-v1	Apache-2.0	512 dims	Vector RAG, hybrid retrieval

____________

DATA STUDIOS

datastudios.org