A stylized depiction of a brain with open-source code flowing through it, symbolizing the open-source safety model.
AI

OpenAI's Safety Model Finally Revealed, and It's Open Source!

Ollama partners with OpenAI and ROOST to deliver gpt-oss-safeguard, an open-source safety model for content classification.

5 min read

OpenAI's Safety Model Finally Revealed, and It's Open Source!

In a significant move for AI safety and accessibility, OpenAI is collaborating with Ollama and ROOST (Robust Open Online Safety Tools) to release gpt-oss-safeguard, a suite of open-source models designed for safety classification tasks. This collaboration aims to empower developers and organizations with tools to better understand and mitigate potential harms associated with large language models. The release marks a pivotal step towards democratizing AI safety, offering a transparent and customizable approach to content moderation and policy enforcement. The open-source nature of gpt-oss-safeguard enables community-driven improvements and adaptation to diverse use cases, fostering a more responsible AI ecosystem.

What's New

The gpt-oss-safeguard models are specifically trained to reason about safety, making them suitable for applications like LLM input-output filtering, online content labeling, and offline analysis for trust and safety teams. Key highlights include:

  • Two Model Sizes: Available in 20B and 120B parameter versions, allowing users to choose the model size that best fits their computational resources and performance requirements.
  • Bring Your Own Policy: The models are designed to interpret user-defined policies, enabling generalization across different products and use cases with minimal engineering effort.
  • Reasoned Decisions: Unlike simple scoring systems, gpt-oss-safeguard provides access to the model's reasoning process, facilitating debugging and increasing trust in policy decisions.
  • Configurable Reasoning Effort: Users can adjust the reasoning effort (low, medium, high) to balance accuracy and latency based on their specific needs.
  • Permissive Licensing: Licensed under the Apache 2.0 license, allowing for experimentation, customization, and commercial deployment without copyleft restrictions or patent risks.

Why It Matters

The release of gpt-oss-safeguard is a game-changer for several reasons. Firstly, it addresses the critical need for accessible and customizable AI safety tools. Organizations can now leverage these models to enforce their own specific policies and adapt them to their unique contexts. Secondly, the open-source nature of the models fosters transparency and collaboration, enabling the community to contribute to their improvement and refinement. This is particularly important in the rapidly evolving landscape of AI safety, where continuous adaptation and innovation are essential. Finally, the ability to access the model's reasoning process provides valuable insights into its decision-making, increasing trust and enabling more effective debugging and refinement of safety policies. This ultimately empowers developers and safety teams to build safer and more responsible AI systems.

Technical Details

OpenAI evaluated the gpt-oss-safeguard models on both internal and external evaluation sets. The internal evaluation involved providing multiple policies simultaneously to the models and assessing their ability to correctly classify text under all included policies. This is a challenging task, as the model is only considered accurate if it exactly matches the golden set labels for all policies. OpenAI also evaluated the models on their moderation dataset released in 2022 and on ToxicChat, a public benchmark based on user queries to an open-source chatbot.

To run the models, users can download Ollama and execute the following commands in a terminal:

  • ollama run gpt-oss-safeguard:20b
  • ollama run gpt-oss-safeguard:120b

The models' performance is configurable, allowing users to adjust the reasoning effort based on their specific use case and latency requirements. This flexibility is crucial for adapting the models to diverse applications with varying performance constraints. The permissive Apache 2.0 license enables a wide range of use cases, from academic research to commercial deployment, without the restrictions associated with more restrictive licenses.

| Feature | Description | | --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Model Sizes | 20B and 120B parameters | | Training Data | Trained and tuned for safety reasoning | | Policy Enforcement | Interprets user-defined policies | | Reasoning Access | Provides access to the model's reasoning process | | Reasoning Effort | Configurable (low, medium, high) | | License | Apache 2.0 | | Evaluation Datasets | OpenAI internal datasets, OpenAI moderation dataset (2022), ToxicChat |

Final Thoughts

The collaboration between OpenAI, Ollama, and ROOST to release gpt-oss-safeguard marks a significant step forward in democratizing AI safety. By providing open-source, customizable, and transparent safety models, this initiative empowers developers and organizations to build more responsible and trustworthy AI systems. As the field of AI safety continues to evolve, we anticipate further advancements in open-source tools and techniques that will contribute to a safer and more beneficial AI future.

Sources verified via Ollama of November 4, 2025.

OpenAI's Safety Model Finally Revealed, and It's Open Source! · FineTunedNews