Update pipeline tag to text-classification and add paper link
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,16 +1,18 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
| 5 |
metrics:
|
| 6 |
- accuracy
|
| 7 |
- precision
|
| 8 |
- recall
|
| 9 |
- f1
|
| 10 |
-
|
| 11 |
-
- meta-llama/Llama-3.2-3B-Instruct
|
| 12 |
-
pipeline_tag: text-generation
|
| 13 |
-
library_name: transformers
|
| 14 |
tags:
|
| 15 |
- llama
|
| 16 |
- safe
|
|
@@ -18,11 +20,9 @@ tags:
|
|
| 18 |
- safety
|
| 19 |
- moderation
|
| 20 |
- classifier
|
| 21 |
-
datasets:
|
| 22 |
-
- ReasoningShield/ReasoningShield-Dataset
|
| 23 |
---
|
| 24 |
|
| 25 |
-
#
|
| 26 |
|
| 27 |
|
| 28 |
<div align="center">
|
|
@@ -56,10 +56,11 @@ datasets:
|
|
| 56 |
|
| 57 |
</div>
|
| 58 |
|
|
|
|
| 59 |
|
| 60 |
---
|
| 61 |
|
| 62 |
-
##
|
| 63 |
|
| 64 |
***ReasoningShield*** is the first specialized safety moderation model tailored to identify hidden risks in intermediate reasoning steps in Large Reasoning Models (LRMs). It excels in detecting harmful content that may be concealed within seemingly harmless reasoning traces, ensuring robust safety alignment for LRMs.
|
| 65 |
|
|
@@ -142,28 +143,28 @@ These two-stage training procedures significantly enhance ***ReasoningShield's**
|
|
| 142 |
|
| 143 |
<div align="center">
|
| 144 |
|
| 145 |
-
| **Model**
|
| 146 |
| :---------------------: | :------: | :-----------: | :-----------: | :-------------: | :-------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: | :-----------: | :-----------: |
|
| 147 |
-
| **Moderation API**
|
| 148 |
-
| Perspective
|
| 149 |
-
| OpenAI Moderation
|
| 150 |
-
| **Prompted LLM**
|
| 151 |
-
| GPT-4o
|
| 152 |
-
| Qwen-2.5
|
| 153 |
-
| Gemma-3
|
| 154 |
-
| Mistral-3.1
|
| 155 |
-
| **Finetuned LLM**
|
| 156 |
-
| LlamaGuard-1
|
| 157 |
-
| LlamaGuard-2
|
| 158 |
-
| LlamaGuard-3
|
| 159 |
-
| LlamaGuard-4
|
| 160 |
-
| Aegis-Permissive
|
| 161 |
-
| Aegis-Defensive
|
| 162 |
-
| WildGuard
|
| 163 |
-
| MD-Judge
|
| 164 |
-
| Beaver-Dam
|
| 165 |
-
| **ReasoningShield (Ours)** | 1B
|
| 166 |
-
| **ReasoningShield (Ours)** | 3B
|
| 167 |
|
| 168 |
</div>
|
| 169 |
|
|
@@ -204,7 +205,8 @@ model_thinking = "Intermediate reasoning steps here" # The content can be repla
|
|
| 204 |
|
| 205 |
messages = [
|
| 206 |
{"role": "system", "content": reasoningshield_prompt},
|
| 207 |
-
{"role": "user", "content": f"Query: {question}
|
|
|
|
| 208 |
]
|
| 209 |
|
| 210 |
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
|
@@ -247,7 +249,8 @@ model_thinking = "Intermediate reasoning steps here" # The content can be replac
|
|
| 247 |
|
| 248 |
messages = [
|
| 249 |
{"role": "system", "content": reasoningshield_prompt},
|
| 250 |
-
{"role": "user", "content": f"Query: {question}
|
|
|
|
| 251 |
]
|
| 252 |
|
| 253 |
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- meta-llama/Llama-3.2-3B-Instruct
|
| 4 |
+
datasets:
|
| 5 |
+
- ReasoningShield/ReasoningShield-Dataset
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
+
library_name: transformers
|
| 9 |
+
license: apache-2.0
|
| 10 |
metrics:
|
| 11 |
- accuracy
|
| 12 |
- precision
|
| 13 |
- recall
|
| 14 |
- f1
|
| 15 |
+
pipeline_tag: text-classification
|
|
|
|
|
|
|
|
|
|
| 16 |
tags:
|
| 17 |
- llama
|
| 18 |
- safe
|
|
|
|
| 20 |
- safety
|
| 21 |
- moderation
|
| 22 |
- classifier
|
|
|
|
|
|
|
| 23 |
---
|
| 24 |
|
| 25 |
+
# 🤗 Model Card for *ReasoningShield*
|
| 26 |
|
| 27 |
|
| 28 |
<div align="center">
|
|
|
|
| 56 |
|
| 57 |
</div>
|
| 58 |
|
| 59 |
+
This model is presented in the paper [ReasoningShield: Safety Detection over Reasoning Traces of Large Reasoning Models](https://huggingface.co/papers/2505.17244).
|
| 60 |
|
| 61 |
---
|
| 62 |
|
| 63 |
+
## 🛡 1. Model Overview
|
| 64 |
|
| 65 |
***ReasoningShield*** is the first specialized safety moderation model tailored to identify hidden risks in intermediate reasoning steps in Large Reasoning Models (LRMs). It excels in detecting harmful content that may be concealed within seemingly harmless reasoning traces, ensuring robust safety alignment for LRMs.
|
| 66 |
|
|
|
|
| 143 |
|
| 144 |
<div align="center">
|
| 145 |
|
| 146 |
+
| **Model** | **Size** | **AIR (OSS)** | **AIR (CSS)** | **SALAD (OSS)** | **SALAD (CSS)** | **BeaverTails (OSS)** | **BeaverTails (CSS)** | **Jailbreak (OSS)** | **Jailbreak (CSS)** | **Avg (OSS)** | **Avg (CSS)** |
|
| 147 |
| :---------------------: | :------: | :-----------: | :-----------: | :-------------: | :-------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: | :-----------: | :-----------: |
|
| 148 |
+
| **Moderation API** | | | | | | | | | | | |
|
| 149 |
+
| Perspective | - | 0.0 | 0.0 | 0.0 | 11.9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.2 |
|
| 150 |
+
| OpenAI Moderation | - | 45.7 | 13.2 | 61.7 | 66.7 | 64.9 | 29.2 | 70.9 | 41.1 | 60.7 | 44.8 |
|
| 151 |
+
| **Prompted LLM** | | | | | | | | | | | |
|
| 152 |
+
| GPT-4o | - | 70.1 | 47.4 | 75.3 | 75.4 | 79.3 | 60.6 | 82.0 | 68.7 | 76.0 | 65.6 |
|
| 153 |
+
| Qwen-2.5 | 72B | 79.1 | 59.8 | 82.1 | **86.0** | 81.1 | 61.5 | 84.2 | 71.9 | 80.8 | 74.0 |
|
| 154 |
+
| Gemma-3 | 27B | 83.2 | 71.6 | 80.2 | 78.3 | 79.2 | **68.9** | 86.6 | 73.2 | 81.6 | 74.4 |
|
| 155 |
+
| Mistral-3.1 | 24B | 65.0 | 45.3 | 77.5 | 73.4 | 73.7 | 55.1 | 77.3 | 54.1 | 73.0 | 60.7 |
|
| 156 |
+
| **Finetuned LLM** | | | | | | | | | | | |
|
| 157 |
+
| LlamaGuard-1 | 7B | 20.3 | 5.7 | 22.8 | 48.8 | 27.1 | 18.8 | 53.9 | 5.7 | 31.0 | 28.0 |
|
| 158 |
+
| LlamaGuard-2 | 8B | 63.3 | 35.7 | 59.8 | 40.0 | 63.3 | 47.4 | 68.2 | 28.6 | 62.4 | 38.1 |
|
| 159 |
+
| LlamaGuard-3 | 8B | 68.3 | 33.3 | 70.4 | 56.5 | 77.6 | 30.3 | 78.5 | 20.5 | 72.8 | 42.2 |
|
| 160 |
+
| LlamaGuard-4 | 12B | 55.0 | 23.4 | 46.1 | 49.6 | 57.0 | 13.3 | 69.2 | 16.2 | 56.2 | 33.7 |
|
| 161 |
+
| Aegis-Permissive | 7B | 56.3 | 51.0 | 66.5 | 67.4 | 65.8 | 35.3 | 70.7 | 33.3 | 64.3 | 53.9 |
|
| 162 |
+
| Aegis-Defensive | 7B | 71.2 | 56.9 | 76.4 | 67.8 | 73.9 | 27.0 | 75.4 | 53.2 | 73.6 | 54.9 |
|
| 163 |
+
| WildGuard | 7B | 58.8 | 45.7 | 66.7 | 76.3 | 68.3 | 51.3 | 79.6 | 55.3 | 67.6 | 62.1 |
|
| 164 |
+
| MD-Judge | 7B | 71.8 | 44.4 | 83.4 | 83.2 | 81.0 | 50.0 | 86.8 | 56.6 | 80.1 | 66.0 |
|
| 165 |
+
| Beaver-Dam | 7B | 50.0 | 17.6 | 52.6 | 36.6 | 71.1 | 12.7 | 60.2 | 36.0 | 58.2 | 26.5 |
|
| 166 |
+
| **ReasoningShield (Ours)** | 1B | <ins>94.2</ins> | <ins>83.7</ins> | <ins>91.5</ins> | 80.5 | <ins>89.0</ins> | 60.0 | <ins>90.1</ins> | <ins>74.2</ins> | <ins>89.4</ins> | <ins>77.7</ins> |
|
| 167 |
+
| **ReasoningShield (Ours)** | 3B | **94.5** | **86.7** | **94.0** | <ins>84.8</ins> | **90.4** | <ins>64.6</ins> | **92.3** | **76.2** | **91.8** | **81.4** |
|
| 168 |
|
| 169 |
</div>
|
| 170 |
|
|
|
|
| 205 |
|
| 206 |
messages = [
|
| 207 |
{"role": "system", "content": reasoningshield_prompt},
|
| 208 |
+
{"role": "user", "content": f"Query: {question}
|
| 209 |
+
Thought: {model_thinking}"}
|
| 210 |
]
|
| 211 |
|
| 212 |
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
|
|
|
| 249 |
|
| 250 |
messages = [
|
| 251 |
{"role": "system", "content": reasoningshield_prompt},
|
| 252 |
+
{"role": "user", "content": f"Query: {question}
|
| 253 |
+
Thought: {model_thinking}"}
|
| 254 |
]
|
| 255 |
|
| 256 |
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|