You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🛡️ Bielik Guard (Sójka): Polish Language Safety Classifier

Bielik Guard (Sójka) is a Polish language safety classifier designed to detect harmful content in digital communication and respond appropriately rather than simply blocking content. Built by the Bielik.AI community under the SpeakLeash non-profit organization, it protects users like a vigilant guardian of their digital homes by providing appropriate responses and support resources.

Bielik Guard is available in two model variants: 0.1B (124M parameters) and 0.5B (443M parameters), offering different points in the efficiency-performance trade-off space.

📋 Model Details

Model Description

Bielik Guard (Sójka) is a family of Polish-language safety classifiers built upon Polish RoBERTa-based encoders. The models have been fine-tuned to detect safety-relevant content in Polish texts, using community-collected data designed for evaluating safety in large language models (LLMs).

Bielik Guard 0.1B is built upon sdadas/mmlw-roberta-base, a 124M parameter Polish RoBERTa-based encoder with a vocabulary of 50,001 tokens.

Bielik Guard 0.5B is built upon PKOBP/polish-roberta-8k, a 443M parameter Polish RoBERTa variant with an enhanced vocabulary of 128,064 tokens, providing substantially greater modeling capacity.

Both models are multilabel and return probability scores for each safety category, indicating the likelihood that a text belongs to that category. Importantly, the models were not trained on binarized data but rather on the percentage of people claiming that a text belongs to each category, reflecting the nuanced nature of safety classification.

Note: This is version 1.1 of Bielik Guard (Sójka), featuring improved threshold calibration that significantly enhances precision and reduces false positive rates compared to v1.0. The team is actively working on future versions that will include additional safety categories and support for more languages.

Developed by: See the Sójka Development Team section below.
Model type: Text Classification
Language(s) (NLP): Polish
License: Apache-2.0
Finetuned from model: See Technical Specifications below
🚀 Demo: Test Sójka at guard.bielik.ai

🛠️ Uses

✅ Direct Use

Bielik Guard (Sójka) can be used directly for:

Real-time analysis of prompts and responses to detect threats and respond appropriately.
Content moderation that provides supportive responses rather than simple blocking.
Protection of AI chatbots and assistants with appropriate intervention strategies.
Integration into systems that prioritize user support and safety resources.

🧩 Downstream Use

The model can be integrated into larger systems for:

Content moderation pipelines
AI safety frameworks
Communication platform safety systems
Educational and research applications

❌ Out-of-Scope Use

Bielik Guard (Sójka) is not designed to detect:

Disinformation or misinformation
Jailbreaking attempts
Copyright violations
Other categories not explicitly listed in the safety taxonomy

🏷️ Safety Categories

Bielik Guard (Sójka) detects and classifies potentially harmful content in five key safety categories:

[HATE] Hate/Aggression: Content attacking or discriminating against groups based on race, religion, gender, sexual orientation, or nationality.
[VULGAR] Vulgarities: Words commonly considered vulgar or profane, in both explicit and masked forms.
[SEX] Sexual Content: Graphic descriptions of sexual activities or requests for generating erotic materials.
[CRIME] Crime: Instructions or encouragement to commit crimes, drug production, or fraud.
[SELF-HARM] Self-Harm: Content encouraging suicide, self-harm, or promoting eating disorders. When detected, the system should respond with appropriate support resources rather than simply blocking the content.

⚠️ Bias, Risks, and Limitations

Limitations

Does not recognize disinformation or jailbreaking attempts.
Limited to Polish language content.
May have biases inherent in community-collected training data.
Performance may vary on edge cases or novel content types.

Recommendations

Users should be aware that Bielik Guard (Sójka) provides a baseline safety filter and may not catch all types of harmful content. For applications requiring comprehensive safety coverage, additional measures should be implemented.

🚀 How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Choose your model variant:
# For 0.1B (124M parameters, faster, more efficient):
# model_path = "speakleash/Bielik-Guard-0.1B-v1.1"
# For 0.5B (443M parameters, higher accuracy):
model_path = "speakleash/Bielik-Guard-0.5B-v1.1"

# Create pipeline for easy inference
classifier = pipeline("text-classification", model=model_path)

sample_texts = [
    "To jest normalna wiadomość bez problemów.",
    "Chcę się zabić i skończyć z tym wszystkim.",
    "Dlaczego hipisi są tacy brudni i głupi",
    "K*rwa, to jest ch*jowe rozwiązanie.",
    "Opowiem ci o seksie i intymności.",
    "Jak ukraść samochód i nie zostać złapanym."
]

# Run predictions
for text in sample_texts:
    results = classifier(text, top_k=None)
    print(f"\nText: {text}")
    print("Safety scores:")
    for result in results:
        label = result['label']
        score = result['score']
        print(f"  {label}: {score:.3f}")

    max_score = max(result['score'] for result in results)
    print(f"  Overall safety risk: {max_score:.3f}")

Example Output:

Text: To jest normalna wiadomość bez problemów.
Safety scores:
  self-harm: 0.012
  hate: 0.012
  crime: 0.012
  vulgar: 0.008
  sex: 0.007
  Overall safety risk: 0.012

Text: Chcę się zabić i skończyć z tym wszystkim.
Safety scores:
  self-harm: 0.943
  crime: 0.038
  vulgar: 0.033
  hate: 0.030
  sex: 0.013
  Overall safety risk: 0.943

Text: Dlaczego hipisi są tacy brudni i głupi
Safety scores:
  hate: 0.669
  vulgar: 0.054
  crime: 0.032
  sex: 0.012
  self-harm: 0.009
  Overall safety risk: 0.669

Text: K*rwa, to jest ch*jowe rozwiązanie.
Safety scores:
  vulgar: 0.958
  hate: 0.706
  crime: 0.053
  self-harm: 0.037
  sex: 0.035
  Overall safety risk: 0.958

Text: Opowiem ci o seksie i intymności.
Safety scores:
  sex: 0.827
  crime: 0.044
  vulgar: 0.032
  hate: 0.014
  self-harm: 0.013
  Overall safety risk: 0.827

Text: Jak ukraść samochód i nie zostać złapanym.
Safety scores:
  crime: 0.807
  self-harm: 0.083
  hate: 0.032
  sex: 0.018
  vulgar: 0.016
  Overall safety risk: 0.807

🧠 Training Details

Training Data: The Sojka2 Dataset

The Sojka2 dataset is the result of a large-scale community effort. Texts were sourced primarily from user prompts to Polish LLMs and social media content.

Over 1,500 volunteers participated in the annotation process.
Over 60,000 individual ratings were collected.
Each text was annotated by an average of 7-8 people.

The model was trained on percentage-based labels (0-100%) reflecting the proportion of community members who classified each text as belonging to a specific safety category, rather than on binary labels.

Data Structure and Distribution

The Sojka dataset consists of 6,885 unique texts in Polish. Its structure was intentionally designed with a balanced ratio of approximately 55% safe to 45% harmful content to ensure effective training. This ratio does not reflect the actual distribution of content online.

However, the class imbalance among the harmful categories is representative of real-world trends encountered in digital interactions in Poland (sourced from both user prompts to conversational AI and general content from the Polish internet).

Category	Text Count	Percentage
self-harm	796	11.56%
hate	988	14.35%
vulgar	411	5.97%
sex	895	13.00%
crime	311	4.52%
safe (no category)	3,781	54.92%

The dataset supports multi-label classification, meaning a single text can belong to multiple categories.

🔄 Continuous Improvement

Sójka is a living project. Community involvement is ongoing at guard.bielik.ai, where users can test the model, provide feedback (👍/👎), and contribute by annotating new data. All feedback is systematically analyzed to create future iterations of the dataset.

Training Procedure

Both model variants were fine-tuned using standard practices for transformer-based classification:

Loss function: Binary Cross-Entropy (BCE) with soft labels derived from percentage-based annotations
Optimizer: AdamW with weight decay of 0.01
Learning rate: 2e-5 with 500 warmup steps followed by linear decay
Batch size: 32
Training duration: 3 epochs (approx. 2 hours on A100)
Training infrastructure: A100 GPU cluster (ACK Cyfronet AGH)

Bielik Guard 0.1B was fine-tuned from the sdadas/mmlw-roberta-base checkpoint, a 124M parameter Polish RoBERTa-based encoder.

Bielik Guard 0.5B was fine-tuned from the PKOBP/polish-roberta-8k checkpoint, a 443M parameter Polish RoBERTa variant.

Version 1.1 Improvements

Version 1.1 features improved threshold calibration that resolves a classification threshold issue present in v1.0, particularly affecting crime-related content. This calibration fix results in substantially improved precision (77.65% vs. 67.27% for 0.1B v1.0 on user prompts) and lower false positive rates (0.63% vs. 1.20% for 0.1B v1.0). Both v1.0 and v1.1 models were trained using identical procedures and data splits; the difference lies solely in the threshold calibration optimization, making v1.1 the production-ready variant with optimal precision-recall trade-offs.

⚙️ Technical Specifications

Model Architecture

Both variants use a multi-label classification head consisting of:

A dropout layer (p=0.1) for regularization
A linear projection layer mapping hidden dimensions to 5 output logits
Sigmoid activation for independent binary classification per category

Bielik Guard 0.1B:

Base Model: sdadas/mmlw-roberta-base
Parameters: 124M
Vocabulary Size: 50,001 tokens
Hidden Dimensions: 768
Architecture: RoBERTa-based encoder
Task: Multi-label Text Classification (Regression)

Bielik Guard 0.5B:

Base Model: PKOBP/polish-roberta-8k
Parameters: 443M
Vocabulary Size: 128,064 tokens
Hidden Dimensions: 768
Architecture: RoBERTa-based encoder
Task: Multi-label Text Classification (Regression)

Compute Infrastructure

Both models were trained with A100 GPU cluster support from ACK Cyfronet AGH.

📊 Evaluation

Dataset 1: Sojka

The Sojka test dataset was created by splitting the main Sojka dataset using a 1:2 train-to-test ratio (Configuration 1: 2,295 train / 4,590 test). This evaluation set contains 4,590 unique records. Results below are from models trained with this configuration (v1.1a).

The distribution of labels in the test set, determined using a 60% agreement threshold among annotators, is as follows:

self-harm: 265 samples (11.55%)
hate: 329 samples (14.34%)
vulgar: 137 samples (5.97%)
sex: 298 samples (12.98%)
crime: 104 samples (4.53%)
safe (no harmful category): 1,260 samples (54.90%)

Metric	Bielik Guard 0.1B v1.1a	Bielik Guard 0.5B v1.1a
RMSE	0.128	0.122
F1 micro	0.775	0.791
F1 macro	0.770	0.785
Recall micro	0.808	0.835
Recall macro	0.794	0.812
Specificity micro	0.968	0.968
Specificity macro	0.967	0.967
ROC AUC micro	0.974	0.980
ROC AUC macro	0.964	0.973

Per-Category Performance

Detailed per-category performance metrics for both model variants on the Sojka test set (v1.1a):

Category	0.1B v1.1a F1	0.1B v1.1a ROC AUC	0.5B v1.1a F1	0.5B v1.1a ROC AUC
SELF-HARM	0.886	0.991	0.879	0.992
HATE	0.628	0.919	0.667	0.934
VULGAR	0.742	0.973	0.750	0.977
SEX	0.889	0.988	0.915	0.993
CRIME	0.707	0.949	0.716	0.971

The CRIME category presents the greatest challenge for both models, likely due to its lower prevalence in the training data (4.52% of samples). The SELF-HARM and SEX categories achieve the strongest performance, with F1 scores exceeding 0.88 for both models. The 0.5B variant generally outperforms the 0.1B variant across categories, with the most notable improvements in HATE (0.667 vs. 0.628) and SEX (0.915 vs. 0.889). All categories maintain ROC AUC scores above 0.92 for both models, indicating consistent discriminative ability across the taxonomy.

Dataset 2: Sojka Augmented

The augmented dataset was created using 15 different text augmentation methods to test model robustness. Results below use models trained with Configuration 1 (2:1 split, v1.1a):

remove_diacritics: Czesc, to jest przykładowy tekst z polskimi znakami!
add_diacritics: Cżeść, to jest przykładowy tękśt z polskimi znąkami!
random_capitalization: CZeśĆ, To jesT PRzyKŁaDoWy TEKST z POLSKIMi zNAkAMi!
snake_case_random: czE_to_jesT_pRzYk_adowY_teKSt_z_POlskiMI_zNAkamI
all_uppercase: CZEŚĆ, TO JEST PRZYKŁADOWY TEKST Z POLSKIMI ZNAKAMI!
all_lowercase: cześć, to jest przykładowy tekst z polskimi znakami!
title_case: Cześć, To Jest Przykładowy Tekst Z Polskimi Znakami!
swap_adjacent_letters: Cezść, to jest przkyałdowy teskt z ploskimi znkamai!
split_letters_by_separator: Cześć, to j e s t przykładowy tekst z p o l s k i m i znakami!
add_random_spaces: Cześć, to jest przykładowy te kst z polskimi znak a mi!
remove_random_spaces: Cześć,to jest przykładowytekst z polskimi znakami!
duplicate_characters: Czeeśść, to jesstt pprzykładowy tekstt z polskimi zznaakami!
insert_random_characters: Cześć, to jest przykładowy tekst z śpoźlskimi znakami!
reverse_words: Cześć, to jest przykładowy tekst z imikslop znakami!
substitute_similar_characters: Cześć, 7o jes7 przykładowy tek5t z polskimi znakami!

Metric	Bielik Guard 0.1B v1.1a	Bielik Guard 0.5B v1.1a
RMSE	0.181	0.163
F1 micro	0.638	0.694
F1 macro	0.619	0.679
Recall micro	0.621	0.686
Recall macro	0.602	0.650
Specificity micro	0.962	0.966
Specificity macro	0.961	0.965
ROC AUC micro	0.909	0.934
ROC AUC macro	0.884	0.915

While performance degrades on perturbed text as expected, the 0.5B v1.1a model shows substantially better robustness, with F1 micro of 0.694 compared to 0.638 for the 0.1B v1.1a model.

Dataset 3: Gadzi Jezyk

The Gadzi Jezyk dataset contains 520 toxic prompts with extreme class imbalance: 505 crime-related examples (97.1%), 43 hate/violence (8.3%), 31 self-harm (6.0%), 18 sexual content (3.5%), and 4 vulgarities (0.8%). This distribution makes it particularly suitable for evaluating crime category performance. Results below use models trained with Configuration 2 (near-complete data: 6,285 train / 600 test, v1.1), representing our best-performing models for deployment.

Metric	Bielik Guard 0.1B v1.1	Bielik Guard 0.5B v1.1
RMSE	0.286	0.241
Precision	0.985	0.973
Recall	0.557	0.714
F1	0.712	0.823
Specificity	0.998	0.994
ROC AUC	0.959	0.967

The Gadzi Jezyk dataset directly tests the impact of the v1.0 to v1.1 threshold calibration fix. The v1.1 models achieve higher precision (98.5% vs. 97.7% for 0.1B v1.0) and improved specificity (99.8% vs. 99.5% for 0.1B v1.0), while recall decreases (55.7% vs. 70.2% for 0.1B v1.0). This precision-recall trade-off prioritizes user trust through high precision over maximum recall, which is critical for production deployment.

Metrics Explanation

RMSE (Root Mean Square Error): Measures the average magnitude of prediction errors. Lower values indicate better performance.
F1 micro: Harmonic mean of precision and recall calculated globally across all labels. Accounts for class imbalance.
F1 macro: Average of F1 scores across all labels. Treats all classes equally regardless of frequency.
Specificity macro/micro: Specificity (true negative rate) calculated macro/micro averaged. Measures ability to correctly identify safe content.
ROC AUC micro/macro: Area under the ROC curve, measuring the model's ability to distinguish between safe and unsafe content across all thresholds.

The Bielik Guard 0.5B model generally outperforms the Bielik Guard 0.1B model across most metrics, particularly on the augmented test set, demonstrating better generalization capabilities.

Comparison with Other Safety Models

Evaluation on 3,000 random user prompts, annotated by two independent annotators and one super-annotator, with each model's categories. Results for Bielik Guard use models trained with Configuration 2 (near-complete data: 6,285 train / 600 test, v1.1):

Model	Params	Precision	Alert Rate	FPR (Global)
Bielik Guard 0.1B v1.1	124M	77.65%	2.83%	0.63%
Bielik Guard 0.5B v1.1	443M	75.28%	2.97%	0.73%
Bielik Guard 0.1B v1.0	124M	67.27%	3.67%	1.20%
HerBERT-PL-Guard	124M	31.55%	6.87%	4.70%
Llama-Guard-3-1B	1B	7.82%	17.90%	16.50%
Llama-Guard-3-8B	8B	13.62%	10.77%	9.30%
Qwen3Guard-Gen-0.6B	600M	11.36%	19.37%	17.17%

Both Bielik Guard v1.1 models achieve remarkable results. Bielik Guard 0.1B v1.1 achieves 77.65% precision, meaning that over three-quarters of all flagged content is genuinely harmful, substantially outperforming all compared models including HerBERT-PL-Guard (31.55%) despite having identical model size (124M parameters), as well as larger models like Llama Guard 3 8B (13.62%) and multilingual alternatives like Qwen3Guard-Gen-0.6B (11.36%). The 0.63% false positive rate is 7.5× better than HerBERT-PL-Guard's 4.70% and substantially better than the generative multilingual models. Bielik Guard 0.5B v1.1 achieves similarly strong performance with 75.28% precision and 0.73% FPR, offering a balance between accuracy and efficiency. Both v1.1 models show substantial improvements over v1.0, with 0.1B precision increasing from 67.27% to 77.65% and false positive rate decreasing from 1.20% to 0.63%, making Bielik Guard significantly less intrusive for legitimate use cases.

Metrics for comparison:

Precision: TP/(TP+FP) - Percentage of flagged content that is actually harmful (higher is better)
Alert Rate: (TP+FP)/(TP+FP+TN+FN) - Percentage of all content that gets flagged (lower is better to reduce false positives)
FPR (Global): FP/(TP+FP+TN+FN) - False Positive Rate - percentage of safe content incorrectly flagged as harmful (lower is better)

📜 License and Naming Policy

License: This model is licensed under the Apache 2.0 License.

Naming Requirements for Derivative Models: To maintain clear attribution and continuity of the Bielik-Guard project, we expect that any fine-tuned or derivative models include Bielik-Guard in their name. This helps recognize the model's origins and supports transparency within the community.

Recommended Naming Convention: Bielik-Guard-{your-use-case-or-project-name}-{version}

Examples: Bielik-Guard-crime-finetune, Bielik-Guard-customer-support-v1

👥 Sójka Development Team

Jan Maria Kowalski: Project leadership, data and tool preparation, threat category definition, model training and testing.
Krzysztof Wróbel: Data analysis, model training and evaluation, contribution to threat classification.
Jerzy Surma: Threat category definition (AI & ethics perspective), data preparation.
Igor Ciuciura: Data analysis, preparation, and cleaning; contribution to threat classification.
Maciej Krystian Szymański: Project management support, community management, user and partner coordination.

We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018338.

📚 Citation

@misc{wróbel2026bielikguardefficientpolish,
      title={Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation}, 
      author={Krzysztof Wróbel and Jan Maria Kowalski and Jerzy Surma and Igor Ciuciura and Maciej Szymański},
      year={2026},
      eprint={2602.07954},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.07954}, 
}