QuillSafe

Polish PII and Sensitive Data Detection Model

QuillSafe is a production-oriented token classification model for detecting personally identifiable information (PII) and other regulated or high-sensitivity data in Polish-language text. It is designed for privacy-preserving NLP workflows, document redaction, secure data processing, and compliance-driven preprocessing pipelines.

Built on top of XLM-RoBERTa-base, QuillSafe identifies a broad set of sensitive entities across personal, financial, identity, health, geolocation, authentication, and special-category personal data. The model is intended for organizations that need reliable Polish-first detection of sensitive content before indexing, analytics, model training, or downstream AI processing.

This is version v0.1 of this model; we will continue to refine and improve it.

Key Highlights

  • Language support: Polish
  • Task: Token classification
  • Base model: XLM-RoBERTa-base
  • Global F1 score: 95%
  • Entity schema: 37 sensitive-data classes

Intended Use

QuillSafe is designed for automated detection and labeling of sensitive spans in Polish text, including both classic PII and higher-risk regulated categories.

Typical use cases include:

  • PII redaction in documents, tickets, emails, and chat logs
  • Dataset sanitization before model training, annotation, or analytics
  • Compliance workflows supporting GDPR-oriented processing controls
  • Enterprise data governance and sensitive-content discovery
  • Document intake pipelines for finance, healthcare, legal, HR, and public-sector use cases
  • Pre-ingestion filtering for search, retrieval, and RAG systems

Supported Language

  • Polish (pl)

QuillSafe is optimized for Polish-language content, including business, administrative, operational, and user-generated text. Real-world performance may vary depending on domain vocabulary, formatting quality, OCR noise, abbreviations, and annotation conventions.

Detected Entity Types

QuillSafe detects the following 37 classes:

Personal Identity and Profile

  • PERSON_NAME
  • DATE_OF_BIRTH
  • PERSON_ATTRIBUTE
  • PERSON_ALIAS

Contact and Location

  • EMAIL_ADDRESS
  • PHONE_NUMBER
  • CONTACT_HANDLE
  • POSTAL_ADDRESS
  • LOCATION
  • GEO_LOCATION

Technical and Digital Identifiers

  • IP_ADDRESS
  • DEVICE_IDENTIFIER
  • COOKIE_IDENTIFIER
  • ACCOUNT_IDENTIFIER
  • AUTH_SECRET

Financial Data

  • BANK_ACCOUNT
  • PAYMENT_CARD
  • PAYMENT_CARD_METADATA
  • FINANCIAL_TRANSACTION
  • FINANCIAL_AMOUNT
  • SALARY_COMPENSATION

Government and Official Identifiers

  • NATIONAL_ID_NUMBER
  • PASSPORT_NUMBER
  • DRIVER_LICENSE_NUMBER
  • TAX_ID_NUMBER
  • HEALTH_INSURANCE_ID
  • RESIDENCE_PERMIT_NUMBER

Health and Biometric Data

  • HEALTH_DATA
  • GENETIC_DATA
  • BIOMETRIC_DATA

Special-Category and Highly Sensitive Personal Data

  • RELIGION_OR_BELIEF
  • POLITICAL_OPINION
  • SEXUAL_ORIENTATION
  • TRADE_UNION_MEMBERSHIP
  • ETHNIC_ORIGIN
  • CRIMINAL_OFFENCE_DATA

Employment and Contextual Sensitivity

  • EMPLOYMENT_CONTEXT

Model Architecture

  • Architecture family: Transformer
  • Base checkpoint: FacebookAI/xlm-roberta-base
  • Fine-tuning task: Token classification / named-entity-style span labeling
  • Primary focus: Detection of PII and regulated sensitive entities in Polish

As a token classification model, QuillSafe predicts entity labels at the token level and supports span reconstruction through standard BIO-style or token-aligned post-processing, depending on the deployment pipeline.

Quick Start

Example usage with Hugging Face Transformers:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "bardsai/quillsafe"  

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "Jan Kowalski, PESEL 85010112345, tel 123 456 789"

inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

input_ids = inputs["input_ids"][0]
tokens = tokenizer.convert_ids_to_tokens(input_ids)
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

Limitations

While QuillSafe is optimized for robust Polish PII detection, the following limitations should be considered:

  • The model may be sensitive to annotation-policy differences between organizations.
  • Ambiguous phrases may require context-aware post-processing.
  • OCR artifacts, broken formatting, or token fragmentation can reduce span accuracy.
  • Some classes, especially sensitive contextual categories, may require human validation in compliance-heavy environments.
  • The model should not be treated as a standalone legal or compliance decision-maker.

About bards.ai

At bards.ai, we focus on providing machine learning expertise and skills to our partners, particularly in the areas of nlp, machine vision and time series analysis. Our team is located in Wroclaw, Poland. Please visit our website for more information: bards.ai

Let us know if you use our model :). Also, if you need any help, feel free to contact us at [email protected]

Downloads last month
31
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bardsai/quillsafe

Finetuned
(3805)
this model