QuillSafe

Polish PII and Sensitive Data Detection Model

QuillSafe is a production-oriented token classification model for detecting personally identifiable information (PII) and other regulated or high-sensitivity data in Polish-language text. It is designed for privacy-preserving NLP workflows, document redaction, secure data processing, and compliance-driven preprocessing pipelines.

Built on top of XLM-RoBERTa-base, QuillSafe identifies a broad set of sensitive entities across personal, financial, identity, health, geolocation, authentication, and special-category personal data. The model is intended for organizations that need reliable Polish-first detection of sensitive content before indexing, analytics, model training, or downstream AI processing.

This is version v0.1 of this model; we will continue to refine and improve it.

Key Highlights

Language support: Polish
Task: Token classification
Base model: XLM-RoBERTa-base
Global F1 score: 95%
Entity schema: 37 sensitive-data classes

Intended Use

QuillSafe is designed for automated detection and labeling of sensitive spans in Polish text, including both classic PII and higher-risk regulated categories.

Typical use cases include:

PII redaction in documents, tickets, emails, and chat logs
Dataset sanitization before model training, annotation, or analytics
Compliance workflows supporting GDPR-oriented processing controls
Enterprise data governance and sensitive-content discovery
Document intake pipelines for finance, healthcare, legal, HR, and public-sector use cases
Pre-ingestion filtering for search, retrieval, and RAG systems

Supported Language

Polish (pl)

QuillSafe is optimized for Polish-language content, including business, administrative, operational, and user-generated text. Real-world performance may vary depending on domain vocabulary, formatting quality, OCR noise, abbreviations, and annotation conventions.

Detected Entity Types

QuillSafe detects the following 37 classes:

Personal Identity and Profile

PERSON_NAME
DATE_OF_BIRTH
PERSON_ATTRIBUTE
PERSON_ALIAS

Contact and Location

EMAIL_ADDRESS
PHONE_NUMBER
CONTACT_HANDLE
POSTAL_ADDRESS
LOCATION
GEO_LOCATION

Technical and Digital Identifiers

IP_ADDRESS
DEVICE_IDENTIFIER
COOKIE_IDENTIFIER
ACCOUNT_IDENTIFIER
AUTH_SECRET

Financial Data

BANK_ACCOUNT
PAYMENT_CARD
PAYMENT_CARD_METADATA
FINANCIAL_TRANSACTION
FINANCIAL_AMOUNT
SALARY_COMPENSATION

Government and Official Identifiers

NATIONAL_ID_NUMBER
PASSPORT_NUMBER
DRIVER_LICENSE_NUMBER
TAX_ID_NUMBER
HEALTH_INSURANCE_ID
RESIDENCE_PERMIT_NUMBER

Health and Biometric Data

HEALTH_DATA
GENETIC_DATA
BIOMETRIC_DATA

Special-Category and Highly Sensitive Personal Data

RELIGION_OR_BELIEF
POLITICAL_OPINION
SEXUAL_ORIENTATION
TRADE_UNION_MEMBERSHIP
ETHNIC_ORIGIN
CRIMINAL_OFFENCE_DATA

Employment and Contextual Sensitivity

EMPLOYMENT_CONTEXT

Model Architecture

Architecture family: Transformer
Base checkpoint: FacebookAI/xlm-roberta-base
Fine-tuning task: Token classification / named-entity-style span labeling
Primary focus: Detection of PII and regulated sensitive entities in Polish

As a token classification model, QuillSafe predicts entity labels at the token level and supports span reconstruction through standard BIO-style or token-aligned post-processing, depending on the deployment pipeline.

Quick Start

Example usage with Hugging Face Transformers:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "bardsai/quillsafe"  

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "Jan Kowalski, PESEL 85010112345, tel 123 456 789"

inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

input_ids = inputs["input_ids"][0]
tokens = tokenizer.convert_ids_to_tokens(input_ids)
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

Limitations

While QuillSafe is optimized for robust Polish PII detection, the following limitations should be considered:

The model may be sensitive to annotation-policy differences between organizations.
Ambiguous phrases may require context-aware post-processing.
OCR artifacts, broken formatting, or token fragmentation can reduce span accuracy.
Some classes, especially sensitive contextual categories, may require human validation in compliance-heavy environments.
The model should not be treated as a standalone legal or compliance decision-maker.

About bards.ai

At bards.ai, we focus on providing machine learning expertise and skills to our partners, particularly in the areas of nlp, machine vision and time series analysis. Our team is located in Wroclaw, Poland. Please visit our website for more information: bards.ai

Let us know if you use our model :). Also, if you need any help, feel free to contact us at [email protected]

Downloads last month: 31

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for bardsai/quillsafe

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3805)

this model