QuillSafe
Polish PII and Sensitive Data Detection Model
QuillSafe is a production-oriented token classification model for detecting personally identifiable information (PII) and other regulated or high-sensitivity data in Polish-language text. It is designed for privacy-preserving NLP workflows, document redaction, secure data processing, and compliance-driven preprocessing pipelines.
Built on top of XLM-RoBERTa-base, QuillSafe identifies a broad set of sensitive entities across personal, financial, identity, health, geolocation, authentication, and special-category personal data. The model is intended for organizations that need reliable Polish-first detection of sensitive content before indexing, analytics, model training, or downstream AI processing.
This is version v0.1 of this model; we will continue to refine and improve it.
Key Highlights
- Language support: Polish
- Task: Token classification
- Base model: XLM-RoBERTa-base
- Global F1 score: 95%
- Entity schema: 37 sensitive-data classes
Intended Use
QuillSafe is designed for automated detection and labeling of sensitive spans in Polish text, including both classic PII and higher-risk regulated categories.
Typical use cases include:
- PII redaction in documents, tickets, emails, and chat logs
- Dataset sanitization before model training, annotation, or analytics
- Compliance workflows supporting GDPR-oriented processing controls
- Enterprise data governance and sensitive-content discovery
- Document intake pipelines for finance, healthcare, legal, HR, and public-sector use cases
- Pre-ingestion filtering for search, retrieval, and RAG systems
Supported Language
- Polish (pl)
QuillSafe is optimized for Polish-language content, including business, administrative, operational, and user-generated text. Real-world performance may vary depending on domain vocabulary, formatting quality, OCR noise, abbreviations, and annotation conventions.
Detected Entity Types
QuillSafe detects the following 37 classes:
Personal Identity and Profile
PERSON_NAMEDATE_OF_BIRTHPERSON_ATTRIBUTEPERSON_ALIAS
Contact and Location
EMAIL_ADDRESSPHONE_NUMBERCONTACT_HANDLEPOSTAL_ADDRESSLOCATIONGEO_LOCATION
Technical and Digital Identifiers
IP_ADDRESSDEVICE_IDENTIFIERCOOKIE_IDENTIFIERACCOUNT_IDENTIFIERAUTH_SECRET
Financial Data
BANK_ACCOUNTPAYMENT_CARDPAYMENT_CARD_METADATAFINANCIAL_TRANSACTIONFINANCIAL_AMOUNTSALARY_COMPENSATION
Government and Official Identifiers
NATIONAL_ID_NUMBERPASSPORT_NUMBERDRIVER_LICENSE_NUMBERTAX_ID_NUMBERHEALTH_INSURANCE_IDRESIDENCE_PERMIT_NUMBER
Health and Biometric Data
HEALTH_DATAGENETIC_DATABIOMETRIC_DATA
Special-Category and Highly Sensitive Personal Data
RELIGION_OR_BELIEFPOLITICAL_OPINIONSEXUAL_ORIENTATIONTRADE_UNION_MEMBERSHIPETHNIC_ORIGINCRIMINAL_OFFENCE_DATA
Employment and Contextual Sensitivity
EMPLOYMENT_CONTEXT
Model Architecture
- Architecture family: Transformer
- Base checkpoint:
FacebookAI/xlm-roberta-base - Fine-tuning task: Token classification / named-entity-style span labeling
- Primary focus: Detection of PII and regulated sensitive entities in Polish
As a token classification model, QuillSafe predicts entity labels at the token level and supports span reconstruction through standard BIO-style or token-aligned post-processing, depending on the deployment pipeline.
Quick Start
Example usage with Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "bardsai/quillsafe"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "Jan Kowalski, PESEL 85010112345, tel 123 456 789"
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
input_ids = inputs["input_ids"][0]
tokens = tokenizer.convert_ids_to_tokens(input_ids)
labels = [model.config.id2label[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if label != "O":
print(label, token)
Limitations
While QuillSafe is optimized for robust Polish PII detection, the following limitations should be considered:
- The model may be sensitive to annotation-policy differences between organizations.
- Ambiguous phrases may require context-aware post-processing.
- OCR artifacts, broken formatting, or token fragmentation can reduce span accuracy.
- Some classes, especially sensitive contextual categories, may require human validation in compliance-heavy environments.
- The model should not be treated as a standalone legal or compliance decision-maker.
About bards.ai
At bards.ai, we focus on providing machine learning expertise and skills to our partners, particularly in the areas of nlp, machine vision and time series analysis. Our team is located in Wroclaw, Poland. Please visit our website for more information: bards.ai
Let us know if you use our model :). Also, if you need any help, feel free to contact us at [email protected]
- Downloads last month
- 31
Model tree for bardsai/quillsafe
Base model
FacebookAI/xlm-roberta-base