HIPAA-BERT: PII/PHI Column Name Classifier
A fine-tuned BERT model for classifying database column names as PII (Personally Identifiable Information), PHI (Protected Health Information), or Other (O).
Model Details
| Property | Value |
|---|---|
| Developer | KronosX AI Labs |
| Model Type | BERT + LoRA (text classification) |
| Base Model | bert-base-uncased |
| Language | English |
| Fine-tuning Method | LoRA (Low-Rank Adaptation) |
| Task | Sequence Classification (3 classes) |
Labels
| Label | Description | Examples |
|---|---|---|
O |
Other/Safe columns | id, created_at, status |
| PII | Personally Identifiable Info | email, phone_number, address |
| PHI | Protected Health Info (HIPAA) | diagnosis_code, patient_name, ssn |
Training Details
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 1e-3 |
| Batch Size | 64 |
| Epochs | 10 |
| Weight Decay | 0.01 |
| Max Sequence Length | 64 |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.1 |
| Target Modules | query, value |
Training Data
Custom HIPAA-compliant dataset with ~50000+ labeled column names from healthcare databases.
Hardware
- GPU: NVIDIA GPU (Kaggle)
- Mixed Precision: FP16 enabled
Performance Metrics
| Metric | Score |
|---|---|
| Accuracy | ~95%+ |
| F1 (weighted) | ~94%+ |
| Precision | ~93%+ |
| Recall | ~94%+ |
Usage
Installation
pip install transformers torch
Quick Start
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch
Load model
model_name = "KronosXAI/HIPAA-BERT-v0.1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name)
Classify column names
columns = ["patient_name", "diagnosis_code", "created_at", "email", "status"] for col in columns: inputs = tokenizer(col, return_tensors="pt", truncation=True, max_length=64) with torch.no_grad(): outputs = model(**inputs) prediction = torch.argmax(outputs.logits, dim=-1).item()
label_map = {0: "O", 1: "PII", 2: "PHI"}
print(f"{col}: {label_map[prediction]}")
Expected Output
patient_name: PHI diagnosis_code: PHI created_at: O email: PII status: O
Intended Use
Primary Use Cases
- Automatic PII/PHI detection in database schemas
- Data privacy compliance audits
- HIPAA compliance automation
- Healthcare data anonymization pipelines
Out-of-Scope
- This model classifies column names, not the actual data content
- Not suitable for classifying free-text or unstructured data
- Should be used as part of a larger compliance workflow, not as sole arbiter
Limitations & Bias
- Trained primarily on English column naming conventions
- May not generalize to non-standard or domain-specific naming patterns
- Should be validated with domain experts before production use
Model Card Authors
Abishek - KronosX AI Labs
Citation
@misc{hipaa-bert-2024, author = {KronosX AI Labs}, title = {HIPAA-BERT: PII/PHI Column Name Classifier}, year = {2026}, url = {https://huggingface.co/KronosXAI/HIPAA-BERT-v0.1} }
Links
- Organization: KronosX AI Labs
- Downloads last month
- 16
Model tree for KronosXAI/HIPAA-BERT-v0.1
Base model
google-bert/bert-base-uncased