|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- token-classification |
|
|
- named-entity-recognition |
|
|
- clinical-nlp |
|
|
- addiction-medicine |
|
|
- treatment-detection |
|
|
- privacy-preserving |
|
|
- information-masking |
|
|
base_model: thomas-sounack/BioClinical-ModernBERT-large |
|
|
datasets: |
|
|
- private |
|
|
model-index: |
|
|
- name: BioClinical-Treatment-Detector |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Token Classification |
|
|
dataset: |
|
|
name: Addiction Medicine Clinical Notes |
|
|
type: clinical-text |
|
|
metrics: |
|
|
- type: f1 |
|
|
value: 0.9 |
|
|
name: Treatment F1-Score |
|
|
- type: precision |
|
|
value: 0.9 |
|
|
name: Treatment Precision |
|
|
- type: recall |
|
|
value: 0.899 |
|
|
name: Treatment Recall |
|
|
--- |
|
|
|
|
|
# BioClinical Treatment Information Detector |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a specialized token classification system designed to detect treatment-related information in addiction medicine clinical notes. It is fine-tuned from `thomas-sounack/BioClinical-ModernBERT-large` to identify current and future treatment plans, medication decisions, and therapeutic interventions while preserving patient privacy. |
|
|
|
|
|
**Key Purpose**: Prevent information leakage about locus of care and medication decisions when training clinical decision support systems in addiction medicine. This model enables researchers to mask sensitive treatment information before using clinical data for machine learning applications. |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Case |
|
|
- **Privacy-preserving clinical AI**: Mask treatment-related information from clinical notes before training decision support systems |
|
|
- **Research data preparation**: Identify and redact sensitive treatment details while preserving other clinical information |
|
|
- **Compliance support**: Help maintain patient confidentiality when sharing clinical datasets for research |
|
|
|
|
|
### What It Detects |
|
|
- Current medication prescriptions and dosages |
|
|
- Treatment plans and recommendations |
|
|
- Therapeutic interventions and procedures |
|
|
- Follow-up care instructions |
|
|
- Clinical advice and care coordination |
|
|
|
|
|
### What It Does NOT Detect |
|
|
- **Past treatment history** (focuses only on current/future treatments) |
|
|
- **Personal Identifiable Information (PII)** like names, addresses, phone numbers |
|
|
- **General medical conditions** or diagnoses |
|
|
- **Demographics** or personal details |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: Token Classification (NER) |
|
|
- **Base Model**: thomas-sounack/BioClinical-ModernBERT-large |
|
|
- **Language**: English |
|
|
- **Domain**: Clinical text (addiction medicine) |
|
|
- **Training Data**: Single-center clinical notes from addiction medicine department |
|
|
- **Labels**: |
|
|
- `O`: Outside treatment information |
|
|
- `B-TREATMENT`: Beginning of treatment entity |
|
|
- `I-TREATMENT`: Inside treatment entity |
|
|
|
|
|
## Performance |
|
|
|
|
|
The model achieves strong performance on treatment detection: |
|
|
- **Treatment F1-Score**: 0.892 |
|
|
- **Treatment Precision**: 0.885 |
|
|
- **Treatment Recall**: 0.899 |
|
|
|
|
|
These metrics reflect the model's ability to accurately identify treatment-related spans while minimizing false positives and negatives. |
|
|
|
|
|
## Limitations and Bias |
|
|
|
|
|
### Domain Specificity |
|
|
- **Single-center training**: Model is trained exclusively on data from one addiction medicine center |
|
|
- **Specialty focus**: Optimized for addiction medicine; may not generalize well to other medical specialties |
|
|
- **Language limitation**: English-only model |
|
|
|
|
|
### Temporal Focus |
|
|
- **Current/future treatments only**: Does not detect historical treatment information |
|
|
- **Context dependency**: Performance may vary with different clinical note structures |
|
|
|
|
|
### Ethical Considerations |
|
|
- This model is designed for **defensive security purposes only** |
|
|
- Should be used to **protect patient privacy**, not to extract sensitive information |
|
|
- Users must ensure compliance with healthcare privacy regulations (HIPAA, GDPR, etc.) |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "Lekhansh/bioclinical-treatment-detector" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
|
|
# Example clinical text |
|
|
text = """ |
|
|
Treatment Plan: |
|
|
1. Start Tablet Buprenorphine 8mg twice daily |
|
|
2. Continue counseling sessions weekly |
|
|
3. Follow up in outpatient clinic after 2 weeks |
|
|
""" |
|
|
|
|
|
# Tokenize and predict |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, |
|
|
return_offsets_mapping=True) |
|
|
outputs = model(**{k: v for k, v in inputs.items() if k != 'offset_mapping'}) |
|
|
|
|
|
# Get predictions |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_labels = torch.argmax(predictions, dim=-1)[0] |
|
|
|
|
|
# Map predictions to text spans |
|
|
id2label = {0: "O", 1: "B-TREATMENT", 2: "I-TREATMENT"} |
|
|
offset_mapping = inputs["offset_mapping"][0] |
|
|
|
|
|
treatment_spans = [] |
|
|
current_span = None |
|
|
|
|
|
for i, (label_id, (start, end)) in enumerate(zip(predicted_labels, offset_mapping)): |
|
|
if start == 0 and end == 0: # Skip special tokens |
|
|
continue |
|
|
|
|
|
label = id2label[label_id.item()] |
|
|
|
|
|
if label == "B-TREATMENT": |
|
|
if current_span: |
|
|
treatment_spans.append(current_span) |
|
|
current_span = {"start": start.item(), "end": end.item()} |
|
|
elif label == "I-TREATMENT" and current_span: |
|
|
current_span["end"] = end.item() |
|
|
else: |
|
|
if current_span: |
|
|
treatment_spans.append(current_span) |
|
|
current_span = None |
|
|
|
|
|
if current_span: |
|
|
treatment_spans.append(current_span) |
|
|
|
|
|
# Extract treatment text |
|
|
for span in treatment_spans: |
|
|
treatment_text = text[span["start"]:span["end"]] |
|
|
print(f"Treatment detected: '{treatment_text}'") |
|
|
``` |
|
|
|
|
|
## Advanced Usage |
|
|
|
|
|
For more sophisticated inference with confidence scores and batch processing, see the complete example: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import numpy as np |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
|
|
class TreatmentDetector: |
|
|
def __init__(self, model_name): |
|
|
self.tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
self.model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
self.model.eval() |
|
|
self.id2label = {0: "O", 1: "B-TREATMENT", 2: "I-TREATMENT"} |
|
|
|
|
|
def detect_treatments(self, text, confidence_threshold=0.5): |
|
|
encoding = self.tokenizer( |
|
|
text, return_tensors="pt", truncation=True, max_length=8192, |
|
|
return_offsets_mapping=True, padding=True |
|
|
) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = self.model(**{k: v for k, v in encoding.items() if k != 'offset_mapping'}) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_labels = torch.argmax(predictions, dim=-1)[0] |
|
|
confidence_scores = torch.max(predictions, dim=-1)[0][0] |
|
|
|
|
|
treatment_spans = [] |
|
|
current_span = None |
|
|
|
|
|
for i, (label_id, confidence, (start, end)) in enumerate( |
|
|
zip(predicted_labels, confidence_scores, encoding["offset_mapping"][0]) |
|
|
): |
|
|
if start == 0 and end == 0: |
|
|
continue |
|
|
|
|
|
label = self.id2label[label_id.item()] |
|
|
conf = confidence.item() |
|
|
|
|
|
if label == "B-TREATMENT" and conf > confidence_threshold: |
|
|
if current_span: |
|
|
treatment_spans.append(current_span) |
|
|
current_span = { |
|
|
"start": start.item(), "end": end.item(), |
|
|
"confidence": conf |
|
|
} |
|
|
elif label == "I-TREATMENT" and current_span and conf > confidence_threshold: |
|
|
current_span["end"] = end.item() |
|
|
current_span["confidence"] = (current_span["confidence"] + conf) / 2 |
|
|
else: |
|
|
if current_span: |
|
|
treatment_spans.append(current_span) |
|
|
current_span = None |
|
|
|
|
|
if current_span: |
|
|
treatment_spans.append(current_span) |
|
|
|
|
|
# Add text content |
|
|
for span in treatment_spans: |
|
|
span["text"] = text[span["start"]:span["end"]] |
|
|
|
|
|
return treatment_spans |
|
|
|
|
|
# Usage |
|
|
detector = TreatmentDetector("Lekhansh/bioclinical-treatment-detector") |
|
|
treatments = detector.detect_treatments(clinical_text) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Source**: Single addiction medicine center clinical notes |
|
|
- **Annotation**: Manual annotation of treatment-related text spans |
|
|
- **Size**: Balanced dataset with both positive and negative examples |
|
|
- **Preprocessing**: Text segmentation with sliding windows for long documents |
|
|
|
|
|
### Training Configuration |
|
|
- **Base Model**: thomas-sounack/BioClinical-ModernBERT-large |
|
|
- **Training Epochs**: 3 |
|
|
- **Batch Size**: 8 (with gradient accumulation) |
|
|
- **Learning Rate**: 5e-5 |
|
|
- **Optimizer**: AdamW with weight decay |
|
|
- **Hardware**: Single GPU training |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{bioclinical-treatment-detector, |
|
|
title={Addiction Medicine Treatment Information Detector for Clinical AI}, |
|
|
author={[Lekhansh S, Prakrithi SN]}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/Lekhansh/bioclinical-treatment-detector}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions about this model or its applications in privacy-preserving clinical AI, please contact [drlekhansh@gmail.com]. |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 License. Please ensure compliance with all applicable healthcare privacy regulations when using this model. |