bioclinical-treatment-detector / README.md

Update README.md

a215c23 verified 5 months ago

9.58 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- token-classification
	- named-entity-recognition
	- clinical-nlp
	- addiction-medicine
	- treatment-detection
	- privacy-preserving
	- information-masking
	base_model: thomas-sounack/BioClinical-ModernBERT-large
	datasets:
	- private
	model-index:
	- name: BioClinical-Treatment-Detector
	results:
	- task:
	type: token-classification
	name: Token Classification
	dataset:
	name: Addiction Medicine Clinical Notes
	type: clinical-text
	metrics:
	- type: f1
	value: 0.9
	name: Treatment F1-Score
	- type: precision
	value: 0.9
	name: Treatment Precision
	- type: recall
	value: 0.899
	name: Treatment Recall
	---

	# BioClinical Treatment Information Detector

	## Model Description

	This model is a specialized token classification system designed to detect treatment-related information in addiction medicine clinical notes. It is fine-tuned from `thomas-sounack/BioClinical-ModernBERT-large` to identify current and future treatment plans, medication decisions, and therapeutic interventions while preserving patient privacy.

	Key Purpose: Prevent information leakage about locus of care and medication decisions when training clinical decision support systems in addiction medicine. This model enables researchers to mask sensitive treatment information before using clinical data for machine learning applications.

	## Intended Use

	### Primary Use Case
	- Privacy-preserving clinical AI: Mask treatment-related information from clinical notes before training decision support systems
	- Research data preparation: Identify and redact sensitive treatment details while preserving other clinical information
	- Compliance support: Help maintain patient confidentiality when sharing clinical datasets for research

	### What It Detects
	- Current medication prescriptions and dosages
	- Treatment plans and recommendations
	- Therapeutic interventions and procedures
	- Follow-up care instructions
	- Clinical advice and care coordination

	### What It Does NOT Detect
	- Past treatment history (focuses only on current/future treatments)
	- Personal Identifiable Information (PII) like names, addresses, phone numbers
	- General medical conditions or diagnoses
	- Demographics or personal details

	## Model Details

	- Model Type: Token Classification (NER)
	- Base Model: thomas-sounack/BioClinical-ModernBERT-large
	- Language: English
	- Domain: Clinical text (addiction medicine)
	- Training Data: Single-center clinical notes from addiction medicine department
	- Labels:
	- `O`: Outside treatment information
	- `B-TREATMENT`: Beginning of treatment entity
	- `I-TREATMENT`: Inside treatment entity

	## Performance

	The model achieves strong performance on treatment detection:
	- Treatment F1-Score: 0.892
	- Treatment Precision: 0.885
	- Treatment Recall: 0.899

	These metrics reflect the model's ability to accurately identify treatment-related spans while minimizing false positives and negatives.

	## Limitations and Bias

	### Domain Specificity
	- Single-center training: Model is trained exclusively on data from one addiction medicine center
	- Specialty focus: Optimized for addiction medicine; may not generalize well to other medical specialties
	- Language limitation: English-only model

	### Temporal Focus
	- Current/future treatments only: Does not detect historical treatment information
	- Context dependency: Performance may vary with different clinical note structures

	### Ethical Considerations
	- This model is designed for defensive security purposes only
	- Should be used to protect patient privacy, not to extract sensitive information
	- Users must ensure compliance with healthcare privacy regulations (HIPAA, GDPR, etc.)

	## Quick Start

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	# Load model and tokenizer
	model_name = "Lekhansh/bioclinical-treatment-detector"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	# Example clinical text
	text = """
	Treatment Plan:
	1. Start Tablet Buprenorphine 8mg twice daily
	2. Continue counseling sessions weekly
	3. Follow up in outpatient clinic after 2 weeks
	"""

	# Tokenize and predict
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512,
	return_offsets_mapping=True)
	outputs = model(**{k: v for k, v in inputs.items() if k != 'offset_mapping'})

	# Get predictions
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_labels = torch.argmax(predictions, dim=-1)[0]

	# Map predictions to text spans
	id2label = {0: "O", 1: "B-TREATMENT", 2: "I-TREATMENT"}
	offset_mapping = inputs["offset_mapping"][0]

	treatment_spans = []
	current_span = None

	for i, (label_id, (start, end)) in enumerate(zip(predicted_labels, offset_mapping)):
	if start == 0 and end == 0: # Skip special tokens
	continue

	label = id2label[label_id.item()]

	if label == "B-TREATMENT":
	if current_span:
	treatment_spans.append(current_span)
	current_span = {"start": start.item(), "end": end.item()}
	elif label == "I-TREATMENT" and current_span:
	current_span["end"] = end.item()
	else:
	if current_span:
	treatment_spans.append(current_span)
	current_span = None

	if current_span:
	treatment_spans.append(current_span)

	# Extract treatment text
	for span in treatment_spans:
	treatment_text = text[span["start"]:span["end"]]
	print(f"Treatment detected: '{treatment_text}'")
	```

	## Advanced Usage

	For more sophisticated inference with confidence scores and batch processing, see the complete example:

	```python
	import torch
	import numpy as np
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	class TreatmentDetector:
	def __init__(self, model_name):
	self.tokenizer = AutoTokenizer.from_pretrained(model_name)
	self.model = AutoModelForTokenClassification.from_pretrained(model_name)
	self.model.eval()
	self.id2label = {0: "O", 1: "B-TREATMENT", 2: "I-TREATMENT"}

	def detect_treatments(self, text, confidence_threshold=0.5):
	encoding = self.tokenizer(
	text, return_tensors="pt", truncation=True, max_length=8192,
	return_offsets_mapping=True, padding=True
	)

	with torch.no_grad():
	outputs = self.model(**{k: v for k, v in encoding.items() if k != 'offset_mapping'})
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_labels = torch.argmax(predictions, dim=-1)[0]
	confidence_scores = torch.max(predictions, dim=-1)[0][0]

	treatment_spans = []
	current_span = None

	for i, (label_id, confidence, (start, end)) in enumerate(
	zip(predicted_labels, confidence_scores, encoding["offset_mapping"][0])
	):
	if start == 0 and end == 0:
	continue

	label = self.id2label[label_id.item()]
	conf = confidence.item()

	if label == "B-TREATMENT" and conf > confidence_threshold:
	if current_span:
	treatment_spans.append(current_span)
	current_span = {
	"start": start.item(), "end": end.item(),
	"confidence": conf
	}
	elif label == "I-TREATMENT" and current_span and conf > confidence_threshold:
	current_span["end"] = end.item()
	current_span["confidence"] = (current_span["confidence"] + conf) / 2
	else:
	if current_span:
	treatment_spans.append(current_span)
	current_span = None

	if current_span:
	treatment_spans.append(current_span)

	# Add text content
	for span in treatment_spans:
	span["text"] = text[span["start"]:span["end"]]

	return treatment_spans

	# Usage
	detector = TreatmentDetector("Lekhansh/bioclinical-treatment-detector")
	treatments = detector.detect_treatments(clinical_text)
	```

	## Training Details

	### Training Data
	- Source: Single addiction medicine center clinical notes
	- Annotation: Manual annotation of treatment-related text spans
	- Size: Balanced dataset with both positive and negative examples
	- Preprocessing: Text segmentation with sliding windows for long documents

	### Training Configuration
	- Base Model: thomas-sounack/BioClinical-ModernBERT-large
	- Training Epochs: 3
	- Batch Size: 8 (with gradient accumulation)
	- Learning Rate: 5e-5
	- Optimizer: AdamW with weight decay
	- Hardware: Single GPU training

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{bioclinical-treatment-detector,
	title={Addiction Medicine Treatment Information Detector for Clinical AI},
	author={[Lekhansh S, Prakrithi SN]},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/Lekhansh/bioclinical-treatment-detector}}
	}
	```

	## Contact

	For questions about this model or its applications in privacy-preserving clinical AI, please contact [drlekhansh@gmail.com].

	## License

	This model is released under the Apache 2.0 License. Please ensure compliance with all applicable healthcare privacy regulations when using this model.