Ric
Claude
commited on
Commit
·
e84ccb9
1
Parent(s):
47ec478
Remove misleading claims about training and results
Browse files- Changed language from definitive to exploratory ("research exploration")
- Clarified that LoRA code exists but was never successfully trained
- Removed training details section (no training was completed)
- Updated Quick Start to focus on working data generation pipeline
- Changed "Features" to "What's Included" to be more accurate
- Made it clear this is infrastructure/implementation, not working model
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
README.md
CHANGED
|
@@ -6,15 +6,17 @@
|
|
| 6 |
|
| 7 |
## Overview
|
| 8 |
|
| 9 |
-
**Justitia** is a
|
| 10 |
|
| 11 |
-
###
|
| 12 |
|
| 13 |
-
The core
|
| 14 |
|
| 15 |
-
- **Stronger Privacy Guarantees**: PHI never
|
| 16 |
-
- **Better Utility Preservation**: Non-PHI medical context
|
| 17 |
-
- **Efficient Implementation**:
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## Architecture
|
| 20 |
|
|
@@ -40,14 +42,14 @@ Input PDF → Vision Encoder → PHI Detection (LoRA) → Token Masking → Deep
|
|
| 40 |
- **Selective Attention Masking**: ToSA-inspired attention mechanism
|
| 41 |
- **Hybrid Approach**: Combines both for optimal privacy-utility tradeoff
|
| 42 |
|
| 43 |
-
##
|
| 44 |
|
| 45 |
-
- **
|
| 46 |
-
- **
|
| 47 |
-
- **
|
| 48 |
-
- **
|
| 49 |
-
- **
|
| 50 |
-
- **
|
| 51 |
|
| 52 |
## Quick Start
|
| 53 |
|
|
@@ -78,6 +80,8 @@ python scripts/download_model.py
|
|
| 78 |
|
| 79 |
### Generate Synthetic Medical Data
|
| 80 |
|
|
|
|
|
|
|
| 81 |
```bash
|
| 82 |
# Setup Synthea (synthetic patient generator)
|
| 83 |
bash scripts/setup_synthea.sh
|
|
@@ -89,25 +93,21 @@ bash scripts/generate_synthea_data.sh
|
|
| 89 |
python scripts/generate_clinical_notes.py --output-dir data/pdfs --num-documents 1000
|
| 90 |
```
|
| 91 |
|
| 92 |
-
###
|
| 93 |
|
| 94 |
```bash
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
--data-dir data/pdfs \
|
| 98 |
-
--output-dir models/lora_adapters
|
| 99 |
-
```
|
| 100 |
|
| 101 |
-
|
|
|
|
| 102 |
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
--input path/to/medical.pdf \
|
| 106 |
-
--output path/to/redacted_output.txt \
|
| 107 |
-
--masking-strategy selective_attention \
|
| 108 |
-
--lora-checkpoint models/lora_adapters/best_model
|
| 109 |
```
|
| 110 |
|
|
|
|
|
|
|
| 111 |
## Project Structure
|
| 112 |
|
| 113 |
```
|
|
@@ -223,31 +223,22 @@ target_modules: # Attention layers to target
|
|
| 223 |
task_type: PHI_DETECTION
|
| 224 |
```
|
| 225 |
|
| 226 |
-
### Training Details
|
| 227 |
-
|
| 228 |
-
- **Dataset**: 10,000+ synthetic medical PDFs from Synthea
|
| 229 |
-
- **Batch Size**: 8 (with gradient accumulation: 4)
|
| 230 |
-
- **Learning Rate**: 2e-4 with cosine annealing
|
| 231 |
-
- **Epochs**: 10
|
| 232 |
-
- **Hardware**: Single A100 40GB GPU
|
| 233 |
-
- **Training Time**: ~8 hours
|
| 234 |
-
|
| 235 |
## Project Status
|
| 236 |
|
| 237 |
-
**Current State**:
|
| 238 |
|
| 239 |
### Completed
|
| 240 |
- [x] Project structure and configuration
|
| 241 |
- [x] Synthea integration for synthetic patient data
|
| 242 |
- [x] PDF generation pipeline with PHI annotations
|
| 243 |
- [x] PHI annotation and preprocessing tools
|
| 244 |
-
- [x]
|
| 245 |
-
- [x] Basic training pipeline (results were suboptimal)
|
| 246 |
|
| 247 |
### Known Limitations
|
| 248 |
-
-
|
| 249 |
-
-
|
| 250 |
-
-
|
|
|
|
| 251 |
|
| 252 |
### Future Directions
|
| 253 |
- Explore alternative masking strategies
|
|
|
|
| 6 |
|
| 7 |
## Overview
|
| 8 |
|
| 9 |
+
**Justitia** is a research exploration of **vision-level token masking** for PHI (Protected Health Information) compliance in OCR systems. Unlike traditional text-based redaction methods that operate after OCR extraction, this approach investigates detecting and masking sensitive information at the vision token stage, attempting to prevent PHI from ever being processed by the language model decoder.
|
| 10 |
|
| 11 |
+
### Research Approach
|
| 12 |
|
| 13 |
+
The core idea is **selective vision token masking** - identifying and masking PHI tokens before they reach the text generation decoder. This approach theoretically could provide:
|
| 14 |
|
| 15 |
+
- **Stronger Privacy Guarantees**: PHI would never enter the text generation pipeline
|
| 16 |
+
- **Better Utility Preservation**: Non-PHI medical context could remain intact for downstream processing
|
| 17 |
+
- **Efficient Implementation**: LoRA adapters for lightweight PHI detection without modifying the base model
|
| 18 |
+
|
| 19 |
+
**Note**: This repository contains the implementation and data generation infrastructure. No actual model training was successfully completed. See [Project Status](#project-status) for details.
|
| 20 |
|
| 21 |
## Architecture
|
| 22 |
|
|
|
|
| 42 |
- **Selective Attention Masking**: ToSA-inspired attention mechanism
|
| 43 |
- **Hybrid Approach**: Combines both for optimal privacy-utility tradeoff
|
| 44 |
|
| 45 |
+
## What's Included
|
| 46 |
|
| 47 |
+
- **Synthetic Data Pipeline**: Fully functional pipeline using Synthea for generating realistic medical PDFs with PHI annotations
|
| 48 |
+
- **LoRA Architecture Code**: Implementation of LoRA adapters for vision token PHI detection (not trained)
|
| 49 |
+
- **PHI Annotation Tools**: Preprocessing pipeline for marking PHI in synthetic documents
|
| 50 |
+
- **Multiple Masking Strategies**: Code for token replacement and selective attention mechanisms (experimental)
|
| 51 |
+
- **HIPAA Safe Harbor Coverage**: Designed to handle all 18 HIPAA PHI identifier categories
|
| 52 |
+
- **Configuration Files**: Model and training configurations for DeepSeek-OCR integration
|
| 53 |
|
| 54 |
## Quick Start
|
| 55 |
|
|
|
|
| 80 |
|
| 81 |
### Generate Synthetic Medical Data
|
| 82 |
|
| 83 |
+
The primary working component is the data generation pipeline:
|
| 84 |
+
|
| 85 |
```bash
|
| 86 |
# Setup Synthea (synthetic patient generator)
|
| 87 |
bash scripts/setup_synthea.sh
|
|
|
|
| 93 |
python scripts/generate_clinical_notes.py --output-dir data/pdfs --num-documents 1000
|
| 94 |
```
|
| 95 |
|
| 96 |
+
### Explore the Code
|
| 97 |
|
| 98 |
```bash
|
| 99 |
+
# View the LoRA adapter implementation
|
| 100 |
+
cat src/training/lora_phi_detector.py
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
# Check the PHI annotation tools
|
| 103 |
+
cat src/preprocessing/phi_annotator.py
|
| 104 |
|
| 105 |
+
# Review configuration files
|
| 106 |
+
cat config/training_config.yaml
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
```
|
| 108 |
|
| 109 |
+
**Note**: The training and inference pipelines are not functional. The code is provided for reference and future development.
|
| 110 |
+
|
| 111 |
## Project Structure
|
| 112 |
|
| 113 |
```
|
|
|
|
| 223 |
task_type: PHI_DETECTION
|
| 224 |
```
|
| 225 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 226 |
## Project Status
|
| 227 |
|
| 228 |
+
**Current State**: Research prototype with functional data generation infrastructure. The LoRA-based vision token masking approach was implemented but not successfully trained.
|
| 229 |
|
| 230 |
### Completed
|
| 231 |
- [x] Project structure and configuration
|
| 232 |
- [x] Synthea integration for synthetic patient data
|
| 233 |
- [x] PDF generation pipeline with PHI annotations
|
| 234 |
- [x] PHI annotation and preprocessing tools
|
| 235 |
+
- [x] LoRA adapter architecture implementation (code only, not trained)
|
|
|
|
| 236 |
|
| 237 |
### Known Limitations
|
| 238 |
+
- No successful model training was completed
|
| 239 |
+
- The vision-level masking approach proved more challenging than anticipated
|
| 240 |
+
- Infrastructure and data generation are functional, but the core ML approach needs rethinking
|
| 241 |
+
- Alternative architectures or hybrid approaches may be required
|
| 242 |
|
| 243 |
### Future Directions
|
| 244 |
- Explore alternative masking strategies
|