Ric Claude commited on
Commit
e84ccb9
·
1 Parent(s): 47ec478

Remove misleading claims about training and results

Browse files

- Changed language from definitive to exploratory ("research exploration")
- Clarified that LoRA code exists but was never successfully trained
- Removed training details section (no training was completed)
- Updated Quick Start to focus on working data generation pipeline
- Changed "Features" to "What's Included" to be more accurate
- Made it clear this is infrastructure/implementation, not working model

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +32 -41
README.md CHANGED
@@ -6,15 +6,17 @@
6
 
7
  ## Overview
8
 
9
- **Justitia** is a novel approach to privacy-preserving OCR that implements **vision-level token masking** for PHI (Protected Health Information) compliance. Unlike traditional text-based redaction methods that operate after OCR extraction, Justitia detects and masks sensitive information at the vision token stage, preventing PHI from ever being processed by the language model decoder.
10
 
11
- ### Key Innovation
12
 
13
- The core innovation is **selective vision token masking** - identifying and masking PHI tokens before they reach the text generation decoder. This approach provides:
14
 
15
- - **Stronger Privacy Guarantees**: PHI never enters the text generation pipeline
16
- - **Better Utility Preservation**: Non-PHI medical context remains intact for downstream processing
17
- - **Efficient Implementation**: Uses LoRA adapters for lightweight PHI detection without modifying the base model
 
 
18
 
19
  ## Architecture
20
 
@@ -40,14 +42,14 @@ Input PDF → Vision Encoder → PHI Detection (LoRA) → Token Masking → Deep
40
  - **Selective Attention Masking**: ToSA-inspired attention mechanism
41
  - **Hybrid Approach**: Combines both for optimal privacy-utility tradeoff
42
 
43
- ## Features
44
 
45
- - **Vision-Level PHI Detection**: Identifies sensitive information before text extraction
46
- - **HIPAA Safe Harbor Compliant**: Covers all 18 HIPAA PHI identifiers
47
- - **Dual Masking Strategies**: Token replacement and selective attention mechanisms
48
- - **Synthetic Data Pipeline**: Uses Synthea for generating realistic medical PDFs with PHI annotations
49
- - **Efficient Fine-tuning**: LoRA adapters for parameter-efficient training
50
- - **Evaluation Framework**: Comprehensive metrics for privacy and utility assessment
51
 
52
  ## Quick Start
53
 
@@ -78,6 +80,8 @@ python scripts/download_model.py
78
 
79
  ### Generate Synthetic Medical Data
80
 
 
 
81
  ```bash
82
  # Setup Synthea (synthetic patient generator)
83
  bash scripts/setup_synthea.sh
@@ -89,25 +93,21 @@ bash scripts/generate_synthea_data.sh
89
  python scripts/generate_clinical_notes.py --output-dir data/pdfs --num-documents 1000
90
  ```
91
 
92
- ### Train PHI Detection LoRA
93
 
94
  ```bash
95
- python src/training/lora_phi_detector.py \
96
- --config config/training_config.yaml \
97
- --data-dir data/pdfs \
98
- --output-dir models/lora_adapters
99
- ```
100
 
101
- ### Run Inference
 
102
 
103
- ```bash
104
- python src/inference/process_documents.py \
105
- --input path/to/medical.pdf \
106
- --output path/to/redacted_output.txt \
107
- --masking-strategy selective_attention \
108
- --lora-checkpoint models/lora_adapters/best_model
109
  ```
110
 
 
 
111
  ## Project Structure
112
 
113
  ```
@@ -223,31 +223,22 @@ target_modules: # Attention layers to target
223
  task_type: PHI_DETECTION
224
  ```
225
 
226
- ### Training Details
227
-
228
- - **Dataset**: 10,000+ synthetic medical PDFs from Synthea
229
- - **Batch Size**: 8 (with gradient accumulation: 4)
230
- - **Learning Rate**: 2e-4 with cosine annealing
231
- - **Epochs**: 10
232
- - **Hardware**: Single A100 40GB GPU
233
- - **Training Time**: ~8 hours
234
-
235
  ## Project Status
236
 
237
- **Current State**: Early research prototype with synthetic data generation pipeline completed. Initial LoRA training experiments showed limitations in the approach.
238
 
239
  ### Completed
240
  - [x] Project structure and configuration
241
  - [x] Synthea integration for synthetic patient data
242
  - [x] PDF generation pipeline with PHI annotations
243
  - [x] PHI annotation and preprocessing tools
244
- - [x] Initial LoRA adapter implementation
245
- - [x] Basic training pipeline (results were suboptimal)
246
 
247
  ### Known Limitations
248
- - Initial training approach did not achieve target performance
249
- - Vision token masking effectiveness needs further research
250
- - Alternative architectures may be required
 
251
 
252
  ### Future Directions
253
  - Explore alternative masking strategies
 
6
 
7
  ## Overview
8
 
9
+ **Justitia** is a research exploration of **vision-level token masking** for PHI (Protected Health Information) compliance in OCR systems. Unlike traditional text-based redaction methods that operate after OCR extraction, this approach investigates detecting and masking sensitive information at the vision token stage, attempting to prevent PHI from ever being processed by the language model decoder.
10
 
11
+ ### Research Approach
12
 
13
+ The core idea is **selective vision token masking** - identifying and masking PHI tokens before they reach the text generation decoder. This approach theoretically could provide:
14
 
15
+ - **Stronger Privacy Guarantees**: PHI would never enter the text generation pipeline
16
+ - **Better Utility Preservation**: Non-PHI medical context could remain intact for downstream processing
17
+ - **Efficient Implementation**: LoRA adapters for lightweight PHI detection without modifying the base model
18
+
19
+ **Note**: This repository contains the implementation and data generation infrastructure. No actual model training was successfully completed. See [Project Status](#project-status) for details.
20
 
21
  ## Architecture
22
 
 
42
  - **Selective Attention Masking**: ToSA-inspired attention mechanism
43
  - **Hybrid Approach**: Combines both for optimal privacy-utility tradeoff
44
 
45
+ ## What's Included
46
 
47
+ - **Synthetic Data Pipeline**: Fully functional pipeline using Synthea for generating realistic medical PDFs with PHI annotations
48
+ - **LoRA Architecture Code**: Implementation of LoRA adapters for vision token PHI detection (not trained)
49
+ - **PHI Annotation Tools**: Preprocessing pipeline for marking PHI in synthetic documents
50
+ - **Multiple Masking Strategies**: Code for token replacement and selective attention mechanisms (experimental)
51
+ - **HIPAA Safe Harbor Coverage**: Designed to handle all 18 HIPAA PHI identifier categories
52
+ - **Configuration Files**: Model and training configurations for DeepSeek-OCR integration
53
 
54
  ## Quick Start
55
 
 
80
 
81
  ### Generate Synthetic Medical Data
82
 
83
+ The primary working component is the data generation pipeline:
84
+
85
  ```bash
86
  # Setup Synthea (synthetic patient generator)
87
  bash scripts/setup_synthea.sh
 
93
  python scripts/generate_clinical_notes.py --output-dir data/pdfs --num-documents 1000
94
  ```
95
 
96
+ ### Explore the Code
97
 
98
  ```bash
99
+ # View the LoRA adapter implementation
100
+ cat src/training/lora_phi_detector.py
 
 
 
101
 
102
+ # Check the PHI annotation tools
103
+ cat src/preprocessing/phi_annotator.py
104
 
105
+ # Review configuration files
106
+ cat config/training_config.yaml
 
 
 
 
107
  ```
108
 
109
+ **Note**: The training and inference pipelines are not functional. The code is provided for reference and future development.
110
+
111
  ## Project Structure
112
 
113
  ```
 
223
  task_type: PHI_DETECTION
224
  ```
225
 
 
 
 
 
 
 
 
 
 
226
  ## Project Status
227
 
228
+ **Current State**: Research prototype with functional data generation infrastructure. The LoRA-based vision token masking approach was implemented but not successfully trained.
229
 
230
  ### Completed
231
  - [x] Project structure and configuration
232
  - [x] Synthea integration for synthetic patient data
233
  - [x] PDF generation pipeline with PHI annotations
234
  - [x] PHI annotation and preprocessing tools
235
+ - [x] LoRA adapter architecture implementation (code only, not trained)
 
236
 
237
  ### Known Limitations
238
+ - No successful model training was completed
239
+ - The vision-level masking approach proved more challenging than anticipated
240
+ - Infrastructure and data generation are functional, but the core ML approach needs rethinking
241
+ - Alternative architectures or hybrid approaches may be required
242
 
243
  ### Future Directions
244
  - Explore alternative masking strategies