vision-token-masking-phi / README_GITHUB.md
Ric
Add HuggingFace model card
011b80f
# Vision Token Masking Cannot Prevent PHI Leakage: A Negative Result
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![Paper](https://img.shields.io/badge/paper-under%20review-red.svg)](https://deepneuro.ai/richard)
> **Author**: Richard J. Young
> **Affiliation**: Founding AI Scientist, [DeepNeuro.AI](https://deepneuro.ai/richard) | University of Nevada, Las Vegas, Department of Neuroscience
> **Links**: [HuggingFace](https://huggingface.co/richardyoung) | [DeepNeuro.AI](https://deepneuro.ai/richard)
## Overview
This repository contains the systematic evaluation code and data generation pipeline for our paper **"Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR"**.
**Key Finding**: Vision-level token masking in VLMs achieves only **42.9% PHI reduction** regardless of masking strategy, successfully suppressing long-form identifiers (names, addresses) at 100% effectiveness while completely failing on short structured identifiers (SSN, medical record numbers) at 0% effectiveness.
### The Negative Result
We evaluated **seven masking strategies** (V3-V9) across different architectural layers of DeepSeek-OCR using 100 synthetic medical billing statements (from a corpus of 38,517 annotated documents):
- **What Worked**: 100% reduction of patient names, dates of birth, physical addresses (spatially-distributed long-form PHI)
- **What Failed**: 0% reduction of SSN, medical record numbers, email addresses, account numbers (short structured identifiers)
- **The Ceiling**: All strategies converged to 42.9% total PHI reduction
- **Why**: Language model contextual inference—not insufficient visual masking—drives structured identifier leakage
This establishes **fundamental boundaries** for vision-only privacy interventions in VLMs and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures.
## Architecture
```
Input PDF → Vision Encoder → PHI Detection → Vision Token Masking → DeepSeek Decoder → Text Output
(SAM + CLIP) (Ground Truth) (V3-V9 Strategies) (3B-MoE)
```
### Experimental Approach
1. **Base Model**: DeepSeek-OCR
- Vision Encoder: SAM-base + CLIP-large blocks
- Text Decoder: DeepSeek-3B-MoE
- Processes 1024×1024 images to 256 vision tokens
2. **PHI Detection**: Ground-truth annotations from synthetic data generation
- Perfect bounding box annotations for all 18 HIPAA PHI categories
- No learned detection model - direct annotation-based masking
3. **Seven Masking Strategies (V3-V9)**:
- **V3-V5**: SAM encoder blocks at different depths
- **V6**: Compression layer (4096→1024 tokens)
- **V7**: Dual vision encoders (SAM + CLIP)
- **V8**: Post-compression stage
- **V9**: Projector fusion layer
## Research Contributions
### What This Work Provides
1. **First Systematic Evaluation** of vision-level token masking for PHI protection in VLMs
2. **Negative Result**: Establishes that vision masking alone is insufficient for HIPAA compliance
3. **Boundary Conditions**: Identifies which PHI types are amenable to vision-level vs language-level redaction
4. **38,517 Annotated Documents**: Massive synthetic medical document corpus with ground-truth PHI annotations
5. **Seven Masking Strategies**: V3-V9 targeting SAM encoders, compression layers, dual vision encoders, and projector fusion
6. **Ablation Studies**: Mask expansion radius variations (r=1,2,3) demonstrating spatial coverage limitations
7. **Hybrid Architecture Simulation**: Shows 88.6% reduction when combining vision masking with NLP post-processing
### What's in This Repository
- **Synthetic Data Pipeline**: Fully functional Synthea-based pipeline generating 38,517+ annotated medical PDFs
- **PHI Annotation Tools**: Ground-truth annotation pipeline for all 18 HIPAA identifier categories
- **Evaluation Framework**: Code for measuring PHI reduction across masking strategies
- **Configuration Files**: DeepSeek-OCR integration and experimental parameters
## Quick Start
### Prerequisites
- Python 3.12+
- CUDA 11.8+ compatible GPU (8GB+ VRAM recommended)
- Java JDK 11+ (for Synthea data generation)
- 50GB+ free disk space
### Installation
```bash
# Clone repository
git clone https://github.com/yourusername/Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR.git
cd Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download DeepSeek-OCR model
python scripts/download_model.py
```
### Generate Synthetic Medical Data
The primary working component is the data generation pipeline:
```bash
# Setup Synthea (synthetic patient generator)
bash scripts/setup_synthea.sh
# Generate synthetic patient data
bash scripts/generate_synthea_data.sh
# Convert to annotated PDFs
python scripts/generate_clinical_notes.py --output-dir data/pdfs --num-documents 1000
```
### Explore the Code
```bash
# View the LoRA adapter implementation
cat src/training/lora_phi_detector.py
# Check the PHI annotation tools
cat src/preprocessing/phi_annotator.py
# Review configuration files
cat config/training_config.yaml
```
**Note**: The training and inference pipelines are not functional. The code is provided for reference and future development.
## Project Structure
```
.
├── config/ # Configuration files
│ ├── model_config.yaml # Model architecture and hyperparameters
│ └── training_config.yaml # Training settings
├── data/ # Data directory (generated, not in repo)
│ ├── synthetic/ # Synthea synthetic patient data
│ ├── pdfs/ # Generated medical PDFs with PHI
│ └── annotations/ # PHI bounding box annotations
├── models/ # Model directory (not in repo)
│ ├── deepseek_ocr/ # Base DeepSeek-OCR model
│ ├── lora_adapters/ # Trained LoRA adapters
│ └── checkpoints/ # Training checkpoints
├── scripts/ # Utility scripts
│ ├── download_model.py # Download DeepSeek-OCR
│ ├── setup_synthea.sh # Install Synthea
│ ├── generate_synthea_data.sh # Generate patient data
│ ├── generate_clinical_notes.py # Create medical PDFs
│ ├── generate_realistic_pdfs.py # Realistic PDF generation
│ ├── generate_additional_documents.py # Additional document types
│ └── generate_final_document_types.py # Final document generation
├── src/ # Source code
│ ├── data_generation/ # Synthea integration and PDF generation
│ │ ├── synthea_to_pdf.py
│ │ └── medical_templates.py
│ ├── preprocessing/ # PHI annotation pipeline
│ │ └── phi_annotator.py
│ ├── training/ # LoRA training implementation
│ │ └── lora_phi_detector.py
│ ├── inference/ # OCR with PHI masking (placeholder)
│ └── utils/ # Evaluation and metrics (placeholder)
├── tests/ # Unit tests
├── notebooks/ # Jupyter notebooks for experiments
├── .gitignore # Git ignore file
├── requirements.txt # Python dependencies
├── setup.py # Package setup
├── SETUP.md # Detailed setup instructions
├── README.md # This file
└── LICENSE # MIT License
```
## PHI Categories Detected
Following HIPAA Safe Harbor guidelines, Justitia detects and masks:
| Category | Examples |
|----------|----------|
| **Names** | Patients, physicians, family members, guarantors |
| **Dates** | Birth dates, admission/discharge, death dates, appointments |
| **Geographic** | Street addresses, cities, counties, zip codes, facility names |
| **Contact** | Phone numbers, fax numbers, email addresses |
| **Medical IDs** | Medical record numbers, account numbers, health plan IDs |
| **Personal IDs** | SSN, driver's license, vehicle IDs, device identifiers |
| **Biometric** | Photos, fingerprints, voiceprints |
| **Web & Network** | URLs, IP addresses, certificate numbers |
## Masking Strategies
### 1. Token Replacement
- Replaces PHI vision tokens with learned privacy-preserving embeddings
- Fast inference, low memory overhead
- Good utility preservation for non-PHI content
### 2. Selective Attention Masking
- Applies attention masking to prevent PHI token information flow
- Based on ToSA (Token-level Selective Attention) approach
- Stronger privacy guarantees, moderate computational cost
### 3. Hybrid Approach
- Combines token replacement with selective attention
- Optimal privacy-utility tradeoff
- Recommended for production use
## Evaluation Metrics
| Metric | Description | Target |
|--------|-------------|--------|
| **PHI Removal Rate** | % of PHI successfully masked | >99% |
| **OCR Accuracy Retention** | Character accuracy on non-PHI text | >95% |
| **False Positive Rate** | Non-PHI incorrectly masked | <5% |
| **Processing Speed** | Seconds per page | <2s |
| **F1 Score** | Harmonic mean of precision/recall | >0.90 |
## Technical Details
### Vision Token Processing
DeepSeek-OCR compresses a 1024×1024 image through multiple stages:
1. **SAM-base block**: Windowed attention for local detail (4096 tokens)
2. **CLIP-large block**: Global attention for layout understanding (1024 tokens)
3. **Convolution layer**: 16x token reduction to 256 tokens
4. **Projector fusion**: Maps vision tokens to language model space
Each vision token represents a ~64×64 pixel region with semantic and spatial information.
### Masking Implementation
Vision tokens corresponding to PHI bounding boxes are zeroed at different architectural layers (V3-V9). Ablation studies tested mask expansion radius r=1,2,3 to determine if spatial coverage affects reduction rates.
## Experimental Results
### Main Findings
| Masking Strategy | Layer Target | PHI Reduction | Names | DOB | SSN | MRN | Addresses |
|-----------------|--------------|---------------|-------|-----|-----|-----|-----------|
| **V3-V9 (all)** | Various | **42.9%** | 100% | 100% | 0% | 0% | 100% |
| Baseline | None | 0% | 0% | 0% | 0% | 0% | 0% |
| Hybrid (sim) | Vision + NLP | **88.6%** | 100% | 100% | 80% | 80% | 100% |
### Key Insights
1. **Convergence**: All seven masking strategies (V3-V9) achieved identical 42.9% reduction regardless of architectural layer
2. **Spatial Invariance**: Mask expansion radius (r=1,2,3) did not improve reduction beyond this ceiling
3. **Type-Dependent Success**:
- ✅ Long-form spatially-distributed PHI: 100% reduction
- ❌ Short structured identifiers: 0% reduction
4. **Root Cause**: Language model contextual inference reconstructs masked structured identifiers from document context
### Implications for Privacy-Preserving VLMs
- Vision-only masking is **insufficient for HIPAA compliance** (requires 99%+ reduction)
- Hybrid architectures combining vision masking with NLP post-processing are necessary
- Future work should focus on decoder-level fine-tuning or defense-in-depth approaches
## Paper
A paper describing this work has been submitted for peer review. The paper, experimental results, and additional materials are available in the `not_uploaded/` directory (not included in this public repository).
## Citation
If you use this work in your research, please cite:
```bibtex
@article{young2025visionmasking,
title={Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation},
author={Young, Richard J.},
institution={DeepNeuro.AI; University of Nevada, Las Vegas},
journal={Under Review},
year={2025},
note={Code available at: https://github.com/ricyoung/Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR}
}
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- **DeepSeek AI** for the DeepSeek-OCR model
- **MITRE Corporation** for Synthea synthetic patient generator
- **Hugging Face** for PEFT library and model hosting
- **Meta AI** for Segment Anything Model (SAM)
- **OpenAI** for CLIP vision encoder
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Disclaimer
**IMPORTANT**: This is a research project for academic purposes. It is **NOT** intended for production use with real patient PHI. Always consult with legal and compliance teams before deploying PHI-related systems in healthcare settings.
## Contact
**Richard J. Young**
- Founding AI Scientist, DeepNeuro.AI
- University of Nevada, Las Vegas, Department of Neuroscience
- Website: [deepneuro.ai/richard](https://deepneuro.ai/richard)
- HuggingFace: [@richardyoung](https://huggingface.co/richardyoung)
- GitHub: Open an issue on this repository
---
**Note**: The `not_uploaded/` directory contains paper drafts, experimental results, and other materials not included in the public repository.