| # Vision Token Masking Cannot Prevent PHI Leakage: A Negative Result | |
| [](https://opensource.org/licenses/MIT) | |
| [](https://www.python.org/downloads/) | |
| [](https://deepneuro.ai/richard) | |
| > **Author**: Richard J. Young | |
| > **Affiliation**: Founding AI Scientist, [DeepNeuro.AI](https://deepneuro.ai/richard) | University of Nevada, Las Vegas, Department of Neuroscience | |
| > **Links**: [HuggingFace](https://huggingface.co/richardyoung) | [DeepNeuro.AI](https://deepneuro.ai/richard) | |
| ## Overview | |
| This repository contains the systematic evaluation code and data generation pipeline for our paper **"Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR"**. | |
| **Key Finding**: Vision-level token masking in VLMs achieves only **42.9% PHI reduction** regardless of masking strategy, successfully suppressing long-form identifiers (names, addresses) at 100% effectiveness while completely failing on short structured identifiers (SSN, medical record numbers) at 0% effectiveness. | |
| ### The Negative Result | |
| We evaluated **seven masking strategies** (V3-V9) across different architectural layers of DeepSeek-OCR using 100 synthetic medical billing statements (from a corpus of 38,517 annotated documents): | |
| - **What Worked**: 100% reduction of patient names, dates of birth, physical addresses (spatially-distributed long-form PHI) | |
| - **What Failed**: 0% reduction of SSN, medical record numbers, email addresses, account numbers (short structured identifiers) | |
| - **The Ceiling**: All strategies converged to 42.9% total PHI reduction | |
| - **Why**: Language model contextual inference—not insufficient visual masking—drives structured identifier leakage | |
| This establishes **fundamental boundaries** for vision-only privacy interventions in VLMs and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures. | |
| ## Architecture | |
| ``` | |
| Input PDF → Vision Encoder → PHI Detection → Vision Token Masking → DeepSeek Decoder → Text Output | |
| (SAM + CLIP) (Ground Truth) (V3-V9 Strategies) (3B-MoE) | |
| ``` | |
| ### Experimental Approach | |
| 1. **Base Model**: DeepSeek-OCR | |
| - Vision Encoder: SAM-base + CLIP-large blocks | |
| - Text Decoder: DeepSeek-3B-MoE | |
| - Processes 1024×1024 images to 256 vision tokens | |
| 2. **PHI Detection**: Ground-truth annotations from synthetic data generation | |
| - Perfect bounding box annotations for all 18 HIPAA PHI categories | |
| - No learned detection model - direct annotation-based masking | |
| 3. **Seven Masking Strategies (V3-V9)**: | |
| - **V3-V5**: SAM encoder blocks at different depths | |
| - **V6**: Compression layer (4096→1024 tokens) | |
| - **V7**: Dual vision encoders (SAM + CLIP) | |
| - **V8**: Post-compression stage | |
| - **V9**: Projector fusion layer | |
| ## Research Contributions | |
| ### What This Work Provides | |
| 1. **First Systematic Evaluation** of vision-level token masking for PHI protection in VLMs | |
| 2. **Negative Result**: Establishes that vision masking alone is insufficient for HIPAA compliance | |
| 3. **Boundary Conditions**: Identifies which PHI types are amenable to vision-level vs language-level redaction | |
| 4. **38,517 Annotated Documents**: Massive synthetic medical document corpus with ground-truth PHI annotations | |
| 5. **Seven Masking Strategies**: V3-V9 targeting SAM encoders, compression layers, dual vision encoders, and projector fusion | |
| 6. **Ablation Studies**: Mask expansion radius variations (r=1,2,3) demonstrating spatial coverage limitations | |
| 7. **Hybrid Architecture Simulation**: Shows 88.6% reduction when combining vision masking with NLP post-processing | |
| ### What's in This Repository | |
| - **Synthetic Data Pipeline**: Fully functional Synthea-based pipeline generating 38,517+ annotated medical PDFs | |
| - **PHI Annotation Tools**: Ground-truth annotation pipeline for all 18 HIPAA identifier categories | |
| - **Evaluation Framework**: Code for measuring PHI reduction across masking strategies | |
| - **Configuration Files**: DeepSeek-OCR integration and experimental parameters | |
| ## Quick Start | |
| ### Prerequisites | |
| - Python 3.12+ | |
| - CUDA 11.8+ compatible GPU (8GB+ VRAM recommended) | |
| - Java JDK 11+ (for Synthea data generation) | |
| - 50GB+ free disk space | |
| ### Installation | |
| ```bash | |
| # Clone repository | |
| git clone https://github.com/yourusername/Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR.git | |
| cd Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR | |
| # Create virtual environment | |
| python -m venv venv | |
| source venv/bin/activate # On Windows: venv\Scripts\activate | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Download DeepSeek-OCR model | |
| python scripts/download_model.py | |
| ``` | |
| ### Generate Synthetic Medical Data | |
| The primary working component is the data generation pipeline: | |
| ```bash | |
| # Setup Synthea (synthetic patient generator) | |
| bash scripts/setup_synthea.sh | |
| # Generate synthetic patient data | |
| bash scripts/generate_synthea_data.sh | |
| # Convert to annotated PDFs | |
| python scripts/generate_clinical_notes.py --output-dir data/pdfs --num-documents 1000 | |
| ``` | |
| ### Explore the Code | |
| ```bash | |
| # View the LoRA adapter implementation | |
| cat src/training/lora_phi_detector.py | |
| # Check the PHI annotation tools | |
| cat src/preprocessing/phi_annotator.py | |
| # Review configuration files | |
| cat config/training_config.yaml | |
| ``` | |
| **Note**: The training and inference pipelines are not functional. The code is provided for reference and future development. | |
| ## Project Structure | |
| ``` | |
| . | |
| ├── config/ # Configuration files | |
| │ ├── model_config.yaml # Model architecture and hyperparameters | |
| │ └── training_config.yaml # Training settings | |
| │ | |
| ├── data/ # Data directory (generated, not in repo) | |
| │ ├── synthetic/ # Synthea synthetic patient data | |
| │ ├── pdfs/ # Generated medical PDFs with PHI | |
| │ └── annotations/ # PHI bounding box annotations | |
| │ | |
| ├── models/ # Model directory (not in repo) | |
| │ ├── deepseek_ocr/ # Base DeepSeek-OCR model | |
| │ ├── lora_adapters/ # Trained LoRA adapters | |
| │ └── checkpoints/ # Training checkpoints | |
| │ | |
| ├── scripts/ # Utility scripts | |
| │ ├── download_model.py # Download DeepSeek-OCR | |
| │ ├── setup_synthea.sh # Install Synthea | |
| │ ├── generate_synthea_data.sh # Generate patient data | |
| │ ├── generate_clinical_notes.py # Create medical PDFs | |
| │ ├── generate_realistic_pdfs.py # Realistic PDF generation | |
| │ ├── generate_additional_documents.py # Additional document types | |
| │ └── generate_final_document_types.py # Final document generation | |
| │ | |
| ├── src/ # Source code | |
| │ ├── data_generation/ # Synthea integration and PDF generation | |
| │ │ ├── synthea_to_pdf.py | |
| │ │ └── medical_templates.py | |
| │ ├── preprocessing/ # PHI annotation pipeline | |
| │ │ └── phi_annotator.py | |
| │ ├── training/ # LoRA training implementation | |
| │ │ └── lora_phi_detector.py | |
| │ ├── inference/ # OCR with PHI masking (placeholder) | |
| │ └── utils/ # Evaluation and metrics (placeholder) | |
| │ | |
| ├── tests/ # Unit tests | |
| ├── notebooks/ # Jupyter notebooks for experiments | |
| │ | |
| ├── .gitignore # Git ignore file | |
| ├── requirements.txt # Python dependencies | |
| ├── setup.py # Package setup | |
| ├── SETUP.md # Detailed setup instructions | |
| ├── README.md # This file | |
| └── LICENSE # MIT License | |
| ``` | |
| ## PHI Categories Detected | |
| Following HIPAA Safe Harbor guidelines, Justitia detects and masks: | |
| | Category | Examples | | |
| |----------|----------| | |
| | **Names** | Patients, physicians, family members, guarantors | | |
| | **Dates** | Birth dates, admission/discharge, death dates, appointments | | |
| | **Geographic** | Street addresses, cities, counties, zip codes, facility names | | |
| | **Contact** | Phone numbers, fax numbers, email addresses | | |
| | **Medical IDs** | Medical record numbers, account numbers, health plan IDs | | |
| | **Personal IDs** | SSN, driver's license, vehicle IDs, device identifiers | | |
| | **Biometric** | Photos, fingerprints, voiceprints | | |
| | **Web & Network** | URLs, IP addresses, certificate numbers | | |
| ## Masking Strategies | |
| ### 1. Token Replacement | |
| - Replaces PHI vision tokens with learned privacy-preserving embeddings | |
| - Fast inference, low memory overhead | |
| - Good utility preservation for non-PHI content | |
| ### 2. Selective Attention Masking | |
| - Applies attention masking to prevent PHI token information flow | |
| - Based on ToSA (Token-level Selective Attention) approach | |
| - Stronger privacy guarantees, moderate computational cost | |
| ### 3. Hybrid Approach | |
| - Combines token replacement with selective attention | |
| - Optimal privacy-utility tradeoff | |
| - Recommended for production use | |
| ## Evaluation Metrics | |
| | Metric | Description | Target | | |
| |--------|-------------|--------| | |
| | **PHI Removal Rate** | % of PHI successfully masked | >99% | | |
| | **OCR Accuracy Retention** | Character accuracy on non-PHI text | >95% | | |
| | **False Positive Rate** | Non-PHI incorrectly masked | <5% | | |
| | **Processing Speed** | Seconds per page | <2s | | |
| | **F1 Score** | Harmonic mean of precision/recall | >0.90 | | |
| ## Technical Details | |
| ### Vision Token Processing | |
| DeepSeek-OCR compresses a 1024×1024 image through multiple stages: | |
| 1. **SAM-base block**: Windowed attention for local detail (4096 tokens) | |
| 2. **CLIP-large block**: Global attention for layout understanding (1024 tokens) | |
| 3. **Convolution layer**: 16x token reduction to 256 tokens | |
| 4. **Projector fusion**: Maps vision tokens to language model space | |
| Each vision token represents a ~64×64 pixel region with semantic and spatial information. | |
| ### Masking Implementation | |
| Vision tokens corresponding to PHI bounding boxes are zeroed at different architectural layers (V3-V9). Ablation studies tested mask expansion radius r=1,2,3 to determine if spatial coverage affects reduction rates. | |
| ## Experimental Results | |
| ### Main Findings | |
| | Masking Strategy | Layer Target | PHI Reduction | Names | DOB | SSN | MRN | Addresses | | |
| |-----------------|--------------|---------------|-------|-----|-----|-----|-----------| | |
| | **V3-V9 (all)** | Various | **42.9%** | 100% | 100% | 0% | 0% | 100% | | |
| | Baseline | None | 0% | 0% | 0% | 0% | 0% | 0% | | |
| | Hybrid (sim) | Vision + NLP | **88.6%** | 100% | 100% | 80% | 80% | 100% | | |
| ### Key Insights | |
| 1. **Convergence**: All seven masking strategies (V3-V9) achieved identical 42.9% reduction regardless of architectural layer | |
| 2. **Spatial Invariance**: Mask expansion radius (r=1,2,3) did not improve reduction beyond this ceiling | |
| 3. **Type-Dependent Success**: | |
| - ✅ Long-form spatially-distributed PHI: 100% reduction | |
| - ❌ Short structured identifiers: 0% reduction | |
| 4. **Root Cause**: Language model contextual inference reconstructs masked structured identifiers from document context | |
| ### Implications for Privacy-Preserving VLMs | |
| - Vision-only masking is **insufficient for HIPAA compliance** (requires 99%+ reduction) | |
| - Hybrid architectures combining vision masking with NLP post-processing are necessary | |
| - Future work should focus on decoder-level fine-tuning or defense-in-depth approaches | |
| ## Paper | |
| A paper describing this work has been submitted for peer review. The paper, experimental results, and additional materials are available in the `not_uploaded/` directory (not included in this public repository). | |
| ## Citation | |
| If you use this work in your research, please cite: | |
| ```bibtex | |
| @article{young2025visionmasking, | |
| title={Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation}, | |
| author={Young, Richard J.}, | |
| institution={DeepNeuro.AI; University of Nevada, Las Vegas}, | |
| journal={Under Review}, | |
| year={2025}, | |
| note={Code available at: https://github.com/ricyoung/Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR} | |
| } | |
| ``` | |
| ## License | |
| This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. | |
| ## Acknowledgments | |
| - **DeepSeek AI** for the DeepSeek-OCR model | |
| - **MITRE Corporation** for Synthea synthetic patient generator | |
| - **Hugging Face** for PEFT library and model hosting | |
| - **Meta AI** for Segment Anything Model (SAM) | |
| - **OpenAI** for CLIP vision encoder | |
| ## Contributing | |
| Contributions are welcome! Please feel free to submit a Pull Request. | |
| ## Disclaimer | |
| **IMPORTANT**: This is a research project for academic purposes. It is **NOT** intended for production use with real patient PHI. Always consult with legal and compliance teams before deploying PHI-related systems in healthcare settings. | |
| ## Contact | |
| **Richard J. Young** | |
| - Founding AI Scientist, DeepNeuro.AI | |
| - University of Nevada, Las Vegas, Department of Neuroscience | |
| - Website: [deepneuro.ai/richard](https://deepneuro.ai/richard) | |
| - HuggingFace: [@richardyoung](https://huggingface.co/richardyoung) | |
| - GitHub: Open an issue on this repository | |
| --- | |
| **Note**: The `not_uploaded/` directory contains paper drafts, experimental results, and other materials not included in the public repository. | |