# Vision Token Masking Cannot Prevent PHI Leakage: A Negative Result

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![Paper](https://img.shields.io/badge/paper-under%20review-red.svg)](https://deepneuro.ai/richard)

> **Author**: Richard J. Young
> **Affiliation**: Founding AI Scientist, [DeepNeuro.AI](https://deepneuro.ai/richard) | University of Nevada, Las Vegas, Department of Neuroscience
> **Links**: [HuggingFace](https://huggingface.co/richardyoung) | [DeepNeuro.AI](https://deepneuro.ai/richard)

## Overview

This repository contains the systematic evaluation code and data generation pipeline for our paper **"Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR"**.

**Key Finding**: Vision-level token masking in VLMs achieves only **42.9% PHI reduction** regardless of masking strategy, successfully suppressing long-form identifiers (names, addresses) at 100% effectiveness while completely failing on short structured identifiers (SSN, medical record numbers) at 0% effectiveness.

### The Negative Result

We evaluated **seven masking strategies** (V3-V9) across different architectural layers of DeepSeek-OCR using 100 synthetic medical billing statements (from a corpus of 38,517 annotated documents):

- **What Worked**: 100% reduction of patient names, dates of birth, physical addresses (spatially-distributed long-form PHI)
- **What Failed**: 0% reduction of SSN, medical record numbers, email addresses, account numbers (short structured identifiers)
- **The Ceiling**: All strategies converged to 42.9% total PHI reduction
- **Why**: Language model contextual inference—not insufficient visual masking—drives structured identifier leakage

This establishes **fundamental boundaries** for vision-only privacy interventions in VLMs and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures.

## Architecture

```
Input PDF → Vision Encoder → PHI Detection → Vision Token Masking → DeepSeek Decoder → Text Output
              (SAM + CLIP)    (Ground Truth)   (V3-V9 Strategies)      (3B-MoE)
```

### Experimental Approach

1. **Base Model**: DeepSeek-OCR
   - Vision Encoder: SAM-base + CLIP-large blocks
   - Text Decoder: DeepSeek-3B-MoE
   - Processes 1024×1024 images to 256 vision tokens

2. **PHI Detection**: Ground-truth annotations from synthetic data generation
   - Perfect bounding box annotations for all 18 HIPAA PHI categories
   - No learned detection model - direct annotation-based masking

3. **Seven Masking Strategies (V3-V9)**:
   - **V3-V5**: SAM encoder blocks at different depths
   - **V6**: Compression layer (4096→1024 tokens)
   - **V7**: Dual vision encoders (SAM + CLIP)
   - **V8**: Post-compression stage
   - **V9**: Projector fusion layer

## Research Contributions

### What This Work Provides

1. **First Systematic Evaluation** of vision-level token masking for PHI protection in VLMs
2. **Negative Result**: Establishes that vision masking alone is insufficient for HIPAA compliance
3. **Boundary Conditions**: Identifies which PHI types are amenable to vision-level vs language-level redaction
4. **38,517 Annotated Documents**: Massive synthetic medical document corpus with ground-truth PHI annotations
5. **Seven Masking Strategies**: V3-V9 targeting SAM encoders, compression layers, dual vision encoders, and projector fusion
6. **Ablation Studies**: Mask expansion radius variations (r=1,2,3) demonstrating spatial coverage limitations
7. **Hybrid Architecture Simulation**: Shows 88.6% reduction when combining vision masking with NLP post-processing

### What's in This Repository

- **Synthetic Data Pipeline**: Fully functional Synthea-based pipeline generating 38,517+ annotated medical PDFs
- **PHI Annotation Tools**: Ground-truth annotation pipeline for all 18 HIPAA identifier categories
- **Evaluation Framework**: Code for measuring PHI reduction across masking strategies
- **Configuration Files**: DeepSeek-OCR integration and experimental parameters

## Quick Start

### Prerequisites

- Python 3.12+
- CUDA 11.8+ compatible GPU (8GB+ VRAM recommended)
- Java JDK 11+ (for Synthea data generation)
- 50GB+ free disk space

### Installation

```bash
# Clone repository
git clone https://github.com/yourusername/Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR.git
cd Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download DeepSeek-OCR model
python scripts/download_model.py
```

### Generate Synthetic Medical Data

The primary working component is the data generation pipeline:

```bash
# Setup Synthea (synthetic patient generator)
bash scripts/setup_synthea.sh

# Generate synthetic patient data
bash scripts/generate_synthea_data.sh

# Convert to annotated PDFs
python scripts/generate_clinical_notes.py --output-dir data/pdfs --num-documents 1000
```

### Explore the Code

```bash
# View the LoRA adapter implementation
cat src/training/lora_phi_detector.py

# Check the PHI annotation tools
cat src/preprocessing/phi_annotator.py

# Review configuration files
cat config/training_config.yaml
```

**Note**: The training and inference pipelines are not functional. The code is provided for reference and future development.

## Project Structure

```
.
├── config/                     # Configuration files
│   ├── model_config.yaml      # Model architecture and hyperparameters
│   └── training_config.yaml   # Training settings
│
├── data/                      # Data directory (generated, not in repo)
│   ├── synthetic/            # Synthea synthetic patient data
│   ├── pdfs/                 # Generated medical PDFs with PHI
│   └── annotations/          # PHI bounding box annotations
│
├── models/                    # Model directory (not in repo)
│   ├── deepseek_ocr/        # Base DeepSeek-OCR model
│   ├── lora_adapters/       # Trained LoRA adapters
│   └── checkpoints/         # Training checkpoints
│
├── scripts/                   # Utility scripts
│   ├── download_model.py    # Download DeepSeek-OCR
│   ├── setup_synthea.sh     # Install Synthea
│   ├── generate_synthea_data.sh          # Generate patient data
│   ├── generate_clinical_notes.py        # Create medical PDFs
│   ├── generate_realistic_pdfs.py        # Realistic PDF generation
│   ├── generate_additional_documents.py  # Additional document types
│   └── generate_final_document_types.py  # Final document generation
│
├── src/                       # Source code
│   ├── data_generation/      # Synthea integration and PDF generation
│   │   ├── synthea_to_pdf.py
│   │   └── medical_templates.py
│   ├── preprocessing/        # PHI annotation pipeline
│   │   └── phi_annotator.py
│   ├── training/             # LoRA training implementation
│   │   └── lora_phi_detector.py
│   ├── inference/            # OCR with PHI masking (placeholder)
│   └── utils/                # Evaluation and metrics (placeholder)
│
├── tests/                     # Unit tests
├── notebooks/                 # Jupyter notebooks for experiments
│
├── .gitignore                # Git ignore file
├── requirements.txt          # Python dependencies
├── setup.py                  # Package setup
├── SETUP.md                  # Detailed setup instructions
├── README.md                 # This file
└── LICENSE                   # MIT License
```

## PHI Categories Detected

Following HIPAA Safe Harbor guidelines, Justitia detects and masks:

| Category | Examples |
|----------|----------|
| **Names** | Patients, physicians, family members, guarantors |
| **Dates** | Birth dates, admission/discharge, death dates, appointments |
| **Geographic** | Street addresses, cities, counties, zip codes, facility names |
| **Contact** | Phone numbers, fax numbers, email addresses |
| **Medical IDs** | Medical record numbers, account numbers, health plan IDs |
| **Personal IDs** | SSN, driver's license, vehicle IDs, device identifiers |
| **Biometric** | Photos, fingerprints, voiceprints |
| **Web & Network** | URLs, IP addresses, certificate numbers |

## Masking Strategies

### 1. Token Replacement
- Replaces PHI vision tokens with learned privacy-preserving embeddings
- Fast inference, low memory overhead
- Good utility preservation for non-PHI content

### 2. Selective Attention Masking
- Applies attention masking to prevent PHI token information flow
- Based on ToSA (Token-level Selective Attention) approach
- Stronger privacy guarantees, moderate computational cost

### 3. Hybrid Approach
- Combines token replacement with selective attention
- Optimal privacy-utility tradeoff
- Recommended for production use

## Evaluation Metrics

| Metric | Description | Target |
|--------|-------------|--------|
| **PHI Removal Rate** | % of PHI successfully masked | >99% |
| **OCR Accuracy Retention** | Character accuracy on non-PHI text | >95% |
| **False Positive Rate** | Non-PHI incorrectly masked | <5% |
| **Processing Speed** | Seconds per page | <2s |
| **F1 Score** | Harmonic mean of precision/recall | >0.90 |

## Technical Details

### Vision Token Processing

DeepSeek-OCR compresses a 1024×1024 image through multiple stages:
1. **SAM-base block**: Windowed attention for local detail (4096 tokens)
2. **CLIP-large block**: Global attention for layout understanding (1024 tokens)
3. **Convolution layer**: 16x token reduction to 256 tokens
4. **Projector fusion**: Maps vision tokens to language model space

Each vision token represents a ~64×64 pixel region with semantic and spatial information.

### Masking Implementation

Vision tokens corresponding to PHI bounding boxes are zeroed at different architectural layers (V3-V9). Ablation studies tested mask expansion radius r=1,2,3 to determine if spatial coverage affects reduction rates.

## Experimental Results

### Main Findings

| Masking Strategy | Layer Target | PHI Reduction | Names | DOB | SSN | MRN | Addresses |
|-----------------|--------------|---------------|-------|-----|-----|-----|-----------|
| **V3-V9 (all)** | Various | **42.9%** | 100% | 100% | 0% | 0% | 100% |
| Baseline | None | 0% | 0% | 0% | 0% | 0% | 0% |
| Hybrid (sim) | Vision + NLP | **88.6%** | 100% | 100% | 80% | 80% | 100% |

### Key Insights

1. **Convergence**: All seven masking strategies (V3-V9) achieved identical 42.9% reduction regardless of architectural layer
2. **Spatial Invariance**: Mask expansion radius (r=1,2,3) did not improve reduction beyond this ceiling
3. **Type-Dependent Success**:
   - ✅ Long-form spatially-distributed PHI: 100% reduction
   - ❌ Short structured identifiers: 0% reduction
4. **Root Cause**: Language model contextual inference reconstructs masked structured identifiers from document context

### Implications for Privacy-Preserving VLMs

- Vision-only masking is **insufficient for HIPAA compliance** (requires 99%+ reduction)
- Hybrid architectures combining vision masking with NLP post-processing are necessary
- Future work should focus on decoder-level fine-tuning or defense-in-depth approaches

## Paper

A paper describing this work has been submitted for peer review. The paper, experimental results, and additional materials are available in the `not_uploaded/` directory (not included in this public repository).

## Citation

If you use this work in your research, please cite:

```bibtex
@article{young2025visionmasking,
  title={Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation},
  author={Young, Richard J.},
  institution={DeepNeuro.AI; University of Nevada, Las Vegas},
  journal={Under Review},
  year={2025},
  note={Code available at: https://github.com/ricyoung/Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR}
}
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- **DeepSeek AI** for the DeepSeek-OCR model
- **MITRE Corporation** for Synthea synthetic patient generator
- **Hugging Face** for PEFT library and model hosting
- **Meta AI** for Segment Anything Model (SAM)
- **OpenAI** for CLIP vision encoder

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Disclaimer

**IMPORTANT**: This is a research project for academic purposes. It is **NOT** intended for production use with real patient PHI. Always consult with legal and compliance teams before deploying PHI-related systems in healthcare settings.

## Contact

**Richard J. Young**
- Founding AI Scientist, DeepNeuro.AI
- University of Nevada, Las Vegas, Department of Neuroscience
- Website: [deepneuro.ai/richard](https://deepneuro.ai/richard)
- HuggingFace: [@richardyoung](https://huggingface.co/richardyoung)
- GitHub: Open an issue on this repository

---

**Note**: The `not_uploaded/` directory contains paper drafts, experimental results, and other materials not included in the public repository.