vision-token-masking-phi / SETUP.md

Ric

Initial commit: Justitia - Selective Vision Token Masking for PHI-Compliant OCR

a6b8ecc 3 months ago

preview code

raw

history blame contribute delete

7.43 kB

Justitia PHI-OCR Setup Guide

Prerequisites
Installation
Synthea Setup
Model Download
Data Generation
Training Setup
Troubleshooting

Prerequisites

System Requirements

OS: Linux, macOS, or Windows with WSL2
Python: 3.12 or newer
RAM: 16GB minimum (32GB recommended)
Storage: 50GB+ free space
GPU: NVIDIA GPU with 8GB+ VRAM (optional but recommended)
- CUDA 11.8 or newer
- cuDNN 8.6 or newer

Software Requirements

Git
Java JDK 11+ (for Synthea)
Python virtual environment tool (venv, conda, etc.)

Installation

1. Clone the Repository

git clone https://github.com/yourusername/Justitia-PHI-OCR.git
cd Justitia-Selective_Vision_Token_Masking_for_PHI-Compliant_OCR

2. Create Virtual Environment

# Using venv
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Or using conda
conda create -n justitia python=3.12
conda activate justitia

3. Install Python Dependencies

# Install base requirements
pip install -r requirements.txt

# Install the package in development mode
pip install -e .

# Install Flash Attention (optional but recommended for speed)
pip install flash-attn==2.7.3 --no-build-isolation

4. Verify Installation

python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"
python -c "import peft; print(f'PEFT: {peft.__version__}')"

If CUDA is available:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

Synthea Setup

Synthea generates synthetic patient data with realistic PHI for training.

Automatic Setup

# Run the setup script
./scripts/setup_synthea.sh

Manual Setup

If the automatic setup fails:

# 1. Create external directory
mkdir -p external
cd external

# 2. Clone Synthea
git clone https://github.com/synthetichealth/synthea.git
cd synthea

# 3. Build Synthea
./gradlew build -x test

# 4. Test generation
./run_synthea -p 5

cd ../..

Generate Synthetic Patients

# Generate 1000 patients from Massachusetts
./scripts/generate_synthea_data.sh 1000 Massachusetts ./data/synthetic/patients

# Generate from different states for diversity
./scripts/generate_synthea_data.sh 500 California ./data/synthetic/ca_patients
./scripts/generate_synthea_data.sh 500 Texas ./data/synthetic/tx_patients

Model Download

Download DeepSeek-OCR

# Download the model
python scripts/download_model.py

# Verify download
python scripts/download_model.py --test-only

Alternative: Manual Download

If automatic download fails:

# Using Hugging Face CLI
pip install huggingface-hub
huggingface-cli download deepseek-ai/DeepSeek-OCR --local-dir ./models/deepseek_ocr

Data Generation

1. Generate Synthetic Medical PDFs

# Convert Synthea data to PDFs with PHI
python src/data_generation/synthea_to_pdf.py \
    --synthea-output data/synthetic/patients \
    --pdf-output data/pdfs \
    --annotations-output data/annotations \
    --num-documents 5000

2. Create Training/Validation Split

# Split data into train/val/test sets
python scripts/split_data.py \
    --input-dir data/pdfs \
    --output-dir data/split \
    --train-ratio 0.8 \
    --val-ratio 0.1 \
    --test-ratio 0.1

Training Setup

1. Configure Training Parameters

Edit config/training_config.yaml to adjust:

Batch size (based on your GPU memory)
Learning rate
Number of epochs
LoRA rank

2. Start Training

# Train the LoRA adapter
python src/training/train_lora.py \
    --config config/training_config.yaml \
    --data-dir data/split \
    --output-dir models/lora_adapters/experiment_1

3. Monitor Training

# If using Weights & Biases
wandb login  # First time only

# If using TensorBoard
tensorboard --logdir logs/tensorboard

4. Distributed Training (Multiple GPUs)

# Using accelerate
accelerate config  # Configure multi-GPU setup
accelerate launch src/training/train_lora.py --config config/training_config.yaml

Quick Start Script

For a complete setup from scratch:

#!/bin/bash
# save as quick_setup.sh

echo "Setting up Justitia PHI-OCR..."

# 1. Create virtual environment
python -m venv venv
source venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt
pip install -e .

# 3. Setup Synthea
./scripts/setup_synthea.sh

# 4. Download model
python scripts/download_model.py

# 5. Generate test data
./scripts/generate_synthea_data.sh 100
python src/data_generation/synthea_to_pdf.py --num-documents 100

echo "Setup complete! You can now start training."

Troubleshooting

Common Issues

1. CUDA Out of Memory

Reduce batch size in config/training_config.yaml:

training:
  batch_size: 4  # Reduce from 8
  gradient_accumulation_steps: 8  # Increase to maintain effective batch size

2. Flash Attention Installation Failed

Flash Attention is optional. If installation fails:

# Edit config/model_config.yaml
optimization:
  use_flash_attention: false

3. Synthea Build Failed

Ensure Java is installed:

java -version  # Should show Java 11 or newer

# On Ubuntu/Debian
sudo apt-get install openjdk-11-jdk

# On macOS
brew install openjdk@11

4. Model Download Timeout

Try alternative sources:

python scripts/download_model.py --model-name unsloth/DeepSeek-OCR

5. ImportError for Custom Modules

Ensure the package is installed in development mode:

pip install -e .

GPU Memory Requirements

Configuration	VRAM Required	Batch Size	Notes
Full Training	40GB+	8-16	A100 recommended
LoRA Training	16-24GB	4-8	RTX 3090/4090
LoRA Training (8-bit)	8-12GB	2-4	RTX 3070/3080
CPU Training	System RAM	1-2	Very slow

Performance Optimization

Enable Mixed Precision:

training:
  mixed_precision:
    enabled: true
    dtype: "fp16"  # or "bf16" for newer GPUs

Use Gradient Checkpointing:

training:
  gradient_checkpointing: true

Enable Compiled Model (PyTorch 2.0+):

optimization:
  compile_model: true

Next Steps

After setup:

Generate Training Data: Create at least 5,000 synthetic PDFs
Train Initial Model: Start with a small dataset to verify pipeline
Evaluate Performance: Test PHI detection accuracy
Fine-tune: Adjust hyperparameters based on results
Deploy: Set up inference pipeline for production use

Getting Help

Issues: GitHub Issues
Documentation: See /docs directory
Community: Join our Discord/Slack (if applicable)

License

This project is licensed under the MIT License. See LICENSE file for details.

Citation

If you use this project in research, please cite:

@software{justitia2025,
  title={Justitia: Selective Vision Token Masking for PHI-Compliant OCR},
  year={2025},
  url={https://github.com/yourusername/Justitia-PHI-OCR}
}

richardyoung
/

vision-token-masking-phi