MSME Document Completeness Scorer

Model Overview

This repository hosts an ensemble of 5 independent Binary XGBoost Classifiers that automate the Document Completeness Scoring step in Indian MSME (Micro, Small, and Medium Enterprises) dispute resolution workflows.

Each classifier is a serialized scikit-learn pipeline (TfidfVectorizer → XGBoostClassifier) that detects the presence or absence of one specific mandatory document type from raw OCR-extracted text. The models are designed to be robust against common real-world challenges including OCR noise, scanned document artifacts, and adversarial near-miss inputs such as Proforma Invoices or Draft documents, which structurally resemble valid legal documents but are legally insufficient for dispute filings.

Included Models

Model File	Target Document Type	Precision (Missing)	Recall (Present)
`invoice_model.pkl`	Tax Invoice	~99%	~90%
`po_model.pkl`	Purchase Order	~99%	~87%
`delivery_model.pkl`	Delivery Challan / Proof of Delivery	~99%	~90%
`gst_model.pkl`	GST Registration Certificate	~99%	~90%
`contract_model.pkl`	Supply Agreement / Contract	~99%	~90%

Models were trained on the msme-dispute-document-corpus, a synthetic OCR dataset of 8,000+ samples generated via Gemini 2.5 Flash.

Intended Use Cases

This model suite is intended for:

Dispute Resolution Platforms — Automatically flagging missing evidence documents in arbitration or legal case files before human review.
MSME Samadhaan Portals — Programmatically filtering incomplete applications to reduce officer workload.
Legal Tech Pipelines — Converting unstructured text dumps from scanned case files into structured document-presence classifications.

Out-of-Scope Use

These models are not intended for general-purpose document classification, non-Indian business contexts, or languages other than English.

Getting Started

Installation

pip install scikit-learn xgboost pandas joblib huggingface_hub

Loading the Models

import joblib
from huggingface_hub import hf_hub_download

# Replace with your actual Hugging Face repository ID
REPO_ID = "your-username/msme-document-completeness-scorer"

MODEL_FILES = {
    "invoice":  "invoice_model.pkl",
    "po":       "po_model.pkl",
    "delivery": "delivery_model.pkl",
    "gst":      "gst_model.pkl",
    "contract": "contract_model.pkl",
}

models = {}
for doc_type, filename in MODEL_FILES.items():
    model_path = hf_hub_download(repo_id=REPO_ID, filename=filename)
    models[doc_type] = joblib.load(model_path)
    print(f"Loaded: {filename}")

Running Inference

def predict_document_status(text: str, doc_type: str) -> tuple[str, float]:
    """
    Predicts whether a given document type is present in the provided OCR text.

    Args:
        text:     Raw text string extracted from a scanned document via OCR.
        doc_type: Document classifier key. One of: 'invoice', 'po',
                  'delivery', 'gst', 'contract'.

    Returns:
        status:     'Present' if the document type is detected, else 'Missing'.
        confidence: Probability score (0.0 to 1.0) from the classifier.
    """
    model = models.get(doc_type)
    if not model:
        raise ValueError(f"No model loaded for doc_type='{doc_type}'.")

    # predict_proba returns [[prob_class_0 (Missing), prob_class_1 (Present)]]
    confidence = model.predict_proba([text])[0][1]

    # Production threshold: >= 0.85 confidence required to classify as Present
    status = "Present" if confidence >= 0.85 else "Missing"
    return status, confidence


# Example
sample_text = """
TAX INVOICE
Inv No: INV-2024-001
Date: 12/12/2024
Total: 50,000 INR
GSTIN: 29ABCDE1234F1Z5
"""

status, confidence = predict_document_status(sample_text, "invoice")
print(f"Status     : {status}")
print(f"Confidence : {confidence:.4f}")

Expected output:

Status     : Present
Confidence : 0.9731

Scoring a Full Case File

To check completeness across all five mandatory document types at once:

def score_case_file(documents: dict[str, str]) -> dict:
    """
    Args:
        documents: A dict mapping doc_type keys to their OCR-extracted text.
                   Example: {"invoice": "...", "po": "...", "gst": "..."}

    Returns:
        A results dict with status and confidence per document type,
        plus a top-level 'is_complete' boolean flag.
    """
    results = {}
    for doc_type, text in documents.items():
        status, confidence = predict_document_status(text, doc_type)
        results[doc_type] = {"status": status, "confidence": round(confidence, 4)}

    results["is_complete"] = all(
        v["status"] == "Present" for v in results.values() if isinstance(v, dict)
    )
    return results

Technical Details

Architecture

Each model is a two-stage scikit-learn pipeline:

TF-IDF Vectorizer — Converts raw OCR text into a sparse term-frequency matrix. Configured with sublinear TF scaling and character n-gram ranges tuned for OCR noise tolerance.
XGBoost Classifier — Gradient-boosted tree classifier trained on the resulting feature vectors with binary cross-entropy loss.

Classification Threshold

The default decision threshold is 0.85 (rather than the standard 0.50). This was selected to minimize false positives on the Missing class — the models require high confidence before declaring a document present. This conservative threshold is appropriate for legal and compliance workflows where falsely accepting an incomplete filing carries greater risk than requesting resubmission.

Training Data

Property	Details
Dataset	`msme-dispute-document-corpus` (synthetic)
Generation Method	Gemini 2.5 Flash with structured OCR simulation
Total Samples	8,000+ labeled examples across all 5 document classes
Noise Augmentation	OCR character substitutions, broken line breaks, skewed formatting
Adversarial Samples	Proforma invoices, draft purchase orders, unsigned contracts

Limitations

Synthetic Training Distribution. All training data is synthetically generated. While OCR noise augmentation is applied, model behavior on extremely degraded scans (e.g., below 150 DPI, severe skew, or handwritten annotations) is not guaranteed and should be validated on representative production samples before deployment.

Language and Locale. Models are optimized exclusively for English-language documents using Indian business conventions — INR currency formatting, GSTIN identifiers, and Indian-specific terminology such as "Challan". Performance on documents from other jurisdictions or in regional languages is untested.

OCR Dependency. These models process text only. PDF, image, or scanned document inputs must be pre-processed through an external OCR engine before inference. Prediction quality is directly bounded by the quality of the OCR output.

Compatible OCR Engines

Engine	Type	Notes
Tesseract OCR	Open-source	Good baseline; benefits from image pre-processing
Azure AI Document Intelligence	Managed API	Strong performance on structured forms and tables
Google Cloud Vision API	Managed API	Reliable across varied scan quality

Citation

If you use this model in research or production, please cite this repository and acknowledge the synthetic training corpus.

License

See LICENSE for full terms.

Downloads last month: -

abhinavdread
/

msme-document-completeness-scorer