MSME Document Completeness Scorer
Model Overview
This repository hosts an ensemble of 5 independent Binary XGBoost Classifiers that automate the Document Completeness Scoring step in Indian MSME (Micro, Small, and Medium Enterprises) dispute resolution workflows.
Each classifier is a serialized scikit-learn pipeline (TfidfVectorizer β XGBoostClassifier) that detects the presence or absence of one specific mandatory document type from raw OCR-extracted text. The models are designed to be robust against common real-world challenges including OCR noise, scanned document artifacts, and adversarial near-miss inputs such as Proforma Invoices or Draft documents, which structurally resemble valid legal documents but are legally insufficient for dispute filings.
Included Models
| Model File | Target Document Type | Precision (Missing) | Recall (Present) |
|---|---|---|---|
invoice_model.pkl |
Tax Invoice | ~99% | ~90% |
po_model.pkl |
Purchase Order | ~99% | ~87% |
delivery_model.pkl |
Delivery Challan / Proof of Delivery | ~99% | ~90% |
gst_model.pkl |
GST Registration Certificate | ~99% | ~90% |
contract_model.pkl |
Supply Agreement / Contract | ~99% | ~90% |
Models were trained on the
msme-dispute-document-corpus, a synthetic OCR dataset of 8,000+ samples generated via Gemini 2.5 Flash.
Intended Use Cases
This model suite is intended for:
- Dispute Resolution Platforms β Automatically flagging missing evidence documents in arbitration or legal case files before human review.
- MSME Samadhaan Portals β Programmatically filtering incomplete applications to reduce officer workload.
- Legal Tech Pipelines β Converting unstructured text dumps from scanned case files into structured document-presence classifications.
Out-of-Scope Use
These models are not intended for general-purpose document classification, non-Indian business contexts, or languages other than English.
Getting Started
Installation
pip install scikit-learn xgboost pandas joblib huggingface_hub
Loading the Models
import joblib
from huggingface_hub import hf_hub_download
# Replace with your actual Hugging Face repository ID
REPO_ID = "your-username/msme-document-completeness-scorer"
MODEL_FILES = {
"invoice": "invoice_model.pkl",
"po": "po_model.pkl",
"delivery": "delivery_model.pkl",
"gst": "gst_model.pkl",
"contract": "contract_model.pkl",
}
models = {}
for doc_type, filename in MODEL_FILES.items():
model_path = hf_hub_download(repo_id=REPO_ID, filename=filename)
models[doc_type] = joblib.load(model_path)
print(f"Loaded: {filename}")
Running Inference
def predict_document_status(text: str, doc_type: str) -> tuple[str, float]:
"""
Predicts whether a given document type is present in the provided OCR text.
Args:
text: Raw text string extracted from a scanned document via OCR.
doc_type: Document classifier key. One of: 'invoice', 'po',
'delivery', 'gst', 'contract'.
Returns:
status: 'Present' if the document type is detected, else 'Missing'.
confidence: Probability score (0.0 to 1.0) from the classifier.
"""
model = models.get(doc_type)
if not model:
raise ValueError(f"No model loaded for doc_type='{doc_type}'.")
# predict_proba returns [[prob_class_0 (Missing), prob_class_1 (Present)]]
confidence = model.predict_proba([text])[0][1]
# Production threshold: >= 0.85 confidence required to classify as Present
status = "Present" if confidence >= 0.85 else "Missing"
return status, confidence
# Example
sample_text = """
TAX INVOICE
Inv No: INV-2024-001
Date: 12/12/2024
Total: 50,000 INR
GSTIN: 29ABCDE1234F1Z5
"""
status, confidence = predict_document_status(sample_text, "invoice")
print(f"Status : {status}")
print(f"Confidence : {confidence:.4f}")
Expected output:
Status : Present
Confidence : 0.9731
Scoring a Full Case File
To check completeness across all five mandatory document types at once:
def score_case_file(documents: dict[str, str]) -> dict:
"""
Args:
documents: A dict mapping doc_type keys to their OCR-extracted text.
Example: {"invoice": "...", "po": "...", "gst": "..."}
Returns:
A results dict with status and confidence per document type,
plus a top-level 'is_complete' boolean flag.
"""
results = {}
for doc_type, text in documents.items():
status, confidence = predict_document_status(text, doc_type)
results[doc_type] = {"status": status, "confidence": round(confidence, 4)}
results["is_complete"] = all(
v["status"] == "Present" for v in results.values() if isinstance(v, dict)
)
return results
Technical Details
Architecture
Each model is a two-stage scikit-learn pipeline:
- TF-IDF Vectorizer β Converts raw OCR text into a sparse term-frequency matrix. Configured with sublinear TF scaling and character n-gram ranges tuned for OCR noise tolerance.
- XGBoost Classifier β Gradient-boosted tree classifier trained on the resulting feature vectors with binary cross-entropy loss.
Classification Threshold
The default decision threshold is 0.85 (rather than the standard 0.50). This was selected to minimize false positives on the Missing class β the models require high confidence before declaring a document present. This conservative threshold is appropriate for legal and compliance workflows where falsely accepting an incomplete filing carries greater risk than requesting resubmission.
Training Data
| Property | Details |
|---|---|
| Dataset | msme-dispute-document-corpus (synthetic) |
| Generation Method | Gemini 2.5 Flash with structured OCR simulation |
| Total Samples | 8,000+ labeled examples across all 5 document classes |
| Noise Augmentation | OCR character substitutions, broken line breaks, skewed formatting |
| Adversarial Samples | Proforma invoices, draft purchase orders, unsigned contracts |
Limitations
Synthetic Training Distribution. All training data is synthetically generated. While OCR noise augmentation is applied, model behavior on extremely degraded scans (e.g., below 150 DPI, severe skew, or handwritten annotations) is not guaranteed and should be validated on representative production samples before deployment.
Language and Locale. Models are optimized exclusively for English-language documents using Indian business conventions β INR currency formatting, GSTIN identifiers, and Indian-specific terminology such as "Challan". Performance on documents from other jurisdictions or in regional languages is untested.
OCR Dependency. These models process text only. PDF, image, or scanned document inputs must be pre-processed through an external OCR engine before inference. Prediction quality is directly bounded by the quality of the OCR output.
Compatible OCR Engines
| Engine | Type | Notes |
|---|---|---|
| Tesseract OCR | Open-source | Good baseline; benefits from image pre-processing |
| Azure AI Document Intelligence | Managed API | Strong performance on structured forms and tables |
| Google Cloud Vision API | Managed API | Reliable across varied scan quality |
Citation
If you use this model in research or production, please cite this repository and acknowledge the synthetic training corpus.
License
See LICENSE for full terms.
- Downloads last month
- -