Political Meme Classification - MAF Model

Model Description

Multimodal Attention Fusion (MAF) model for binary classification of Bengali political memes:

  • NonPolitical (0): Non-political content
  • Political (1): Political content

This model combines visual features from CLIP (ViT-B-16) and textual features from XLM-RoBERTa-Large using multi-head attention to classify meme images with Bengali text.

Architecture

  • Visual Encoder: CLIP ViT-B-16 (fine-tuned, pretrained on LAION-2B)
  • Text Encoder: XLM-RoBERTa-Large (fine-tuned)
  • Fusion: Multi-head Attention (16 heads) for cross-modal interaction
  • Classifier: 2-layer fully connected network with dropout
  • Lexicon Boosting: Political keyword detection for improved accuracy
  • Input: 224x224 images + text (max 70 tokens)
  • Output: Binary classification (NonPolitical/Political)

Training Details

  • Task: Binary Image Classification
  • Dataset: PoliMemeDecode (2,290 training samples, 572 validation samples)
  • Epochs: 10
  • Learning Rate: 8e-05
  • Batch Size: 16
  • Max Text Length: 70
  • Attention Heads: 16
  • Optimizer: AdamW with linear warmup scheduler
  • Loss: CrossEntropyLoss
  • Fine-tuning: Both CLIP visual encoder and XLM-RoBERTa are fully fine-tuned

Files in Repository

  • maf_model_full.pth - Complete model state dict (includes all weights)
  • clip_visual_finetuned.pth - Fine-tuned CLIP visual encoder weights only
  • clip_config.pth - CLIP model configuration
  • model_architecture.py - Model architecture code with lexicon support

Usage

from huggingface_hub import hf_hub_download
import torch
import open_clip
from transformers import AutoTokenizer

# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Download model files
model_weights_path = hf_hub_download(repo_id="pawmeow/bengali-political-maf-friday", filename="maf_model_full.pth")
arch_path = hf_hub_download(repo_id="pawmeow/bengali-political-maf-friday", filename="model_architecture.py")
clip_config_path = hf_hub_download(repo_id="pawmeow/bengali-political-maf-friday", filename="clip_config.pth")

# Load CLIP configuration
clip_config = torch.load(clip_config_path, map_location=device)

# Initialize CLIP model (base model)
clip_model, _, preprocess = open_clip.create_model_and_transforms(
    clip_config['model_name'],  # 'ViT-B-16'
    pretrained=clip_config['pretrained'],  # 'laion2b_s34b_b88k'
    device=device
)

# Extract visual encoder
clip_model = clip_model.visual.float().to(device)

# Import architecture
import importlib.util
spec = importlib.util.spec_from_file_location("model_architecture", arch_path)
model_arch = importlib.util.module_from_spec(spec)
spec.loader.exec_module(model_arch)
MAF = model_arch.MAF

# Initialize model with CLIP encoder
model = MAF(clip_model, num_classes=2, num_heads=16, use_lexicon_boost=True)

# Load fine-tuned weights (this will load the fine-tuned CLIP weights too)
model.load_state_dict(torch.load(model_weights_path, map_location=device))
model = model.to(device)
model.eval()

print("โœ“ Model loaded with fine-tuned CLIP and XLM-RoBERTa weights!")

# Prepare for inference
# ... (prepare image and text inputs)

Model Performance

Evaluated on validation set with binary classification metrics:

  • Accuracy, Precision, Recall, F1 Score
  • Class-specific metrics for Political class
  • Confusion matrix analysis
  • Lexicon-based boosting for improved political content detection

Requirements

torch>=1.9.0
torchvision>=0.10.0
transformers>=4.41.2
open_clip_torch
pillow>=9.5.0

Citation

@inproceedings{ahsan2024multimodal,
  title={A Multimodal Framework to Detect Target Aware Aggression in Memes},
  author={Ahsan, Shawly and Hossain, Eftekhar and Sharif, Omar and Das, Avishek and Hoque, Mohammed Moshiul and Dewan, M},
  booktitle={Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={2487--2500},
  year={2024}
}

License

Apache 2.0

Limitations

  • Trained specifically on Bengali political memes
  • Requires both image and text input
  • Performance may vary on out-of-domain content
  • Binary classification only (Political vs NonPolitical)

Model Details

CLIP Visual Encoder

  • Model: ViT-B-16 (Vision Transformer)
  • Pretrained: LAION-2B dataset (34B samples seen)
  • Patch Size: 16ร—16 (196 patches per image)
  • Output Dimension: 512
  • Fine-tuned: All parameters updated during training

Text Encoder

  • Model: XLM-RoBERTa-Large
  • Parameters: 559M
  • Fine-tuned: All layers updated during training

Political Lexicon

The model uses a curated lexicon of Bengali and English political keywords to boost detection of political content. The lexicon includes terms related to political parties, leaders, movements, and events.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for pawmeow/bengali-political-maf-friday

Finetuned
(842)
this model