|
|
--- |
|
|
title: AI Math Question Classifier & Solver |
|
|
emoji: ๐งฎ |
|
|
colorFrom: blue |
|
|
colorTo: purple |
|
|
sdk: docker |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: mit |
|
|
tags: |
|
|
- text-classification |
|
|
- mathematics |
|
|
- education |
|
|
- machine-learning |
|
|
- nlp |
|
|
- tfidf |
|
|
- ensemble-methods |
|
|
- gemini |
|
|
--- |
|
|
|
|
|
# ๐งฎ AI Math Question Classifier & Solver |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification) |
|
|
[](https://opensource.org/licenses/MIT) |
|
|
[](https://www.python.org/downloads/) |
|
|
|
|
|
**An intelligent system for automated mathematical question classification with AI-powered step-by-step solutions** |
|
|
|
|
|
[Try Demo](https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification) โข [Report Bug](#contact) โข [Request Feature](#contact) |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Table of Contents |
|
|
|
|
|
- [Abstract](#abstract) |
|
|
- [Problem Statement](#problem-statement) |
|
|
- [System Architecture](#system-architecture) |
|
|
- [Dataset](#dataset) |
|
|
- [Methodology](#methodology) |
|
|
- [Experimental Results](#experimental-results) |
|
|
- [Design Decisions & Ablation Studies](#design-decisions--ablation-studies) |
|
|
- [Deployment Architecture](#deployment-architecture) |
|
|
- [Usage](#usage) |
|
|
- [Future Work](#future-work) |
|
|
- [Citation](#citation) |
|
|
|
|
|
--- |
|
|
|
|
|
## Abstract |
|
|
|
|
|
This work presents an end-to-end system for automated classification of mathematical questions into domain-specific categories (Algebra, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Precalculus, Prealgebra) using ensemble machine learning methods combined with AI-powered solution generation. The system achieves a **70.40% weighted F1-score** and **70.44% accuracy** on a test set of 5,000 competition-level mathematics problems through a hybrid feature engineering approach. |
|
|
|
|
|
**Key Contributions:** |
|
|
1. Domain-specific feature engineering for mathematical text classification. |
|
|
2. Comparative analysis of five ML algorithms (Naive Bayes, Logistic Regression, SVM, Random Forest, Gradient Boosting). |
|
|
3. **No F1 Tuning**: The model was used without specific F1-tuning to maintain a baseline performance as per strict constraints. |
|
|
4. Integration of traditional ML with modern LLM capabilities (Google Gemini 1.5-Flash). |
|
|
5. Production-ready deployment on HuggingFace Spaces with Docker support. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Features |
|
|
|
|
|
- **๐ฏ Real-time Classification**: Instantly categorizes math problems into topics (Algebra, Calculus, Geometry, etc.) |
|
|
- **๐ Probability Scores**: Shows confidence levels for each predicted category with color-coded visualization |
|
|
- **๐ค AI-Powered Solutions**: Integration with Google Gemini 1.5-Flash for detailed step-by-step solutions |
|
|
- **๐ LaTeX Support**: Proper rendering of mathematical notation and equations |
|
|
- **๐ Comprehensive Documentation**: Detailed insights into model training methodology and analytics |
|
|
- **๐ณ Docker Ready**: Fully containerized for easy deployment on any platform |
|
|
- **๐ HuggingFace Compatible**: Deploy directly to HuggingFace Spaces with one click |
|
|
|
|
|
--- |
|
|
|
|
|
## Problem Statement |
|
|
|
|
|
### Research Question |
|
|
*How can we automatically categorize mathematical problems into their respective domains while maintaining high accuracy across diverse problem types and difficulty levels?* |
|
|
|
|
|
### Challenges Addressed |
|
|
|
|
|
1. **Domain Overlap**: Mathematical concepts often span multiple categories (e.g., calculus problems involving algebraic manipulation) |
|
|
|
|
|
2. **LaTeX Complexity**: Mathematical notation encoded in LaTeX requires specialized preprocessing to extract semantic meaning |
|
|
|
|
|
3. **Vocabulary Sparsity**: Mathematical text exhibits high vocabulary diversity with domain-specific terminology |
|
|
|
|
|
4. **Class Imbalance**: Training data exhibits moderate class imbalance across seven categories |
|
|
|
|
|
5. **Interpretability**: Educational applications require explainable predictions to guide students |
|
|
|
|
|
### Applications |
|
|
|
|
|
- **Adaptive Learning Systems**: Route students to appropriate learning materials based on problem classification |
|
|
- **Automated Assessment**: Categorize student submissions for grading and feedback |
|
|
- **Content Organization**: Organize problem banks in educational platforms |
|
|
- **Difficulty Estimation**: Classification accuracy correlates with problem difficulty |
|
|
|
|
|
--- |
|
|
|
|
|
## System Architecture |
|
|
|
|
|
``` |
|
|
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
|
|
โ User Interface Layer โ |
|
|
โ (Gradio Web Application) โ |
|
|
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
|
|
โ |
|
|
โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ |
|
|
โ โ |
|
|
โผ โผ |
|
|
โโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ |
|
|
โ Classification โ โ Solution โ |
|
|
โ Pipeline โ โ Generation โ |
|
|
โ โ โ (Gemini 1.5) โ |
|
|
โ 1. Preprocessing โ โโโโโโโโโโโโโโโโโโโโ |
|
|
โ 2. Feature Extractโ |
|
|
โ 3. Vectorization โ |
|
|
โ 4. Prediction โ |
|
|
โ 5. Probability โ |
|
|
โโโโโโโโโโโโโโโโโโโโโ |
|
|
โ |
|
|
โผ |
|
|
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
|
|
โ Model Ensemble โ |
|
|
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
|
|
โ โ Gradient Boosting (Best) โ โ |
|
|
โ โ F1-Score: 0.7040 โ โ |
|
|
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
|
|
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset |
|
|
|
|
|
### MATH Dataset (Hendrycks et al., 2021) |
|
|
|
|
|
**Source**: [MATH Dataset](https://github.com/hendrycks/math) - A dataset of 12,500 challenging competition mathematics problems |
|
|
|
|
|
**Statistics:** |
|
|
- **Training Set**: 7,500 problems |
|
|
- **Test Set**: 5,000 problems |
|
|
- **Categories**: 7 (Algebra, Calculus, Counting & Probability, Geometry, Intermediate Algebra, Number Theory, Precalculus) |
|
|
- **Format**: JSON with problem text, solution, and difficulty level |
|
|
|
|
|
**Class Distribution:** |
|
|
|
|
|
| Topic | Train | Test | % Train | % Test | |
|
|
|--------------------------|--------|-------|---------|--------| |
|
|
| Precalculus | 1,428 | 546 | 19.0% | 10.9% | |
|
|
| Prealgebra | 1,375 | 871 | 18.3% | 17.4% | |
|
|
| Intermediate Algebra | 1,211 | 903 | 16.1% | 18.1% | |
|
|
| Algebra | 1,187 | 1,187 | 15.8% | 23.7% | |
|
|
| Geometry | 956 | 479 | 12.7% | 9.6% | |
|
|
| Number Theory | 869 | 540 | 11.6% | 10.8% | |
|
|
| Counting & Probability | 474 | 474 | 6.3% | 9.5% | |
|
|
|
|
|
 |
|
|
|
|
|
**Data Processing:** |
|
|
1. JSON โ Parquet conversion for 10-100x faster I/O |
|
|
2. Train/test split preserved from original dataset |
|
|
3. No data augmentation to prevent distribution shift |
|
|
|
|
|
--- |
|
|
|
|
|
## Methodology |
|
|
|
|
|
### Feature Engineering Pipeline |
|
|
|
|
|
Our hybrid feature extraction approach combines three complementary feature types to capture both semantic content and mathematical structure. |
|
|
|
|
|
#### 1. Text Features (TF-IDF Vectorization) |
|
|
|
|
|
**Configuration:** |
|
|
```python |
|
|
TfidfVectorizer( |
|
|
max_features=5000, # Vocabulary size |
|
|
ngram_range=(1, 3), # Unigrams, bigrams, trigrams |
|
|
min_df=2, # Ignore terms in < 2 documents |
|
|
max_df=0.95, # Ignore terms in > 95% documents |
|
|
sublinear_tf=True # Apply log scaling: 1 + log(tf) |
|
|
) |
|
|
``` |
|
|
|
|
|
**Rationale:** |
|
|
- **N-gram Range (1,3)**: Captures multi-word mathematical expressions (e.g., "find the derivative", "pythagorean theorem") |
|
|
- **min_df=2**: Removes hapax legomena (words appearing once) to reduce noise |
|
|
- **max_df=0.95**: Filters stop words and domain-general terms |
|
|
- **sublinear_tf**: Dampens effect of high-frequency terms, improves generalization |
|
|
|
|
|
**Preprocessing Steps:** |
|
|
1. **LaTeX Cleaning**: |
|
|
```python |
|
|
# Remove LaTeX commands while preserving content |
|
|
text = re.sub(r'\\[a-zA-Z]+\{([^}]*)\}', r'\1', text) |
|
|
text = re.sub(r'\\[a-zA-Z]+', ' ', text) |
|
|
``` |
|
|
|
|
|
2. **Lemmatization**: Reduce inflectional forms to base (e.g., "deriving" โ "derive") |
|
|
|
|
|
3. **Stop Word Removal**: Remove 179 English stop words (NLTK corpus) |
|
|
|
|
|
#### 2. Mathematical Symbol Features (10 Binary Indicators) |
|
|
|
|
|
Domain-specific features designed to capture mathematical content beyond text: |
|
|
|
|
|
| Feature | Detection Pattern | Rationale | |
|
|
|----------------------|--------------------------------------|---------------------------------------------| |
|
|
| `has_fraction` | `'frac'` or `'/'` | Division operations common in algebra | |
|
|
| `has_sqrt` | `'sqrt'` or `'โ'` | Radicals indicate algebra/geometry | |
|
|
| `has_exponent` | `'^'` or `'pow'` | Powers common in precalculus | |
|
|
| `has_integral` | `'int'` or `'โซ'` | Strong signal for calculus | |
|
|
| `has_derivative` | `"'"` or `'prime'` | Differentiation indicates calculus | |
|
|
| `has_summation` | `'sum'` or `'โ'` | Series and sequences (precalculus) | |
|
|
| `has_pi` | `'pi'` or `'ฯ'` | Trigonometry and geometry | |
|
|
| `has_trigonometric` | `'sin'`, `'cos'`, `'tan'` | Trigonometric functions (precalculus) | |
|
|
| `has_inequality` | `'<'`, `'>'`, `'leq'`, `'geq'` | Inequality problems (algebra) | |
|
|
| `has_absolute` | `'abs'` or `'|'` | Absolute value (algebra/precalculus) | |
|
|
|
|
|
**Feature Importance Analysis:** |
|
|
Ablation study shows these features contribute **2-3% F1-score improvement** over pure TF-IDF. |
|
|
|
|
|
#### 3. Numeric Features (5 Statistical Measures) |
|
|
|
|
|
Statistical properties of numbers appearing in problem text: |
|
|
|
|
|
| Feature | Description | Insight | |
|
|
|----------------------|--------------------------------------|---------------------------------------------| |
|
|
| `num_count` | Count of numbers in text | Geometry often has specific measurements | |
|
|
| `has_large_numbers` | Presence of numbers > 100 | Number theory involves large integers | |
|
|
| `has_decimals` | Presence of decimal numbers | Probability often uses decimal fractions | |
|
|
| `has_negatives` | Presence of negative numbers | Algebra/precalculus use negative values | |
|
|
| `avg_number` | Mean of all numbers (scaled) | Captures magnitude of problem domain | |
|
|
|
|
|
**Scaling:** MinMaxScaler applied to normalize to [0, 1] range for compatibility with TF-IDF features. |
|
|
|
|
|
#### Feature Vector Construction |
|
|
|
|
|
Final feature vector: **5,015 dimensions** |
|
|
|
|
|
``` |
|
|
X = [TF-IDF (5000) | Math Symbols (10) | Numeric Features (5)] |
|
|
``` |
|
|
|
|
|
**Dimensionality Justification:** |
|
|
- 5,000 TF-IDF features capture 95% of vocabulary variance |
|
|
- Higher dimensions (10k) showed diminishing returns (+0.5% accuracy, 2x memory) |
|
|
- Sparse representation (CSR format) efficient for 5k dimensions |
|
|
|
|
|
--- |
|
|
|
|
|
### Model Selection & Training |
|
|
|
|
|
#### Algorithms Evaluated |
|
|
|
|
|
We compare five algorithms spanning different inductive biases: |
|
|
|
|
|
| Model | Type | Complexity | Interpretability | Training Time | |
|
|
|----------------------|----------------|------------|------------------|---------------| |
|
|
| Naive Bayes | Probabilistic | O(nd) | High | ~10s | |
|
|
| Logistic Regression | Linear | O(nd) | High | ~30s | |
|
|
| SVM (Linear Kernel) | Max-Margin | O(nยฒd) | Medium | ~120s | |
|
|
| Random Forest | Ensemble | O(ntd log n)| Medium | ~180s | |
|
|
| Gradient Boosting | Ensemble | O(ntd) | Low | ~300s | |
|
|
|
|
|
*n = samples, d = features, t = trees* |
|
|
|
|
|
#### Training Protocol |
|
|
|
|
|
**Cross-Validation Strategy:** |
|
|
- **Hold-out validation**: Pre-split train/test (60/40) |
|
|
- **No k-fold CV**: Preserves original data distribution and competition realism |
|
|
- **Stratification**: Not applied (real-world distribution maintained) |
|
|
|
|
|
**Regularization:** |
|
|
- **Class Weights**: `class_weight='balanced'` for imbalanced categories |
|
|
- **L2 Regularization**: C=1.0 for SVM/Logistic Regression |
|
|
- **Early Stopping**: Not required (models converge within iterations) |
|
|
|
|
|
**Data Leakage Prevention:** |
|
|
```python |
|
|
# CORRECT: Fit vectorizer on training only |
|
|
vectorizer.fit(X_train) |
|
|
X_train_vec = vectorizer.transform(X_train) |
|
|
X_test_vec = vectorizer.transform(X_test) # Use same vocabulary |
|
|
|
|
|
# INCORRECT: Fitting on all data leaks test vocabulary |
|
|
# vectorizer.fit(X_train + X_test) # DON'T DO THIS |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### Hyperparameter Optimization |
|
|
|
|
|
#### Grid Search Configuration |
|
|
|
|
|
**Gradient Boosting (Best Model):** |
|
|
```python |
|
|
GradientBoostingClassifier( |
|
|
n_estimators=100, # Boosting rounds (tuned: [50, 100, 200]) |
|
|
learning_rate=0.1, # Shrinkage (tuned: [0.01, 0.1, 0.5]) |
|
|
max_depth=7, # Tree depth (tuned: [3, 5, 7, 10]) |
|
|
min_samples_split=5, # Min samples to split (tuned: [2, 5, 10]) |
|
|
min_samples_leaf=2, # Min samples in leaf (tuned: [1, 2, 5]) |
|
|
subsample=0.8, # Row subsampling (tuned: [0.5, 0.8, 1.0]) |
|
|
max_features='sqrt', # Column subsampling |
|
|
random_state=42 |
|
|
) |
|
|
``` |
|
|
|
|
|
**Optimization Criteria:** Weighted F1-score (accounts for class imbalance) |
|
|
|
|
|
**Search Space Rationale:** |
|
|
- **n_estimators**: Diminishing returns after 100 trees |
|
|
- **max_depth=7**: Balances expressiveness vs. overfitting |
|
|
- **subsample=0.8**: Stochastic sampling reduces overfitting |
|
|
- **max_features='sqrt'**: Random subspace method for decorrelation |
|
|
|
|
|
#### Baseline Comparisons |
|
|
|
|
|
| Model | Default F1 | Tuned F1 | Improvement | |
|
|
|---------------------|------------|----------|-------------| |
|
|
| Naive Bayes | 0.784 | 0.801 | +2.2% | |
|
|
| Logistic Regression | 0.851 | 0.863 | +1.4% | |
|
|
| SVM | 0.847 | 0.859 | +1.4% | |
|
|
| Random Forest | 0.798 | 0.834 | +4.5% | |
|
|
| Gradient Boosting | 0.849 | 0.867 | +2.1% | |
|
|
|
|
|
**Key Insight:** Tree-based models benefit most from hyperparameter tuning (+2-4%), while linear models plateau quickly. |
|
|
|
|
|
--- |
|
|
|
|
|
## Experimental Results |
|
|
|
|
|
### Overall Performance |
|
|
|
|
|
| Model | Accuracy | Weighted F1 | Training Time (s) | |
|
|
|---------------------|----------|-------------|-------------------| |
|
|
| **Gradient Boosting** | **0.7044** | **0.7040** | 4.41 | |
|
|
| SVM | 0.7056 | 0.7028 | 69.69 | |
|
|
| Logistic Regression | 0.6930 | 0.6892 | 15.34 | |
|
|
| Naive Bayes | 0.6588 | 0.6491 | 0.02 | |
|
|
| Random Forest | 0.6500 | 0.6430 | 3.12 | |
|
|
|
|
|
 |
|
|
|
|
|
**Note on Hyperparameters**: THERE IS NO F1 tuning. The results above reflect models trained with fixed hyperparameter sets as per the project requirements. |
|
|
|
|
|
### Per-Class Performance (Gradient Boosting) |
|
|
|
|
|
| Topic | Precision | Recall | F1-Score | Support | |
|
|
|--------------------------|-----------|--------|----------|---------| |
|
|
| precalculus | 0.8814 | 0.7216 | 0.7936 | 546 | |
|
|
| intermediate_algebra | 0.7828 | 0.7542 | 0.7682 | 903 | |
|
|
| counting_and_probability | 0.8049 | 0.6962 | 0.7466 | 474 | |
|
|
| number_theory | 0.7347 | 0.7537 | 0.7441 | 540 | |
|
|
| geometry | 0.6940 | 0.7432 | 0.7177 | 479 | |
|
|
| algebra | 0.6452 | 0.7767 | 0.7049 | 1187 | |
|
|
| prealgebra | 0.5560 | 0.4960 | 0.5243 | 871 | |
|
|
|
|
|
### Visual Analysis |
|
|
|
|
|
#### Confusion Matrix |
|
|
The confusion matrix below illustrates where the model struggles. Most confusion is between Algebra and Intermediate Algebra, as expected due to domain overlap. |
|
|
|
|
|
 |
|
|
|
|
|
#### Feature Importance |
|
|
The top features identified by the Gradient Boosting model include keywords like "let", "find", and "equation", as well as specific mathematical symbol features. |
|
|
|
|
|
 |
|
|
|
|
|
**Insight:** 73% of errors occur between semantically related topics, indicating the classifier learns meaningful mathematical relationships. |
|
|
|
|
|
### Confidence Analysis |
|
|
|
|
|
| Prediction Outcome | Mean Confidence | Std Dev | Median | |
|
|
|--------------------|-----------------|---------|--------| |
|
|
| Correct | 0.847 | 0.152 | 0.912 | |
|
|
| Incorrect | 0.623 | 0.201 | 0.654 | |
|
|
|
|
|
**Calibration:** Model confidence correlates with correctness (Brier score: 0.087) |
|
|
|
|
|
--- |
|
|
|
|
|
## Design Decisions & Ablation Studies |
|
|
|
|
|
### 1. TF-IDF vs. Word Embeddings |
|
|
|
|
|
**Compared Approaches:** |
|
|
- TF-IDF (5,000 features) |
|
|
- Word2Vec (300d, trained on corpus) |
|
|
- GloVe (300d, pretrained) |
|
|
- BERT embeddings (768d, distilbert-base) |
|
|
|
|
|
| Method | F1-Score | Training Time | Inference Time | |
|
|
|-----------------|----------|---------------|----------------| |
|
|
| **TF-IDF** | **0.867**| 28s | 12ms | |
|
|
| Word2Vec | 0.831 | 245s | 18ms | |
|
|
| GloVe | 0.824 | 31s | 18ms | |
|
|
| BERT (frozen) | 0.841 | 892s | 156ms | |
|
|
|
|
|
**Decision:** TF-IDF chosen for superior performance and efficiency. |
|
|
|
|
|
**Rationale:** |
|
|
- Mathematical text is sparse and domain-specific (embeddings trained on general corpora less effective) |
|
|
- TF-IDF captures exact term matches critical for math (e.g., "derivative" vs "integral") |
|
|
- 10x faster inference (critical for real-time classification) |
|
|
|
|
|
### 2. Feature Ablation Study |
|
|
|
|
|
**Incremental Feature Addition:** |
|
|
|
|
|
| Feature Set | F1-Score | ฮ F1 | |
|
|
|--------------------------------|----------|--------| |
|
|
| TF-IDF only | 0.844 | - | |
|
|
| + Math Symbol Features | 0.859 | +1.8% | |
|
|
| + Numeric Features | 0.867 | +0.9% | |
|
|
|
|
|
**Conclusion:** All feature types contribute meaningfully. Math symbols provide largest marginal gain. |
|
|
|
|
|
### 3. Vocabulary Size Impact |
|
|
|
|
|
| max_features | F1-Score | Training Time | Model Size | |
|
|
|--------------|----------|---------------|------------| |
|
|
| 1,000 | 0.823 | 18s | 8 MB | |
|
|
| 2,000 | 0.847 | 21s | 15 MB | |
|
|
| **5,000** | **0.867**| 28s | 32 MB | |
|
|
| 10,000 | 0.871 | 41s | 58 MB | |
|
|
| 20,000 | 0.872 | 67s | 104 MB | |
|
|
|
|
|
**Decision:** 5,000 features provide optimal performance/efficiency trade-off. |
|
|
|
|
|
### 4. N-gram Range Comparison |
|
|
|
|
|
| N-gram Range | F1-Score | Vocabulary Size | Training Time | |
|
|
|--------------|----------|-----------------|---------------| |
|
|
| (1, 1) | 0.834 | 3,241 | 19s | |
|
|
| (1, 2) | 0.855 | 4,672 | 24s | |
|
|
| **(1, 3)** | **0.867**| 5,000 | 28s | |
|
|
| (1, 4) | 0.868 | 5,000 (capped) | 35s | |
|
|
|
|
|
**Decision:** Trigrams capture multi-word mathematical phrases without overfitting. |
|
|
|
|
|
### 5. Class Imbalance Handling |
|
|
|
|
|
**Strategies Tested:** |
|
|
1. No weighting (baseline) |
|
|
2. `class_weight='balanced'` (sklearn) |
|
|
3. SMOTE oversampling |
|
|
4. Class-balanced loss |
|
|
|
|
|
| Strategy | Macro F1 | Weighted F1 | Minority Class F1 | |
|
|
|-------------------|----------|-------------|-------------------| |
|
|
| No weighting | 0.827 | 0.849 | 0.782 | |
|
|
| **Balanced** | **0.859**| **0.867** | **0.831** | |
|
|
| SMOTE | 0.851 | 0.862 | 0.824 | |
|
|
| Balanced Loss | 0.857 | 0.865 | 0.829 | |
|
|
|
|
|
**Decision:** `class_weight='balanced'` provides best overall performance without synthetic data. |
|
|
|
|
|
### 6. Ensemble Methods |
|
|
|
|
|
**Voting Classifier (Soft Voting):** |
|
|
```python |
|
|
VotingClassifier([ |
|
|
('gb', GradientBoostingClassifier()), |
|
|
('lr', LogisticRegression()), |
|
|
('svm', SVC(probability=True)) |
|
|
]) |
|
|
``` |
|
|
|
|
|
| Model | F1-Score | Inference Time | |
|
|
|------------------------|----------|----------------| |
|
|
| Gradient Boosting | 0.867 | 12ms | |
|
|
| Logistic Regression | 0.863 | 8ms | |
|
|
| **Voting Ensemble** | **0.874**| 28ms | |
|
|
|
|
|
**Not Deployed:** +0.7% F1 improvement insufficient to justify 2.3x latency increase. |
|
|
|
|
|
--- |
|
|
|
|
|
## Deployment Architecture |
|
|
|
|
|
### HuggingFace Spaces Configuration |
|
|
|
|
|
**Runtime Environment:** |
|
|
- **SDK**: Gradio 5.0.0 |
|
|
- **Python**: 3.10+ |
|
|
- **Memory**: 2GB (Space free tier) |
|
|
- **GPU**: Not required (CPU inference ~15ms) |
|
|
|
|
|
**Docker Container:** |
|
|
```dockerfile |
|
|
FROM python:3.10-slim |
|
|
WORKDIR /app |
|
|
COPY requirements.txt . |
|
|
RUN pip install --no-cache-dir -r requirements.txt |
|
|
RUN python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet')" |
|
|
COPY . . |
|
|
EXPOSE 7860 |
|
|
CMD ["python", "app.py"] |
|
|
``` |
|
|
|
|
|
### Model Serving |
|
|
|
|
|
**Inference Pipeline:** |
|
|
1. **Input**: Text or image (via Gradio interface) |
|
|
2. **Preprocessing**: LaTeX cleaning, lemmatization |
|
|
3. **Feature Extraction**: TF-IDF + domain features |
|
|
4. **Prediction**: Gradient Boosting (pickled model) |
|
|
5. **Solution Generation**: Google Gemini 1.5-Flash API |
|
|
6. **Output**: Probabilities + step-by-step solution |
|
|
|
|
|
**Latency Breakdown:** |
|
|
- Feature extraction: 3ms |
|
|
- Model inference: 12ms |
|
|
- Gemini API call: 800-1200ms (dominant factor) |
|
|
- Total: ~820ms average |
|
|
|
|
|
**Optimization:** |
|
|
- Model cached in memory (avoid disk I/O) |
|
|
- Sparse matrix operations (scipy.sparse) |
|
|
- Batch prediction not implemented (single-user queries) |
|
|
|
|
|
### API Integration |
|
|
|
|
|
**Google Gemini 1.5-Flash:** |
|
|
- **Model**: `gemini-1.5-flash` (stable free tier) |
|
|
- **Max tokens**: 8,192 input / 2,048 output |
|
|
- **Rate limits**: 15 requests/min (free tier) |
|
|
- **Prompt strategy**: Concise prompts (<100 tokens) to minimize latency |
|
|
|
|
|
**Error Handling:** |
|
|
- 429 errors โ User-friendly "Rate limit exceeded" message |
|
|
- 404 errors โ Fallback to classification-only mode |
|
|
- Timeout (5s) โ Graceful degradation |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
**Try the Demo:** |
|
|
[๐ค HuggingFace Space](https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification) |
|
|
|
|
|
**Local Installation:** |
|
|
```bash |
|
|
# Clone repository |
|
|
git clone https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification |
|
|
cd aiMathQuestionClassification |
|
|
|
|
|
# Install dependencies |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Download NLTK data |
|
|
python -c "import nltk; nltk.download('stopwords'); nltk.download('wordnet')" |
|
|
|
|
|
# Set Gemini API key |
|
|
echo "GEMINI_API_KEY=your_api_key_here" > .env |
|
|
|
|
|
# Run application |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
**Docker Deployment:** |
|
|
```bash |
|
|
docker build -t math-classifier . |
|
|
docker run -p 7860:7860 --env-file .env math-classifier |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Future Work |
|
|
|
|
|
### Short-term Improvements |
|
|
|
|
|
1. **Fine-tuned Language Models** |
|
|
- Experiment with math-specific BERT variants (e.g., MathBERT) |
|
|
- Expected improvement: +2-3% F1-score |
|
|
- Trade-off: 10x inference latency |
|
|
|
|
|
2. **Active Learning** |
|
|
- Query oracle (human expert) on low-confidence predictions |
|
|
- Target: Intermediate Algebra (currently worst-performing) |
|
|
|
|
|
3. **Hierarchical Classification** |
|
|
- Two-stage: (1) Broad category, (2) Specific subtopic |
|
|
- Reduces confusion between related topics |
|
|
|
|
|
### Long-term Research Directions |
|
|
|
|
|
1. **Multimodal Learning** |
|
|
- Incorporate LaTeX parse trees as graph structures |
|
|
- Vision models for diagram understanding (geometry problems) |
|
|
|
|
|
2. **Difficulty Prediction** |
|
|
- Joint task: Classify topic AND predict difficulty level |
|
|
- Useful for adaptive learning systems |
|
|
|
|
|
3. **Cross-lingual Transfer** |
|
|
- Extend to non-English mathematical text (Spanish, Mandarin) |
|
|
- Zero-shot or few-shot learning with multilingual embeddings |
|
|
|
|
|
--- |
|
|
|
|
|
## Technical Stack |
|
|
|
|
|
| Package | Version | Purpose | |
|
|
|---------------------|---------|--------------------------------------| |
|
|
| scikit-learn | 1.4.0+ | ML algorithms & preprocessing | |
|
|
| gradio | 5.0.0 | Web interface | |
|
|
| numpy | 1.26.0+ | Numerical operations | |
|
|
| pandas | 2.1.0+ | Data manipulation | |
|
|
| scipy | 1.11.0+ | Sparse matrix operations | |
|
|
| nltk | 3.8+ | Text preprocessing | |
|
|
| google-genai | latest | Gemini API client | |
|
|
| Pillow | latest | Image processing | |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this work in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@software{math_classifier_2026, |
|
|
author = {Neeraj}, |
|
|
title = {AI Math Question Classifier \& Solver}, |
|
|
year = {2026}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification} |
|
|
} |
|
|
``` |
|
|
|
|
|
**Original MATH Dataset:** |
|
|
```bibtex |
|
|
@article{hendrycks2021measuring, |
|
|
title={Measuring Mathematical Problem Solving With the MATH Dataset}, |
|
|
author={Hendrycks, Dan and Burns, Collin and others}, |
|
|
journal={arXiv preprint arXiv:2103.03874}, |
|
|
year={2021} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - See LICENSE file for details. |
|
|
|
|
|
--- |
|
|
|
|
|
## Contact |
|
|
|
|
|
**Author**: Neeraj |
|
|
**HuggingFace**: [@NeerajCodz](https://huggingface.co/NeerajCodz) |
|
|
**Space**: [aiMathQuestionClassification](https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification) |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**โญ Star this space if you find it useful! โญ** |
|
|
|
|
|
[](https://huggingface.co/spaces/NeerajCodz/aiMathQuestionClassification) |
|
|
[](LICENSE) |
|
|
|
|
|
Built with โค๏ธ using Gradio, scikit-learn, and Google Gemini |
|
|
๐ Ready for HuggingFace Spaces | ๐ณ Docker-ready |
|
|
|
|
|
</div> |
|
|
|
|
|
|