AI & ML interests
None defined yet.
Recent Activity
TR-MTEB: Turkish Massive Text Embedding Benchmark
Welcome to the official Hugging Face organization for TR-MTEB,
the first large-scale and task-diverse benchmark for evaluating Turkish sentence embedding models.
š Paper
TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations
Mehmet Selman Baysan, Tunga Gungor
Findings of EMNLP 2025
- š ACL Anthology: https://aclanthology.org/2025.findings-emnlp.471/
- DOI: https://doi.org/10.18653/v1/2025.findings-emnlp.471
We introduce TR-MTEB, the first comprehensive benchmark for Turkish sentence representations, covering six core embedding evaluation tasks and 26 datasets.
š Benchmark Overview
TR-MTEB provides evaluation across 6 major embedding task categories:
- Classification
- Clustering
- Pair Classification
- Retrieval
- Bitext Mining
- Semantic Textual Similarity (STS)
š Total datasets included: 26
š Combination of native Turkish + high-quality translated datasets
š§ Turkish Embedding Models
To complement the benchmark, we also release Turkish-specific embedding models trained on:
- 34.2 million weakly supervised Turkish sentence pairs
- Contrastive pretraining + supervised fine-tuning
These models achieve strong performance and significantly outperform monolingual baselines.
š Released Resources
This organization hosts:
ā
Benchmark datasets
ā
Evaluation pipeline
ā
Turkish embedding model suite
ā
Training corpus and scripts (where applicable)
All resources are released publicly to support research in:
- Turkish NLP
- Low-resource language embeddings
- Multilingual benchmark development
š Citation
If you use TR-MTEB in your work, please cite:
@inproceedings{baysan-gungor-2025-tr,
title = "{TR}-{MTEB}: A Comprehensive Benchmark and Embedding Model Suite for {T}urkish Sentence Representations",
author = "Baysan, Mehmet Selman and Gungor, Tunga",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.471/",
doi = "10.18653/v1/2025.findings-emnlp.471",
pages = "8867--8887"
}
š¤ Contact & Contributions
We welcome contributions, new datasets, and collaborations.
Author: Mehmet Selman Baysan (mselmanbaysan@gmail.com)
Organization: TR-MTEB Project
Feel free to open issues or discussions on Hugging Face.
š¹š· Building better embedding benchmarks for Turkish and low-resource languages.