AI & ML interests

None defined yet.

Recent Activity

TR-MTEBĀ  updated a model 4 days ago
trmteb/turkish-embedding-model
TR-MTEBĀ  updated a Space 4 days ago
trmteb/README
View all activity

TR-MTEB: Turkish Massive Text Embedding Benchmark

Welcome to the official Hugging Face organization for TR-MTEB,
the first large-scale and task-diverse benchmark for evaluating Turkish sentence embedding models.


šŸ“Œ Paper

TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations
Mehmet Selman Baysan, Tunga Gungor
Findings of EMNLP 2025

We introduce TR-MTEB, the first comprehensive benchmark for Turkish sentence representations, covering six core embedding evaluation tasks and 26 datasets.


šŸ” Benchmark Overview

TR-MTEB provides evaluation across 6 major embedding task categories:

  • Classification
  • Clustering
  • Pair Classification
  • Retrieval
  • Bitext Mining
  • Semantic Textual Similarity (STS)

šŸ“Š Total datasets included: 26
šŸŒ Combination of native Turkish + high-quality translated datasets


🧠 Turkish Embedding Models

To complement the benchmark, we also release Turkish-specific embedding models trained on:

  • 34.2 million weakly supervised Turkish sentence pairs
  • Contrastive pretraining + supervised fine-tuning

These models achieve strong performance and significantly outperform monolingual baselines.


šŸ“‚ Released Resources

This organization hosts:

āœ… Benchmark datasets
āœ… Evaluation pipeline
āœ… Turkish embedding model suite
āœ… Training corpus and scripts (where applicable)

All resources are released publicly to support research in:

  • Turkish NLP
  • Low-resource language embeddings
  • Multilingual benchmark development

🌟 Citation

If you use TR-MTEB in your work, please cite:

@inproceedings{baysan-gungor-2025-tr,
  title = "{TR}-{MTEB}: A Comprehensive Benchmark and Embedding Model Suite for {T}urkish Sentence Representations",
  author = "Baysan, Mehmet Selman and Gungor, Tunga",
  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
  month = nov,
  year = "2025",
  address = "Suzhou, China",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2025.findings-emnlp.471/",
  doi = "10.18653/v1/2025.findings-emnlp.471",
  pages = "8867--8887"
}

šŸ¤ Contact & Contributions

We welcome contributions, new datasets, and collaborations.

Author: Mehmet Selman Baysan (mselmanbaysan@gmail.com)

Organization: TR-MTEB Project

Feel free to open issues or discussions on Hugging Face.

šŸ‡¹šŸ‡· Building better embedding benchmarks for Turkish and low-resource languages.