10 34 24

Sifal KLIOUI

Sifal

https://sifal.social/

AI & ML interests

None yet

Recent Activity

liked a dataset 9 days ago

HuggingFaceTB/training-guide-nanotron-configs

liked a dataset 11 days ago

HealthDataHub/PARHAF

liked a model about 1 month ago

deepseek-ai/DeepSeek-V4-Pro

View all activity

Organizations

liked a dataset 9 days ago

HuggingFaceTB/training-guide-nanotron-configs

Viewer • Updated Dec 22, 2025 • 2 • 247 • 10

liked a dataset 11 days ago

HealthDataHub/PARHAF

Viewer • Updated 17 days ago • 4.25k • 898 • 14

liked a model about 1 month ago

deepseek-ai/DeepSeek-V4-Pro

Text Generation • 862B • Updated 27 days ago • 5.85M • • 4.53k

liked a model about 2 months ago

QwQZh/gated_attention

Updated May 10, 2025 • 24

upvoted a paper 2 months ago

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Paper • 2603.19466 • Published Mar 19 • 41

commented on Tokenization in Transformers v5: Simpler, Clearer, and More Modular 5 months ago

Thanks for your response! I'll check that out. 👏

commented on Tokenization in Transformers v5: Simpler, Clearer, and More Modular 5 months ago

Thanks for doing this! I had to train some tokenizers with the v4, it was indeed not straightforward to understand the behavior.

I had two questions:

You said: older model implementations may rely on Python-specific behavior.
Curious if you had any example
You sometimes say "fast" (between quotes) is it just to refer to the fastTokenizers backend or can the implementation actually be slower than the python implementation because of some kind of rust overhead?

upvoted an article 5 months ago

Article

Tokenization in Transformers v5: Simpler, Clearer, and More Modular

itazap, ariG23498, ArthurZ, sergiopaniego, merve, pcuenq

•

Dec 18, 2025

• 124

liked a Space 5 months ago

The Ultra-Scale Playbook

🌌

3.86k

The ultimate guide to training LLM on large GPU Clusters

liked 3 models 5 months ago

upvoted a collection 5 months ago

Olmo 3.1

Collection

The latest members of the Olmo 3 family: another 3 weeks of RL for 32B Think, the 32B Instruct model, large post-training research datasets... • 9 items • Updated Dec 23, 2025 • 52

liked a dataset 5 months ago

osunlp/TravelPlanner

Viewer • Updated Jul 14, 2024 • 1.23k • 6.34k • 83

liked a model 6 months ago

jinaai/jina-code-embeddings-1.5b

Feature Extraction • 2B • Updated Oct 2, 2025 • 9.55k • 48

updated a dataset 6 months ago

Sifal/Kabyle-French

Viewer • Updated Dec 15, 2025 • 115k • 50 • 3

commented on Gotchas in Tokenizer Behavior Every Developer Should Know 6 months ago

Thanks for sharing, probably worth having a script to check:

import warnings
from transformers import AutoTokenizer

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

def check_tokenizer_gotchas(model_id):
    print(f"\n{'='*60}")
    print(f"Analyzing Tokenizer for: {model_id}")
    print(f"{'='*60}\n")

    try:
        # Load tokenizer (trust_remote_code=True is often needed for newer/custom models)
        tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    except Exception as e:
        print(f"Error loading tokenizer: {e}")
        return

    # Standard test input
    test_text = "Beautiful is better than ugly"
    
    # Standard test messages for Chat Templates
    messages = [
        {"role": "user", "content": "What is better than ugly?"},
        {"role": "assistant", "content": "Beautiful."}
    ]

    # --- GOTCHA 1 & 2: BOS Token Existence and Usage ---
    print(f"--- 1 & 2. BOS Token Analysis ---")
    if tokenizer.bos_token is None:
        print(f"⚠️  Gotcha #1: This tokenizer has NO BOS token defined.")
    else:
        print(f"✅  BOS token exists: '{tokenizer.bos_token}' (ID: {tokenizer.bos_token_id})")
        
        # Check usage in standard encoding
        encoded = tokenizer(test_text)["input_ids"]
        if tokenizer.bos_token_id in encoded:
             print(f"✅  BOS token IS automatically added during standard tokenization.")
        else:
            print(f"⚠️  Gotcha #2: BOS exists but is NOT added automatically.")

    # --- GOTCHA 3: EOS Token in Standard Tokenization ---
    print(f"--- 3. Standard EOS Token Analysis ---")
    encoded = tokenizer(test_text)["input_ids"]
    if tokenizer.eos_token_id and encoded[-1] == tokenizer.eos_token_id:
        print(f"ℹ️  EOS token WAS added automatically (Uncommon behavior).")
    else:
        print(f"⚠️  Gotcha #3: Tokenization did NOT add the EOS token automatically.")

    # --- GOTCHA 4: EOS in Chat Templates ---
    print(f"--- 4. Chat Template EOS Analysis ---")
    if tokenizer.chat_template:
        # Generate IDs without adding the generation prompt yet
        chat_encoded = tokenizer.apply_chat_template(messages, add_generation_prompt=False)
        
        if tokenizer.eos_token_id is None:
             print("❌  No EOS token defined in tokenizer.")
        
        elif len(chat_encoded) > 0:
            last_id = chat_encoded[-1]
            # Check if the very last token is EOS
            if last_id == tokenizer.eos_token_id:
                print(f"✅  Chat template correctly appends EOS ({tokenizer.eos_token}) at the very end.")
            
            # Check if EOS is second to last (common issue)
            elif len(chat_encoded) > 1 and chat_encoded[-2] == tokenizer.eos_token_id:
                # Decode the actual last token to show the user
                trailing_token = tokenizer.decode([last_id])
                # Escape newlines for visibility in print output
                trailing_repr = repr(trailing_token) 
                
                print(f"⚠️  Gotcha #4: EOS is present but NOT at the end.")
                print(f"    The actual last token is ID {last_id} ({trailing_repr}).")
                print(f"    (This is likely a trailing newline from the Jinja template).")
            
            else:
                print(f"⚠️  Gotcha #4: Chat template does NOT append the EOS token.")
    else:
        print("ℹ️  No chat template defined for this tokenizer.")

    # --- GOTCHA 5: PAD == EOS ---
    print(f"--- 5. Pad Token Collision Check ---")
    if tokenizer.pad_token_id is not None and tokenizer.eos_token_id is not None:
        if tokenizer.pad_token_id == tokenizer.eos_token_id:
            print(f"⚠️  Gotcha #5: PAD token ID equals EOS token ID ({tokenizer.pad_token_id}).")
            print(f"    Warning: Masking logic `input_ids == pad_token_id` will unintentionally mask EOS tokens.")
        else:
            print(f"✅  PAD ({tokenizer.pad_token_id}) and EOS ({tokenizer.eos_token_id}) are distinct.")
    else:
        print("ℹ️  PAD or EOS token not defined for this tokenizer.")

    # --- GOTCHA 6 & 7: Composition and Double Special Tokens ---
    print(f"--- 6 & 7. Chat Template Composition ---")
    if tokenizer.chat_template:
        # Step 1: Apply template directly to IDs (Correct way)
        direct_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)
        
        # Step 2: Apply template to string, THEN tokenize (Incorrect way often used)
        str_template = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        composed_ids = tokenizer(str_template)["input_ids"]

        if direct_ids != composed_ids:
            print(f"⚠️  Gotcha #7: Tokenizing the output of `apply_chat_template` ADDS extra special tokens.")
            print(f"    Direct ID length: {len(direct_ids)} vs Re-tokenized length: {len(composed_ids)}")
        else:
            print(f"✅  Tokenization of chat template string matches direct ID generation.")
    else:
      print("ℹ️  No chat template defined for this tokenizer.")

# Run for all models mentioned in the text
models = [
    "Qwen/Qwen2.5-0.5B",
    "microsoft/Phi-3-mini-128k-instruct",
    "CohereLabs/aya-expanse-8b",
    "meta-llama/Llama-3.2-1B-Instruct",
    "databricks/dbrx-instruct",
    "Qwen/Qwen2.5-0.5B-Instruct"
]

for model in models:
    check_tokenizer_gotchas(model)

upvoted an article 6 months ago

Article

Gotchas in Tokenizer Behavior Every Developer Should Know

qgallouedec

•

Apr 18, 2025

• 72

New activity in Sifal/Kabyle-French 6 months ago

Wrong translation in some cases ?

#3 opened 6 months ago by

Djame

commented on Model statistics of the 50 most downloaded entities on Hugging Face 6 months ago

Very instersing example regarding CamemBERT, these were actually what I was referring to when I said "with a few exception", didn't know it was much more common, your point now on how this biases the results makes much more sense, thanks for clarifications!