Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing
    • Website
      • Tasks
      • HuggingChat
      • Collections
      • Languages
      • Organizations
    • Community
      • Blog
      • Posts
      • Daily Papers
      • Learn
      • Discord
      • Forum
      • GitHub
    • Solutions
      • Team & Enterprise
      • Hugging Face PRO
      • Enterprise Support
      • Inference Providers
      • Inference Endpoints
      • Storage Buckets

  • Log In
  • Sign Up
Ray's picture
1 4 1

Ray

rayw2k25
·

AI & ML interests

None yet

Recent Activity

upvoted a collection 3 days ago
Unsloth Dynamic 2.0 Quants
upvoted a collection 3 days ago
Gemma 4
reacted to Parveshiiii's post with 🔥 about 2 months ago
Just did something I’ve been meaning to try for ages. In only 3 hours, on 10 billion+ tokens, I trained a custom BPE + tiktoken-style tokenizer using my new library microtok — and it hits the same token efficiency as Qwen3. Tokenizers have always felt like black magic to me. We drop them into every LLM project, but actually training one from scratch? That always seemed way too complicated. Turns out it doesn’t have to be. microtok makes the whole process stupidly simple — literally just 3 lines of code. No heavy setup, no GPU required. I built it on top of the Hugging Face tokenizers library so it stays clean, fast, and actually understandable. If you’ve ever wanted to look under the hood and build your own optimized vocabulary instead of just copying someone else’s, this is the entry point you’ve been waiting for. I wrote up the full story, threw in a ready-to-run Colab template, and dropped the trained tokenizer on Hugging Face. Blog → https://parveshiiii.github.io/blogs/microtok/ Trained tokenizer → https://huggingface.co/Parveshiiii/microtok GitHub repo → https://github.com/Parveshiiii/microtok
View all activity

Organizations

None yet

models 0

None public yet

datasets 0

None public yet
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs