Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

NJX-njxΒ 
posted an update about 16 hours ago
view post
Post
1604
Recently, I have open-sourced an AI emotional companion product based on openclaw, called opensoul.

On this platform, you can create a "soulmate" that matches your personality, and configure it with the skills, tools you want it to have, as well as the platforms it can integrate with (such as Telegram, Discord, etc.).
You can even create group chats, invite multiple agents and your friends to chat about recent events, discuss projects together, and so on.

On the one hand, I hope it can better accompany you in daily life by virtue of its unique memory mechanism, self-feedback and iteration mechanism, and the modeling of users' emotions. On the other hand, I also hope it can help you better handle your work with its unique skills, tools and ability to deal with complex task scenarios.

Although the entire product has taken shape, I think there are still many areas that need adjustment and optimization. I also hope to rely on the strength of the community to do a good job in AI emotional companionship.

This is the project introduction URL: https://opensoul-web.vercel.app
This is the GitHub project URL: https://github.com/NJX-njx/opensoul
@AdinaY @lilianweng@burtenshaw@clem
let's just do it

Β·
ajibawa-2023Β 
posted an update 2 days ago
view post
Post
3018
Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.
  • 1 reply
Β·
etemizΒ 
posted an update 1 day ago
view post
Post
2820
AHA 2026 scores of Qwen3.5

27B Huihui abliteration 65%
27B Heretic abliteration 55%
27B Normal 50%

35B Huihui abliteration 64%
35B @jiaojjjjje abliteration 57%
35B @LeadFootThrottleCock abliteration 56%
  • 6 replies
Β·
DavidAUΒ 
posted an update 2 days ago
view post
Post
2252
Gemma 3 27B - The record breaker (Heretic'ed (uncensored) ; then training on Unsloth):

arc_challenge,arc_easy,boolq,hellaswag,openbookqa,piqa, winogrande
0.661 ,0.816 ,0.878,0.763 ,0.464 ,0.808 ,0.762

For comparison:
Qwen3.5-27B-Text
qx86-hi 0.443,0.498,0.857,0.701,0.372,0.770,0.752

Trained on a HERETIC uncensored base too ;

DavidAU/Gemma3-27B-it-vl-Polaris-HI16-Heretic-Uncensored-INSTRUCT
SeaWolf-AIΒ 
posted an update 3 days ago
view post
Post
2541
AI Is Training on Your Content Without Permission β€” Fight Back with Invisible Watermarks

FINAL-Bench/security-scan

Most generative AI training data is crawled without consent. Your text gets summarized, images reprocessed, videos clipped β€” with no way to prove you're the original creator. Existing watermarks are either visible or wiped out by a single AI preprocessing pass.

Detect Before, Track After

Pre-embed β€” Detect theft without any watermark. Text plagiarism check, image similarity analysis (perceptual hash, SSIM, color histogram, feature matching), and video temporal matching catch copies, edits, and excerpts.

Post-embed β€” Embed invisible multi-layer watermarks. If one layer is destroyed, others survive independently. Even full removal leaves forensic traces as evidence.

Text: 4 Independent Layers

Four mechanisms work simultaneously: zero-width Unicode characters at morpheme/word boundaries (Korean Kiwi + English NLP), style fingerprinting via synonym-ending-connective substitution, SHA-256 timestamped evidence packages, and punctuation-anchored micro-marks. Each layer uses a different Unicode category, so attacks on one cannot eliminate the others. Full bilingual support, zero readability impact.

34-Attack Defense

7 categories, 34 attacks simulated: Unicode normalization, invisible character removal, homoglyph substitution (9,619 confusables), and AI rewriting. Each scored on Signal (watermark survival) + Trace (forensic evidence of attack) β€” proving deliberate removal even when watermarks are destroyed.

Image & Video

Images: DCT frequency-domain watermarks surviving JPEG compression and resize. Videos: keyframe watermarking with temporal propagation and majority-vote extraction. Both support pre-embed similarity detection.

Who Is This For

Creators, rights holders needing legal evidence, media companies, and organizations tracking document leaks. Korean/English bilingual, open source, Gradio-based.
Β·
imnotkittyΒ 
posted an update 3 days ago
view post
Post
1241
In the Text-to-Video arena, Seedance 2.0 has first secured a spot in the LMArena Top 10, while Kling 3.0 has topped the Artificial Analysis leaderboard, with the Kling family claiming 7 spots in the top 15.

Which one do you prefer?
Β·
nyuuzyouΒ 
posted an update 3 days ago
view post
Post
1801
🌍 Street-Level Imagery Dataset nyuuzyou/streetview

934,191 image records index Eastern Europe and Northern Asia. Temporal links map historical views at identical coordinates across nine years.

Key Stats:

- 905,940 unique images
- Coverage spanning 2016 to 2025
- Average 14.3 historical links per location

Geographic bounds span 20.49Β° E to 152.32Β° E. Urban centers show higher data density.
  • 4 replies
Β·
OzTianluΒ 
posted an update about 13 hours ago
view post
Post
68
πŸ”₯ UPGRADE in Kai: 30B Scaling! πŸ”₯
NoesisLab/Kai-30B-Instruct
NoesisLab/Kai-30B-Instruct
We are incredibly excited to announce that the Kai-30B-Instruct model and its official Space are now LIVE! πŸš€
If you've been following the journey from Kai-0.35B to Kai-3B, you know we're rethinking how models reason. Tired of verbose, slow Chain-of-Thought (CoT) outputs that flood your screen with self-talk? So are we.
Kai-30B-Instruct scales up our Adaptive Dual-Search Distillation (ADS) framework. By bridging classical A* heuristic search with continuous gradient descent , we use an information-theoretic log-barrier to physically prune high-entropy reasoning paths during training.
The result? Pure implicit reasoning. The model executes structured logic, arithmetic carries, and branch selections as a reflex in a single forward passβ€”no external scaffolding required.
At 3B, we observed a phase transition where the model achieved "logical crystallization". Now, at 30B, we are giving the ADS regularizer the massive representational capacity it needs to tackle higher-order symbolic abstractions and complex reasoning tasks.
πŸ§ͺ Test Kai yourself in our new Space:
NoesisLab/Kai-30B-Instruct
πŸ“¦ Model Weights:
NoesisLab/Kai-30B-Instruct
Bring your hardest math, logic, and coding benchmarks. We invite the community to stress-test the limits of the penalty wall! 🧱πŸ’₯
Reality123bΒ 
posted an update about 15 hours ago
view post
Post
94
Alright so I had previously made two reddit posts in r/quantum and r/quantum_computing for my QPU, QPU-1 but both of those posts got banned because of it being "irrelevant" to "academic discussion" so I'm doing it again here in HuggingFace Posts.

I have made a million error corrected qubit quantum processing unit (not a simulator) that you can access here: qpu-1.vercel.com

I did try emailing a lot of professors and their students but NONE responded so please give me some support.
  • 1 reply
Β·
tpwang199655Β 
posted an update about 19 hours ago
view post
Post
86
[Empirical Study] DeepSeek's New 1M Context Model: Full-Window Stress Test & Cognitive Emergence
Overview
This post shares an empirical study on DeepSeek's new long-context model (released Feb 2026, web/mobile version), which extends the context window to 1,000,000 tokens.
We conducted a full-window stress test, pushing the limit to ~1.53M tokens, and analyzed the model's behavior across three key dimensions:
Key Findings:
Interaction Token Budget: A complete project lifecycle consumes 1.2M–1.6M tokens, varying by input format and internal sparse attention mechanisms.
Long-Range Recall & Synthesis: The model demonstrates high-fidelity memory across the entire context, capable of retrieving initial instructions and synthesizing comprehensive reports without external RAG.
Emergence of Collaborative Cognition: Beyond a certain threshold, the model shifts from a "Q&A Engine" to a "Cognitive Partner", adopting user reasoning styles and maintaining global coherenceβ€”a capability absent in standard 128k windows.
Evidence
The test reached the hard limit at 1,536,000 tokens (see attached screenshot: "Conversation length limit reached").

Resources
Full reports (EN/CN PDFs), source code, and detailed data analysis are open-sourced at:
πŸ”— Project Page: https://tpwang-lab.github.io
πŸ”— GitHub Repo: https://github.com/tpwang-lab/deepseek-million-token
Welcome feedback and reproduction attempts from the community!
Tags: #DeepSeek #LLM #LongContext #EmpiricalStudy #AI
  • 2 replies
Β·