MnemoCore / docs /PATTERN_LEARNER_SPEC.md

Upload folder using huggingface_hub

7c8b011 verified 8 days ago

preview code

raw

history blame contribute delete

16.6 kB

MnemoCore Pattern Learner — Specification Draft

Version: 0.1-draft
Date: 2026-02-20
Status: Draft for Review
Author: Omega (GLM-5) for Robin Granberg

Executive Summary

Pattern Learner är en MnemoCore-modul som lär sig från användarinteraktioner utan att lagra persondata. Den extraherar statistiska mönster, topic clustering och kvalitetsmetrics som kan användas för att förbättra chatbot-performance över tid.

Key principle: Learn patterns, forget people.

Problem Statement

Healthcare Chatbot Challenges

Utmaning	Konsekvens
GDPR/HIPAA compliance	Kan inte lagra konversationer
Multitenancy	Data får inte läcka mellan kliniker
Quality improvement	Behöver veta vad som fungerar
Knowledge gaps	Behöver identifiera vad som saknas i docs

Current Solutions (Limitations)

Stateless RAG: Ingen inlärning alls
Full memory: GDPR-risk, sekretessproblem
Manual analytics: Tidskrävande, inte real-time

Solution: Pattern Learner

Core Concept

User Query ──► Anonymize ──► Extract Pattern ──► Aggregate
                  │
                  └── PII removed before storage

What IS stored:

Topic clusters (anonymized)
Query frequency distributions
Response quality aggregates
Knowledge gap indicators

What is NOT stored:

User identities
Clinic associations
Patient data
Raw conversations

Architecture

High-Level Design

┌─────────────────────────────────────────────────────────────┐
│                    Pattern Learner Module                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │   Anonymizer │───►│Topic Extractor│───►│  Aggregator  │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
│         │                   │                    │          │
│         │                   ▼                    ▼          │
│         │           ┌──────────────┐    ┌──────────────┐   │
│         │           │Topic Embedder│    │ Stats Store  │   │
│         │           │  (MnemoCore) │    │  (Encrypted) │   │
│         │           └──────────────┘    └──────────────┘   │
│         │                   │                    │          │
│         └───────────────────┴────────────────────┘          │
│                             │                               │
│                             ▼                               │
│                    ┌──────────────┐                        │
│                    │  Insights API│                        │
│                    └──────────────┘                        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Components

1. Anonymizer

Purpose: Remove all PII before processing

Methods:

Named Entity Recognition (NER) for person names
Pattern matching for phone numbers, addresses
Clinic/organization detection
Session ID hashing

class Anonymizer:
    """Remove PII from queries before pattern extraction"""
    
    def __init__(self):
        self.ner_model = load_ner_model("sv")  # Swedish
        self.patterns = {
            "phone": r"\+?\d{1,3}[\s-]?\d{2,4}[\s-]?\d{2,4}[\s-]?\d{2,4}",
            "email": r"[\w\.-]+@[\w\.-]+\.\w+",
            "personal_number": r"\d{6,8}[-\s]?\d{4}",
        }
    
    def anonymize(self, text: str) -> str:
        """Remove all PII from text"""
        
        # 1. NER for names
        entities = self.ner_model.extract(text)
        for entity in entities:
            if entity.type in ["PER", "ORG"]:
                text = text.replace(entity.text, "[ANON]")
        
        # 2. Pattern matching
        for pattern_type, pattern in self.patterns.items():
            text = re.sub(pattern, f"[{pattern_type.upper()}]", text)
        
        # 3. Remove clinic names (configurable blacklist)
        for clinic_name in self.clinic_blacklist:
            text = text.replace(clinic_name, "[KLINIK]")
        
        return text

2. Topic Extractor

Purpose: Extract semantic topics from anonymized queries

Methods:

Keyword extraction (TF-IDF)
Topic modeling (LDA, BERTopic)
Embedding-based clustering

class TopicExtractor:
    """Extract topics from anonymized queries"""
    
    def __init__(self, mnemocore_engine):
        self.engine = mnemocore_engine
        self.topic_threshold = 0.5
    
    async def extract_topics(self, query: str) -> List[str]:
        """Extract topics from anonymized query"""
        
        # 1. Get keywords
        keywords = self._extract_keywords(query)
        
        # 2. Find similar topics in MnemoCore
        similar = await self.engine.query(query, top_k=5)
        
        # 3. Cluster into topics
        topics = []
        for memory_id, similarity in similar:
            if similarity > self.topic_threshold:
                memory = await self.engine.get_memory(memory_id)
                topics.extend(memory.metadata.get("topics", []))
        
        # 4. Deduplicate
        return list(set(topics + keywords))
    
    def _extract_keywords(self, text: str) -> List[str]:
        """Extract keywords using TF-IDF"""
        # Simple implementation
        words = text.lower().split()
        return [w for w in words if len(w) > 3 and w not in STOPWORDS_SV]

3. Aggregator

Purpose: Store statistical patterns without PII

Data structures:

@dataclass
class TopicStats:
    """Statistics for a topic"""
    topic: str
    count: int = 0
    first_seen: datetime = None
    last_seen: datetime = None
    trend: float = 0.0  # Recent increase/decrease

@dataclass
class ResponseQuality:
    """Aggregated response quality (no individual ratings)"""
    response_signature: str  # Hash of response template
    avg_rating: float = 0.5
    sample_count: int = 0
    last_updated: datetime = None

@dataclass
class KnowledgeGap:
    """Topics with no good answers"""
    topic: str
    query_count: int = 0
    failure_rate: float = 1.0  # % of queries that got "I don't know"
    suggested_action: str = ""  # "add documentation", "improve answer"

Storage:

class PatternStore:
    """Store patterns (encrypted, no PII)"""
    
    def __init__(self, encryption_key: bytes):
        self.key = encryption_key
        self.topics: Dict[str, TopicStats] = {}
        self.qualities: Dict[str, ResponseQuality] = {}
        self.gaps: Dict[str, KnowledgeGap] = {}
    
    def record_topic(self, topic: str):
        """Record that a topic was queried"""
        if topic not in self.topics:
            self.topics[topic] = TopicStats(
                topic=topic,
                first_seen=datetime.utcnow()
            )
        
        stats = self.topics[topic]
        stats.count += 1
        stats.last_seen = datetime.utcnow()
    
    def record_quality(self, response_sig: str, rating: int):
        """Record response quality (aggregated)"""
        if response_sig not in self.qualities:
            self.qualities[response_sig] = ResponseQuality(
                response_signature=response_sig
            )
        
        q = self.qualities[response_sig]
        # Exponential moving average
        q.avg_rating = 0.9 * q.avg_rating + 0.1 * (rating / 5.0)
        q.sample_count += 1
        q.last_updated = datetime.utcnow()
    
    def record_gap(self, topic: str, had_answer: bool):
        """Record knowledge gap"""
        if topic not in self.gaps:
            self.gaps[topic] = KnowledgeGap(topic=topic)
        
        gap = self.gaps[topic]
        gap.query_count += 1
        if not had_answer:
            gap.failure_rate = (gap.failure_rate * (gap.query_count - 1) + 1) / gap.query_count
        else:
            gap.failure_rate = (gap.failure_rate * (gap.query_count - 1)) / gap.query_count

4. Insights API

Purpose: Provide actionable insights to admins/developers

Endpoints:

# GET /insights/topics?top_k=10
{
    "topics": [
        {"topic": "implantat", "count": 1250, "trend": 0.15},
        {"topic": "rotfyllning", "count": 980, "trend": -0.02},
        {"topic": "priser", "count": 850, "trend": 0.30}
    ],
    "period": "30d"
}

# GET /insights/gaps
{
    "knowledge_gaps": [
        {
            "topic": "tandreglering vuxna",
            "query_count": 145,
            "failure_rate": 0.85,
            "suggested_action": "add documentation"
        },
        {
            "topic": "akut tandvård",
            "query_count": 89,
            "failure_rate": 0.72,
            "suggested_action": "improve answer"
        }
    ]
}

# GET /insights/quality
{
    "top_responses": [
        {"signature": "abc123", "avg_rating": 4.8, "sample_count": 520},
        {"signature": "def456", "avg_rating": 4.5, "sample_count": 340}
    ],
    "worst_responses": [
        {"signature": "xyz789", "avg_rating": 2.1, "sample_count": 45}
    ]
}

MnemoCore Integration

Usage Pattern

from mnemocore import HAIMEngine
from mnemocore.pattern_learner import PatternLearner

# Initialize MnemoCore (stores topic embeddings)
engine = HAIMEngine(dimension=16384)
await engine.initialize()

# Initialize Pattern Learner
learner = PatternLearner(
    engine=engine,
    encryption_key=get_encryption_key(),
    anonymizer=Anonymizer()
)

# Process a query (automatic learning)
async def handle_query(user_query: str, tenant_id: str):
    # 1. Anonymize
    anon_query = learner.anonymize(user_query)
    
    # 2. Extract patterns (no PII)
    topics = await learner.extract_topics(anon_query)
    
    # 3. Record topic usage
    for topic in topics:
        learner.record_topic(topic)
    
    # 4. Get answer from RAG
    answer = await rag_lookup(anon_query)
    
    # 5. Record if we had an answer
    learner.record_gap(
        topic=topics[0] if topics else "unknown",
        had_answer=(answer is not None)
    )
    
    return answer

# Get insights (admin only)
async def get_dashboard():
    top_topics = learner.get_top_topics(10)
    gaps = learner.get_knowledge_gaps()
    quality = learner.get_response_quality()
    
    return {
        "popular_topics": top_topics,
        "needs_documentation": gaps,
        "response_performance": quality
    }

GDPR Compliance

Data Minimization

Data Type	Stored?	Justification
Raw queries	❌	PII risk
User IDs	❌	Not needed
Session IDs	❌	Not needed
Clinic IDs	❌	Not needed
Topic labels	✅	Anonymized
Topic counts	✅	Statistical
Quality scores	✅	Aggregated
Gap indicators	✅	Anonymized

Right to Erasure (GDPR Art 17)

Since no PII is stored, right to erasure is automatically satisfied.

Data Retention

# Configurable retention
retention_policy = {
    "topic_stats": "365d",  # Keep for 1 year
    "quality_scores": "90d",  # Keep for 3 months
    "gap_indicators": "30d",  # Refresh monthly
}

# Automatic cleanup
async def cleanup_old_patterns():
    cutoff = datetime.utcnow() - timedelta(days=retention_policy["topic_stats"])
    for topic, stats in learner.topics.items():
        if stats.last_seen < cutoff:
            del learner.topics[topic]

Security Considerations

Encryption

All pattern data encrypted at rest (AES-256)
Encryption keys managed via HSM or Azure Key Vault
Per-tenant encryption optional (for multi-tenant isolation)

Access Control

# Insights API requires admin role
@app.get("/insights/topics")
@require_role("admin")
async def get_topics():
    return learner.get_top_topics(10)

Audit Logging

# Log all pattern access (not the patterns themselves)
async def log_access(user_id: str, endpoint: str, timestamp: datetime):
    await audit_log.store({
        "user_id": user_id,
        "endpoint": endpoint,
        "timestamp": timestamp.isoformat(),
        # No pattern data logged
    })

Implementation Roadmap

Phase 1: MVP (2 weeks)

Anonymizer with Swedish NER
Basic topic extraction (keywords)
Topic counter (no MnemoCore yet)
Simple insights API

Phase 2: MnemoCore Integration (2 weeks)

Topic embedding storage in MnemoCore
Semantic topic clustering
Gap detection using similarity search

Phase 3: Quality Metrics (2 weeks)

Response quality tracking
Feedback integration
Quality dashboard

Phase 4: Production Hardening (2 weeks)

Encryption at rest
Access control
Audit logging
Performance optimization

Business Value

For Healthcare Organizations

Value	Metric
Documentation gaps	Know what to add to knowledge base
Popular topics	Prioritize documentation efforts
Response quality	Improve user satisfaction
Trend analysis	Identify emerging needs

For Opus Dental (Competitive Advantage)

Advantage	Value
Continuous improvement	Chatbot gets smarter without storing PII
Customer insights	Know what clinics need
Compliance by design	GDPR-safe from day 1
Unique selling point	"Learning chatbot" vs competitors

Technical Requirements

Dependencies

mnemocore>=4.5.0
spacy[sv]>=3.7.0  # Swedish NER
numpy>=1.24.0
cryptography>=41.0.0  # Encryption

Infrastructure

MnemoCore instance (can be shared or per-tenant)
Encrypted storage (Azure SQL, PostgreSQL with TDE)
Optional: Azure Key Vault for key management

Performance

Topic extraction: <50ms per query
Insights API: <200ms
Storage: ~1KB per unique topic (highly efficient)

Open Questions

Topic granularity: How specific should topics be? "Implantat" vs "Implantat pris" vs "Implantat komplikationer"
Trend detection: What time window for trend analysis? 7d? 30d?
Multi-language: Support for Finnish/Norwegian in addition to Swedish?
Tenant isolation: Should patterns be shared across tenants (anonymized) or kept separate?
Feedback mechanism: How to collect ratings? Thumbs up/down? 1-5 stars?

Conclusion

Pattern Learner enables continuous improvement of healthcare chatbots without GDPR risk. It learns what users ask about, which answers work, and where documentation is missing — all without storing any personal data.

Key innovation: Transform "memory" into "patterns" — compliance-safe learning.

Next Steps

Review this spec
Decide on open questions
Prioritize MVP features
Start implementation

Draft by Omega (GLM-5) for Robin Granberg
2026-02-20