anonymous12321 commited on
Commit
19c9a07
Β·
verified Β·
1 Parent(s): 9785536

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ language:
2
+ - pt
3
+ - en
4
+ license: cc-by-nc-nd-4.0
5
+ colorTo: red
6
+ sdk: docker
7
+ app_port: 8501
8
+ tags:
9
+ - streamlit
10
+ - text-segmentation
11
+ - topic-segmentation
12
+ - bert
13
+ - next-sentence-prediction
14
+ - document-segmentation
15
+ - meeting-minutes
16
+ library_name: transformers
17
+ base_model:
18
+ - neuralmind/bert-base-portuguese-cased
19
+
20
+ NSP-CouncilSeg: Linear Text Segmentation for Municipal Meeting Minutes
21
+ Model Description
22
+
23
+ NSP-CouncilSeg is a fine-tuned BERT model specialized in Text Segmentation for municipal council meeting minutes. The model uses Next Sentence Prediction (NSP) to identify topic boundaries in long-form documents, making it particularly effective for segmenting administrative and governmental meeting minutes.
24
+
25
+ Try out the model: Hugging Face Space Demo
26
+ Key Features
27
+
28
+ 🎯 Specialized for Meeting Minutes: Fine-tuned on Portuguese municipal council meeting minutes
29
+ 🌍 Multilingual Capability: Works with both Portuguese and English text
30
+ ⚑ Fast Inference: Efficient BERT-base architecture for real-time segmentation
31
+ πŸ“Š High Accuracy: Achieves BED F-measure score of 0.79 on CouncilSeg dataset
32
+ πŸ”„ Sentence-Level Segmentation: Identifies topic boundaries at sentence granularity
33
+
34
+ Model Details
35
+
36
+ Base Model: google-bert/bert-base-uncased
37
+ Architecture: BERT with Next Sentence Prediction head
38
+ Parameters: 110M
39
+ Max Sequence Length: 512 tokens
40
+ Fine-tuning Dataset: CouncilSeg (Portuguese Municipal Meeting Minutes)
41
+ Fine-tuning Method: Focal Loss with boundary-aware weighting
42
+ Training Framework: PyTorch + Transformers
43
+
44
+ How It Works
45
+
46
+ The model predicts whether two consecutive sentences belong to the same topic (label 0: "is_next") or represent a topic transition (label 1: "not_next"). By applying this classifier sequentially across all sentence pairs in a document, it identifies topic boundaries.
47
+
48
+ Sentence A: "By the President, minutes no. 28 of 20.12.2023 were present at the meeting."
49
+ Sentence B: "After considering and analyzing the matter, the Municipal Executive unanimously decided to approve minute no. 28 of 12.20.2023."
50
+ β†’ Prediction: Same Topic (confidence: 76%)
51
+
52
+ Sentence A: "After considering and analyzing the matter, the Municipal Executive unanimously decided to approve minute no. 28 of 12.20.2023."
53
+ Sentence B: "There were no various processes and requests to submit."
54
+ β†’ Prediction: Topic Boundary (confidence: 82%)
55
+
56
+ Usage
57
+ Quick Start with Transformers
58
+
59
+ from transformers import AutoTokenizer, AutoModelForNextSentencePrediction
60
+ import torch
61
+
62
+ # Load model and tokenizer
63
+ tokenizer = AutoTokenizer.from_pretrained("anonymous15135/nsp-councilseg")
64
+ model = AutoModelForNextSentencePrediction.from_pretrained("anonymous15135/nsp-councilseg")
65
+
66
+ # Prepare input
67
+ sentence_a = "By the President, minutes no. 28 of 20.12.2023 were present at the meeting."
68
+ sentence_b = "After considering and analyzing the matter, the Municipal Executive unanimously decided to approve minute no. 28 of 12.20.2023."
69
+
70
+
71
+ # Tokenize
72
+ inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")
73
+
74
+ # Predict
75
+ with torch.no_grad():
76
+ outputs = model(**inputs)
77
+ logits = outputs.logits
78
+ probs = torch.softmax(logits, dim=1)
79
+
80
+ # Interpret results
81
+ is_next_prob = probs[0][0].item()
82
+ not_next_prob = probs[0][1].item()
83
+
84
+ print(f"Is Next (same topic): {is_next_prob:.3f}")
85
+ print(f"Not Next (topic boundary): {not_next_prob:.3f}")
86
+
87
+ if not_next_prob > 0.5:
88
+ print("πŸ”΄ Topic boundary detected!")
89
+ else:
90
+ print("🟒 Same topic continues")
91
+
92
+ Evaluation Results
93
+ CouncilSeg Test Set
94
+ Metric Score
95
+ BED F-measure 0.79
96
+ Boundary Similarity 0.59
97
+ Pk Score 0.08
98
+ WindowDiff 0.10
99
+ Limitations
100
+
101
+ Domain Specificity: Best performance on administrative/governmental meeting minutes
102
+ Language: Optimized for Portuguese; English performance may vary
103
+ Document Length: Designed for documents with 10-50 segments
104
+ Context Window: Limited to 512 tokens per sentence pair
105
+ Ambiguous Boundaries: May struggle with subtle topic transitions
106
+
107
+ Model Card Contact
108
+
109
+ For questions or feedback, please open an issue in the model repository.
110
+ License
111
+
112
+ This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International