Ch3nYe commited on
Commit
e79e8bd
·
verified ·
1 Parent(s): 83f398c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +323 -0
README.md CHANGED
@@ -11,3 +11,326 @@ tags:
11
  - binary
12
  - sentence-similarity
13
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - binary
12
  - sentence-similarity
13
  ---
14
+
15
+
16
+ # BinSeek: Cross-modal Retrieval Models for Stripped Binary Analysis
17
+
18
+ BinSeek is the first two-stage cross-modal retrieval framework specifically designed for stripped binary code analysis. It bridges the semantic gap between natural language queries and binary code (decompiled pseudocode), enabling effective retrieval of relevant binary functions from large-scale codebases.
19
+
20
+ BinSeek addresses these challenges with a two-stage retrieval strategy:
21
+
22
+ - **BinSeek-Embedding**: An embedding model trained to learn the semantic relevance between binary code and natural language descriptions, used for efficient first-stage candidate retrieval.
23
+ - **BinSeek-Reranker**: A reranking model that carefully judges the relevance of candidate code to the description with calling context augmentation for more precise results.
24
+
25
+ <p align="center">
26
+ <img src="https://raw.githubusercontent.com/XingTuLab/BinSeek/main/assets/binseek.png" alt="Overview of BinSeek" width="95%">
27
+ </p>
28
+
29
+ ## Model Information
30
+
31
+ | Model | Domain | Parameters | Embedding Dim | Max Tokens |
32
+ |:-------------------------------------------------------------------|:------:|:----------:|:-------------:|:----------:|
33
+ | [🤗 BinSeek-Embedding](https://huggingface.co/XingTuLab/BinSeek-Embedding) | Binary | 0.3B | 1024 | 4096 |
34
+ | [🤗 BinSeek-Reranker](https://huggingface.co/XingTuLab/BinSeek-Reranker) | Binary | 0.6B | / | 16384 |
35
+
36
+
37
+ BinSeek achieves advanced performance on binary code retrieval:
38
+
39
+ | Model | Model Size | Recall@1 | Recall@3 | MRR@3 |
40
+ |:-------------------------|:----------:|:--------:|:--------:|:------:|
41
+ | Qwen3-Embedding-8B | 8B | 57.50 | 65.00 | 60.75 |
42
+ | BinSeek-Embedding | 0.3B | 67.00 | 80.50 | 72.83 |
43
+ | Qwen3-Reranker-8B | 8B | 62.50 | 80.50 | 70.83 |
44
+ | BinSeek-Reranker | 0.6B | 61.50 | 83.00 | 70.50 |
45
+ | BinSeek (Emb+ Rerank) | / | 76.75 | 84.50 | 80.25 |
46
+
47
+
48
+ ## Model Usage
49
+
50
+ ### Dependencies
51
+
52
+ ```bash
53
+ pip install torch sentence-transformers>=5.1.2 transformers>=4.57.1
54
+ ```
55
+
56
+ Our models are compatible with the following frameworks. We recommend using the **two-stage pipeline** (Embedding + Reranker) for optimal retrieval performance.
57
+
58
+ ### Sentence-Transformers
59
+
60
+ ```python
61
+ import torch
62
+ from sentence_transformers import SentenceTransformer, CrossEncoder
63
+
64
+ # Query and Corpus
65
+ query = "A function that implements XTEA encryption algorithm"
66
+
67
+ # Binary pseudocode corpus (decompiled by IDA Pro)
68
+ corpus = [
69
+ '''char *__fastcall sub_100000924(__int64 a1, __int64 a2, unsigned int a3)
70
+ {
71
+ unsigned int i; // [xsp+1Ch] [xbp-34h]
72
+ char *v5; // [xsp+20h] [xbp-30h]
73
+ unsigned int v6; // [xsp+2Ch] [xbp-24h]
74
+ __int64 v9; // [xsp+40h] [xbp-10h] BYREF
75
+
76
+ v6 = a3;
77
+ v9 = 0;
78
+ if ( a3 % 8 )
79
+ v6 = a3 + 8 - a3 % 8;
80
+ v5 = (char *)malloc(v6);
81
+ __memset_chk(v5, 0, v6, -1);
82
+ for ( i = 0; i < v6; i += 8 )
83
+ {
84
+ v9 = *(_QWORD *)(a1 + (int)i);
85
+ sub_100000A68(32, (unsigned int *)&v9, a2);
86
+ __memcpy_chk(&v5[i], &v9, 8, -1);
87
+ }
88
+ return v5;
89
+ }''',
90
+ '''void *__fastcall sub_401000(size_t size){
91
+ void *ptr = malloc(size);
92
+ if (!ptr) { perror("malloc failed"); exit(1); }
93
+ return ptr;
94
+ }''',
95
+ '''int __fastcall sub_402000(char *s1, char *s2){
96
+ return strcmp(s1, s2);
97
+ }''',
98
+ # ... more functions in your corpus
99
+ ]
100
+ # the context functions (concatenated into a single string) for each binary function in the corpus, selected from callees, see our paper for more details
101
+ corpus_context = [
102
+ '''__int64 __fastcall sub_100000A68(__int64 result, unsigned int *a2, __int64 a3)
103
+ {
104
+ unsigned int v3; // [xsp+8h] [xbp-28h]
105
+ unsigned int v4; // [xsp+Ch] [xbp-24h]
106
+ unsigned int v5; // [xsp+10h] [xbp-20h]
107
+ unsigned int i; // [xsp+14h] [xbp-1Ch]
108
+
109
+ v5 = *a2;
110
+ v4 = a2[1];
111
+ v3 = 0;
112
+ for ( i = 0; i < (unsigned int)result; ++i )
113
+ {
114
+ v5 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v3 + *(_DWORD *)(a3 + 4LL * (v3 & 3)));
115
+ v3 -= 1640531527;
116
+ v4 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v3 + *(_DWORD *)(a3 + 4LL * ((v3 >> 11) & 3)));
117
+ }
118
+ *a2 = v5;
119
+ a2[1] = v4;
120
+ return result;
121
+ }''',
122
+ "",
123
+ "",
124
+ # ... more context functions in your corpus
125
+ ]
126
+
127
+ # Embedding-based Retrieval
128
+ embedding_model = SentenceTransformer(
129
+ "XingTuLab/BinSeek-Embedding",
130
+ model_kwargs={"dtype": torch.bfloat16},
131
+ trust_remote_code=True
132
+ )
133
+
134
+ query_embeddings = embedding_model.encode([query])
135
+ corpus_embeddings = embedding_model.encode(corpus, batch_size=64)
136
+
137
+ similarity_matrix = embedding_model.similarity(query_embeddings, corpus_embeddings)
138
+ scores = similarity_matrix[0].cpu().float().numpy()
139
+ top_k = 10 # Number of candidates to retrieve
140
+ top_k_indices = scores.argsort()[::-1][:top_k]
141
+ candidates = [corpus[i] for i in top_k_indices]
142
+
143
+ print("=== Stage 1: Embedding Retrieval Results ===")
144
+ for i, idx in enumerate(top_k_indices):
145
+ print(f"Rank {i+1}: Score={scores[idx]:.4f}, Corpus Index={idx}")
146
+
147
+ def build_candidates_with_context(candidates_ids):
148
+ candidates_with_context = []
149
+ for candidate_id in candidates_ids:
150
+ data = f"<pseudocode>\n{corpus[candidate_id]}\n</pseudocode>\n<context>\n{corpus_context[candidate_id]}\n</context>"
151
+ candidates_with_context.append(data)
152
+ return candidates_with_context
153
+
154
+ candidates_with_context = build_candidates_with_context(top_k_indices)
155
+
156
+ # Reranking for Precise Results
157
+ reranker = CrossEncoder(
158
+ "XingTuLab/BinSeek-Reranker",
159
+ model_kwargs={"dtype": torch.bfloat16},
160
+ trust_remote_code=True
161
+ )
162
+
163
+ reranked_results = reranker.rank(query, candidates_with_context)
164
+
165
+ print("\n=== Stage 2: Reranking Results ===")
166
+ print(f"Query: {query}")
167
+ for rank in reranked_results:
168
+ original_idx = top_k_indices[rank['corpus_id']]
169
+ print(f"Rank {reranked_results.index(rank)+1}: Score={rank['score']:.4f}, Corpus Index={original_idx}")
170
+ ```
171
+
172
+ ### Transformers
173
+
174
+ ```python
175
+ import torch
176
+ import numpy as np
177
+ from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
178
+
179
+ # Query and Corpus
180
+ query = "A function that implements XTEA encryption algorithm"
181
+
182
+ # Binary pseudocode corpus (decompiled by IDA Pro)
183
+ corpus = [
184
+ '''char *__fastcall sub_100000924(__int64 a1, __int64 a2, unsigned int a3)
185
+ {
186
+ unsigned int i; // [xsp+1Ch] [xbp-34h]
187
+ char *v5; // [xsp+20h] [xbp-30h]
188
+ unsigned int v6; // [xsp+2Ch] [xbp-24h]
189
+ __int64 v9; // [xsp+40h] [xbp-10h] BYREF
190
+
191
+ v6 = a3;
192
+ v9 = 0;
193
+ if ( a3 % 8 )
194
+ v6 = a3 + 8 - a3 % 8;
195
+ v5 = (char *)malloc(v6);
196
+ __memset_chk(v5, 0, v6, -1);
197
+ for ( i = 0; i < v6; i += 8 )
198
+ {
199
+ v9 = *(_QWORD *)(a1 + (int)i);
200
+ sub_100000A68(32, (unsigned int *)&v9, a2);
201
+ __memcpy_chk(&v5[i], &v9, 8, -1);
202
+ }
203
+ return v5;
204
+ }''',
205
+ '''void *__fastcall sub_401000(size_t size){
206
+ void *ptr = malloc(size);
207
+ if (!ptr) { perror("malloc failed"); exit(1); }
208
+ return ptr;
209
+ }''',
210
+ '''int __fastcall sub_402000(char *s1, char *s2){
211
+ return strcmp(s1, s2);
212
+ }''',
213
+ # ... more functions in your corpus
214
+ ]
215
+ # the context functions (concatenated into a single string) for each binary function in the corpus, selected from callees, see our paper for more details
216
+ corpus_context = [
217
+ '''__int64 __fastcall sub_100000A68(__int64 result, unsigned int *a2, __int64 a3)
218
+ {
219
+ unsigned int v3; // [xsp+8h] [xbp-28h]
220
+ unsigned int v4; // [xsp+Ch] [xbp-24h]
221
+ unsigned int v5; // [xsp+10h] [xbp-20h]
222
+ unsigned int i; // [xsp+14h] [xbp-1Ch]
223
+
224
+ v5 = *a2;
225
+ v4 = a2[1];
226
+ v3 = 0;
227
+ for ( i = 0; i < (unsigned int)result; ++i )
228
+ {
229
+ v5 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v3 + *(_DWORD *)(a3 + 4LL * (v3 & 3)));
230
+ v3 -= 1640531527;
231
+ v4 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v3 + *(_DWORD *)(a3 + 4LL * ((v3 >> 11) & 3)));
232
+ }
233
+ *a2 = v5;
234
+ a2[1] = v4;
235
+ return result;
236
+ }''',
237
+ "",
238
+ "",
239
+ # ... more context functions in your corpus
240
+ ]
241
+
242
+ # Embedding-based Retrieval
243
+ embed_tokenizer = AutoTokenizer.from_pretrained(
244
+ "XingTuLab/BinSeek-Embedding",
245
+ trust_remote_code=True
246
+ )
247
+ embed_model = AutoModel.from_pretrained(
248
+ "XingTuLab/BinSeek-Embedding",
249
+ dtype=torch.bfloat16,
250
+ trust_remote_code=True
251
+ ).eval().cuda()
252
+
253
+ def get_embeddings(texts, tokenizer, model, max_length=4096):
254
+ inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
255
+ inputs = {k: v.cuda() for k, v in inputs.items()}
256
+ with torch.no_grad():
257
+ outputs = model(**inputs)
258
+ # Last token pooling: use attention_mask to find last valid token position
259
+ attention_mask = inputs["attention_mask"]
260
+ last_token_indices = attention_mask.sum(dim=1) - 1 # (batch_size,)
261
+ batch_indices = torch.arange(outputs.last_hidden_state.size(0), device=outputs.last_hidden_state.device)
262
+ embeddings = outputs.last_hidden_state[batch_indices, last_token_indices, :]
263
+ embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
264
+ return embeddings.cpu().float().numpy()
265
+
266
+ query_embedding = get_embeddings([query], embed_tokenizer, embed_model)
267
+ corpus_embeddings = get_embeddings(corpus, embed_tokenizer, embed_model)
268
+
269
+ scores = np.dot(query_embedding, corpus_embeddings.T)[0]
270
+ top_k = 10
271
+ top_k_indices = np.argsort(scores)[::-1][:min(top_k, len(corpus))]
272
+ candidates = [corpus[i] for i in top_k_indices]
273
+
274
+ print("=== Stage 1: Embedding Retrieval Results ===")
275
+ for i, idx in enumerate(top_k_indices):
276
+ print(f"Rank {i+1}: Score={scores[idx]:.4f}, Corpus Index={idx}")
277
+
278
+ def build_candidates_with_context(candidates_ids):
279
+ candidates_with_context = []
280
+ for candidate_id in candidates_ids:
281
+ data = f"<pseudocode>\n{corpus[candidate_id]}\n</pseudocode>\n<context>\n{corpus_context[candidate_id]}\n</context>"
282
+ candidates_with_context.append(data)
283
+ return candidates_with_context
284
+
285
+ candidates_with_context = build_candidates_with_context(top_k_indices)
286
+
287
+ # Reranking for Precise Results
288
+ rerank_tokenizer = AutoTokenizer.from_pretrained(
289
+ "XingTuLab/BinSeek-Reranker",
290
+ trust_remote_code=True
291
+ )
292
+ rerank_model = AutoModelForSequenceClassification.from_pretrained(
293
+ "XingTuLab/BinSeek-Reranker",
294
+ dtype=torch.bfloat16,
295
+ trust_remote_code=True
296
+ ).eval().cuda()
297
+
298
+ def rerank(query, candidates, tokenizer, model, max_length=16384):
299
+ pairs = [[query, cand] for cand in candidates]
300
+ inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
301
+ inputs = {k: v.cuda() for k, v in inputs.items()}
302
+ with torch.no_grad():
303
+ logits = model(**inputs).logits.squeeze(-1)
304
+ scores = torch.sigmoid(logits).float().cpu().numpy() # Apply sigmoid activation
305
+ return scores
306
+
307
+ rerank_scores = rerank(query, candidates_with_context, rerank_tokenizer, rerank_model)
308
+ reranked_order = np.argsort(rerank_scores)[::-1]
309
+
310
+ print("\n=== Stage 2: Reranking Results ===")
311
+ print(f"Query: {query}")
312
+ for i, idx in enumerate(reranked_order):
313
+ original_idx = top_k_indices[idx]
314
+ print(f"Rank {i+1}: Score={rerank_scores[idx]:.4f}, Corpus Index={original_idx}")
315
+ ```
316
+
317
+
318
+ ## License
319
+
320
+ This project is under the GPL-3.0 License, and it is for research purposes only. Please use responsibly and in accordance with applicable laws and regulations.
321
+
322
+ ## Citation
323
+
324
+ If you find our work helpful, feel free to give us a cite.
325
+
326
+ ```bibtex
327
+ @misc{chen2025BinSeek,
328
+ title={Cross-modal Retrieval Models for Stripped Binary Analysis},
329
+ author={Guoqiang Chen and Lingyun Ying and Ziyang Song and Daguang Liu and Qiang Wang and Zhiqi Wang and Li Hu and Shaoyin Cheng and Weiming Zhang and Nenghai Yu},
330
+ year={2025},
331
+ eprint={2512.10393},
332
+ archivePrefix={arXiv},
333
+ primaryClass={cs.SE},
334
+ url={https://arxiv.org/abs/2512.10393},
335
+ }
336
+ ```