Spaces:

openbmb
/

UltraData-Math-L3-Generator

Running

App Files Files Community

UltraData-Math-L3-Generator / README.md

ZhouChuYue

Initial commit: UltraData-Math L3 Generator Space

787a7ad 18 days ago

preview code

raw

history blame

4.93 kB

UltraData-Math-L3-Generator

L3 合成数据层：基于 LLM 的多格式数学数据合成工具。

📂 目录结构

UltraData-Math-L3-Generator/
├── run_synthesis.py                 # OpenAI API 调用脚本
├── qa_synthesis.py                  # Q&A 问答对合成 Prompt
├── conversation_synthesis.py        # 多轮对话合成 Prompt
├── multistyle_rewrite.py           # 多风格改写 Prompt
├── knowledge_textbook.py           # 知识点提取 + 教材练习 Prompt
└── README.md

🔧 安装依赖

pip install openai

🚀 快速开始

环境配置

# 设置 API Key
export OPENAI_API_KEY="your-api-key"

# 可选：设置自定义 API 地址（兼容 OpenAI 格式的 API）
export OPENAI_BASE_URL="https://your-api-endpoint/v1"

基本用法

python run_synthesis.py \
    --input data.jsonl \
    --output output.jsonl \
    --task qa \
    --level high_school \
    --model gpt-4o \
    --workers 10

📋 任务类型

1. Q&A 问答对合成 (`qa`)

根据数学内容生成问答对，按教育难度分级。

参数 --level：

值	说明
`grade_school`	小学
`middle_school`	初中
`high_school`	高中（默认）
`college`	大学

python run_synthesis.py -i data.jsonl -o output.jsonl -t qa --level high_school

2. 多轮对话合成 (`conversation`)

将数学内容转换为多轮对话格式。

参数 --style：

值	说明
`two_professors`	两位教授对话
`teacher_student`	师生对话（默认）
`two_students`	两位学生对话
`interview`	面试风格
`problem_solving`	问题解决
`layman_expert`	外行与专家
`debate`	辩论风格

python run_synthesis.py -i data.jsonl -o output.jsonl -t conversation --style teacher_student

3. 多风格改写 (`rewrite`)

将数学内容改写为不同风格。

参数 --style：

值	说明
`wikipedia`	维基百科风格
`textbook`	教科书风格（默认）
`blog`	博客风格
`popular_science`	科普风格
`academic_paper`	学术论文风格
`learning_note`	学习笔记风格
`lecture_note`	讲义风格

python run_synthesis.py -i data.jsonl -o output.jsonl -t rewrite --style textbook

4. 知识点提取 (`knowledge`)

从数学内容中提取定义、定理、性质等知识点。

python run_synthesis.py -i data.jsonl -o knowledge_output.jsonl -t knowledge

5. 教材练习生成 (`textbook`)

基于知识点生成不同难度的教材式练习。

参数 --difficulty：

值	说明
`easy`	简单（默认）
`medium`	中等
`hard`	困难

python run_synthesis.py -i knowledge.jsonl -o output.jsonl -t textbook --difficulty medium

注意： 输入文件需包含 knowledge_point 字段（可通过 --knowledge-field 自定义）。

⚙️ 参数说明

参数	说明	默认值
`-i, --input`	输入 JSONL 文件路径	必填
`-o, --output`	输出 JSONL 文件路径	必填
`-t, --task`	任务类型：`qa`, `conversation`, `rewrite`, `knowledge`, `textbook`	必填
`--level`	Q&A 难度级别	`high_school`
`--style`	对话/改写风格	-
`--difficulty`	教材练习难度	`easy`
`--text-field`	输入文本字段名	`text`
`--knowledge-field`	知识点字段名	`knowledge_point`
`--api-key`	OpenAI API Key	环境变量
`--base-url`	API Base URL	环境变量
`--model`	模型名称	`gpt-4o`
`--temperature`	采样温度	`0.7`
`--max-tokens`	最大生成 token 数	`4096`
`-w, --workers`	并发数	`10`
`--max-retries`	最大重试次数	`3`
`--limit`	限制处理样本数量	-
`-q, --quiet`	静默模式	`False`

📊 输入输出格式

输入： JSONL 格式，每行一个 JSON 对象（参见 example_data.jsonl）：

{"text": "The quadratic formula states that for any quadratic equation..."}
{"text": "The Pythagorean theorem is a fundamental relation..."}

输出： 在原数据基础上添加 synthesis_result 字段：

{
  "text": "原始数学内容",
  "synthesis_result": {
    "raw": "完整响应",
    "problem": "生成的问题",
    "solution": "详细解答"
  }
}

🔌 兼容其他 API

支持任何 OpenAI 兼容的 API（如 Qwen、DeepSeek、vLLM 等）：

# 使用阿里云 Qwen API
export OPENAI_API_KEY="your-dashscope-api-key"
export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"

python run_synthesis.py -i data.jsonl -o output.jsonl -t qa --model qwen-plus

📜 许可证

本项目基于 Apache 2.0 许可证发布。