如何用CAMEL从DeepSeek-R1蒸馏数学推理数据？手把手教你实现！

显示全部楼层

DeepSeek R1，这款备受瞩目的业界领先的推理模型，凭借卓越的数学推理能力和高效的逻辑处理，在近期引发了广泛关注。无论是基础算术还是复杂的数学难题，它都能轻松应对，为开发者提供强大的计算支持。

现在，结合CAMEL框架，我们可以利用长链式思维（Long Chain-of-Thought, CoT）提取数学问题的详细推理过程，从 DeepSeek R1 中蒸馏出高质量的数学推理数据。最终，我们将这一数据集上传至 Hugging Face，方便社区伙伴们共享和使用，推动更智能的数学推理研究。

本教程将手把手带你探索如何用 CAMEL 框架高效提取 DeepSeek R1 的数学推理能力，生成有价值的数据集，一起动手试试吧！

? 在这里，您将探索以下内容：

CAMEL框架：一个功能强大的多智能体框架，能够生成合成数据并模拟多智能体角色扮演场景，助力实现更高级的AI应用。
数据蒸馏流程：通过系统化的方法，从DeepSeek R1等模型中提取并优化包含详细思维过程的高质量推理数据集。
Hugging Face集成：提供便捷的流程，将蒸馏后的数据集上传并分享到Hugging Face平台。

通过我们的合成数据生成工具，CAEML-AI 精心打造了三个高质量的数据集，这些数据集现已发布在 Hugging Face 平台上，方便大家随时使用：

?AMC AIME STaR 数据集
包含 4000 道高难度数学题目及其解答，特别加入了解决方案的迭代改进历史，展示了如何一步步优化答案。
? 查看数据集：https://huggingface.co/datasets/camel-ai/amc_aime_star
?AMC AIME 蒸馏数据集
包含 4000 道高难度数学题目及其解答，每道题目都配有清晰的分步解析。
? 查看数据集：https://huggingface.co/datasets/camel-ai/amc_aime_distilled
?GSM8K 蒸馏数据集
包含 7000 道高质量、语言多样化的小学数学应用题及其解答，每道题目都配有详细的分步解析。
? 查看数据集：https://huggingface.co/datasets/camel-ai/gsm8k_distilled

无论您是希望探索 AI 如何解决复杂问题，还是想深入钻研数学推理，这些数据集都是绝佳的资源！?✨

使用CAMEL数据蒸馏管道生成数学推理数据集的具体步骤

? 前期准备

1. 安装依赖

首先，安装所需的Python库，从命令行执行以下命令：

pip install"git+https://github.com/camel-ai/camel.git@4210cb0849f3f13d6a46fefeb9e2c3e791c158cb#egg=camel-ai"
pip install datasets
pip install rouge

2. 设置相关密钥

设置SILICONFLOW_API_KEY 或 DEEPSEEK_API_KEY，这些密钥将用于结合思维过程来提炼数学推理数据。

⭐提示：也可以选择其他模型提供商，比如 Fireworks 或 Together AI。

fromgetpassimportgetpass
importos

SILICONFLOW_API_KEY = getpass('Enter your SILICONFLOW_API_KEY: ')
os.environ["SILICONFLOW_API_KEY"] = SILICONFLOW_API_KEY

DEEPSEEK_API_KEY = getpass('Enter your DEEPSEEK_API_KEY: ')
os.environ["DEEPSEEK_API_KEY"] = DEEPSEEK_API_KEY

# To make deepseek r1 responds with thought process content,we should set the following environment variable
os.environ["GET_REASONING_CONTENT"]="True"

3. 从Hugging Face下载数据

我们从Hugging Face平台开始准备原始的数学数据，这些数据的核心部分主要包括问题

和答案两部分。接下来，我们将以GSM8K数据集为例，为大家详细讲解具体操作步骤。

# Set the number of problems to download from GSM8K in huggingface
NUMBER_OF_PROBLEMS=10

importjson
frompathlibimportPath
importuuid
fromdatasetsimportload_dataset

defdownload_gsm8k_dataset():
 try:
   # Load the dataset using the datasets library
    dataset = load_dataset("openai/gsm8k","main")

   # Get the items from train split
    data = dataset['train'].select(range(NUMBER_OF_PROBLEMS))

   # Convert to the desired format
    formatted_data = []
   foritemindata:
     # Extract the final answer from the solution
      solution = item['answer']
     ifsolution:
       # GSM8K solutions typically end with "#### number"
       importre

        match = re.search(r'####\s*(\d+)', solution)
       ifmatch:
          number = match.group(1)
         # Replace the "#### number" with "\boxed{number}"
          solution = re.sub(
           r'####\s*\d+',f'\\\\boxed{{{number}}}', solution
          )

      formatted_item = {
       "id": str(uuid.uuid4()), # GSM8K doesn't provide IDs
       "problem": item['question'],
       "type":"openai/gsm8k", # All problems are from GSM8K
       "solution": solution, # Use the modified solution with \boxed
      }
      formatted_data.append(formatted_item)

   # Save to a file
    output = formatted_data
    output_file ="downloaded_gsm8k_10.json"
   withopen(output_file,"w")asf:
      json.dump(output, f, indent=2)

    print(f"Successfully downloaded and saved GSM8K dataset to{output_file}")
 exceptExceptionase:
    print(f"Error downloading GSM8K dataset:{e}")

if__name__ =="__main__":
  download_gsm8k_dataset()

获得了一些符合目标格式的示例数据，接下来让我们开始蒸馏一些包含详细思维过程的数学推理数据吧！

? 蒸馏包含思维过程的数学推理数据（长链思维数据，Long CoT Data）

1. 导入所需的库

importnest_asyncio
nest_asyncio.apply()

importjson
importos
importtime

fromcamel.agentsimportChatAgent
fromcamel.datagenimportSTaRPipeline
fromcamel.modelsimportModelFactory
fromcamel.typesimportModelPlatformType, ModelType

2. 设置推理模型和评估模型

由于DeepSeek的API服务目前不太稳定，我们将通过Siliconflow来调用DeepSeek R1模型。CAMEL的模型管理器会根据请求的成功情况自动切换模型。

# Set DeepSeek R1 served by siliconflow as reason model 1
reason_model_1 = ModelFactory.create(
  model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL,
  model_type="deepseek-ai/DeepSeek-R1",
  api_key=os.environ["SILICONFLOW_API_KEY"],
  url="https://api.siliconflow.cn/v1",
  model_config_dict={"max_tokens":4096},# Config the max_token carefully
)

# Set DeepSeek R1 served by deepseek cloud as reason model 2
reason_model_2 = ModelFactory.create(
  model_platform=ModelPlatformType.DEEPSEEK,
  model_type=ModelType.DEEPSEEK_REASONER,
)

3.运行CAMEL的Self-Improve数据生成模块

在运行之前，请注意一些关键参数的设置，例如：

problems_path：原始数学问题的路径。
output_path：生成数据的保存路径。
max_iterations：最大迭代次数，控制数据生成的深度。
rationalization：是否将正确内容作为参考加入推理过程生成。

注意事项：

我们已经将部分可选的设置代码注释掉，大家可以按需启用对应代码。
生成的数据可以直接用于训练或进一步分析。

运行完成后，你将在output_path中找到生成的高质量数学推理数据集！

start_time = time.time()
problems_path ="downloaded_gsm8k_10.json"
output_path ="generated_data.json"

# Load problems from JSON file
withopen(problems_path,'r')asf:
  problems = json.load(f)

# Initialize agent
reason_agent_system_message ="""Answer my question and give your
final answer within \\boxed{}."""
evaluate_agent_system_message ="""You are a highly critical teacher who
evaluates the student's answers with a meticulous and demanding approach.
"""

# Set up reason agent
reason_agent = ChatAgent(
  system_message=reason_agent_system_message,
  model=[reason_model_1, reason_model_2],# add models to the list, You can also swtich to other models
)

# # Set up evaluate agent(optional)
# evaluate_agent = ChatAgent(
#   system_message=evaluate_agent_system_message
# )

# # Initialize reward model (optional)
# reward_model = NemotronRewardModel(
#   model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD,
#   url="https://integrate.api.nvidia.com/v1",
#   api_key=os.environ.get("NVIDIA_API_KEY"),
# )

# # Set score thresholds for different dimensions (optional)
# score_threshold = {
#   "correctness": 1.0,
#   "clarity": 0.0,
#   "completeness": 0.0,
# }
# # Or use a single threshold for all dimensions:
# score_threshold = 0.9


# Create and run pipeline
pipeline = STaRPipeline(
  reason_agent=reason_agent,
  problems=problems, # Pass problems list directly
  output_path=output_path,
  max_iterations=0,
  batch_size=100,# Size of batch to process the data (optional)
 # evaluate_agent=evaluate_agent, # To use evaluate agent(optional)
 # score_threshold=score_threshold, # Score thresholds for agent evaluation (optional)
 # reward_model=reward_model, # To use a reward model (optional)
)

print("Start generation! May take some time, please wait..")

results = pipeline.generate(rationalization=False)

end_time = time.time()
execution_time = end_time - start_time

print(f"\nProcessed{len(results)}problems")
print(f"Results saved to:{output_path}")
print(f"Total execution time:{execution_time:.2f}seconds")

通过以下代码查看生成的CoT数据：

withopen('generated_data.json','r')asf:
  data = json.load(f)
  print(json.dumps(data, indent=2))

?上传数据到Hugging Face平台

具体步骤包含：

加载生成的数据：从本地文件加载生成的数据集。
转换为Hugging Face格式：将数据转换为Hugging Face的Dataset格式。
生成数据集卡片：创建包含数据集描述、标签和许可证信息的卡片。
登录Hugging Face：使用API token登录Hugging Face账户。
上传数据集：将数据集和卡片上传到Hugging Face平台。

# Import necessary modules and classes
fromcamel.datahubs.huggingfaceimportHuggingFaceDatasetManager # Manages interactions with Hugging Face datasets
fromcamel.datahubs.modelsimportRecord # Represents a single record in the dataset
fromdatetimeimportdatetime # Handles date and time operations
importjson # For reading JSON files

defload_star_output(file_path):
 r"""Load and parse the star output JSON file.

  Args:
    file_path (str): Path to the star_output.json file.

  Returns:
    list: List of traces from the JSON file.
  """
 withopen(file_path,'r')asf:
    data = json.load(f)
 returndata['traces']

# Main function: Upload dataset to Hugging Face
defupload_to_huggingface(transformed_data, username, dataset_name=None):
 r"""Uploads transformed data to the Hugging Face dataset platform.

  Args:
    transformed_data (list): Transformed data, typically a list of dictionaries.
    username (str): Hugging Face username.
    dataset_name (str, optional): Custom dataset name.

  Returns:
    str: URL of the uploaded dataset.
  """
 # Initialize HuggingFaceDatasetManager to interact with Hugging Face datasets
  manager = HuggingFaceDatasetManager()

 # Generate or validate the dataset name
  dataset_name = generate_or_validate_dataset_name(username, dataset_name)

 # Create the dataset on Hugging Face and get the dataset URL
  dataset_url = create_dataset(manager, dataset_name)

 # Create a dataset card to add metadata
  create_dataset_card(manager, dataset_name, username)

 # Convert the transformed data into a list of Record objects
  records = create_records(transformed_data)

 # Add the Record objects to the dataset
  add_records_to_dataset(manager, dataset_name, records)

 # Return the dataset URL
 returndataset_url

# Generate or validate the dataset name
defgenerate_or_validate_dataset_name(username, dataset_name):
 r"""Generates a default dataset name or validates and formats a user-provided name.

  Args:
    username (str): Hugging Face username.
    dataset_name (str, optional): User-provided custom dataset name.

  Returns:
    str: Formatted dataset name.
  """
 ifdataset_nameisNone:
   # If no dataset name is provided, generate a default name with the username and current date
    current_date = datetime.now().strftime("%Y%m%d")
    dataset_name =f"star_traces_{current_date}"

 # Format the dataset name to include the username
 returnf"{username}/{dataset_name}"

# Create a dataset on Hugging Face
defcreate_dataset(manager, dataset_name):
 r"""Creates a new dataset on Hugging Face and returns the dataset URL.

  Args:
    manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
    dataset_name (str): Name of the dataset.

  Returns:
    str: URL of the created dataset.
  """
  dataset_url = manager.create_dataset(dataset_name)
 returndataset_url

# Create a dataset card with metadata
defcreate_dataset_card(manager, dataset_name, username):
 r"""Creates a dataset card to add metadata

  Args:
    manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
    dataset_name (str): Name of the dataset.
    username (str): Hugging Face username.
  """
  manager.create_dataset_card(
    dataset_name=dataset_name,
    description="A dataset containing mathematical problem-solving traces with step-by-step solutions and improvement history. Each record includes a mathematical problem, its final solution, and the iterative improvement process.",
    license="mit", # Using lowercase 'mit' as required by HuggingFace
    tags=["math","problem-solving","step-by-step","traces"],
    authors=[username],
    language=["en"],
    task_categories=["text-generation"],
    content="This dataset contains mathematical problem-solving traces generated using the CAMEL framework. Each entry includes:\n\n"
       "- A mathematical problem statement\n"
       "- A detailed step-by-step solution\n"
  )

# Convert transformed data into Record objects
defcreate_records(transformed_data):
 r"""Converts transformed data into a list of Record objects.

  Args:
    transformed_data (list): List of trace dictionaries from star_output.json.

  Returns:
    list: List of Record objects.
  """
  records = []
 fortraceintransformed_data:
    record = Record(
      source_type=trace['type'],
      problem=trace['problem'],
      solution=trace['final_trace'],
    )
    records.append(record)
 returnrecords

# Add Record objects to the dataset
defadd_records_to_dataset(manager, dataset_name, records):
 r"""Adds a list of Record objects to the dataset.

  Args:
    manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
    dataset_name (str): Name of the dataset.
    records (list): List of Record objects.
  """
  manager.add_records(dataset_name, records)

? 配置Hugging Face访问令牌，上传数据集

前往https://huggingface.co/settings/tokens/new?tokenType=write获取Hugging Face的 API 密钥，并确保你已开启对仓库的写入权限。

接下来，在 Hugging Face上创建一个新的数据集：

# Get HuggingFace token and username
HUGGING_FACE_TOKEN = getpass('Enter your HUGGING_FACE_TOKEN: ')
os.environ["HUGGING_FACE_TOKEN"] = HUGGING_FACE_TOKEN
username = input("Enter your HuggingFace username: ")
dataset_name = input("Enter your dataset name:")

# Load the star output data
current_dir = os.getcwd()
star_output_path = os.path.join(current_dir,'./generated_data.json')
traces = load_star_output(star_output_path)

# Upload the data to HuggingFace
dataset_url = upload_to_huggingface(traces, username, dataset_name)
print(f"\nDataset uploaded successfully!")
print(f"You can view your dataset at:{dataset_url}")

最终上传的数据预览