|
DeepSeek R1,这款备受瞩目的业界领先的推理模型,凭借卓越的数学推理能力和高效的逻辑处理,在近期引发了广泛关注。无论是基础算术还是复杂的数学难题,它都能轻松应对,为开发者提供强大的计算支持。 现在,结合CAMEL框架,我们可以利用长链式思维(Long Chain-of-Thought, CoT)提取数学问题的详细推理过程,从 DeepSeek R1 中蒸馏出高质量的数学推理数据。最终,我们将这一数据集上传至 Hugging Face,方便社区伙伴们共享和使用,推动更智能的数学推理研究。
本教程将手把手带你探索如何用 CAMEL 框架高效提取 DeepSeek R1 的数学推理能力,生成有价值的数据集,一起动手试试吧! - CAMEL框架:一个功能强大的多智能体框架,能够生成合成数据并模拟多智能体角色扮演场景,助力实现更高级的AI应用。
- 数据蒸馏流程:通过系统化的方法,从DeepSeek R1等模型中提取并优化包含详细思维过程的高质量推理数据集。
- Hugging Face集成:提供便捷的流程,将蒸馏后的数据集上传并分享到Hugging Face平台。通过我们的合成数据生成工具,CAEML-AI 精心打造了三个高质量的数据集,这些数据集现已发布在 Hugging Face 平台上,方便大家随时使用:
- 包含 4000 道高难度数学题目及其解答,特别加入了解决方案的迭代改进历史,展示了如何一步步优化答案。? 查看数据集:https://huggingface.co/datasets/camel-ai/amc_aime_star
- 包含 4000 道高难度数学题目及其解答,每道题目都配有清晰的分步解析。? 查看数据集:https://huggingface.co/datasets/camel-ai/amc_aime_distilled
- 包含 7000 道高质量、语言多样化的小学数学应用题及其解答,每道题目都配有详细的分步解析。? 查看数据集:https://huggingface.co/datasets/camel-ai/gsm8k_distilled
无论您是希望探索 AI 如何解决复杂问题,还是想深入钻研数学推理,这些数据集都是绝佳的资源!?✨ 使用CAMEL数据蒸馏管道生成数学推理数据集的具体步骤首先,安装所需的Python库,从命令行执行以下命令:pip install"git+https://github.com/camel-ai/camel.git@4210cb0849f3f13d6a46fefeb9e2c3e791c158cb#egg=camel-ai" pip install datasets pip install rouge
2. 设置相关密钥 设置SILICONFLOW_API_KEY 或 DEEPSEEK_API_KEY,这些密钥将用于结合思维过程来提炼数学推理数据。 ⭐提示:也可以选择其他模型提供商,比如 Fireworks 或 Together AI。 fromgetpassimportgetpass importos
SILICONFLOW_API_KEY = getpass('Enter your SILICONFLOW_API_KEY: ') os.environ["SILICONFLOW_API_KEY"] = SILICONFLOW_API_KEY
DEEPSEEK_API_KEY = getpass('Enter your DEEPSEEK_API_KEY: ') os.environ["DEEPSEEK_API_KEY"] = DEEPSEEK_API_KEY
# To make deepseek r1 responds with thought process content,we should set the following environment variable os.environ["GET_REASONING_CONTENT"]="True"
3. 从Hugging Face下载数据 我们从Hugging Face平台开始准备原始的数学数据,这些数据的核心部分主要包括问题 和答案两部分。接下来,我们将以GSM8K数据集为例,为大家详细讲解具体操作步骤。# Set the number of problems to download from GSM8K in huggingface NUMBER_OF_PROBLEMS=10
importjson frompathlibimportPath importuuid fromdatasetsimportload_dataset
defdownload_gsm8k_dataset(): try: # Load the dataset using the datasets library dataset = load_dataset("openai/gsm8k","main")
# Get the items from train split data = dataset['train'].select(range(NUMBER_OF_PROBLEMS))
# Convert to the desired format formatted_data = [] foritemindata: # Extract the final answer from the solution solution = item['answer'] ifsolution: # GSM8K solutions typically end with "#### number" importre
match = re.search(r'####\s*(\d+)', solution) ifmatch: number = match.group(1) # Replace the "#### number" with "\boxed{number}" solution = re.sub( r'####\s*\d+',f'\\\\boxed{{{number}}}', solution )
formatted_item = { "id": str(uuid.uuid4()), # GSM8K doesn't provide IDs "problem": item['question'], "type":"openai/gsm8k", # All problems are from GSM8K "solution": solution, # Use the modified solution with \boxed } formatted_data.append(formatted_item)
# Save to a file output = formatted_data output_file ="downloaded_gsm8k_10.json" withopen(output_file,"w")asf: json.dump(output, f, indent=2)
print(f"Successfully downloaded and saved GSM8K dataset to{output_file}") exceptExceptionase: print(f"Error downloading GSM8K dataset:{e}")
if__name__ =="__main__": download_gsm8k_dataset()
获得了一些符合目标格式的示例数据,接下来让我们开始蒸馏一些包含详细思维过程的数学推理数据吧! ? 蒸馏包含思维过程的数学推理数据(长链思维数据,Long CoT Data)1. 导入所需的库 importnest_asyncio nest_asyncio.apply()
importjson importos importtime
fromcamel.agentsimportChatAgent fromcamel.datagenimportSTaRPipeline fromcamel.modelsimportModelFactory fromcamel.typesimportModelPlatformType, ModelType
2. 设置推理模型和评估模型 由于DeepSeek的API服务目前不太稳定,我们将通过Siliconflow来调用DeepSeek R1模型。CAMEL的模型管理器会根据请求的成功情况自动切换模型。 # Set DeepSeek R1 served by siliconflow as reason model 1 reason_model_1 = ModelFactory.create( model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL, model_type="deepseek-ai/DeepSeek-R1", api_key=os.environ["SILICONFLOW_API_KEY"], url="https://api.siliconflow.cn/v1", model_config_dict={"max_tokens":4096},# Config the max_token carefully )
# Set DeepSeek R1 served by deepseek cloud as reason model 2 reason_model_2 = ModelFactory.create( model_platform=ModelPlatformType.DEEPSEEK, model_type=ModelType.DEEPSEEK_REASONER, )
3.运行CAMEL的Self-Improve数据生成模块 在运行之前,请注意一些关键参数的设置,例如:
注意事项:- 我们已经将部分可选的设置代码注释掉,大家可以按需启用对应代码。
生成的数据可以直接用于训练或进一步分析。
运行完成后,你将在output_path中找到生成的高质量数学推理数据集! start_time = time.time() problems_path ="downloaded_gsm8k_10.json" output_path ="generated_data.json"
# Load problems from JSON file withopen(problems_path,'r')asf: problems = json.load(f)
# Initialize agent reason_agent_system_message ="""Answer my question and give your final answer within \\boxed{}.""" evaluate_agent_system_message ="""You are a highly critical teacher who evaluates the student's answers with a meticulous and demanding approach. """
# Set up reason agent reason_agent = ChatAgent( system_message=reason_agent_system_message, model=[reason_model_1, reason_model_2],# add models to the list, You can also swtich to other models )
# # Set up evaluate agent(optional) # evaluate_agent = ChatAgent( # system_message=evaluate_agent_system_message # )
# # Initialize reward model (optional) # reward_model = NemotronRewardModel( # model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD, # url="https://integrate.api.nvidia.com/v1", # api_key=os.environ.get("NVIDIA_API_KEY"), # )
# # Set score thresholds for different dimensions (optional) # score_threshold = { # "correctness": 1.0, # "clarity": 0.0, # "completeness": 0.0, # } # # Or use a single threshold for all dimensions: # score_threshold = 0.9
# Create and run pipeline pipeline = STaRPipeline( reason_agent=reason_agent, problems=problems, # Pass problems list directly output_path=output_path, max_iterations=0, batch_size=100,# Size of batch to process the data (optional) # evaluate_agent=evaluate_agent, # To use evaluate agent(optional) # score_threshold=score_threshold, # Score thresholds for agent evaluation (optional) # reward_model=reward_model, # To use a reward model (optional) )
print("Start generation! May take some time, please wait..")
results = pipeline.generate(rationalization=False)
end_time = time.time() execution_time = end_time - start_time
print(f"\nProcessed{len(results)}problems") print(f"Results saved to:{output_path}") print(f"Total execution time:{execution_time:.2f}seconds")
通过以下代码查看生成的CoT数据: withopen('generated_data.json','r')asf: data = json.load(f) print(json.dumps(data, indent=2))
?上传数据到Hugging Face平台 具体步骤包含: - 转换为Hugging Face格式:将数据转换为Hugging Face的Dataset格式。
- 生成数据集卡片:创建包含数据集描述、标签和许可证信息的卡片。
- 登录Hugging Face:使用API token登录Hugging Face账户。
- 上传数据集:将数据集和卡片上传到Hugging Face平台。
# Import necessary modules and classes fromcamel.datahubs.huggingfaceimportHuggingFaceDatasetManager # Manages interactions with Hugging Face datasets fromcamel.datahubs.modelsimportRecord # Represents a single record in the dataset fromdatetimeimportdatetime # Handles date and time operations importjson # For reading JSON files
defload_star_output(file_path): r"""Load and parse the star output JSON file.
Args: file_path (str): Path to the star_output.json file.
Returns: list: List of traces from the JSON file. """ withopen(file_path,'r')asf: data = json.load(f) returndata['traces']
# Main function: Upload dataset to Hugging Face defupload_to_huggingface(transformed_data, username, dataset_name=None): r"""Uploads transformed data to the Hugging Face dataset platform.
Args: transformed_data (list): Transformed data, typically a list of dictionaries. username (str): Hugging Face username. dataset_name (str, optional): Custom dataset name.
Returns: str: URL of the uploaded dataset. """ # Initialize HuggingFaceDatasetManager to interact with Hugging Face datasets manager = HuggingFaceDatasetManager()
# Generate or validate the dataset name dataset_name = generate_or_validate_dataset_name(username, dataset_name)
# Create the dataset on Hugging Face and get the dataset URL dataset_url = create_dataset(manager, dataset_name)
# Create a dataset card to add metadata create_dataset_card(manager, dataset_name, username)
# Convert the transformed data into a list of Record objects records = create_records(transformed_data)
# Add the Record objects to the dataset add_records_to_dataset(manager, dataset_name, records)
# Return the dataset URL returndataset_url
# Generate or validate the dataset name defgenerate_or_validate_dataset_name(username, dataset_name): r"""Generates a default dataset name or validates and formats a user-provided name.
Args: username (str): Hugging Face username. dataset_name (str, optional): User-provided custom dataset name.
Returns: str: Formatted dataset name. """ ifdataset_nameisNone: # If no dataset name is provided, generate a default name with the username and current date current_date = datetime.now().strftime("%Y%m%d") dataset_name =f"star_traces_{current_date}"
# Format the dataset name to include the username returnf"{username}/{dataset_name}"
# Create a dataset on Hugging Face defcreate_dataset(manager, dataset_name): r"""Creates a new dataset on Hugging Face and returns the dataset URL.
Args: manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager. dataset_name (str): Name of the dataset.
Returns: str: URL of the created dataset. """ dataset_url = manager.create_dataset(dataset_name) returndataset_url
# Create a dataset card with metadata defcreate_dataset_card(manager, dataset_name, username): r"""Creates a dataset card to add metadata
Args: manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager. dataset_name (str): Name of the dataset. username (str): Hugging Face username. """ manager.create_dataset_card( dataset_name=dataset_name, description="A dataset containing mathematical problem-solving traces with step-by-step solutions and improvement history. Each record includes a mathematical problem, its final solution, and the iterative improvement process.", license="mit", # Using lowercase 'mit' as required by HuggingFace tags=["math","problem-solving","step-by-step","traces"], authors=[username], language=["en"], task_categories=["text-generation"], content="This dataset contains mathematical problem-solving traces generated using the CAMEL framework. Each entry includes:\n\n" "- A mathematical problem statement\n" "- A detailed step-by-step solution\n" )
# Convert transformed data into Record objects defcreate_records(transformed_data): r"""Converts transformed data into a list of Record objects.
Args: transformed_data (list): List of trace dictionaries from star_output.json.
Returns: list: List of Record objects. """ records = [] fortraceintransformed_data: record = Record( source_type=trace['type'], problem=trace['problem'], solution=trace['final_trace'], ) records.append(record) returnrecords
# Add Record objects to the dataset defadd_records_to_dataset(manager, dataset_name, records): r"""Adds a list of Record objects to the dataset.
Args: manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager. dataset_name (str): Name of the dataset. records (list): List of Record objects. """ manager.add_records(dataset_name, records)
? 配置Hugging Face访问令牌,上传数据集 前往https://huggingface.co/settings/tokens/new?tokenType=write获取Hugging Face的 API 密钥,并确保你已开启对仓库的写入权限。接下来,在 Hugging Face上创建一个新的数据集:# Get HuggingFace token and username HUGGING_FACE_TOKEN = getpass('Enter your HUGGING_FACE_TOKEN: ') os.environ["HUGGING_FACE_TOKEN"] = HUGGING_FACE_TOKEN username = input("Enter your HuggingFace username: ") dataset_name = input("Enter your dataset name:")
# Load the star output data current_dir = os.getcwd() star_output_path = os.path.join(current_dir,'./generated_data.json') traces = load_star_output(star_output_path)
# Upload the data to HuggingFace dataset_url = upload_to_huggingface(traces, username, dataset_name) print(f"\nDataset uploaded successfully!") print(f"You can view your dataset at:{dataset_url}")
|