项目概述背景:随着信息时代的快速发展,从网站上高效抓取和处理数据的需求急剧增加。Firecrawl 由 Mendable.ai 和 Firecrawl 社区共同开发,专门用于将网站内容转换为 LLM(大型语言模型)可直接使用的 Markdown 格式,解决了数据采集和转换的效率问题。 目标:主要目标是提供一个简单、高效的方式,使开发者和研究人员能够快速从网站抓取和转换数据,支持Markdown和结构化数据转换,以便进行数据分析和机器学习应用。

主要功能: 全面网站爬取:即使无需站点地图,也能自动爬取所有可访问子页面。 高级数据采集:即使网站使用 JavaScript 渲染内容,Firecrawl 也能有效采集数据。 格式化输出:返回干净、格式良好的 Markdown 文档,直接适用于 LLM 应用。 并行处理:抓取过程支持并行处理,能够快速返回结果。 智能缓存:会缓存内容,除非有新内容出现,否则无需等待全面搜索。 LLM 提取:利用大型语言模型快速完成网页数据的智能提取。
安装与配置前提条件Python 3.7或更高版本 Node.js(可选,用于Node SDK) API密钥:需在Firecrawl网站注册以获取
安装步骤克隆仓库:执行命令 git clone https://github.com/mendable/firecrawl.git 安装依赖:Python 用户运行 pip install firecrawl-py,Node.js 用户运行 npm install @mendable/firecrawl-js 运行项目:根据语言环境设置适当的环境变量或直接在代码中配置API密钥,然后开始调用API或运行SDK函数。
配置说明使用指南爬取(Crawling)使用爬取功能,可以对指定的URL及其所有可访问的子页面进行爬取。此操作提交一个爬取任务,并返回一个任务ID,用于检查爬取状态。 启动爬取任务 curl -X POST https://api.firecrawl.dev/v0/crawl \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_API_KEY' \ -d '{ "url": "https://mendable.ai" }'
成功调用后返回的 jobId: { "jobId": "1234-5678-9101" }
检查爬取任务状态 curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_API_KEY'
返回爬取任务的状态: { "status": "completed", "current": 22, "total": 22, "data": [ { "content": "Raw Content ", "markdown": "# Markdown Content", "provider": "web-scraper", "metadata": { "title": "Mendable | AI for CX and Sales", "description": "AI for CX and Sales", "language": null, "sourceURL": "https://www.mendable.ai/" } } ] }
抓取(Scraping)使用抓取功能,可以获取单个URL的内容。 抓取单个URL curl -X POST https://api.firecrawl.dev/v0/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_API_KEY' \ -d '{ "url": "https://mendable.ai" }'
返回抓取的数据: { "success": true, "data": { "content": "Raw Content ", "markdown": "# Markdown Content", "provider": "web-scraper", "metadata": { "title": "Mendable | AI for CX and Sales", "description": "AI for CX and Sales", "language": null, "sourceURL": "https://www.mendable.ai/" } } }
搜索(Search)使用搜索功能,可以根据查询词进行网页搜索,获取最相关的结果,并抓取每个页面返回Markdown格式的内容。 进行搜索 curl -X POST https://api.firecrawl.dev/v0/search \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_API_KEY' \ -d '{ "query": "firecrawl", "pageOptions": { "fetchPageContent": true // false for a fast serp api } }'
返回搜索和抓取结果: { "success": true, "data": [ { "url": "https://mendable.ai", "markdown": "# Markdown Content", "provider": "web-scraper", "metadata": { "title": "Mendable | AI for CX and Sales", "description": "AI for CX and Sales", "language": null, "sourceURL": "https://www.mendable.ai/" } } ] }
智能提取(Intelligent Extraction)利用LLM技术从抓取的页面中提取结构化数据。 从URL提取结构化数据 curl -X POST https://api.firecrawl.dev/v0/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_API_KEY' \ -d '{ "url": "https://www.mendable.ai/", "extractorOptions": { "mode": "llm-extraction", "extractionPrompt": "Based on the information on the page, extract the information from the schema.", "extractionSchema": { "type": "object", "properties": { "company_mission": {"type": "string"}, "supports_sso": {"type": "boolean"}, "is_open_source": {"type":
"boolean"},
"is_in_yc": {"type": "boolean"}
},
"required": ["company_mission", "supports_sso", "is_open_source", "is_in_yc"]
}
}
}' 返回提取的结构化数据: ```json { "success": true, "data": { "content": "Raw Content", "metadata": { "title": "Mendable", "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide", "robots": "follow, index", "ogTitle": "Mendable", "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide", "ogUrl": "https://mendable.ai/", "ogImage": "https://mendable.ai/mendable_new_og1.png", "ogLocaleAlternate": [], "ogSiteName": "Mendable", "sourceURL": "https://mendable.ai/" }, "llm_extraction": { "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to", "supports_sso": true, "is_open_source": false, "is_in_yc": true } } }
这些指南提供了对Firecrawl功能的全面概述,助力用户有效地利用其强大的网页数据爬取和处理能力。 常见问题文档与资源API文档参考资源授权协议ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.544px;text-wrap: wrap;background-color: rgb(255, 255, 255);">注:本文内容仅供参考,具体项目特性请参照官方 GitHub 页面的最新说明。 https://github.com/mendableai/firecrawl |