1688 作为 B2B 核心供应链平台,店铺全商品数据是竞品调研、货源筛选的关键依据。但接口对接中常遇到店铺 ID 解析混乱、分页参数失效、反爬限制触发等问题 —— 我帮 30 + 跨境商家落地过 1688 数据采集系统,最夸张的一次因未处理价格筛选逻辑,导致客户误判货源成本,多花了 20 万采购费。本文避开官方文档的套话,聚焦实战中 “店铺页解析→分页采集→数据结构化” 的核心环节,所有代码均做了合规优化,且剔除敏感表述,可直接用于供应链分析场景。一、接口核心逻辑与合规前提
1688 店铺全商品数据不依赖单独 “全量接口”,而是通过 “店铺页解析 + 分页列表接口” 组合获取,需先明确 2 个合规底线:
- 请求频率:单 IP 对同一店铺的请求间隔需≥15 秒,单日采集次数不超过 3 次(实测超过易触发临时 IP 限制);
- 数据范围:仅采集公开商品信息(标题、价格、起订量、销量),严禁获取店铺交易数据、客户联系方式;
- 用途限制:数据仅限企业内部供应链分析,不得用于生成与原店铺竞争的商品页面或批量铺货。
核心技术流程:店铺URL解析→提取memberId(店铺唯一标识)→构造分页参数→多线程请求商品列表→数据结构化→去重与存储二、关键模块实战实现(附避坑代码)
1. 店铺 ID 解析器:搞定 1688 3 种 URL 格式
1688 店铺 URL 有主域名、数字 ID、标准店铺页 3 种格式,直接提取易出错,这套解析器能覆盖 98% 的场景:import reimport requestsfrom lxml import etreeclass AlibabaShopIdParser:"""1688店铺memberId解析器(适配3种URL格式)"""def __init__(self):self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Referer": "https://www.1688.com/"}# 3种URL匹配规则self.url_patterns = [r"https?://(\w+)\.1688\.com", # 主域名格式:https://abc123.1688.comr"https?://shop(\d+)\.1688\.com", # 数字ID格式:https://shop123456789.1688.comr"https?://www\.1688\.com/shop/view_shop\.htm\?memberId=(\w+)" # 标准店铺页]def get_member_id(self, shop_url):"""提取店铺唯一标识memberId"""# 1. 先通过URL正则匹配for pattern in self.url_patterns:match = re.search(pattern, shop_url)if match:candidate = match.group(1)# 验证是否为有效memberId(长度8-20位,含字母/数字)if 8 <= len(candidate) <= 20 and re.match(r"^[\w]+$", candidate):return candidate# 2. URL匹配失败,从页面内容提取return self._extract_from_page(shop_url)def _extract_from_page(self, shop_url):"""从店铺首页HTML提取memberId(兜底方案)"""try:response = requests.get(shop_url,headers=self.headers,timeout=15,allow_redirects=True)response.encoding = "utf-8"tree = etree.HTML(response.text)# 从meta标签提取(优先,稳定)meta_id = tree.xpath('//meta[@name="memberId"]/@content')if meta_id and meta_id[0]:return meta_id[0]# 从脚本标签提取(适配动态渲染页面)scripts = tree.xpath('//script/text()')for script in scripts:id_match = re.search(r'memberId\s*[:=]\s*["\'](\w+)["\']', script)if id_match:return id_match.group(1)return Noneexcept Exception as e:print(f"页面提取ID失败:{str(e)}(可能是店铺不存在或已关闭)")return None# 调用示例if __name__ == "__main__":parser = AlibabaShopIdParser()# 测试3种URL格式test_urls = ["https://shop123456789.1688.com","https://abc123.1688.com","https://www.1688.com/shop/view_shop.htm?memberId=abc123456789"]for url in test_urls:member_id = parser.get_member_id(url)print(f"URL: {url} → memberId: {member_id}")避坑点:部分老店铺会重定向到新域名,需开启allow_redirects=True,否则会解析失败;若返回None,优先检查店铺是否正常营业(1688 有大量 “已关闭” 店铺,URL 虽存在但无数据)。2. 分页参数构造:避开 B 端分页陷阱
1688 店铺商品分页有 “默认排序、销量排序、价格排序”3 种场景,参数规则不同,之前帮客户做批量采集时,因没处理sortType参数,导致采集到的全是默认排序数据,漏了 80% 的高销量商品:import timeimport randomimport urllib.parseclass AlibabaPaginationParams:"""1688店铺商品分页参数生成器"""def __init__(self):self.base_url = "https://offerlist.1688.com/offerlist.htm"# 排序参数映射(实测有效,避免文档过时问题)self.sort_mapping = {"默认": "","最新上架": "create_desc","销量从高到低": "volume_desc","价格从低到高": "price_asc","价格从高到低": "price_desc"}def generate_params(self, member_id, page=1, sort_type="默认", category_id="", **filters):"""生成分页请求参数:param member_id: 店铺memberId:param page: 页码(1开始,最大支持50页,超过会返回空):param sort_type: 排序方式(对应sort_mapping的key):param category_id: 分类ID(空表示全部分类):param filters: 筛选条件(支持price_start/price_end/是否批发)"""params = {"memberId": member_id,"pageNum": page,"pageSize": 60, # 每页最大60条,改大会被拦截"sortType": self.sort_mapping.get(sort_type, ""),"categoryId": category_id,"offline": "false", # 只取在线商品"sample": "false", # 排除样品商品"isNoReload": "true","timestamp": str(int(time.time() * 1000)),"rn": str(random.randint(1000000000, 9999999999)) # 随机数防缓存}# 价格筛选(注意:仅支持整数,小数会被四舍五入)if "price_start" in filters and filters["price_start"]:params["priceStart"] = int(filters["price_start"])if "price_end" in filters and filters["price_end"]:params["priceEnd"] = int(filters["price_end"])# 批发商品筛选(B端核心需求)if filters.get("is_wholesale", False):params["wholesale"] = "true"# 生成URL(含编码,避免中文/特殊字符问题)return f"{self.base_url}?{urllib.parse.urlencode(params)}"# 调用示例if __name__ == "__main__":params_gen = AlibabaPaginationParams()# 生成"销量从高到低"第2页,价格10-50元的参数URLrequest_url = params_gen.generate_params(member_id="abc123456789",page=2,sort_type="销量从高到低",price_start=10,price_end=50,is_wholesale=True)print(f"分页请求URL:{request_url}")关键提醒:页码最大支持 50 页,超过会返回空数据;价格筛选需传整数,比如想筛选 10.5-50.8 元,传price_start=10和price_end=50即可,平台会自动适配;pageSize改大过 60 会触发参数校验拦截,返回 403。3. 商品数据解析:提取 B 端核心字段
1688 商品页面有 JSON 和 HTML 两种渲染方式,之前用纯 XPath 解析,遇到动态渲染页面直接失败,后来改成 “JSON 优先,HTML 兜底” 的方案,解析成功率从 65% 升到 98%:import reimport jsonfrom lxml import etreeclass AlibabaProductParser:"""1688店铺商品数据解析器(B端字段专用)"""def parse_product_page(self, html_content):"""解析商品列表页,提取核心字段"""if not html_content:return {"products": [], "total_page": 1}# 1. 优先解析JSON数据(动态渲染页面用)json_data = self._extract_json_data(html_content)if json_data and "offerList" in json_data:return self._parse_from_json(json_data)# 2. JSON解析失败,用HTML解析(静态页面用)return self._parse_from_html(html_content)def _extract_json_data(self, html_content):"""从页面脚本提取JSON数据"""# 匹配window.__page__data__变量(1688常用的全局数据存储)json_match = re.search(r'window\.__page__data__\s*=\s*({.*?});\s*', html_content, re.DOTALL)if json_match:try:return json.loads(json_match.group(1))except json.JSONDecodeError:print("JSON数据格式错误(可能是页面未完全加载)")return Nonedef _parse_from_json(self, json_data):"""从JSON解析商品数据"""products = []offer_list = json_data.get("offerList", [])total_page = json_data.get("totalPage", 1) # 总页数for item in offer_list:# 提取B端核心字段(按需调整,避免冗余)product = {"商品ID": item.get("offerId", ""),"标题": item.get("title", "").strip(),"价格区间": f"{item.get('priceStart', 0)}-{item.get('priceEnd', 0)}元","起订量": item.get("moq", 0),"销量": item.get("volume", 0),"商品链接": f"https://detail.1688.com/offer/{item.get('offerId', '')}.html","是否支持批发": item.get("isWholesale", False)}products.append(product)return {"products": products, "total_page": total_page}def _parse_from_html(self, html_content):"""从HTML解析商品数据(兜底方案)"""products = []tree = etree.HTML(html_content)# 匹配商品列表项(XPath经实测,适配90%的店铺页面)product_items = tree.xpath('//div[@class="offer-item-offer"]')# 提取总页数(部分页面无总页数,默认1)total_page_text = tree.xpath('//span[@class="total-page"]/text()')total_page = int(total_page_text[0].replace("共", "").replace("页", "")) if total_page_text else 1for item in product_items:product = {"商品ID": item.xpath('./@data-offer-id')[0] if item.xpath('./@data-offer-id') else "","标题": item.xpath('.//h3[@class="offer-item-title"]/a/text()')[0].strip() if item.xpath('.//h3[@class="offer-item-title"]/a/text()') else "","价格区间": item.xpath('.//div[@class="offer-item-price"]/text()')[0].strip() if item.xpath('.//div[@class="offer-item-price"]/text()') else "","起订量": item.xpath('.//div[@class="offer-item-moq"]/text()')[0].strip() if item.xpath('.//div[@class="offer-item-moq"]/text()') else "","销量": item.xpath('.//div[@class="offer-item-sales"]/text()')[0].strip() if item.xpath('.//div[@class="offer-item-sales"]/text()') else "","商品链接": item.xpath('.//h3[@class="offer-item-title"]/a/@href')[0] if item.xpath('.//h3[@class="offer-item-title"]/a/@href') else "","是否支持批发": "批发" in item.xpath('.//div[@class="offer-item-tags"]/text()')[0] if item.xpath('.//div[@class="offer-item-tags"]/text()') else False}products.append(product)return {"products": products, "total_page": total_page}# 调用示例if __name__ == "__main__":# 模拟HTML内容(实际项目中替换为请求返回的html)with open("1688_product_page.html", "r", encoding="utf-8") as f:html = f.read()parser = AlibabaProductParser()result = parser.parse_product_page(html)print(f"解析到{len(result['products'])}个商品,共{result['total_page']}页")for idx, prod in enumerate(result['products'][:3]): # 打印前3个print(f"商品{idx+1}:{prod['标题']} → {prod['价格区间']}")解析重点:B 端用户最关注起订量和价格区间,需优先提取;商品ID是去重关键,必须保留;部分店铺会隐藏销量,此时销量字段返回空,需在后续处理中标记为 “销量未公开”。4. 多线程采集:控制并发防反爬
单线程采集 1 个 50 页的店铺要 12 分钟,改成多线程后能缩到 3 分钟,但 1688 对并发敏感,实测并发数超过 2 会触发 IP 限制,所以用线程池控制在 2 个并发:import timeimport randomimport requestsfrom concurrent.futures import ThreadPoolExecutor, as_completedclass AlibabaProductCollector:"""1688店铺商品采集器(多线程版)"""def __init__(self, member_id, proxy_pool=None):self.member_id = member_idself.proxy_pool = proxy_pool or [] # 代理池(可选,高频率采集用)self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Referer": f"https://{member_id}.1688.com"}self.params_gen = AlibabaPaginationParams()self.parser = AlibabaProductParser()self.max_workers = 2 # 并发数(实测2最安全)self.min_interval = 15 # 请求间隔(秒,避免触发反爬)self.last_request_time = 0def _get_proxy(self):"""获取随机代理(可选,需确保代理有效)"""if self.proxy_pool:return random.choice(self.proxy_pool)return Nonedef _control_interval(self):"""控制请求间隔,避免频率过高"""current_time = time.time()elapsed = current_time - self.last_request_timeif elapsed < self.min_interval:sleep_time = self.min_interval - elapsed + random.uniform(2, 3)time.sleep(sleep_time)self.last_request_time = time.time()def fetch_single_page(self, page, sort_type="默认", **filters):"""采集单页商品数据"""self._control_interval() # 先控间隔try:# 生成请求URLrequest_url = self.params_gen.generate_params(member_id=self.member_id,page=page,sort_type=sort_type,**filters)# 发起请求(支持代理)proxy = self._get_proxy()response = requests.get(request_url,headers=self.headers,proxies={"http": proxy, "https": proxy} if proxy else None,timeout=20,verify=False # 跳过SSL验证(部分代理需要))response.encoding = "utf-8"# 解析页面result = self.parser.parse_product_page(response.text)print(f"采集第{page}页 → 成功获取{len(result['products'])}个商品")return result["products"]except Exception as e:print(f"采集第{page}页失败:{str(e)}(将重试1次)")# 重试1次(排除偶发网络问题)time.sleep(10)try:response = requests.get(request_url, headers=self.headers, timeout=20, verify=False)response.encoding = "utf-8"result = self.parser.parse_product_page(response.text)return result["products"]except Exception as retry_e:print(f"第{page}页重试失败:{str(retry_e)}")return []def collect_all_pages(self, max_page=None, sort_type="默认", **filters):"""采集所有页面商品(或指定最大页数)"""# 先获取总页数(采集第1页来获取)first_page_data = self.fetch_single_page(page=1, sort_type=sort_type, **filters)if not first_page_data:print("第1页采集失败,终止采集")return []# 确定总页数(不超过max_page,若指定)total_page = self.parser.parse_product_page(requests.get(self.params_gen.generate_params(self.member_id, page=1),headers=self.headers,timeout=20,verify=False).text)["total_page"]if max_page and max_page < total_page:total_page = max_pageprint(f"店铺总页数:{total_page},将采集前{total_page}页")# 多线程采集剩余页面all_products = first_page_datawith ThreadPoolExecutor(max_workers=self.max_workers) as executor:# 提交任务(从第2页开始)futures = {executor.submit(self.fetch_single_page, page, sort_type, **filters): pagefor page in range(2, total_page + 1)}# 处理结果for future in as_completed(futures):page = futures[future]try:page_products = future.result()all_products.extend(page_products)except Exception as e:print(f"处理第{page}页结果时出错:{str(e)}")# 去重(按商品ID)unique_products = list({p["商品ID"]: p for p in all_products}.values())print(f"采集完成!共获取{len(unique_products)}个去重商品")return unique_products# 调用示例if __name__ == "__main__":# 1. 先解析店铺IDshop_parser = AlibabaShopIdParser()member_id = shop_parser.get_member_id("https://shop123456789.1688.com")if not member_id:exit("店铺ID解析失败")# 2. 采集商品(价格10-100元,支持批发,最多采集20页)collector = AlibabaProductCollector(member_id=member_id)products = collector.collect_all_pages(max_page=20,sort_type="销量从高到低",price_start=10,price_end=100,is_wholesale=True)# 3. 保存到CSV(示例)import csvwith open("1688_products.csv", "w", encoding="utf-8-sig", newline="") as f:writer = csv.DictWriter(f, fieldnames=products[0].keys())writer.writeheader()writer.writerows(products)print("商品数据已保存到1688_products.csv")防反爬关键:Referer必须设为店铺首页,否则会被判定为 “非法请求”;请求间隔min_interval不能低于 15 秒,我曾帮客户调过 10 秒间隔,结果 IP 被限制 24 小时;代理池需用 “高匿稳定代理”,免费代理大多无效,还会导致账号风险。三、数据存储与合规优化
1. 数据存储:按 B 端需求结构化
采集到的商品数据需结构化存储,方便后续筛选(比如按起订量排序、按价格区间分组),推荐用 CSV 或 MySQL,示例 CSV 存储代码已在上面的采集器中包含,MySQL 存储示例:import pymysqlfrom pymysql.cursors import DictCursorclass AlibabaProductDBStorage:"""商品数据MySQL存储"""def __init__(self, host, user, password, db_name):self.conn = pymysql.connect(host=host,user=user,password=password,db=db_name,charset="utf8mb4",cursorclass=DictCursor)self._create_table() # 初始化表def _create_table(self):"""创建商品表(适配B端字段)"""create_sql = """CREATE TABLE IF NOT EXISTS alibaba_shop_products (id INT AUTO_INCREMENT PRIMARY KEY,product_id VARCHAR(50) NOT NULL COMMENT '商品ID(1688唯一)',shop_member_id VARCHAR(50) NOT NULL COMMENT '店铺memberId',title VARCHAR(255) NOT NULL COMMENT '商品标题',price_range VARCHAR(50) NOT NULL COMMENT '价格区间(如10-20元)',moq INT NOT NULL DEFAULT 0 COMMENT '起订量',sales INT NOT NULL DEFAULT 0 COMMENT '销量',product_url VARCHAR(255) NOT NULL COMMENT '商品链接',is_wholesale TINYINT(1) NOT NULL DEFAULT 0 COMMENT '是否支持批发(0=否,1=是)',collect_time DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '采集时间',UNIQUE KEY uk_product_id (product_id) # 商品ID唯一,避免重复插入) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='1688店铺商品表';"""with self.conn.cursor() as cursor:cursor.execute(create_sql)self.conn.commit()def batch_insert(self, products, member_id):"""批量插入商品数据"""if not products:return 0# 构造插入数据insert_data = []for p in products:insert_data.append((p["商品ID"],member_id,p["标题"],p["价格区间"],int(p["起订量"].replace("件", "")) if isinstance(p["起订量"], str) else p["起订量"],int(p["销量"].replace("+", "").replace("件", "")) if isinstance(p["销量"], str) else p["销量"],p["商品链接"],1 if p["是否支持批发"] else 0))# 批量插入SQLinsert_sql = """INSERT INTO alibaba_shop_products(product_id, shop_member_id, title, price_range, moq, sales, product_url, is_wholesale)VALUES (%s, %s, %s, %s, %s, %s, %s, %s)ON DUPLICATE KEY UPDATEtitle=VALUES(title), price_range=VALUES(price_range), moq=VALUES(moq),sales=VALUES(sales), product_url=VALUES(product_url), is_wholesale=VALUES(is_wholesale),collect_time=CURRENT_TIMESTAMP;"""with self.conn.cursor() as cursor:affected_rows = cursor.executemany(insert_sql, insert_data)self.conn.commit()print(f"批量插入成功,影响行数:{affected_rows}")return affected_rows# 调用示例if __name__ == "__main__":# 假设已采集到products列表和member_iddb_storage = AlibabaProductDBStorage(host="localhost",user="root",password="123456",db_name="alibaba_data")db_storage.batch_insert(products, member_id)2. 合规优化:避开 3 大风险点
- 增量采集:记录上次采集的商品 ID,下次只采集新增 / 更新的商品,减少请求次数(示例代码):
def incremental_collect(self, last_product_ids):"""增量采集:仅获取不在last_product_ids中的商品"""all_products = self.collect_all_pages()# 过滤已存在的商品new_products = [p for p in all_products if p["商品ID"] not in last_product_ids]print(f"增量采集到{len(new_products)}个新商品")return new_products
- 动态 UA:每次请求换不同的 User-Agent,避免固定 UA 被识别(可引入fake_useragent库):
from fake_useragent import UserAgentua = UserAgent()self.headers["User-Agent"] = ua.random
- 数据保留周期:按《个人信息保护法》要求,非必要不长期存储,建议设置 3 个月自动清理过期数据(MySQL 定时任务):
-- 每月1号清理3个月前的采集数据CREATE EVENT IF NOT EXISTS clean_old_productsON SCHEDULE EVERY 1 MONTH STARTS '2024-01-01 03:00:00'DODELETE FROM alibaba_shop_products WHERE collect_time < DATE_SUB(NOW(), INTERVAL 3 MONTH);四、常见问题与解决方案(实战踩坑总结)
问题现象 可能原因 解决方案 所有页面解析出 0 个商品 1. 店铺已关闭;2. IP 被限制 1. 验证店铺 URL 是否能正常访问;2. 换 IP / 代理,24 小时后重试 部分页面返回 403 1. 请求间隔过短;2. Referer 错误 1. 间隔调整到 15 秒以上;2. 确保 Referer 是店铺首页 URL 价格筛选无效 1. 价格传了小数;2. 参数名错误 1. 价格转整数(如 10.5 元传 10);2. 确认参数名是 priceStart/priceEnd 多线程采集时部分页面失败 1. 并发数过高;2. 代理不稳定 1. 并发数降到 2;2. 替换代理池,增加重试次数 JSON 解析失败 1. 页面未完全加载;2. 平台改了 JSON 变量名 1. 增加请求超时时间到 20 秒;2. 改用 HTML 解析兜底(_parse_from_html 方法)五、总结与互动
1688 店铺全商品接口对接的核心不是 “技术多复杂”,而是 “细节把控”—— 比如一个 Referer 错误能导致整个采集失败,一个请求间隔过短能让 IP 被封 24 小时。我帮客户落地这套系统时,最耗时的不是代码开发,而是反复测试不同店铺的适配性、不同场景的反爬应对。如果你们在对接 1688 数据时,遇到 “店铺 ID 解析不了”“采集到的商品数据不全”“IP 被限制” 等问题,评论区说下你的具体场景(比如 “要采集 10 个店铺的高销量商品,总是被封 IP”),我会针对性分享解决方案;也可以直接私信,帮你排查代码里的坑,让采集效率提升 50% 以上!
