在爬虫开发中，异常处理是确保程序稳定运行的关键环节。爬虫在执行过程中可能会遇到各种问题，如网络请求失败、目标页面结构变化、数据解析错误等。合理地处理这些异常不仅可以避免程序崩溃，还能帮助开发者快速定位问题并进行调试。以下是几种常见的异常处理方法和最佳实践，帮助你构建更健壮的爬虫程序。

一、常见的异常类型

在爬虫开发中，常见的异常类型包括：

网络请求异常：
- requests.exceptions.RequestException：请求失败（如超时、连接错误等）。
- requests.exceptions.HTTPError：HTTP请求返回错误状态码（如404、500等）。
解析异常：
- AttributeError：页面结构发生变化，导致某些元素无法找到。
- ValueError：数据格式错误，如解析JSON时出现问题。
其他异常：
- Exception：通用异常，用于捕获未明确的错误。

二、异常处理策略

（一）捕获异常

使用 try-except语句块捕获可能出现的异常。在爬虫代码中，通常需要对网络请求、数据解析等关键操作进行异常捕获。

示例代码：

import requestsfrom bs4 import BeautifulSoupdef get_product_details(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # 检查请求是否成功
        soup = BeautifulSoup(response.text, 'html.parser')
        product_name = soup.find('h1', {'class': 'd-title'}).text.strip()
        product_price = soup.find('span', {'class': 'price-tag-text-sku'}).text.strip()
        product_image = soup.find('img', {'class': 'desc-lazyload'}).get('src')
        return {
            'name': product_name,
            'price': product_price,
            'image': product_image        }
    except requests.exceptions.RequestException as e:
        print(f"请求失败: {e}")
    except AttributeError as e:
        print(f"页面解析失败: {e}")
    except Exception as e:
        print(f"发生未知异常: {e}")
    return None

（二）日志记录

在捕获异常后，将异常信息记录到日志文件中，便于后续分析和排查问题。可以使用日志框架（如 logging）来记录日志。

示例代码：

import logging# 配置日志logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')def get_product_details(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        product_name = soup.find('h1', {'class': 'd-title'}).text.strip()
        product_price = soup.find('span', {'class': 'price-tag-text-sku'}).text.strip()
        product_image = soup.find('img', {'class': 'desc-lazyload'}).get('src')
        logging.info(f"成功获取商品详情: {product_name}")
        return {
            'name': product_name,
            'price': product_price,
            'image': product_image        }
    except requests.exceptions.RequestException as e:
        logging.error(f"请求失败: {e}")
    except AttributeError as e:
        logging.error(f"页面解析失败: {e}")
    except Exception as e:
        logging.error(f"发生未知异常: {e}")
    return None

（三）重试机制

对于一些可能由于网络波动或临时问题导致的异常，可以设置重试机制。例如，当捕获到 requests.exceptions.RequestException时，可以尝试重新发送请求。

示例代码：

import timefrom requests.exceptions import RequestExceptiondef get_product_details(url, max_retries=3):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    retries = 0
    while retries < max_retries:
        try:
            response = requests.get(url, headers=headers)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')
            product_name = soup.find('h1', {'class': 'd-title'}).text.strip()
            product_price = soup.find('span', {'class': 'price-tag-text-sku'}).text.strip()
            product_image = soup.find('img', {'class': 'desc-lazyload'}).get('src')
            logging.info(f"成功获取商品详情: {product_name}")
            return {
                'name': product_name,
                'price': product_price,
                'image': product_image            }
        except RequestException as e:
            retries += 1
            logging.warning(f"请求失败，正在重试... ({retries}/{max_retries})")
            time.sleep(1)  # 等待1秒后重试
        except Exception as e:
            logging.error(f"发生未知异常: {e}")
            break
    logging.error(f"重试次数已达上限，放弃请求")
    return None

（四）异常分类处理

对于不同类型的异常，可以进行分类处理。例如，对于网络异常可以重试，对于数据解析异常可以跳过当前数据并记录日志。

示例代码：

def get_product_details(url, max_retries=3):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    retries = 0
    while retries < max_retries:
        try:
            response = requests.get(url, headers=headers)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')
            product_name = soup.find('h1', {'class': 'd-title'}).text.strip()
            product_price = soup.find('span', {'class': 'price-tag-text-sku'}).text.strip()
            product_image = soup.find('img', {'class': 'desc-lazyload'}).get('src')
            logging.info(f"成功获取商品详情: {product_name}")
            return {
                'name': product_name,
                'price': product_price,
                'image': product_image            }
        except requests.exceptions.RequestException as e:
            retries += 1
            logging.warning(f"请求失败，正在重试... ({retries}/{max_retries})")
            time.sleep(1)
        except AttributeError as e:
            logging.error(f"页面解析失败: {e}")
            break
        except Exception as e:
            logging.error(f"发生未知异常: {e}")
            break
    logging.error(f"重试次数已达上限，放弃请求")
    return None

（五）资源清理

在异常发生时，确保释放已分配的资源，如关闭HTTP连接、数据库连接等。可以在 finally块中进行资源清理。

示例代码：

def get_product_details(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        product_name = soup.find('h1', {'class': 'd-title'}).text.strip()
        product_price = soup.find('span', {'class': 'price-tag-text-sku'}).text.strip()
        product_image = soup.find('img', {'class': 'desc-lazyload'}).get('src')
        logging.info(f"成功获取商品详情: {product_name}")
        return {
            'name': product_name,
            'price': product_price,
            'image': product_image        }
    except requests.exceptions.RequestException as e:
        logging.error(f"请求失败: {e}")
    except AttributeError as e:
        logging.error(f"页面解析失败: {e}")
    except Exception as e:
        logging.error(f"发生未知异常: {e}")
    finally:
        response.close()  # 确保关闭响应对象
    return None

三、总结

通过合理设置异常处理机制，可以有效提升爬虫的稳定性和可靠性。主要的异常处理策略包括：

使用 try-except捕获异常。
使用日志记录异常信息。
设置重试机制处理网络异常。
对不同类型的异常进行分类处理。
在 finally块中清理资源。

在实际开发中，可以根据爬虫的具体需求和目标网站的特点，灵活调整异常处理策略，确保爬虫能够在复杂环境下稳定运行。

爬虫代码中如何处理异常？

一、常见的异常类型

二、异常处理策略

（一）捕获异常

（二）日志记录

（三）重试机制

（四）异常分类处理

（五）资源清理

三、总结