python爬虫，狂爬各种导航网站并分类

xiaotuzi1 · 发表于 2020-7-28 18:43

本帖最后由 xiaotuzi1 于 2020-7-28 22:40 编辑

由于最近需要获取分类好的网站，所有想到了通过爬虫来爬各类导航网站，先说下技术路线。
使用python爬虫常用框架：scrapy

如图，该框架基本上由这四个核心模块构成，items.py 定义数据类型，pipeline是将爬虫返回的数据处理并入库，spiders目录下面是各个子爬虫（待会举例说明），begin.py用来启动爬虫。

这里主要说明spiders这个里面子爬虫，其他的都有教程，请各位大神自己学习下。

以爬好用好网这个网站为例：haoyonghaowan.com

[Python] 纯文本查看 复制代码

#coding=utf-8
'''
Created on 2018-3-27

@author: haoning
'''

import scrapy
import sys
from util import get_absolute_path
import hashlib
from util import now_time
from util import convert_mg_url_to_base64
import urlparse
from util import lock
from util import unlock
from lxml import etree
m = hashlib.md5()

sys.path.append(get_absolute_path())
from crawl.scrapy_crawl.items import ScrapyCrawlItem

class BookmarkSpider(scrapy.Spider):
    name = "haoyonghaowan"
    allowed_domains = ["haoyonghaowan.com"]
    
    def __init__(self):
        self.base = "http://www.haoyonghaowan.com"
        self.reqeust_url=self.base
        self.urls=[]

    def start_requests(self):
        self.urls.append(self.reqeust_url)
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse_link_page)
            
    def parse_link_page(self,response):
        parts=response.xpath('//div[contains(@class,"sitelist")]/ul/li/a/@href').extract()
        for url in parts: 
            yield scrapy.Request(url=url, callback=self.parse_item)
          
            
    def parse_item(self,response):
        
        items=response.xpath('//article[contains(@class,"sitelist")]/div/ul/li/a').extract()
        for it in items:
            try:
                lock()
                item = ScrapyCrawlItem()
                item['category2']=self.parse_category2(response)
                item['category1']=self.parse_category1(response) 

                it=etree.fromstring(it)
                item['name']=self.parse_name(it)
                item['url']=self.parse_url(it)
                if item['url']:
                    item['icon']=self.parse_icon(item['url'])
                    item['crawl_time']=self.parse_crawl_time(it)
                    item['content']=self.parse_content(it)
                    item['origin_website']=self.base
                    item['stop']=False
                    
                    print item
                    
                    if item['name'] and item['url']  and item['category2'] and ("http" in item['url']):
                        yield item
                        
            except Exception as e:
                print "ee",e
            finally:
                unlock()
        
        item = ScrapyCrawlItem()
        item['stop']=True
        yield item

    def parse_name(self, it):
        name=it.xpath('//text()')
        if name:
            return name[0]
        return None
    
    def parse_url(self, it):
        href=it.xpath('//@href')
        if href:
            return href[0]
        return None
    
    def parse_category2(self, response):
        name=response.xpath('//article[contains(@class,"sitelist")]/header/h2/text()').extract()
        if name:
            return  name[0].replace('\n','').replace('\t','').replace(' ','').replace('\r','')
        return None
    
    def parse_category1(self, response):
        return None

    def parse_icon(self,link):
        try:
            url="http://www.google.cn/s2/favicons?domain="+urlparse.urlparse(link).netloc
            return convert_mg_url_to_base64(url)
        except Exception as e:
            print "errro it",e
    
    def parse_crawl_time(self, response):
        return now_time()
    
    def parse_content(self, it):
        return None

if __name__ == '__main__':
    spider = BookmarkSpider()
    print spider.now_time()

可以看得很清楚，程序重点集中在 def parse_item(self,response) 这个函数中，主要通过xpath来解析网页，然后将解析好的内容以item的形式返回给上文提到的pineline去处理，然后准备入库。

数据库如上图，整个代码非常简单，就不再各位大神面前献丑了。

其实最近发现一个不错的网站，书签地球：https://www.bookmarkearth.com/，这个网站的内容整理的非常整齐，但是每次爬取都返回503错误，暂时还没有找到好的办法解决。

还在继续研究如何攻克这个问题，好了，先写到这里，希望各位大神喜欢我的帖子。

寒雨孤夜 · 发表于 2020-7-29 09:00

学习一下，正在学爬虫。

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] python爬虫，狂爬各种导航网站并分类

免费评分