1、申 请 I D:ykallan
2、个人邮箱:815583442@qq.com
3、原创技术文章:技术文章标准请参考论坛精华帖
平时主要研究方向为爬虫,机器学习与深度学习方向,希望能够通过申请,注册成为正式会员
本次发帖内容为使用scrapy框架爬取中国展会网的相关信息
中国展会网
根据列表页,进入展会的详情也,在列表也中寻找下一页的链接,进行下一页的请求,通过两组循环来爬取所有需要的展会信息,并进行保存
重点代码如下:
spider文件:
[Python] 纯文本查看 复制代码 # -*- coding: utf-8 -*-
# 中国展会网
import scrapy
from ..items import ZhanhuiItem
class ZhSpider(scrapy.Spider):
name = 'zh'
start_uuu = 'http://www.china-show.net/exhibit/search-htm-page-1-kw--fields-0-fromdate-20200601-todate--catid-0-process-0-order-0-x-49-y-17.html'
base_url = 'http://www.china-show.net'
def __init__(self):
super(ZhSpider, self).__init__()
self.start_req_url = input('请输入需要爬取的网页链接,可直接回车继续爬取:')
def start_requests(self):
if len(self.start_req_url) < 10:
yield scrapy.Request(url=self.start_uuu,callback=self.parse)
else:
yield scrapy.Request(url=self.start_req_url, callback=self.parse)
def parse(self, response):
url_list = response.xpath('//td[@align="left"]/ul/li/a/@href').extract()
page = response.xpath('//div[@class="pages"]/strong/text()').extract_first()
for i in range(len(url_list)):
meta = {
'page': page
}
yield scrapy.Request(url=url_list[i], callback=self.parse_detail, meta=meta)
next_page = response.xpath('//a[@title="下一页"]/@href').extract_first()
yield scrapy.Request(url=self.base_url+next_page, callback=self.parse)
def parse_detail(self, response):
item = ZhanhuiItem()
page = response.meta['page']
title = response.xpath('////h1[@class="title"]/text()').extract_first()
data = response.xpath('//table/tr[1]/td[2]/text()').extract_first()
loc = response.xpath('//table/tr[2]/td[2]/text()').extract_first()
address = response.xpath('//table/tr[3]/td[2]/text()').extract_first()
name = response.xpath('//table/tr[4]/td[2]/text()').extract_first()
host = response.xpath('//table/tr[5]/td[2]/text()').extract_first()
contents = response.xpath('//div[@class="pd10 lh18 px13"]//*/text()').extract()
content = ''
for con in contents:
if con != '\r\n':
con.replace('\r', '').replace('\n', '').replace(' ', '')
if len(con) != 0:
content += con
item['title'] = title
item['data'] = data
item['loc'] = loc
item['address'] = address
item['name'] = name
item['host'] = host
item['content'] = content
yield item
print('这是第', page, '页的信息')
items文件内容如下,用于保存爬取的字段:
[Python] 纯文本查看 复制代码 # -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ZhanhuiItem(scrapy.Item):
data = scrapy.Field()
loc = scrapy.Field()
address = scrapy.Field()
name = scrapy.Field()
host = scrapy.Field()
content = scrapy.Field()
title = scrapy.Field()
管道文件用于保存爬取的结果到csv文件中:
[Python] 纯文本查看 复制代码 # -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import csv
class ZhanhuiPipeline(object):
def __init__(self):
# 打开文件,指定方式为写,利用第3个参数把csv写数据时产生的空行消除
self.f = open("zhanhuixinxi.csv", "a", newline="", encoding='utf-8')
# 设置文件第一行的字段名,注意要跟spider传过来的字典key名称相同
self.fieldnames = ['title', 'name', 'data', 'host', 'address', 'loc', 'content']
# 指定文件的写入方式为csv字典写入,参数1为指定具体文件,参数2为指定字段名
self.writer = csv.DictWriter(self.f, fieldnames=self.fieldnames)
# 写入第一行字段名,因为只要写入一次,所以文件放在__init__里面
self.writer.writeheader()
def process_item(self, item, spider):
# 写入spider传过来的具体数值
self.writer.writerow(item)
# 写入完返回
return item
def close(self, spider):
self.f.close()
# csv_file = pd.read_csv('zhanhuixinxi.csv', encoding='utf-8')
# csv_file.to_excel('MyData.xlsx', sheet_name='data')
# print('csv文件转换为xlsx文件')
在设置文件中设置不遵守robots协议,以及激活管道文件,即可正常爬取了,这个网站没有对请求频率有所限制,所以没有设置下载延迟,其他网站需要注意一下。
如果想要实现断点续爬等功能,在运行爬虫文件的时候,可以通过命令实现:
[Python] 纯文本查看 复制代码 scrapy crawl zh -s JOBDIR=./job |