使用python scrapy爬取全国小区信息，并保存至mysql和Excel

逸少凌仙 发表于 2020-3-28 00:40

小白第一个项目，还望各位大佬斧正！！！

### 目标
此次爬取的网站是楼盘网，因为只爬取小区信息，所以先从深圳小区（http://sz.loupan.com/community/）网页入手分析，然后发散爬取至全国。

爬取的信息包括省，所属市，所属区，小区名，小区链接，详细地址，经纬度，交通，价格，物业类型，物业价格，面积，户数，竣工时间，车位数，容积率，绿化率，物业公司，开发商。

保存至Excel和mysql，也可以保存至MongoDB，看个人需求。

#### 字段目录
!(https://img-blog.csdnimg.cn/20200327155924261.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
### 效果
#### mysql表:
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327222727500.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
#### excel表：
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327222900282.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
### 配置环境
电脑：win10 64x
python版本：Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34) on win32
工具：JetBrains PyCharm Community Edition 2019.2.5 x64
scrapy框架：![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327165352914.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)

### 分析
为了保证分工明确，提高爬取效率，此次爬取采用python的scrapy框架，因为scrapy自带dupefilters.py去重器，可以不用担心重复爬取。scrapy运行机制如下：

```python
Scrapy运行时,请求发出去的整个流程大概如下：

1.首先爬虫将需要发送请求的url(requests)经引擎交给调度器;

2.排序处理后，经ScrapyEngine，DownloaderMiddlewares(有User_Agent, Proxy代{过}{滤}理)交给Downloader;

3.Downloader向互联网发送请求，并接收下载响应.将响应经ScrapyEngine，可选交给Spiders;

4.Spiders处理response，提取数据并将数据经ScrapyEngine交给ItemPipeline保存;

5.提取url重新经ScrapyEngine交给Scheduler进行下一个循环。直到无Url请求程序停止结束。
```

首先创建scrapy工程
![在基本结构](https://img-blog.csdnimg.cn/20200327161036974.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
### 爬城市
首先根据网页，获取所有城市的链接
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327161409575.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
然后就可以多线程的同时爬取所有城市小区，我分析了网页结构，单纯的爬取本页是获取不了城市链接的，城市链接使用的是JavaScript加载的动态数据，既然如此那就获取他的json数据，如下：
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327161911102.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
右键获取网址：http://sz.loupan.com/index.php/jsdata/common?_=1579245949843，这是一个json格式列表，我们已经获取所有城市的链接目录，但是这个目录有很多无用网址，所以要进行数据清洗，把无用数据清洗掉。
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327162148125.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
这里使用requests方法获取网页信息，然后使用正则获取链接，再使用for循环清洗数据。
比如深圳市：http://sz.loupan.com/
既然已经获取了所有城市链接，那就开始正式爬取网页上的小区。

### 爬小区
首先在http://sz.loupan.com/community/这一页，我看到，一页有25个小区，然后总共只有100页，
也就是说，如果我按照传统的遍历每一页去爬的话，最多只能爬取2500条数据。
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327223347426.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
但是，实际上可以看到深圳小区数量是10243
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327223451645.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
所以，这个时候我转换思路，我随机点开一个小区链接，拉下来可以看到，有一个周边小区推荐
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327223715502.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
   我可以依靠这个，然后不断循环遍历，直至最终爬完所有小区，这也是很多爬虫所运用的方法，包括像爬新浪微博所有用户信息，爬取知乎所有用户信息，都可以从某一个大v出发，然后爬取其关注列表和被关注列表，然后不断循环发散，直至爬完整个新浪用户。

在http://sz.loupan.com/community页面按F12查看深圳小区第一页网页源码，很容易得到网页所有小区链接，25条，然后遍历这25各小区，在每个小区里再遍历周边小区。
好，现在可以爬完所有小区信息了，那问题来了，重复的怎么办，scrapy自带dupefilters.py去重器，可以不用担心重复爬取。

#### scrapy去重原理：

```python
1.Scrapy本身自带有一个中间件;

2.scrapy源码中可以找到一个dupefilters.py去重器;

3.需要将dont_filter设置为False开启去重，默认是True，没有开启去重；

4 .对于每一个url的请求，调度器都会根据请求得相关信息加密得到一个指纹信息，并且将指纹信息和set()集合中的指纹信息进行比对，如果set()集合中已经存在这个数据，就不在将这个Request放入队列中;

5.如果set()集合中没有存在这个加密后的数据，就将这个Request对象放入队列中，等待被调度。
```

![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327163120985.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
很明显，小区链接就是城市链接+community+小区编号，community是小区的意思。

```python
class XiaoquSpiderSpider(scrapy.Spider):
name = 'xiaoqu_spider'
# 获取所有城市链接
url = 'http://sz.loupan.com/index.php/jsdata/common?_=1579245949843'
response = requests.get(url).text
urls = list(set(re.findall('http://\w+?.loupan.com', response)))
url_delete = (
   'http://app.loupan.com', 'http://www.loupan.com', 'http://public.loupan.com', 'http://user.loupan.com')
for url in urls:
   if url in url_delete:
         urls.remove(url)
```

我这里使用pyquery库的PyQuery方法获取小区链接。
获取链接后，要开始分析小区详情的网页。
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327163523549.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
同样方法，F12查看网页源码，然后获取小区的详细信息

```python
def parse(self, response):
   doc = pq(response.text)
   item = XiaoquItem()

   url = doc('.pos > a:nth-child(4)').attr('href')# 小区链接
   item['url'] = url# 小区链接

   name = doc('.t p').text()# 小区名
   item['name'] = name# 小区名

   # 根据网页获得小区模糊地址，再通过百度地图API获取经纬度
   addres = doc('.text_nr.bug2').text()# 小区地址
   citys = doc('.pos > a:nth-child(2)').text()
   city = ''.join(re.findall('(\w+)小区', citys)) + '市'
   districts = doc('span.font_col_o > a').text()# 所属区
   address = city + districts + addres + name# 所属详细地址
   # 将地址传入api获取省市区
   location = self.location(address)
   coord = location['coord']# 经纬度
   item['coord'] = coord
   province = location['province']# 省
   item['province'] = province
   city = location['city']# 市
   item['city'] = city
   district = location['district']# 区
   item['district'] = district
   item['detail_address'] = province + city + district + addres + name# 详细地址

   id = ''.join(re.findall('\d+', url))
   around_url = 'http://sz.loupan.com/community/around/' + id + '.html'# 周边信息网址
   response = requests.get(around_url)
   around_doc = pq(response.text)
   traffic = around_doc('.trend > p:nth-child(7)').text()# 交通
   item['traffic'] = traffic.replace('m', 'm,')# 交通

   prices = doc('div.price > span.dj').text()# 参考价格
   if prices == '暂无数据':
         price = None
         item['price'] = price
   else:
         price = int(prices)
         item['price'] = price

   item['property_type'] = doc('ul > li:nth-child(1) > span.text_nr').text()# 物业类型

   property_fees = doc('ul > li:nth-child(2) > span.text_nr').text()# 物业费
   if property_fees == '暂无数据':
         property_fee = None
         item['property_fee'] = property_fee
   else:
         property_fee = float(''.join(re.findall('\d*\.\d*', property_fees)))
         item['property_fee'] = property_fee

   areas = doc('ul > li:nth-child(3) > span.text_nr').text()# 总建面积
   if areas == '暂无数据':
         area = None
         item['area'] = area
   else:
         area = int(''.join(re.findall('\d*', areas)))
         item['area'] = area

   house_counts = doc('ul > li:nth-child(4) > span.text_nr').text()# 总户数
   if house_counts == '暂无数据' or house_counts == '':
         house_count = None
         item['house_count'] = house_count
   else:
         house_count = int(''.join(re.findall('\d*', house_counts)))
         item['house_count'] = house_count

   completion_times = doc('ul > li:nth-child(5) > span.text_nr').text()# 竣工时间
   if completion_times in ('暂无数据', '', None):
         completion_time = None
         item['completion_time'] = completion_time
   else:
         completion_time = int(''.join(re.findall('\d*', completion_times)))
         item['completion_time'] = completion_time

   item['parking_count'] = doc('ul > li:nth-child(6) > span.text_nr').text()# 停车位

   plot_ratios = doc('ul > li:nth-child(7) > span.text_nr').text()# 容积率
   if plot_ratios == '暂无数据' or plot_ratios == '':
         plot_ratio = None
         item['plot_ratio'] = plot_ratio
   else:
         plot_ratio = float(''.join(re.findall('\d*\.\d*', plot_ratios)))
         item['plot_ratio'] = plot_ratio

   greening_rates = doc('ul > li:nth-child(8) > span.text_nr').text()# 绿化率
   if greening_rates == '暂无数据':
         greening_rate = None
         item['greening_rate'] = greening_rate
   else:
         greening_rate = ''.join(re.findall('\d*\.\d*%', greening_rates))
         item['greening_rate'] = greening_rate

   item['property_company'] = doc('div.ps > p:nth-child(1) > span.text_nr').text()# 物业公司
   item['developers'] = doc('div.ps > p:nth-child(2) > span.text_nr').text()# 开发商
   yield item

   lis = doc('body > div.pages > div.main.esf_xq > div > div.main > div.tj_esf > ul > li')
   li_doc = pq(lis).items()
   for li in li_doc:
         url = li('div.text > a').attr('href')
         yield Request(url=url, callback=self.parse)
```
因为要保存到mysql，所以我这里对‘暂无数据’的值进行了数据修改，改成null或者' '。

将所爬取到的数据存储到item模块里面，这是items.py代码：

```python
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class XiaoquItem(scrapy.Item):
collection = table = 'community'
province = scrapy.Field()# 省
city = scrapy.Field()# 市
district = scrapy.Field()# 区
name = scrapy.Field()# 小区名
url = scrapy.Field()# 小区链接
detail_address = scrapy.Field()# 所属详细地址
coord = scrapy.Field()# 经纬度
traffic = scrapy.Field()# 附近交通
price = scrapy.Field()# 参考价格
property_type = scrapy.Field()# 物业类型
property_fee = scrapy.Field()# 物业费
area = scrapy.Field()# 总建面积
house_count = scrapy.Field()# 总户数
completion_time = scrapy.Field()# 竣工时间
parking_count = scrapy.Field()# 停车位
plot_ratio = scrapy.Field()# 容积率
greening_rate = scrapy.Field()# 绿化率
property_company = scrapy.Field()# 物业公司
developers = scrapy.Field()# 开发商
pass
```
这里有一个难点就是获取小区经纬度，因为网页上没有，想要获取经纬度，就得通过调用api来实现，api调用是传入小区详细地址，返回经纬度，我这里使用的是高德地图api，百度太渣了，定位都定到海里去了。。。。

```python
# 调用经高德地图API，获取经纬度与详细地址
def location(self, detail_address):
   url = "https://restapi.amap.com/v3/geocode/geo?address=" + detail_address + "&key=你的key码"
   response = requests.get(url).json()
   geocodes = response['geocodes']
   for geocode in geocodes:
         coord = geocode['location']
         province = geocode['province']
         city = geocode['city']
         district = geocode['district']
         local = {'coord': coord, 'province': province, 'city': city, 'district': district}
         return local

```
可以去高德地图申请一个开发者账号，然后认证一下，每天有30万次配额，完全够用
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327164637170.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
基本上小区的所有信息就分析完毕。

### 保存
python scrapy框架的保存是很方便的，使用的是项目管道（ITEM PIPLINES）来保存，可以在这里进行数据清洗和数据存储，我使用的是excel和mysql来保存，pipelines.py代码：

```python
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from openpyxl import Workbook
import pymysql

class XiaoquPipeline(object):
def process_item(self, item, spider):
   return item

class ExcelPipeline(object):
def __init__(self):
   self.wb = Workbook()
   self.ws = self.wb.active
   self.ws.append(['省', '市', '区', '小区名', '小区详情页链接', '详细地址', '经纬度', '交通',
                     '参考价格', '物业类型', '物业费', '总建面积', '总户数', '竣工时间', '停车位', '容积率', '绿化率',
                     '物业公司', '开发商'])

def process_item(self, item, spider):
   line = , item['city'], item['district'], item['name'], item['url'], item['detail_address'],
            item['coord'], item['traffic'], item['price'], item['property_type'],
            item['property_fee'], item['area'], item['house_count'], item['completion_time'], item['parking_count'],
            item['plot_ratio'], item['greening_rate'], item['property_company'], item['developers']]
   self.ws.append(line)
   # keys = spider.settings.get('KEYS')
   self.wb.save('../小区' + '.xlsx')
   return item

class MysqlPipeline():
def __init__(self, host, database, user, password, port):
   self.host = host
   self.database = database
   self.user = user
   self.password = password
   self.port = port

@classmethod
def from_crawler(cls, crawler):
   return cls(
         host=crawler.settings.get('MYSQL_HOST'),
         database=crawler.settings.get('MYSQL_DATABASE'),
         user=crawler.settings.get('MYSQL_USER'),
         password=crawler.settings.get('MYSQL_PASSWORD'),
         port=crawler.settings.get('MYSQL_PORT'),
   )

def open_spider(self, spider):
   self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8',
                              port=self.port)
   self.cursor = self.db.cursor()

def close_spider(self, spider):
   self.db.close()

def process_item(self, item, spider):
   # print(item['raw_title'])
   data = dict(item)
   keys = ', '.join(data.keys())
   values = ', '.join(['%s'] * len(data))
   sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
   self.cursor.execute(sql, tuple(data.values()))
   self.db.commit()
   return item

```
这里调用了openpyxl来保存至excel和pymysql保存至mysql，这里excel是自动在上一级目录创建一个excel文档，并保存数据。
mysql的话需要服务器，账号密码等配置信息，这些信息在settings里可以设置。
我这里是保存到一个community的数据库的community表里面，所以大家运行的时候需要注意，或者修改成自己的数据库名就好。
保存至mysql之前需要在mysql创建相应表，创建代码：

```sql
CREATE TABLE `community`.`community`(
`id` int(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`province` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '省',
`city` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '所属市',
`district` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '所属区',
`name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '小区名',
`url` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '小区链接',
`detail_address` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '详细地址',
`coord` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '经纬度',
`traffic` varchar(555) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '交通',
`price` int(10) NULL DEFAULT NULL COMMENT '价格',
`property_type` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '物业类型',
`property_fee` decimal(10, 2) NULL DEFAULT NULL COMMENT '物业价格',
`area` int(20) NULL DEFAULT NULL COMMENT '面积',
`house_count` int(10) NULL DEFAULT NULL COMMENT '户数',
`completion_time` int(4) NULL DEFAULT NULL COMMENT '竣工时间',
`parking_count` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '车位数',
`plot_ratio` decimal(10, 2) NULL DEFAULT NULL COMMENT '容积率',
`greening_rate` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '绿化率',
`property_company` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '物业公司',
`developers` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '开发商',
`create_time` datetime(0) NULL DEFAULT CURRENT_TIMESTAMP(0) ON UPDATE CURRENT_TIMESTAMP(0) COMMENT '插入时间',
PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 3080 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
```
### 设置
现在万事俱备，还差一个settings.py,配置信息，如何配置的问题。
首先ROBOTSTXT_OBEY，这是个啥呢，

```python
爬虫协议，即 robots 协议，也叫机器人协议
它用来限定爬虫程序可以爬取的内容范围
通常写在 robots.txt 文件中
该文件保存在网站的服务器上
爬虫程序访问网站时首先查看此文件
在 scrapy 项目的 settings.py 文件中
默认 ROBOTSTXT_OBEY = True ，即遵守此协议
当爬取内容不符合该协议且仍要爬取时
设置 ROBOTSTXT_OBEY = False ，不遵守此协议
```
我们是爬虫，所以肯定要FALSE的，不然爬不起来。

然后是下载中间器DOWNLOADER_MIDDLEWARES，这个是默认是屏蔽的，需要我们解除屏蔽，不然没法把数据下载下来。

接下来就是ITEM_PIPELINES，项目管道的配置，需要和管道的下载器名对应，不然不知道你保存到哪儿。

```python
ITEM_PIPELINES = {
# 'xiaoqu.pipelines.XiaoquPipeline': 300,
'xiaoqu.pipelines.ExcelPipeline': 301,
'xiaoqu.pipelines.MysqlPipeline': 302,
}
```
默认是只有第一行，我们添加excel和mysql，如果没有安装mysql的小伙伴，可以屏蔽掉mysql管道。

我的spider里面调用了一个请求头headers，因为一般的网站都有反爬措施，这个网站暂时没有，但是也加上。还有之前说过的mysql配置信息

```python
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': '__customer_trace_id=B8D70080-5CA9-45D9-995E-F3DAB0EC0D1E; PHPSESSID=4qipinajmn8q9pvmkdcfpp3fh1; Hm_lvt_c07a5cf91cdac070faa1e701f45995a8=1577343206,1577347508; AGL_USER_ID=e79f0249-d96f-4a98-837a-398d4bb287b8; Hm_lvt_15e5e51b14c8efd1f1488ea51faa1172=1577347593,1577347761,1577771483,1578894243; loadDomain=http%3A%2F%2Fsz.loupan.com%2F; Hm_lvt_2c0f8f2133c1fc1d09538a565dd8d6c8=1577343206,1577347508,1579223144; Hm_lpvt_15e5e51b14c8efd1f1488ea51faa1172=1579240756; loupan_user_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%22622161d9507e199454d3eb844eb97b5d%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A13%3A%22121.15.170.60%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A115%3A%22Mozilla%2F5.0+%28Windows+NT+10.0%3B+Win64%3B+x64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F78.0.3904.108+Safari%2F537.36%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1579241140%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7Dd2512f1f510eeb07c62f3db5e3ee5796; Hm_lpvt_2c0f8f2133c1fc1d09538a565dd8d6c8=1579241158; Hm_lpvt_c07a5cf91cdac070faa1e701f45995a8=1579241159',
'Host': 'sz.loupan.com',
'Upgrade-Insecure-Requests': 1,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
#LOG_LEVEL = 'INFO'
RETRY_ENABLED = False
#DOWNLOAD_TIMEOUT = 10

MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'community'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_PORT = 3306

```
这样配置的话，基本上整个项目就完成了，但是要提升效率，提高爬取速度，再配置以下信息

```python
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 100
CONCURRENT_REQUESTS_PER_IP = 100

# Disable cookies (enabled by default)
COOKIES_ENABLED = False
RETRY_ENABLED = False
```
使用多线程，低延时，拒绝重试等等，反正不反爬，都无所谓啦，settings.py完整代码：

```python
# -*- coding: utf-8 -*-

# Scrapy settings for xiaoqu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'xiaoqu'

SPIDER_MODULES = ['xiaoqu.spiders']
NEWSPIDER_MODULE = 'xiaoqu.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 100
CONCURRENT_REQUESTS_PER_IP = 100

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# 'xiaoqu.middlewares.XiaoquSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'xiaoqu.middlewares.XiaoquDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 'xiaoqu.pipelines.XiaoquPipeline': 300,
'xiaoqu.pipelines.ExcelPipeline': 301,
'xiaoqu.pipelines.MysqlPipeline': 302,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': '__customer_trace_id=B8D70080-5CA9-45D9-995E-F3DAB0EC0D1E; PHPSESSID=4qipinajmn8q9pvmkdcfpp3fh1; Hm_lvt_c07a5cf91cdac070faa1e701f45995a8=1577343206,1577347508; AGL_USER_ID=e79f0249-d96f-4a98-837a-398d4bb287b8; Hm_lvt_15e5e51b14c8efd1f1488ea51faa1172=1577347593,1577347761,1577771483,1578894243; loadDomain=http%3A%2F%2Fsz.loupan.com%2F; Hm_lvt_2c0f8f2133c1fc1d09538a565dd8d6c8=1577343206,1577347508,1579223144; Hm_lpvt_15e5e51b14c8efd1f1488ea51faa1172=1579240756; loupan_user_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%22622161d9507e199454d3eb844eb97b5d%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A13%3A%22121.15.170.60%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A115%3A%22Mozilla%2F5.0+%28Windows+NT+10.0%3B+Win64%3B+x64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F78.0.3904.108+Safari%2F537.36%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1579241140%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7Dd2512f1f510eeb07c62f3db5e3ee5796; Hm_lpvt_2c0f8f2133c1fc1d09538a565dd8d6c8=1579241158; Hm_lpvt_c07a5cf91cdac070faa1e701f45995a8=1579241159',
'Host': 'sz.loupan.com',
'Upgrade-Insecure-Requests': 1,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
#LOG_LEVEL = 'INFO'
RETRY_ENABLED = False
#DOWNLOAD_TIMEOUT = 10

MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'community'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_PORT = 3306

```

### 运行
在控制台进入xiaoqu/spiders目录，然后使用scrapy crawl xiaoqu_spider运行，这个xiaoqu_spider.py是主要的爬取代码
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327225508200.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
回车运行：
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327231909804.gif)
到这里基本上就完成了一个爬取任务，其中有许多字符类型之类的错误，我就没有去细究了，有时候会出现某个字段报错。

打包后的exe文件，因为可能有些小伙伴需要修改配置文件，所以没有打包成一个单exe文件，双击蜘蛛侠图标运行
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200327235906730.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTI0MjQzMTM=,size_16,color_FFFFFF,t_70)
运行图示：
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200328001019877.gif)
配置文件修改路径：community_spider\xiaoqu\settings
源码github链接：https://github.com/yishaolingxian/community_spiders
打包的exe链接：https://www.lanzouj.com/iapt67c

rekcard 发表于 2020-3-30 16:25

真的很详细很厉害，我昨天刚开始0基础学python，目标就是学点技术以后会用爬虫，真的太牛掰了，那么多信息，自动爬下来

铁头张 发表于 2020-3-29 10:54

写的非常详细，感谢。一直还没有用过scrapy这种程序。

mengqiu 发表于 2020-3-29 22:26

ciker_li 发表于 2020-3-29 23:04

看着有点晕，下存一下

Justcodes 发表于 2020-3-29 23:15

写得很好很详细！我先学习，后面再来看成都的效果

小肆发表于 2020-3-30 15:33

为什么要扒宝鸡市的

jydcb003 发表于 2020-3-30 16:38

一直只会用requests的路过

stefaniema 发表于 2020-3-31 08:03

目前只会request ，一直想学scrapy，多谢！

fsrank 发表于 2020-3-31 20:40

写得非常好, 学习了, 谢谢分享

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

使用python scrapy爬取全国小区信息，并保存至mysql和Excel