***Python最火爬虫框架Scrapy入门与实践,关键源码
本帖最后由 huguo002 于 2019-11-15 09:12 编辑***Python最火爬虫框架Scrapy入门与实践,关键源码
记录一下!
创建Scrapy项目:
scrapy startproject douban
1.pycham ide调试文件代码
新建py文件
entrypoint.py
from scrapy.cmdline import execute
execute(['scrapy','crawl','douban'])
douban是scrapy项目名!
2.items.py
设置字段
import scrapy
class DoubanItem(scrapy.Item):
num=scrapy.Field() #序列号
name=scrapy.Field() #电影名
introduce=scrapy.Field() #介绍
star=scrapy.Field() # 星级评分
appraise=scrapy.Field() # 评价人数
survey=scrapy.Field() #一句话介绍
引入 scrapy框架
设置字段格式:
字段名=scrapy.Field()
3.设置文件
settings.py
爬取豆瓣需要协议头!
ua开启:
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)'
抓取调试开启:
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/lates ... middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
这里开启的含义:
这几行注释的作用是,Scrapy会缓存你有的Requests!当你再次请求时,如果存在缓存文档则返回缓存文档,而不是去网站请求,这样既加快了本地调试速度,也减轻了 网站的压力。
激活item pipeline
我们的pipeline定义后,需要在配置文件中添加激活才能使用,因此我们需要配置settings.py。
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}
4.爬虫文件
doub.py
# -*- coding: utf-8 -*-
import scrapy
import requests
from douban.items import DoubanItem
from bs4 import BeautifulSoup
from scrapy.http import Request
class DoubSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['movie.douban.com']
start_urls=["https://movie.douban.com/top250"]
def parse(self, response):
item_list=response.xpath('//ol[@class="grid_view"]/li')
for item in item_list:
douban_item=DoubanItem()
douban_item['num'] = item.xpath('.//div[@class="pic"]/em/text()').extract_first()
douban_item['name']=item.xpath('.//div[@class="hd"]/a/span/text()').extract_first()
#print(douban_item['name'])
introduces=item.xpath('.//div[@class="bd"]/p/text()').extract()
for introduce in introduces:
introduce_date="".join(introduce.split())
douban_item['introduce']=introduce_date
#print(douban_item['introduce'])
douban_item['star']=item.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract_first()
#print(douban_item['star'])
douban_item['appraise'] = item.xpath('.//div[@class="star"]/span/text()').extract_first()
douban_item['survey']=item.xpath('.//p[@class="quote"]/span[@class="inq"]/text()').extract_first()
#print(douban_item['survey'])
print(douban_item)
yield douban_item
next_page=response.xpath('//div[@class="paginator"]/span[@class="next"]/a/@href').extract()
if next_page:
yield Request(f'https://movie.douban.com/top250{next_page}',callback=self.parse)
'''def parse(self, response):
paginator_urls=[]
paginator_urls.extend(self.start_urls)
paginators=BeautifulSoup(response.text,'lxml').find('div',class_="paginator").find_all('a')[:-1]
for paginator in paginators:
paginator=f"https://movie.douban.com/top250{paginator['href']}"
paginator_urls.append(paginator)
print(paginator_urls)
paginator_urls=set(paginator_urls)
for paginator_url in paginator_urls:
print(paginator_url)
yield Request(paginator_url,callback=self.get_content)
def get_content(self,response):
thispage=BeautifulSoup(response.text,'lxml').find('span',class_="thispage").get_text()
print(thispage)'''
Scrapy自带xpath 与爬虫 etree xpath类似
注意.extract() 和.extract_first()
注释部分为调用bs4抓取数据,代码、排序等等不完美
5.pipelines.py
写入本地数据库
import pymysql
class DoubanPipeline(object):
def __init__(self):
#连接MySQL数据库
self.connect=pymysql.connect(
host="localhost",
user="root",
password="123456",
db="xiaoshuo",
port=3306,
)
self.cursor=self.connect.cursor()
def process_item(self, item, spider):
self.cursor.execute('insert into movie(num,name,introduce,star,appraise,survey)VALUES("{}","{}","{}","{}","{}","{}")'.format(item['num'],item['name'],item['introduce'],item['star'],item['appraise'],item['survey']))
self.connect.commit()
return item
#关闭数据库
def close_spider(self,spider):
self.cursor.close()
self.connect.close()
6.代{过}{滤}理ip的使用 阿布云
由于没有账号,未测试。。
项目打包,两种获取方式
百度云:
链接: https://pan.baidu.com/s/1GX9srMbh7aJbbpC8y6ZzDw 提取码: zp3h
论坛附件:
感谢 *** 大壮老师!
2019.11.14 更新
扩展 django 项目
1.创建django项目
django-admin startproject douban_movie
2.创建app
pycham内置工具 manage.py
startapp douban
3.注册app
settings.py
NSTALLED_APPS =[]
添加
'douban'
4.模型添加字段
models.py
from django.db import models
# Create your models here.
class Movie(models.Model):
num=models.IntegerField(max_length=10)
name=models.CharField(max_length=50)
introduce=models.CharField(max_length=255)
star=models.CharField(max_length=10)
appraise=models.CharField(max_length=255)
survey=models.CharField(max_length=100)
5.添加app urls.py文件
urls.py添加代码
from django.urls import path
from . import views
urlpatterns=[
path('index/',views.hello_world)
]
项目urls设置app urls转发
from django.contrib import admin
from django.urls import path,include
urlpatterns = [
path('admin/', admin.site.urls),
path('douban/', include('douban.urls')),
]
6.app视图 views 添加代码实现 hello world
from django.shortcuts import render
from django.http import HttpResponse
def hello_world(request):
return HttpResponse("Hello_world!")
http://127.0.0.1:8000/douban/index/ 访问实现返回字段 Hello_world!
7.更改数据库
修改settings.py
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.sqlite3',
'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
}
}
更改为:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql', #数据库引擎
'NAME': 'douban_movie', #数据库名
'USER': 'root', #密码
'HOST': 'localhost', #主机
'PORT': '3306', #端口
}
}
修改时区
TIME_ZONE = 'UTC'
更改为
TIME_ZONE='Asia/Shanghai'
项目下的_init_.py添加代码
import pymysql
pymysql.install_as_MySQLdb()
数据库迁移命令
python manage.py makemigrations
python manage.py migrate
8.数据库更改报错,错误处理
https://blog.csdn.net/weixin_45476498/article/details/100098297
附上部分核心代码:
模型层
models.py
from django.db import models
# Create your models here.
class Movie(models.Model):
num=models.IntegerField(max_length=11)
name=models.CharField(max_length=50)
introduce=models.CharField(max_length=255)
star=models.CharField(max_length=10)
appraise=models.CharField(max_length=255)
survey=models.CharField(max_length=100)
def __str__(self):
return self.name
app 路由 urls.py
from django.urls import path
from . import views
urlpatterns=[
path('index',views.hello_world,),
#path('index/',views.movie),
path('index/',views.index,),
]
app 视图层 urls.py
from django.shortcuts import render
from django.http import HttpResponse
from .models import Movie
from django.core.paginator import Paginator
def hello_world(request):
return HttpResponse("Hello_world!")
'''def movie(request):
movie_list=Movie.objects.all()
movie=movie_list
return HttpResponse('%s%s%s%s%s%s'%(movie.num,movie.name,movie.introduce,movie.star,movie.appraise,movie.survey,))'''
'''def index(request):
movie_list=Movie.objects.all()
return render(request,'douban/index.html',{
'movie_list':movie_list,
})'''
def index(request):
movie_list=Movie.objects.all()
paginator=Paginator(movie_list,25)
page=request.GET.get('page')
page_obj=paginator.get_page(page)
return render(request,'douban/index.html',{
'paginator':paginator,
'page_obj':page_obj,
})
django分页器 Paginator
print(Paginator.count) #总数据量
print(Paginator.num_pages) #分页数
print(Paginator.page_range) #显示的是页数的标记 就是按钮的数目
print(page2.has_next()) #是否有下一页
print(page2.next_page_number()) #下一页的页码
print(page2.has_previous()) #是否有上一页
print(page2.previous_page_number()) #上一页的页码
项目 路由 urls.py
from django.contrib import admin
from django.urls import path,include
urlpatterns = [
path('admin/', admin.site.urls),
path('douban/', include('douban.urls')),
]
__init__.py
import pymysql
pymysql.install_as_MySQLdb()
index.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Top 250电影</title>
</head>
<body>
<div>
<table>
{% for movie in page_obj %}
<tr>
<td>编号:{{ movie.num}}</td>
<td>电源名:{{ movie.name}}</td>
<td>简介:{{ movie.introduce}}</td>
<td>评分:{{ movie.star}}</td>
<td>评论人次:{{ movie.appraise}}</td>
<td>一句话介绍:{{ movie.survey}}</td>
</tr>
{% endfor %}
</table>
</div>
<div>
<ul>
<li>
{% if page_obj.has_previous %}
<a href="?page={{ page_obj.previous_page_number }}">上一页</a>
{% endif %}
</li>
{% for i in paginator.page_range %}
<li>
<a href="?page={{ i }}">{{ i }}</a>
</li>
{% endfor %}
<li>
{% if page_obj.has_next %}
<a href="?page={{ page_obj.next_page_number }}">下一页</a>
{% endif %}
</li>
</ul></div>
</body>
</html>
效果:
豆瓣Top250 电影:https://movie.douban.com/top250
代码有问题就不打包了!!!
存在问题
报错信息:
"D:\Program Files\JetBrains\PyCharm 2019.1.2\bin\runnerw64.exe" E:\douban_movie\venv\Scripts\python.exe E:/douban_movie/manage.py runserver 8000
Watching for file changes with StatReloader
Performing system checks...
System check identified some issues:
WARNINGS:
douban.Movie.num: (fields.W122) 'max_length' is ignored when used with IntegerField.
HINT: Remove 'max_length' from field
System check identified 1 issue (0 silenced).
November 14, 2019 - 19:46:31
Django version 2.2.7, using settings 'douban_movie.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CTRL-BREAK.
来个大佬解答下!万分感谢! 天行键丶 发表于 2019-11-14 12:47
把教务系统爬崩了,记了个大过,好在师兄有个省级的奖项,过几天就撤销了。
换个ip代{过}{滤}理会不会好一些啊 huguo002 发表于 2019-11-14 08:51
是不是爬取力度不够?
把教务系统爬崩了,记了个大过,好在师兄有个省级的奖项,过几天就撤销了。:rggrg {:1_909:} *** 竟然是和谐词啊!! 爬&虫学得好,喝茶少不了 2Burhero 发表于 2019-11-13 20:57
爬&虫学得好,喝茶少不了
吓得我赶紧加大学习力度,压压惊! 建议加大力度 支持,学习python还是需要亲手写项目 虽然小白但是还是支持下大佬 感谢,害得我照着打了一遍 不敢随便说话,怕暴露我的无知{:1_905:} 谢谢分享!学习python,还是需要多实践!