吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 11911|回复: 35
收起左侧

[Python 原创] Scrapy爬取猫眼流浪地球影评2----- 获取数据

  [复制链接]
py看考场 发表于 2019-3-16 17:12
本帖最后由 py看考场 于 2019-3-25 20:16 编辑

上一篇帖子中介绍了scrapy的安装,以及scrapy的基本配置,本篇介绍流浪地球影评的获取。
上一篇帖子传输门    scrapy的安装配置  
下一篇帖子传输门    数据可视化

一.分析
1.首先介绍一下scrapy爬虫的基本流程:
在items.py文件中定义需要爬取内容的数据格式----->在spiders中的爬虫文件里发起请求并处理信息----->处理完后交给pipelines将数据存储到数据库或文件中

2.找到猫眼影评接口:
用Chrome打开猫眼PC网页,发现只有十条数据,因此将浏览器切换到手机模式,手机模式下就可以看到更多评论信息了,往上滑动终于可以看到接口请求数据了。因此评论信息在如下链接中:(前方有坑
http://m.maoyan.com/review/v2/comments.json?movieId=248906&userId=-1&offset=0&limit=15&ts=0&type=3,这时有点经验的人一般会改变offset的值(每次+15)控制翻页来获取数据,但是offset到1000就没有评论
信息了,这意味着这个方式只能获取 990 条数据。(舍弃)
1.png
2.png
3.png

因此我在网上找到了另外的接口:http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=2019-02-05%2020:28:22,可以把offset的值设定为0,通过改变startTime的值来获取更
多的评论信息(每页评论数据中最后一次评论时间作为新的startTime并构造url重新请求)

二.写代码(这个源码中没有添加gender信息,如果想要添加gender信息,可以在帖子下方查看[修改版])
1.Items.py文件
[Python] 纯文本查看 复制代码
import scrapy

class MaoyanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    city = scrapy.Field() #城市
    content = scrapy.Field() #评论
    user_id = scrapy.Field() #用户id
    nick_name = scrapy.Field() #昵称
    score = scrapy.Field() #评分
    time = scrapy.Field() #评论时间
    user_level = scrapy.Field() #用户等级

2.comment.py文件
[Python] 纯文本查看 复制代码
# -*- coding: utf-8 -*-
import scrapy
import random
from scrapy.http import Request
import datetime
import json
from maoyan.items import MaoyanItem

class CommentSpider(scrapy.Spider):
    name = 'comment'
    allowed_domains = ['maoyan.com']
    uapools = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)',
        'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
        'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0'
    ]
    thisua = random.choice(uapools)
    header = {'User-Agent': thisua}
    current_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    # current_time = '2019-02-06 18:01:22'
    end_time = '2019-02-05 00:00:00' #电影上映时间
    url = 'http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=' + current_time.replace(' ', '%20')

    def start_requests(self):
        current_t = str(self.current_time)
        if current_t > self.end_time:
            try:
                yield Request(self.url, headers = self.header, callback = self.parse)
            except Exception as error:
                print('请求1出错-----' + str(error))
        else:
            print('全部有关信息已经搜索完毕')

    def parse(self, response):
        item = MaoyanItem()
        data = response.body.decode('utf-8','ignore')
        json_data = json.loads(data)['cmts']
        count = 0
        for item1 in json_data:
            if 'cityName' in item1 and 'nickName' in item1 and 'userId' in item1 and 'content' in item1 and 'score' in item1 and 'startTime' in item1 and 'userLevel' in item1:
                try:
                    city = item1['cityName']
                    comment = item1['content']
                    user_id = item1['userId']
                    nick_name = item1['nickName']
                    score = item1['score']
                    time = item1['startTime']
                    user_level = item1['userLevel']
                    item['city'] = city
                    item['content'] = comment
                    item['user_id'] = user_id
                    item['nick_name'] = nick_name
                    item['score'] = score
                    item['time'] = time
                    item['user_level'] = user_level
                    yield item
                    count += 1
                    if count >= 15:
                        temp_time = item['time']
                        current_t = datetime.datetime.strptime(temp_time, '%Y-%m-%d %H:%M:%S') + datetime.timedelta(seconds = -1)
                        current_t = str(current_t)
                        if current_t > self.end_time:
                            url1 = 'http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=' + current_t.replace(' ', '%20')
                            yield Request(url1, headers=self.header, callback=self.parse)
                        else:
                            print('全部有关信息已经搜索完毕')
                except Exception as error:
                    print('提取信息出错1-----' + str(error))
            else:
                print('信息不全,已滤除')

3.pipelines文件
[Python] 纯文本查看 复制代码
import pandas as pd

class MaoyanPipeline(object):
    def process_item(self, item, spider):
        dict_info = {'city': item['city'], 'content': item['content'], 'user_id': item['user_id'], 'nick_name': item['nick_name'],
                     'score': item['score'], 'time': item['time'], 'user_level': item['user_level']}
        try:
            data = pd.DataFrame(dict_info, index=[0])  # 为data创建一个表格形式 ,注意加index = [0]
            data.to_csv('C:/Users/1/Desktop/流浪地球影评/info.csv', header=False, index=True, mode='a', encoding = 'utf_8_sig')  # 模式:追加,encoding = 'utf-8-sig'
        except Exception as error:
            print('写入文件出错-------->>>' + str(error))
        else:
            print(dict_info['content'] + '---------->>>已经写入文件')


三.运行程序
写完程序后点击pycharm界面左下角的Terminal图标,直接进入文件目录的命令行终端,输入scrapy crawl comment,回车运行程序
3.png

四.爬取过程以及成果图
1.png
这是在爬取时的截图,爬取过程很长,大约5-6小时吧。一共爬取了47万的数据,下一次的帖子会对这些数据可视化分析
2.png
最终效果图,看了一下数据,90%以上都是好评,评分大都是满分,评论中出现很多的好看,不错,很棒之类的词,不愧能在短时间内拿下这么高的票房。

五.回顾
1.当爬取了大约5万数据时,程序出现了2次中断,原因是一位用户没有位置信息,还有的没有昵称等。因此在程序中加了一个判断信息有无的语句
2.程序运行的速度感觉还是不快,希望大佬看到后能指点一二
3.因为获得的csv文件比较大,超过了50M,因此只能附上源码,供大家学习了
4.写作不易,希望大家给个热心吧

maoyan.zip (15.8 KB, 下载次数: 46)
5.修改版(gender=0,1,2分别代表男,女,无性别设定):
maoyan.zip (15.88 KB, 下载次数: 63)







免费评分

参与人数 13吾爱币 +13 热心值 +12 收起 理由
回手一笑 + 1 + 1 热心回复!
wuaizbs999 + 1 我很赞同!
hohov + 1 谢谢@Thanks!
LAX29120 + 1 + 1 谢谢@Thanks!
天空宫阙 + 1 + 1 热心回复!
qylisten + 1 + 1 谢谢@Thanks!
niebaohua + 1 + 1 热心回复!
时空之外 + 1 + 1 最近正在学scrapy,楼主的教程很有意义,感谢分享~
繁华转眼 + 1 + 1 用心讨论,共获提升!
1358582642 + 1 感谢发布原创作品,吾爱破解论坛因你更精彩!
superain + 1 + 1 热心回复!
cndaofeng + 1 + 1 教程编写不易 必须给你支持
wushaominkk + 3 + 1 感谢发布原创作品,吾爱破解论坛因你更精彩!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

繁华转眼 发表于 2019-3-25 13:57
只能爬今天13点到现在的评论,这是怎么回事?
[Python] 纯文本查看 复制代码
# -*- coding: utf-8 -*-
import scrapy
import random
from scrapy.http import Request
import datetime
import json
from pic.items import MaoyanItem


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['maoyan.com']
    uapools = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)',
        'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
        'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0'
    ]
    thisua = random.choice(uapools)
    header = {'User-Agent': thisua}
    current_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    # current_time = '2019-02-06 18:01:22'
    end_time = '2019-02-05 00:00:00'  # 电影上映时间
    url = 'http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=' + current_time.replace(' ',
                                                                                                                   '%20')

    def start_requests(self):
        current_t = str(self.current_time)
        if current_t > self.end_time:
            try:
                yield Request(self.url, headers=self.header, callback=self.parse)
            except Exception as error:
                print('请求1出错-----' + str(error))
        else:
            print('全部有关信息已经搜索完毕')

    def parse(self, response):
        item = MaoyanItem()
        data = response.body.decode('utf-8', 'ignore')
        json_data = json.loads(data)['cmts']
        count = 0
        for item1 in json_data:
            if 'cityName' in item1 and 'nickName' in item1 and 'userId' in item1 and 'gender' in item1 and 'content' in item1 and 'score' in item1 and 'startTime' in item1 and 'userLevel' in item1:
                try:
                    city = item1['cityName']
                    comment = item1['content']
                    user_id = item1['userId']
                    nick_name = item1['nickName']
                    gender = item1['gender']
                    score = item1['score']
                    time = item1['startTime']
                    user_level = item1['userLevel']
                    item['city'] = city
                    item['content'] = comment
                    item['user_id'] = user_id
                    item['nick_name'] = nick_name
                    item['gender'] = gender
                    item['score'] = score
                    item['time'] = time
                    item['user_level'] = user_level
                    yield item
                    count += 1
                    if count >= 15:
                        temp_time = item['time']
                        current_t = datetime.datetime.strptime(temp_time, '%Y-%m-%d %H:%M:%S') + datetime.timedelta(
                            seconds=-1)
                        current_t = str(current_t)
                        if current_t > self.end_time:
                            url1 = 'http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=' + current_t.replace(
                                ' ', '%20')
                            yield Request(url1, headers=self.header, callback=self.parse)
                        else:
                            print('全部有关信息已经搜索完毕')
                except Exception as error:
                    print('提取信息出错1-----' + str(error))
            else:
                print('信息不全,已滤除')
niebaohua 发表于 2019-3-20 21:31
写入文件出错-------->>>[Errno 2] No such file or directory: '/home/nianshao/PycharmProjects/流浪地球影评/info.csv'
2019-03-20 21:27:30 [scrapy.core.scraper] DEBUG: Scraped from <200 http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=2019-03-19%2020:46:15>
None
写入文件出错-------->>>[Errno 2] No such file or directory: '/home/nianshao/PycharmProjects/流浪地球影评/info.csv'
2019-03-20 21:27:30 [scrapy.core.scraper] DEBUG: Scraped from <200 http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=2019-03-19%2020:46:15>
None
写入文件出错-------->>>[Errno 2] No such file or directory: '/home/nianshao/PycharmProjects/流浪地球影评/info.csv'
2019-03-20 21:27:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=2019-03-19%2020:46:15>
None
写入文件出错-------->>>[Errno 2] No such file or directory: '/home/nianshao/PycharmProjects/流浪地球影评/info.csv'
2019-03-20 21:27:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=2019-03-19%2020:46:15>
None
写入文件出错-------->>>[Errno 2] No such file or directory: '/home/nianshao/PycharmProjects/流浪地球影评/info.csv'
2019-03-20 21:27:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=2019-03-19%2020:46:15>
None
写入文件出错-------->>>[Errno 2] No such file or directory: '/home/nianshao/PycharmProjects/流浪地球影评/info.csv'
2019-03-20 21:27:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://m.maoyan.com/mmdb/comments/movie/248906.json?_v_=yes&offset=0&startTime=2019-03-19%2020:46:15>
hyolyn 发表于 2019-3-16 17:33
 楼主| py看考场 发表于 2019-3-16 17:43

大佬可以无视
苍余楚 发表于 2019-3-16 18:28
爬虫代码
runfog 发表于 2019-3-17 00:53
高度的赞赏和认可
 楼主| py看考场 发表于 2019-3-17 01:34
runfog 发表于 2019-3-17 00:53
高度的赞赏和认可

感谢认可
小雪龍 发表于 2019-3-17 10:11
厉害了楼主
uumesafe 发表于 2019-3-19 14:00
python 技术贴,支持一下
wdye 发表于 2019-3-20 16:52
好像没分了,刚评了3 ,,只是想说很棒。。。
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-16 11:36

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表