爬虫学习【1】获取豆瓣榜单信息存入本地

Fullmoonbaka · 发表于 2021-7-21 10:21

本爬虫采用 requests 请求网页信息，xpath 对网页DOM树进行解析的方式来采集网页关键信息，最后生成含有关键信息的txt文件。
本爬虫为python爬虫入门级练手作品，感兴趣的朋友可以一起交流学习。有什么问题也望大佬指出。

headers里必要的项为User-Agent，其他项没有也无所谓。
代码里含有两个方法：
一个为下载网页到本地
另一个为解析本地网页生成txt文件

这样写的本意是在减少访问网页的次数，减少网站的压力。
也可以将获取到的网页内容直接进行解析，减少保存到本地以及读取本地这个操作。

各位编写爬虫时一定要尽可能的减少对网页的访问次数，不要给网站的维护和管理人员添麻烦

[Python] 纯文本查看 复制代码

import requests
from lxml import etree


# 将豆瓣网页下载至本地
def get_douban_html():
    """获取豆瓣榜单网页数据存入本地文件中"""
    # 豆瓣新片榜
    url = "https://movie.douban.com/chart"
    # 请求头
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Host": "movie.douban.com",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    # 发起请求
    data = requests.get(url=url, headers=headers)
    # 状态码
    if data.status_code == 200:
        file = open('001.html', 'w', encoding='UTF-8')
        file.write(data.text)
        file.close()
        print('下载完成')
    else:
        print("请求出错: {}".format(data.status_code))


# 解析本地豆瓣网页, 将榜单信息处理后存入 txt 文件
def read_douban_lxml():
    # 初始化生成一个XPath解析对象
    html = etree.HTML(open('001.html', encoding='UTF-8').read(), etree.HTMLParser())
    # 解析对象输出代码
    # result = etree.tostring(html, encoding='utf-8')
    # print(type(result))
    # print(type(html))
    # print(result.decode('utf-8'))
    string = ''
    item_list = html.xpath('//tr[@class="item"]')
    for item in item_list:
        # print(item.xpath('.//a[@class="nbg"]/@title')[0])
        string += "----------\n[电影]: {}\n".format(item.xpath('.//a[@class="nbg"]/@title')[0])
        # print(item.xpath('.//div[@class="pl2"]/a/text()')[0].strip().strip('/').strip())
        string += "[别名]: {} / ".format(item.xpath('.//div[@class="pl2"]/a/text()')[0].strip().strip('/').strip())
        # print(item.xpath('.//div[@class="pl2"]/a/span/text()')[0])
        string += "{}\n".format(item.xpath('.//div[@class="pl2"]/a/span/text()')[0])
        # print(item.xpath('.//div[@class="pl2"]/p[@class="pl"]/text()')[0])
        string += "[信息]: {}\n".format(item.xpath('.//div[@class="pl2"]/p[@class="pl"]/text()')[0])
        # print(item.xpath('.//div[@class="pl2"]//span[@class="rating_nums"]/text()')[0])
        string += "[评分]: {}\n".format(item.xpath('.//div[@class="pl2"]//span[@class="rating_nums"]/text()')[0])
    file = open('001.txt', 'w', encoding='utf-8')
    file.write(string)
    file.close()


read_douban_lxml()



"""
    # 指定解析器HTMLParser会根据文件修复HTML文件中缺失的如声明信息
    html2 = etree.parse('001.html', etree.HTMLParser())
    # 解析成字节
    result2 = etree.tostring(html2)
    # 解析成列表
    # result2 = etree.tostringlist(html2)
    print(type(result2))
    print(type(html2))
    print(result2.decode('utf-8'))
"""

Fullmoonbaka · 发表于 2021-7-21 10:24

最下面的注释是我学习的一些笔记，忘记删除了，大家见谅

QingYi. · 发表于 2021-7-21 11:32

這個網站只需要帶一個ua就行了

yoyoma211 · 发表于 2021-7-21 12:27

这个要学习下，不太懂

emoheizi · 发表于 2021-7-22 12:39

需要sleep下防止过快爬虫导致反爬虫嘛？

simbro · 发表于 2021-7-22 17:45

这个要学习下，以后肯定用得到

Fullmoonbaka · 发表于 2021-7-27 09:37

emoheizi 发表于 2021-7-22 12:39
需要sleep下防止过快爬虫导致反爬虫嘛？

这个倒是不需要, 因为我的第一个方法将html文件下载到本地了, 读取的是本地的html文件

帐号		自动登录	找回密码
密码			注册[Register]

[学习记录] 爬虫学习【1】获取豆瓣榜单信息存入本地

免费评分