初学了点Python爬虫,看什么都想爬一爬,但又一想,爬了干嘛呢?
感觉re正则的表达,和xpath比较易懂一点。就是模板上套一套内容
BeautifulSoup没搞懂,欢迎大家再搞个bs4版的,学习学习。
re版:
[Python] 纯文本查看 复制代码 import requests,re
url = 'https://www.bilibili.com/v/popular/rank/all'
resp = requests.get(url)
page_content = resp.text
resp.close()
obj = re.compile(r'data-rank="(?P<rank>.*?)" class="rank-item">.*?<div class="img"><a href="(?P<href>.*?)" target=.*?class="title">(?P<title>.*?)</a>.*?alt="play">.*?(?P<hot>.*?)</span>',re.S)
result = obj.finditer(page_content)
count = 0
for i in result:
rank = i.group('rank')
title = i.group('title').strip()
hot = i.group('hot').strip()
href = i.group('href')
print(rank,title,hot,'https:'+href)
count += 1
if count >= 20: # 展示的数量
break
print('done')
xpath版:
[Python] 纯文本查看 复制代码 import requests
from lxml import etree
url = 'https://www.bilibili.com/v/popular/rank/all'
resp = requests.get(url)
resp.close()
html = etree.HTML(resp.text)
lists = html.xpath('//*[@id="app"]/div/div[2]/div[2]/ul/li')[0:20] # 展示的数量
for i in lists:
rank = i.xpath('./div/div[1]/i/span/text()')[0]
title = i.xpath('./div/div[2]/a/text()')[0]
hot = "".join(i.xpath('./div/div[2]/div/div/span[1]/text()')).strip()
href = i.xpath('./div/div[2]/a/@href')[0]
print(rank,title,hot,"https:" + href)
print('done')
运行效果:
|