本帖最后由 poji123 于 2020-2-8 09:41 编辑
刚学到正则表达式就写了一个爬取糗事百科{:1_927:},大神勿喷。 适合新手学习:lol
效果图:
源代码:
"""
程序默认爬取1-10页的内容,如想爬取多页请在主函数里的for循环序列range(1,x)x代表想爬取的页数如10页那么就填入11,因为range()函数计数到参数end但不包含end即这里的11.
"""
def pear_tree_page(url):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.85 Safari/537.36 Edg/80.0.361.47"
}
response = requests.get(url,headers=headers)
text = response.text
contents= re.findall(r'<div class="content">.*?<span>(.*?)</span>',text,re.S)
contents_list = []
for i in contents:
r = re.sub(r'<.*?>','',i)
contents_list.append(r.strip())
for j in contents_list:
print(j)
print('*' * 50)
def main():
for i in range(3, 11):
url = 'https://www.qiushibaike.com/text/page/{}/'.format(i)
pear_tree_page(url)
if __name__ == '__main__':
main()
|