import requests
from lxml import etree
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Host": "tieba.baidu.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36",
}
url = "https://tieba.baidu.com/f?ie=utf-8&kw=%E6%9D%8E%E6%AF%85"
html = requests.get(url,headers=headers).text
html_new=html.replace(r'<!--','"').replace(r'-->','"')
titles = etree.HTML(html_new).xpath('//div[@class="threadlist_title pull_left j_th_tit "]/a/text()')
for title in titles:
print(title)
xpath最后有个空格,另headers里面可以多加一些字段,而且楼主用的UA看起来好像是Android的所以返回的是手机wap端的数据,所以解析不了 建议把,get到的html内容打印出来或者存到文本,
然后去查看获取到的内容是否包含,这些标题。
存在的话,可以尝试着复制出来,一段一段匹配
应该是异步了,$.ajax.seting设置false看看 import requests
from lxml import html
url = 'https://tieba.baidu.com/f?ie=utf-8&kw=%E5%A4%B4%E5%83%8F'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0',
}
html_doc = requests.get(url, headers=headers).text
html_data = html_doc.replace(r'<!--', '"').replace(r'-->', '"')
selector = html.fromstring(html_data)
title_list = selector.xpath('//*[@id="thread_list"]/li')
for i in title_list:
title = i.xpath('div/div/div/div/a/@title')
print(title) import requests
from lxml import html
url = 'https://tieba.baidu.com/f?ie=utf-8&kw=%E5%A4%B4%E5%83%8F'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0',
}
html_doc = requests.get(url, headers=headers).text
html_data = html_doc.replace(r'<!--', '"').replace(r'-->', '"')
selector = html.fromstring(html_data)
title_list = selector.xpath('//*[@id="thread_list"]/li')
for i in title_list:
title = i.xpath('div/div/div/div/a/@title')
print(title) 本帖最后由 1170 于 2020-6-4 22:31 编辑
loner. 发表于 2020-6-4 21:20
哇,大佬厉害,你的代码可以成功运行。但是你说的xpath最后有个空格是啥意思,没太明白
xpath表达式里面的末尾有个空格,你复制的时候漏了@class="threadlist_title pull_left j_th_tit "这里
页:
[1]
2