loner. 发表于 2020-6-3 22:56

loner. 发表于 2020-6-3 23:00

hill_king 发表于 2020-6-3 23:04

是否异步加载?

1170 发表于 2020-6-4 00:18

本帖最后由 1170 于 2020-6-4 00:20 编辑

import requests
from lxml import etree


headers={
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                "Host": "tieba.baidu.com",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36",
                }
url = "https://tieba.baidu.com/f?ie=utf-8&kw=%E6%9D%8E%E6%AF%85"
html = requests.get(url,headers=headers).text
html_new=html.replace(r'<!--','"').replace(r'-->','"')
titles = etree.HTML(html_new).xpath('//div[@class="threadlist_title pull_left j_th_tit "]/a/text()')
for title in titles:
    print(title)


xpath最后有个空格,另headers里面可以多加一些字段,而且楼主用的UA看起来好像是Android的所以返回的是手机wap端的数据,所以解析不了

徒想er 发表于 2020-6-4 00:19

建议把,get到的html内容打印出来或者存到文本,

然后去查看获取到的内容是否包含,这些标题。

存在的话,可以尝试着复制出来,一段一段匹配

天川天音 发表于 2020-6-4 01:31

应该是异步了,$.ajax.seting设置false看看

yc19951005 发表于 2020-6-4 14:16

import requests
from lxml import html

url = 'https://tieba.baidu.com/f?ie=utf-8&kw=%E5%A4%B4%E5%83%8F'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0',
}
html_doc = requests.get(url, headers=headers).text
html_data = html_doc.replace(r'<!--', '"').replace(r'-->', '"')
selector = html.fromstring(html_data)
title_list = selector.xpath('//*[@id="thread_list"]/li')
for i in title_list:
    title = i.xpath('div/div/div/div/a/@title')
    print(title)

yc19951005 发表于 2020-6-4 14:18

import requests
from lxml import html

url = 'https://tieba.baidu.com/f?ie=utf-8&kw=%E5%A4%B4%E5%83%8F'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0',
}
html_doc = requests.get(url, headers=headers).text
html_data = html_doc.replace(r'<!--', '"').replace(r'-->', '"')
selector = html.fromstring(html_data)
title_list = selector.xpath('//*[@id="thread_list"]/li')
for i in title_list:
    title = i.xpath('div/div/div/div/a/@title')
    print(title)

loner. 发表于 2020-6-4 21:20

1170 发表于 2020-6-4 22:30

本帖最后由 1170 于 2020-6-4 22:31 编辑

loner. 发表于 2020-6-4 21:20
哇,大佬厉害,你的代码可以成功运行。但是你说的xpath最后有个空格是啥意思,没太明白
xpath表达式里面的末尾有个空格,你复制的时候漏了@class="threadlist_title pull_left j_th_tit "这里
页: [1] 2
查看完整版本: 使用xpath爬取百度贴吧返回空列表怎么解决,求助大佬