豆瓣排行榜ajax爬取 看视频都懂 一上手就各种问题 建议小白新手还是多练光看不...
import requestsimport pymongo
frommultiprocessing.dummy import Pool
from db_phb import *
from lxml import etree
"""
url = https://movie.douban.com/typerank?type_name=%E5%89%A7%E6%83%85&type=11&interval_id=100:90&action=
1.请求首页url 从response中找到详情页url
2.对详情页发送请求,从response中解析出电影标题评分 简介
3.首页翻页请求 分析ajax动态加载 获得翻页的url
4.储存
存到mangodb
"""
defget_moive_index(url,data):
try:
response = requests.get(url=url,headers=headers,params=data)
if response.status_code == 200:
return response.json()
except Exception:
pass
def detail_mov(url):
try:
response = requests.get(url=url,headers=headers)
if response.status_code == 200:
returnresponse.text
except Exception:
pass
#解析电影简介
def info(url):
rep = detail_mov(url)
tree = etree.HTML(rep)
try:
mov_info = tree.xpath('//div[@Class ="indent"]//span/text()')
mov_info = "".join()
# mov_info = str(mov_info)
# mov_info = mov_info.replace("\\u3000","").replace("\\n","").replace(" ","").replace("[","").replace("]","") #和上面那个推导式效果差不多
# print(mov_info)
returnmov_info
except:
pass
def save_to_mongo(item):
if db.insert(item):
print("储存到mongodb成功",item)
returnTrue
returnFalse
defmain(data):
print(("多进程启动成功"))
respons= get_moive_index(url,data)
for dic inrespons:
item = {}
ur = dic['url']
item['title'] = dic['title']
item['regions'] = dic['regions']
item['score'] = dic['score']
item['vote_count'] = dic['vote_count']
item['mov_info'] = info(ur)
print(item)
save_to_mongo(item)
if __name__ == '__main__':
clint = pymongo.MongoClient(MONGODB_URL)
db = clint
url = "https://movie.douban.com/j/chart/top_list?"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
datas = []
for x in range(START, END + 1):
data = {
"type": "11",
"interval_id": "100:90",
"action": "",
"start": x * 20,
"limit": "20"
}
datas.append(data)
pool = Pool(4)
pool.map(main,datas)
pool.close()
pool.join()
clint.close() 个人不熟悉python,但。。。ajax是在浏览器端用的,python后端用的并不是ajax技术吧 bigcan 发表于 2021-6-13 23:04
个人不熟悉python,但。。。ajax是在浏览器端用的,python后端用的并不是ajax技术吧
他说的是豆瓣使用的是AJAX,然后通过python爬取它 我也刚刚开始学,看你的代码逻辑都对呀,能编出这个我觉得你已经不是小白了。 ablajan 发表于 2021-6-13 23:35
我也刚刚开始学,看你的代码逻辑都对呀,能编出这个我觉得你已经不是小白了。
自己瞎折腾2个月了啥都看啥都不精 lihu5841314 发表于 2021-6-13 23:57
自己瞎折腾2个月了啥都看啥都不精
一般只要不是加密的还是爬的下来的 就是写的不流畅 lihu5841314 发表于 2021-6-13 23:58
一般只要不是加密的还是爬的下来的 就是写的不流畅
反正我感觉我只要坚持学个一两年我就是头猪也该成精了 小白来学习学习,这个网站我感觉两个库就行了呀,还有就是代码写的不太好看。
启动main下面那么多代码, 对于没底子的小白来说还是好难哦 你这标题写的我个人理解为用ajax来爬取内容
页:
[1]
2