本帖最后由 Culaccino 于 2020-1-19 16:23 编辑
查询分类的链接:https://v.dsb.ink
本程序使用了pyquery,可以爬取大部分人人影视信息目前仅支持保存为csv和mongodb
因为人人影视不能筛选影视分级,所以写了个爬虫出来
目前支持爬取剧名,分级,链接,站内排名,剧种,评分,原名,地区,语言,首播时间,制作公司/电视台,类型,翻译,IMDB,别名,編劇,导演,主演,简介,图片链接
贴部分代码
[Python] 纯文本查看 复制代码 def analyze(u):
try:
url = baseurl + u
html = pq(requests.get(url, headers=headers).content.decode())
# 获取标题和剧种
t = html(".resource-tit h2").text()
title = re.search("【(.*)】《(.*)》", t).groups()
# 获取影视分级
items = html('.level-item img').attr('src')
level = get_level(items, info["level"])
# 如果返回的分级为false则跳过这条
if level == False: return
# 获取主要信息
main_info = get_info(html)
# 获取剧种
if "dramaType" in result_info:
main_info["dramaType"] = title[0]
# 获取评分
if "score" in result_info:
main_info["score"] = get_score(u[-5:])
if 'url' in result_info:
main_info['url'] = url
# 获取影片封面
if "imgurl" in result_info:
main_info['imgurl'] = html('.imglink a').attr('href')
# 获取本站排名
if "rank" in result_info:
main_info["rank"] = re.search("本站排名:(\d*)", html(".score-con li:first-child .f4").text()).group(1)
# 获取简介
if "introduction" in result_info:
main_info["introduction"] = html(".resource-desc .con:last-child").text()
result = {}
# 写入标题和分级
result["title"] = title[1]
result["level"] = level
# 遍历main_info,只写入有效数据
for key, value in result_info.items():
try:
result[key] = main_info[key]
except:
result[key] = '暂无'
print(result['title'])
# 写入文件
if export == 'csv':
wirtecsv([i for i in result.values()])
else:
mycol.insert_one(result)
except Exception as e:
throw_error(e)
analyze(u)
def main(num):
try:
url = baseurl + "/resourcelist/?page={}".format(num)
doc = pq(requests.get(url, headers=headers).content.decode())
urls = doc(".fl-info a").items() #遍历a标签
if info['threads']:
for i in urls:
Thread(target=analyze,args=(i.attr("href"),)).start()
else:
for i in urls:
analyze(i.attr("href"))
time.sleep(1)
except Exception as e:
throw_error(e)
main(num)
下载可访问https://github.com/Kevin0z0/yyeTs-resource-crawler
也可以下载附件 |