好友
阅读权限30
听众
最后登录1970-1-1
|
本帖最后由 创造太阳 于 2020-9-2 17:06 编辑
我给女朋友发了20000句情话,她暂时不想听情话了!(详见:女朋友抱怨不会说情话,用python给她发了20000句!让她见识见识情话大全!https://www.52pojie.cn/thread-1113388-1-1.html(出处: 吾爱破解论坛))
女朋友最近不找我下五子棋了!(详见:女朋友下棋比我厉害,还特别嚣张,不能忍。拿python敲个辅助,看她如何嚣张!https://www.52pojie.cn/thread-1116867-1-1.html(出处: 吾爱破解论坛))
不过也没怼我,上次的表情包好像白存了,不过还是留着吧,万一派上用场了呢?(详见:为了防止女朋友怼我,我就先用python爬了3600个怼人表情包等她来战!https://www.52pojie.cn/thread-1118801-1-1.html(出处: 吾爱破解论坛))
也不知道你们找到女朋友了没有,找到的话,一起来交流研究!(详见:别再酸了,女朋友是不可能发,但是可以用python帮你创造机会搞到一个!搞到来告诉我!https://www.52pojie.cn/thread-1119202-1-1.html(出处: 吾爱破解论坛))
女朋友存了好多网上的男明星的照片,我准备用python全部换成我的脸!https://www.52pojie.cn/thread-1120431-1-1.html(出处: 吾爱破解论坛)
为了知道女朋友的小秘密,我用python爬了榜姐微博下60000个女生小秘密!https://www.52pojie.cn/thread-1123043-1-1.html(出处: 吾爱破解论坛)
女朋友每晚都给我发诱惑图,我用python搞了更多诱惑图反击她!https://www.52pojie.cn/thread-1128807-1-1.html(出处: 吾爱破解论坛)
女朋友要和我拼手速,不得不用python让她见识我的手速!年轻人,不要试图和你不知...https://www.52pojie.cn/thread-1139015-1-1.html(出处: 吾爱破解论坛)
为了看看女朋友生日那天是什么样子,我用python爬了一年的照片!https://www.52pojie.cn/thread-1144764-1-1.html(出处: 吾爱破解论坛)
女朋友说A罩杯最流行,我用python爬了几十万的购买数据来证明她是在狡辩!https://www.52pojie.cn/thread-1145712-1-1.html(出处: 吾爱破解论坛)
女朋友说因为异性相吸,所以容易产生真爱!我觉得得用python验证一下!https://www.52pojie.cn/thread-1151862-1-1.html(出处: 吾爱破解论坛)
周扬青和罗志祥分手,女朋友说要去周扬青吧看看有没有经验贴,我只好用python快速...https://www.52pojie.cn/thread-1163712-1-1.html(出处: 吾爱破解论坛)
女朋友比较喜欢民宿风格,我就用python爬了一个短租民宿网站,做个合格的男朋友!https://www.52pojie.cn/thread-1171476-1-1.html(出处: 吾爱破解论坛)
为了提高女朋友的计算速度,我用python帮她做了一个出题器,提升一下她的智力水平!https://www.52pojie.cn/thread-1187174-1-1.html(出处: 吾爱破解论坛)
不要998,不要668,不要188,只要10行代码!手把手带你给女朋友写本《吃不胖经》!https://www.52pojie.cn/thread-1235152-1-1.html(出处: 吾爱破解论坛)
女朋友急的叫“爸爸”了,我只好用python帮女朋友迅速完成问卷搜集任务,一小时轻松完成五百份!.https://www.52pojie.cn/thread-1252330-1-1.html(出处: 吾爱破解论坛)
七夕前用python给女朋友抢了萝卜丁,结果她好像并不是很开心!强烈鄙视乱起名的商...https://www.52pojie.cn/thread-1256443-1-1.html(出处: 吾爱破解论坛)
七夕研究了半天给女朋友送什么,送“萝卜丁”让我成了智障玩意,问了个中文的小姐姐才明白,送这玩意还真是智商捉急呀!
当机立断还是清空女朋友的购物车,不过转念一想,这样好像有点太俗气了呀,不符合她的气质!
突然想到她说她之前在飞卢上写过小说,long long ago,读起来会让人脸红的,干脆就找出来,然后帮她出版了吧。
但是每次问她,总是给说一个大概的信息,二十多万字,有一段时间没写了,现在看起来也让人脸红。
只好大致推测了一下,应该是未完结,未签约,长久不更新,20万字左右,上飞卢,大致搜一下就可以了。
采集一下书名、作者、字数、更新日期什么的,然后找一下。
简单写了一下:
[Python] 纯文本查看 复制代码 import requests
from lxml import etree
import time
from pandas import DataFrame
import random
all_book_titles = [] #全部书名列表
all_book_urls = [] #网址列表
all_book_types = [] #类型列表
all_book_authors = [] #作者列表
all_book_nums = [] #字数列表
all_book_sums = [] #简介列表
all_book_date = [] #更新日期
for i in range(1,201): #便利1-200
print("正在爬取第" + str(i) + "页小说信息") #打印查看
url = "https://b.faloo.com/y/0/0/0/1/6/1/" + str(i) + ".html" #拼接网址
time.sleep(random.randint(1,5)) #随机延时1-5秒
try:
res = requests.get(url).text #用requests中的get函数采集访问网址,并取得数据
res_xpath = etree.HTML(res) #转城xapth格式
book_titles = res_xpath.xpath('//h1[@class="fontSize17andHei"]/@title') #标题xpath
book_urls = res_xpath.xpath('//h1[@class="fontSize17andHei"]/a/@href') #书网址xpath
book_types = res_xpath.xpath('//span[@class="fontSize14andHui"]/a/text()') #类型xpath
book_authors = res_xpath.xpath('//div[@id="BookContent"]//span[@class="fontSize14andsHui"]/a/text()') #作者xpath
book_nums = res_xpath.xpath('//*[@class="fontSize14andHui"]/span[4]/text()') #字数xpath
book_sums = res_xpath.xpath('//*[@id="BookContent"]/div/div/div[2]/div[3]/a/text()') #简介xpath
for book_title,book_url,book_type,book_author,book_num,book_sum in zip(book_titles,book_urls,book_types,book_authors,book_nums,book_sums):
all_book_titles.append(book_title) #加入到列表all_book_titles中
all_book_types.append(book_type)
all_book_authors.append(book_author)
all_book_nums.append(book_num)
all_book_sums.append(book_sum)
book_url = "https:" + book_url #拼接书本网址
all_book_urls.append(book_url)
try:
print("正在爬取小说 " + book_title + " 的更新日期") #显示
res = requests.get(book_url).text
time.sleep(0.15)
res_xpath = etree.HTML(res)
book_date = res_xpath.xpath('/html/body/div[3]/div[2]/div[3]/div[1]/div[1]/div[1]/span/span/text()') #更新日期xpath
book_date = "".join(book_date) #转换为str类型
all_book_date.append(book_date)
except:
pass
except:
pass
df = DataFrame({'名字': all_book_titles, '作者': all_book_authors,"类型":all_book_types,"字数":all_book_nums,"更新日期":all_book_date,"简介":all_book_sums,"网址":all_book_urls})
df.to_excel("飞卢文学_免费小说详细.xlsx")
但是,只有600本,完全不行呀,就只好用最暴力的方法了!
简单粗暴,每本书的信息都爬一下吧,小说页的网址全是数字变化,拼接就好不过看着这个庞大的数据量,只有搞个多线程才行,先整几个,不够的话就多整几个!
代码如下:[Python] 纯文本查看 复制代码 from lxml import etree
from pandas import DataFrame
import requests # 导入requests库
import re # 导入正则表达式库
import time # 导入时间库
import threading #导入多任务和线程库
book_url = "https://b.faloo.com/f/668302.html"
book_url = "https://b.faloo.com/f/626871.html"
def crawl(num): #定义一个爬取函数,函数值为数字,即书籍编号
book_titles = [] #书名列表
book_authors = [] #作者名列表
book_dates = [] #更新日期列表
book_types = [] #内容类型列表
book_son_types = [] #内容子类型列表
book_states = [] #书状态列表
book_word_nums = [] #字数列表
book_sums = [] #书简介列表
fail_book_urls = [] #失败网址列表
book_urls = [] #小说网址列表
for i in range(int(num),int(num)+20001): #遍历生成的从 输入的数字到数字+20000,也就是相当于,依次生成20000本书籍的网址
book_url = "https://b.faloo.com/f/" + str(i) + ".html" #拼接书籍网址
try: #生成的书籍网址不一定有书,因此要用try来处理,相当于尝试去访问
res = requests.get(book_url).text #用requests库中的get函数访问书籍所在页面
# time.sleep(0.15) #设定0.15秒的延时,不给服务器造成压力给被阻止
res_xpath = etree.HTML(res) #转换成为xpath结构
book_title = res_xpath.xpath('//*[@id="novelName"]/text()') #用元素的xpath信息来查找元素,标签中的文字信息要加上/text()
if len(book_title) == 0: #如果book_title的数量为0的话,也就是没有这个网址上面没有书
pass #跳过
else: #否则
book_title = "".join(book_title) #将标题转换为str类型
# print(book_title) #打印查看
book_titles.append(book_title) #将书名添加到书名列表中
print("正在爬取小说 " + book_title) #打印查看正在爬取的小说名字
book_author = res_xpath.xpath('/html/body/div[3]/div[2]/div[3]/div[1]/div[1]/div[1]/a/@title') #作者
# print(book_author)
book_author = "".join(book_author)
book_authors.append(book_author)
book_date = res_xpath.xpath('/html/body/div[3]/div[2]/div[3]/div[1]/div[1]/div[1]/span/span/text()')
book_date = "".join(book_date)
# print(book_date)
book_dates.append(book_date)
book_type = res_xpath.xpath('/html/body/div[3]/div[2]/div[5]/div[1]/div[2]/div[1]/span/span/a/text()')
book_son_type = res_xpath.xpath('/html/body/div[3]/div[2]/div[5]/div[1]/div[2]/div[2]/span/a/text()')
# print(book_type,book_son_type)
book_type = "".join(book_type)
book_son_type = "".join(book_son_type)
book_types.append(book_type)
book_son_types.append(book_son_type)
book_state = res_xpath.xpath('/html/body/div[3]/div[2]/div[5]/div[2]/div[4]/span/text()')
# print(book_state)
book_state = "".join(book_state)
book_states.append(book_state)
book_nums = res_xpath.xpath('/html/body/div[3]/div[2]/div[5]/div[2]/div[2]/span/text()')
# print(book_nums)
book_nums = "".join(book_nums)
book_num = re.findall(r"\d+\.?\d*",book_nums) #用正则表达式提取book_nums中的数字
book_num = "".join(book_num)
# print(book_num)
book_word_nums.append(book_num)
book_sum = res_xpath.xpath('/html/body/div[3]/div[2]/div[3]/div[2]/div[2]/div[1]/p/text()')
book_sum = "".join(book_sum)
# print(book_sum)
book_sums.append(book_sum)
book_urls.append(book_url)
except:
fail_book_url = book_url #失败网址
# print(fail_book_url)
fail_book_urls.append(fail_book_url)
df = DataFrame({'名字': book_titles, '作者': book_authors, "状态":book_states,"类型": book_types, "子类型":book_son_types,
"字数": book_word_nums,"更新日期": book_dates, "简介": book_sums, "网址": book_urls}) #存储这些列表到excel中的每列
df.to_excel(str(num) + "小说基本信息.xlsx") #保存的excel的名字
df = DataFrame({'失败网址': fail_book_urls})
df.to_excel(str(num) + "失败网址.xlsx")
if __name__ == "__main__":
starttime = time.time() #记录开始时间
t1 = threading.Thread(target=crawl, args=("100000",)) # 第一个线程,target=函数名,args=函数值,只有一个函数值的话,要加个“,”号
t2 = threading.Thread(target=crawl, args=("120000",)) # 第二个线程
t3 = threading.Thread(target=crawl, args=("140000",)) # 以此类推
t4 = threading.Thread(target=crawl, args=("160000",))
t5 = threading.Thread(target=crawl, args=("180000",))
t6 = threading.Thread(target=crawl, args=("200000",))
t7 = threading.Thread(target=crawl, args=("220000",))
t8 = threading.Thread(target=crawl, args=("240000",))
t9 = threading.Thread(target=crawl, args=("260000",))
t10 = threading.Thread(target=crawl, args=("280000",))
t11 = threading.Thread(target=crawl, args=("300000",))
t12 = threading.Thread(target=crawl, args=("320000",))
t13 = threading.Thread(target=crawl, args=("340000",))
t14 = threading.Thread(target=crawl, args=("360000",))
t15 = threading.Thread(target=crawl, args=("380000",))
t16 = threading.Thread(target=crawl, args=("400000",))
t17 = threading.Thread(target=crawl, args=("420000",))
t18 = threading.Thread(target=crawl, args=("440000",))
t19 = threading.Thread(target=crawl, args=("460000",))
t20 = threading.Thread(target=crawl, args=("480000",))
t21 = threading.Thread(target=crawl, args=("500000",))
t22 = threading.Thread(target=crawl, args=("520000",))
t23 = threading.Thread(target=crawl, args=("540000",))
t24 = threading.Thread(target=crawl, args=("560000",))
t25 = threading.Thread(target=crawl, args=("580000",))
t26 = threading.Thread(target=crawl, args=("600000",))
t27 = threading.Thread(target=crawl, args=("620000",))
t28 = threading.Thread(target=crawl, args=("640000",))
t29 = threading.Thread(target=crawl, args=("660000",))
t30 = threading.Thread(target=crawl, args=("680000",))
t31 = threading.Thread(target=crawl, args=("700000",))
t32 = threading.Thread(target=crawl, args=("720000",))
t33 = threading.Thread(target=crawl, args=("740000",))
t34 = threading.Thread(target=crawl, args=("760000",))
t35 = threading.Thread(target=crawl, args=("780000",))
t36 = threading.Thread(target=crawl, args=("700000",))
t37 = threading.Thread(target=crawl, args=("820000",))
t38 = threading.Thread(target=crawl, args=("840000",))
t39 = threading.Thread(target=crawl, args=("860000",))
t40 = threading.Thread(target=crawl, args=("880000",))
t41 = threading.Thread(target=crawl, args=("900000",))
t42 = threading.Thread(target=crawl, args=("920000",))
t43 = threading.Thread(target=crawl, args=("940000",))
t44 = threading.Thread(target=crawl, args=("960000",))
t45 = threading.Thread(target=crawl, args=("980000",))
t1.start() #开启第一个线程
t2.start() #开启第二个线程
t3.start()
t4.start()
t5.start()
t6.start()
t7.start()
t8.start()
t9.start()
t10.start()
t11.start()
t12.start()
t13.start()
t14.start()
t15.start()
t16.start()
t17.start()
t18.start()
t19.start()
t20.start()
t21.start()
t22.start()
t23.start()
t24.start()
t25.start()
t26.start()
t27.start()
t28.start()
t29.start()
t30.start()
t31.start()
t32.start()
t33.start()
t34.start()
t35.start()
t36.start()
t37.start()
t38.start()
t39.start()
t40.start()
t41.start()
t42.start()
t43.start()
t44.start()
t45.start()
endtime = time.time() #结束时间
print("执行时间为:",(endtime - starttime),"秒") #执行时间=结束时间-开始时间
哪位盆友比较闲的话,可以帮忙找一下吧!
|
免费评分
-
查看全部评分
|