python3 网页的title的识别
本帖最后由 zuoyou001 于 2020-8-5 11:57 编辑师傅们这是我采用python3写的多线程的网页的title的识别,新手入门写的比较low,还望师傅们别嫌弃。
但是师傅们,我有一个问题,我在刚开始写python3的多线程时,碰见了ip_scan这个线程自动结束的问题,我采用的是在ip_scan这个函数里设置延迟,但是效果不太理想,跪求各位师傅有没有更好的解决办法。
code区
import re
import requests
import os
from bs4 import BeautifulSoup
from selenium import webdriver
from queue import Queue
from threading import Thread
ip_list=[]
file_name="url"
file_name=os.path.join('H:/file',file_name)
def file(q):
file_name1=file_name+".txt"
f = open(file_name1,'r',encoding='UTF-8')
for line in f.readlines():
line=line.strip('\n')
q.put(line)
def save(ip,title):
file_name2=file_name+"_target.txt"
f3=open(file_name2,'a+')
title=title+"\n"
f3.write(ip)
f3.write(title)
def ip_scan(q):
browser=webdriver.Firefox()
browser.set_page_load_timeout(8)
while not q.empty():
ip=q.get()
print (ip)
if 'http' not in ip:
url1='http://'+ip
else:
url1=ip
try:
browser.get(url1)
title=browser.title
print (title)
if"阿里云404页面" not intitle:
if "Not Found" not in title:
save(url1,title)
except:
pass
defmain():
q=Queue()
s = Thread(target=file,args=(q,))
s.start()
s.join()
threads = []
for i in range(5):
t = Thread(target=ip_scan, args=(q,))
t.start()
threads.append(t)
for t in threads:
t.join()
if __name__ == "__main__":
main()
正好手里有个多进程的例子。如果线程间数据独立,最好用多进程,python的多线程跟闹着玩似的:
import os
from multiprocessing.pool import Pool
import pymongo
from drivers import firefox, refresh_page, get_header_detail, click_each_url, if_new_window_is_opened, \
if_detail_page_loaded, get_overview_detail
def get_products_in_each_page(page):
browser = firefox()
browser.get(
f'https://www.alibaba.com/products/Schisandra_Chinensis.html?spm=a2700.galleryofferlist.0.0.2a1363ffJcGqtX&IndexArea=product_en&page={page}')
refresh_page(browser)
products = browser.find_elements_by_class_name('organic-list-offer-inner')
items = []
for product in products:
item = {}
get_header_detail(product, item)
main_handles = browser.window_handles
href = click_each_url(product)
if not if_new_window_is_opened(browser, main_handles):
continue
browser.switch_to.window(browser.window_handles[-1])
if not if_detail_page_loaded(browser):
continue
get_overview_detail(browser, item, href)
browser.close()
browser.switch_to.window(browser.window_handles)
items.append(item)
print(item)
browser.quit()
return items
def get_products(page):
client = pymongo.MongoClient('172.17.0.2', 27017)
db = client['alibaba']
db.authenticate('root', 'a')
wwz = db['wwz']
items = get_products_in_each_page(page)
wwz.insert_many(items)
if __name__ == '__main__':
print('Parent process %s.' % os.getpid())
last_page = 37
pool = Pool(2)
for i in range(last_page):
pool.apply_async(get_products, args=(i + 1,))
print('Waiting for all subprocesses done...')
pool.close()
pool.join()
print('All subprocesses done.') 本帖最后由 pzx521521 于 2020-8-5 16:35 编辑
selenium 一般不用来做爬虫, 最好用2L的request 携带cookie
如果非要用来做爬虫
也必须用到线程池(chrome太占资源了 不可能开几十个)
ip_scan线程自动结束的问题
->不是问题, 线程执行完当然要结束了
你碰到的问题是如何等待到页面加载完成(可以google selenium wait )
一般用这种方法: 等待直到发现到某个html element
WebDriverWait wait=new WebDriverWait(driver,10);
wait.until(ExpectedConditions.presenceOfElementLocated(By.id("XXX")));
thepoy 发表于 2020-8-5 13:19
正好手里有个多进程的例子。如果线程间数据独立,最好用多进程,python的多线程跟闹着玩似的:
受教了,感谢师傅 pzx521521 发表于 2020-8-5 16:33
selenium 一般不用来做爬虫, 最好用2L的request 携带cookie
如果非要用来做爬虫
也必须用到线程池(chrome ...
受教了,感谢,师傅,
页:
[1]