【原创源码】python 爬取智联招聘
本帖最后由 lz270978971 于 2019-9-5 10:22 编辑无聊,写了一个爬取智联的一个小爬虫
python版本:python3.7
依赖模块:selenium、pyquery
废话少说,上代码
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from pyquery import PyQuery as pq
import time
class ZhiLian:
def __init__(self):
# 设置 chrome 无界面化模式
self.chrome_options = Options()
self.chrome_options.add_argument('--headless')
self.chrome_options.add_argument('--disable-gpu')
self.driver = webdriver.Chrome(chrome_options=self.chrome_options)
def get_url(self, search='python'):
"""
获取搜索职位的url, demo里面默认搜索python
:param search:
:return:
"""
self.driver.get("https://www.zhaopin.com/")
element = self.driver.find_element_by_class_name("zp-search__input")
element.send_keys(f"{search}")
element.send_keys(Keys.ENTER)
# 切换窗口
self.driver.switch_to.window(self.driver.window_handles)
# 等待js渲染完成后,在获取html
time.sleep(4)
html = self.driver.find_element_by_xpath("//*").get_attribute("outerHTML")
return html
def data_processing(self):
"""
处理数据
:return:
"""
html = self.get_url()
doc = pq(html)
contents = doc(".contentpile__content__wrapper")
for content in contents.items():
jobname = content(".contentpile__content__wrapper__item__info__box__jobname__title").text()
companyname = content(".contentpile__content__wrapper__item__info__box__cname").text()
saray = content(".contentpile__content__wrapper__item__info__box__job__saray").text()
demand = content(".contentpile__content__wrapper__item__info__box__job__demand").text()
yield jobname, companyname, saray, ",".join(demand.split("\n"))
datas = ZhiLian().data_processing()
for data in datas:
print(data)
这是结果图:
尝试跑了一下代码,出现错误提示,貌似现在要通过真人验证才能正常打开网页。楼主遇到这样的情况吗?出错信息如下:
"鎮ㄥソ锛?鎴戜滑鏄?愭櫤鑱斿ぇ鍓嶇鈥嬨?戙??
鎴戜滑甯姪鑺歌姼浼楃敓鎵惧埌鏇村ソ鐨勫伐浣滐紝褰撶劧涔熶笉鎰块敊杩囪蛋鍦ㄥ墠绔箣宸呯殑鎮ㄣ??
鎴戜滑鍦?zpfe@group.zhaopin.com.cn 鎭?欐偍鐨勭畝鍘嗐??, source: https://fecdn1.zhaopin.cn/www/index.web.ef23d8.js (1)
"%c ___ ___ ___ ___
//\ //\ //\ //\
//::| //::\//:/ //:/
//:/:| //:/\:\//:/ /\//:/ /\
//:/|:|__ //:/ /://:/ /://:/ /:/_
/__/:/ |:| //__/:/ /:/__/:/ /:/__/:/ /:/ /\
\__\/|:|/:\\:\/:/\\:\/:/\\:\/:/ /:/
||:/:/ \\::/\\::/\\::/ /:/
||::/ \\:\ \\:\ \\:\/:/
||:/ \\:\ \\:\ \\::/
|__|/ \__\/ \__\/ \__\/
color: #1787fb", source: https://fecdn1.zhaopin.cn/www/index.web.ef23d8.js (1)
"adv锛氳姹傚紑濮?, source: https://fecdn1.zhaopin.cn/www/index.web.ef23d8.js (1)
"adv锛氭暟鎹洿鏂版垚鍔燂紝鑰楁椂锛?8ms", source: https://fecdn1.zhaopin.cn/www/index.web.ef23d8.js (1)
"adv锛氳姹傝繑鍥炴暟鎹紝鐢ㄦ椂 88ms", source: https://fecdn1.zhaopin.cn/www/index.web.ef23d8.js (1)
"The AudioContext was not allowed to start. It must be resumed (or created) after a user gesture on the page. https://goo.gl/7K7WLu", source: https://static.geetest.com/static/js/fullpage.8.8.5.js (1)
"The AudioContext was not allowed to start. It must be resumed (or created) after a user gesture on the page. https://goo.gl/7K7WLu", source: https://static.geetest.com/static/js/fullpage.8.8.5.js (1) Exception ignored in: <function Popen.__del__ at 0x01243A08>
Traceback (most recent call last):
File "D:\python3.7\lib\subprocess.py", line 860, in __del__
self._internal_poll(_deadstate=_maxsize)
File "D:\python3.7\lib\subprocess.py", line 1216, in _internal_poll
if _WaitForSingleObject(self._handle, 0) == _WAIT_OBJECT_0:
OSError: 句柄无效。 不错,有一定实际意义,能爬取更多招聘网站就更好了 貌似没有指定地点? 看到Selenium就关闭了 支持一下,希望楼主做的更好,加油! 路过 支持一下 要是能有成品exe就好了 氓之嗤嗤 发表于 2019-9-5 10:44
貌似没有指定地点?
地点是默认你所在的城市 Antigen 发表于 2019-9-5 10:49
看到Selenium就关闭了
在爬取动态网站的时候,很好用,省的找接口了
页:
[1]
2