lxml和Xpath的具体应用
最近在学分布式爬虫发现lxml和Xpath熟练应用很重要,写了一个练习(爬取腾讯招聘信息)
没有写线程,目前只会多进程,多进程顺序会乱,所以没有用到
IDE是pycharm
python版本是3.65
import requests
from lxml import etree
BASE_DOMAIN="https://hr.tencent.com/"
HEADERS={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",
"Cookie":"PHPSESSID=77cb9dm9pvcs7lgeu401lc0td1; pgv_pvi=9957434368; pgv_si=s9246081024",
"Host":"hr.tencent.com",
"Upgrade-Insecure-Requests":"1"
}
#获取每一页的url
def get_urls(url):
hander = requests.get(url, headers=HEADERS)
html = etree.HTML(hander.text)
link = html.xpath("//td[@class='l square']/a/@href")
links = []
for a in link:
a = BASE_DOMAIN + a
links.append(a)
return links
#获取职位的详细信息,写入字典中
def parse_tetail_page(url):
position={}
hander = requests.get(url, headers=HEADERS)
html = etree.HTML(hander.text)
table=html.xpath("//table[@class='tablelist textl']")
title=table.xpath("//td[@id='sharetitle']/text()")
position['title']=title
place=table.xpath("//tr[@class='c bottomline']/td//text()")
workplace=place+place
JobCategory = place + place
Hiring = place + place
position['workplace'] = workplace
position['JobCategory'] = JobCategory
position['Hiring'] = Hiring
content = table.xpath("//ul[@class='squareli']")
duty = content.xpath(".//text()")
requirements = content.xpath(".//text()")
position['duty']=duty
position['requirements']=requirements
return position
#主循环
def spider():
informations=[]
#此处为搜索的详细页面
page="https://hr.tencent.com/position.php?keywords=python&start={}0#a"
#此处为爬取的页数
for x in range(0,1):
url=page.format(x)
links=get_urls(url)
for link in links:
information = parse_tetail_page(link)
informations.append(information)
print(informations)
if __name__ == '__main__':
spider()
支持下!我比较熟悉的是beautifulsoup Thending 发表于 2018-7-27 16:56
支持下!我比较熟悉的是beautifulsoup
熟练一种就可以了! bs4 太臃肿了 用起来不爽又慢 还是xpath 爽些 有点道理 看来我得好好学一下 xpath 语法 了 {:1_921:}
页:
[1]