吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 2999|回复: 4
收起左侧

[Python 转载] lxml和Xpath的具体应用

[复制链接]
luoluoovo 发表于 2018-7-27 16:32
最近在学分布式爬虫
发现lxml和Xpath熟练应用很重要,写了一个练习(爬取腾讯招聘信息)
没有写线程,目前只会多进程,多进程顺序会乱,所以没有用到
IDE是pycharm
python版本是3.65
[Python] 纯文本查看 复制代码
import requests
from lxml import etree
BASE_DOMAIN="https://hr.tencent.com/"

HEADERS={
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",
"Cookie":"PHPSESSID=77cb9dm9pvcs7lgeu401lc0td1; pgv_pvi=9957434368; pgv_si=s9246081024",
"Host":"hr.tencent.com",
"Upgrade-Insecure-Requests":"1"
}
#获取每一页的url
def get_urls(url):
    hander = requests.get(url, headers=HEADERS)
    html = etree.HTML(hander.text)
    link = html.xpath("//td[@class='l square']/a/@href")
    links = []
    for a in link:
        a = BASE_DOMAIN + a
        links.append(a)
    return links
#获取职位的详细信息,写入字典中
def parse_tetail_page(url):
    position={}
    hander = requests.get(url, headers=HEADERS)
    html = etree.HTML(hander.text)
    table=html.xpath("//table[@class='tablelist textl']")[0]
    title=table.xpath("//td[@id='sharetitle']/text()")[0]
    position['title']=title
    place=table.xpath("//tr[@class='c bottomline']/td//text()")
    workplace=place[0]+place[1]
    JobCategory = place[2] + place[3]
    Hiring = place[4] + place[5]
    position['workplace'] = workplace
    position['JobCategory'] = JobCategory
    position['Hiring'] = Hiring
    content = table.xpath("//ul[@class='squareli']")
    duty = content[0].xpath(".//text()")
    requirements = content[1].xpath(".//text()")
    position['duty']=duty
    position['requirements']=requirements
    return position
#主循环
def spider():
    informations=[]
    #此处为搜索的详细页面
    page="https://hr.tencent.com/position.php?keywords=python&start={}0#a"
    #此处为爬取的页数
    for x in range(0,1):
        url=page.format(x)
        links=get_urls(url)
        for link in links:
            information = parse_tetail_page(link)
            informations.append(information)
    print(informations)

if __name__ == '__main__':
    spider()

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

Thending 发表于 2018-7-27 16:56
支持下!我比较熟悉的是beautifulsoup
 楼主| luoluoovo 发表于 2018-7-27 17:05
aixxa 发表于 2018-11-12 18:07
小黑LLB 发表于 2019-2-13 21:46
有点道理 看来我得好好学一下 xpath 语法 了
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-26 16:43

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表