【Python】爬虫框架PySpider爬取前程无忧职位

小葫蘆 发表于 2018-5-14 09:46

本帖最后由 wushaominkk 于 2018-5-14 17:26 编辑

爬虫框架PySpider：一个国人编写的强大的网络爬虫系统并带有强大的WebUI。采用Python语言编写，分布式架构，支持多种数据库后端，强大的WebUI支持脚本编辑器，任务监视器，项目管理器以及结果查看器。
只需要写少量代码就可以大量爬取网页信息，真是爬虫必备啊，适合新人使用的框架
安装方法：
1、先安装Anaconda3（我用的python3.6）
2、安装对应python版本的pycurl：pip install pycurl
3、安装PySpider：pip install PySpider
命令提示符运行命令：pyspider
浏览器地址栏输入：http://localhost:5000/，成功后进入页面
详细的安装方法自行百度
新建爬虫的源码和我修改后的源码比较：
新建之初爬虫的源码：
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-05-14 09:43:31
# Project: qcwy20180514

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
crawl_config = {
}

@every(minutes=24 * 60)
def on_start(self):
   self.crawl('http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=030200&keyword=python&keywordtype=2&lang=c&stype=2&postchannel=0000&fromType=1&confirmdate=9', callback=self.index_page)

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
   for each in response.doc('a').items():
         self.crawl(each.attr.href, callback=self.detail_page)

@config(priority=2)
def detail_page(self, response):
   return {
         "url": response.url,
         "title": response.doc('title').text(),
   }

https://static.52pojie.cn/static/image/hrline/1.gif
我修改后的源码：
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-01-29 11:56:33
# Project: qcwy

from pyspider.libs.base_handler import *
import pymongo

class Handler(BaseHandler):
crawl_config = {
}

client=pymongo.MongoClient("localhost") # 本地的MongoDB数据库
db=client["tb_qcwy"] # 数据库名

@every(minutes=24 * 60)
def on_start(self):
   self.crawl('http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=030200&keyword=python&keywordtype=2&lang=c&stype=2&postchannel=0000&fromType=1&confirmdate=9',
               callback=self.index_page,
               validate_cert=False,
               connect_timeout = 50,
               timeout = 500
               )

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
   for each in response.doc('p > span > a').items(): # 每个职位详情链接
         self.crawl(each.attr.href, callback=self.detail_page,validate_cert=False)

   next=response.doc('.bk > a').attr.href # 下一页链接
   self.crawl(next,callback=self.index_page,validate_cert=False)

@config(priority=2)
def detail_page(self, response):
   return {
         "url": response.url, # 页面地址
         "location": response.doc('h1').text(), # 地理位置
         "company":response.doc('.cname > a').text(), # 公司名
         "work_location":response.doc('.lname').text(), # 工作地点
         "salary":response.doc('.cn > strong').text(), # 工资
         "requirements":response.doc('.sp4').text(), # 工作需求
         "zhiweixinxi":response.doc('.job_msg').text(), # 职位信息
         "address":response.doc('.bmsg > .fp').text(), # 公司地址

   }

# 保存到MongoDB
def on_result(self,result):
   if result:
         self.save_to_mongo(result)
def save_to_mongo(self,result):
   if self.db["qcwy20180129"].insert(result): # 数据库表名
         print("save to mongo",result)

完结撒花https://static.52pojie.cn/static/image/hrline/1.gif

mythe777 发表于 2019-12-23 23:55

linuxprobe 发表于 2018-5-14 10:37
现在编程排第一的是java，python排第四，而且还有可能往下降。

目前第三了，哈哈有点打脸，不过JAVA的确还是常青树

heroic 发表于 2018-5-14 09:59

鼓掌{:301_991:}学习了

benet 发表于 2018-5-14 10:11

棒棒的哒

hlink1021 发表于 2018-5-14 10:31

鼓掌学习了

zeng110114 发表于 2018-5-14 10:36

学习学习。感谢分享

linuxprobe 发表于 2018-5-14 10:37

现在编程排第一的是java，python排第四，而且还有可能往下降。

小卷毛吼吼 发表于 2018-5-14 11:03

厉害学习一下

wangchun1lei 发表于 2018-5-14 11:21

哇哦，这个很不错啊

不一定 发表于 2018-5-14 11:40

这个不错，学习了

Mcoco 发表于 2018-5-14 11:56

{:1_893:}学习了，支持！！！

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

【Python】爬虫框架PySpider爬取前程无忧职位