吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 2539|回复: 28
收起左侧

[Python 原创] 原创启信宝爬虫

[复制链接]
bobo2017365 发表于 2023-12-19 23:09
本帖最后由 bobo2017365 于 2023-12-22 22:42 编辑

需要把header处 xxxxx 替换为真实的用户名密码,记得不能爬太快。
仅仅是学习用,不用于商业用途!




2023-12-21
已经兼容python3
需要安装bs4,安装方法
pip install bs4


[Python] 纯文本查看 复制代码
# coding: utf-8
from __future__ import print_function
__author__ = 'bobo'
from math import ceil
import sys
py_version = sys.version_infoimport requests
from BeautifulSoup import BeautifulSoup as BS
if py_version < (3, 0):
    reload(sys)
    sys.setdefaultencoding("utf-8")

home_url = 'http://www.qixin.com/search'
login_url = ""
get_params = {
    "area.city": 3303,
    "area.province": 33,
    "key": '',
    "page": 1,
}


header = {
    "acc": "xxxxxxxx",
    "pass": "xxxxxxx",
    "captcha": {"isTrusted": True},

    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
    "Connection": "keep-alive",
    "Host": "www.qixin.com",
    "Upgrade-Insecure-Requests": 1,
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:54.0) Gecko/20100101 Firefox/54.0",
}


def get_page_number(html, EVERY_PAGE=10, bbq=False):
    """
    获取总页数、计算出页数
    :param html:
    :param EVERY_PAGE:
    :return:
    """
    if bbq:
        html = BS(html)
    shangjia_numbers = html.find("em").text
    print(shangjia_numbers)
    print()
    try:
        # 为了方便计算页数
        shangjia_numbers = float(shangjia_numbers)
        PAGE_NUMBER = int(ceil(shangjia_numbers / EVERY_PAGE))
    except:
        print("int失败")
        PAGE_NUMBER = False
    return PAGE_NUMBER


# 传递页数爬取数据
def get_company_info(html, status=u"存续", all=False, file_name=None, bbq=False):
    """

    :param status: 企业状态 存续/注销
    :param all: 是否查看所有的企业
    :param file_name: 是否写入文件
    :return:
    """
    if bbq:
        html = BS(html)
    for tag_div in html.findAll(attrs={"class": "col-2-1"}):
        for span in tag_div.findAll("span"):
            # if status in span.text or all:
            if status in span.text:
                # print("公司存续")
                company_name = tag_div.find(attrs={"title": u"点击查看公司详情"}).text
                # print(company_name)
                for span_address in tag_div.findAll(attrs={"class": "legal-person"}):
                    if u"地址" in span_address.text:
                        company_address = span_address.text[3:][:-4]
                        # print(company_address)
                        if file_name:
                            with open(file_name, 'a') as f:
                                print(company_name, company_address)
                                f.write("{0} {1}\n".format(company_name, company_address))

    return


def main(KEY_WORD=u"医美", file_name="address.txt", first_html=None):
    if not first_html:
        r = requests.post(login_url, data=header)
        _cookies = r.cookies

        get_params["key"] = KEY_WORD
        print(get_params)
        first_html = requests.get(home_url, params=get_params, cookies=_cookies)
        first_html = first_html.text
        print(repr(first_html))
    # print(first_html)
    # sys.exit(0)
    first_html_content = BS(first_html)
    # print(first_html_content)
    get_company_info(first_html_content, file_name=file_name)

    pages = get_page_number(first_html_content)
    for page_number in range(2, pages + 1):
        get_params["page"] = page_number
        html_content = requests.get(home_url, params=get_params, cookies=_cookies)
        get_company_info(html_content, file_name=file_name)


if __name__ == "__main__":
    # for word in [u"医美", u"整容", u"整形"]:
    #     main(KEY_WORD=word)
    main()

免费评分

参与人数 4吾爱币 +10 热心值 +3 收起 理由
苏紫方璇 + 7 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!
mozhongzhou + 1 我很赞同!
blindcat + 1 + 1 谢谢@Thanks!
日月与你 + 1 + 1 热心回复!

查看全部评分

本帖被以下淘专辑推荐:

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

scz 发表于 2023-12-20 09:12
blindcat 发表于 2023-12-20 07:49
误会了,我以为企业微信爬虫了

我跟你一样误会了,还在想,这啥套路。
zheng8542 发表于 2023-12-21 10:09
python3将from BeautifulSoup import BeautifulSoup as BS
更改为from bs4 import BeautifulSoup
禁用
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
否则报错:
Traceback (most recent call last):
  File "d:\python\企信网.py", line 12, in <module>
    reload(sys)
NameError: name 'reload' is not defined
但还是报错:
Traceback (most recent call last):
  File "d:\python\企信网.py", line 117, in <module>
    main()
  File "d:\python\企信网.py", line 93, in main
    r = requests.post(login_url, data=header)
  File "D:\Python39\lib\site-packages\requests\api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "D:\Python39\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "D:\Python39\lib\site-packages\requests\sessions.py", line 573, in request
    prep = self.prepare_request(req)
  File "D:\Python39\lib\site-packages\requests\sessions.py", line 484, in prepare_request
    p.prepare(
  File "D:\Python39\lib\site-packages\requests\models.py", line 368, in prepare
    self.prepare_url(url, params)
  File "D:\Python39\lib\site-packages\requests\models.py", line 439, in prepare_url
    raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL '': No scheme supplied. Perhaps you meant http://?
是因为login_url里没有内容?
MimzyGo 发表于 2023-12-20 01:55
Python666999 发表于 2023-12-20 07:40
先收藏了,再慢慢研究学习
blindcat 发表于 2023-12-20 07:49
误会了,我以为企业微信爬虫了
zql961213wgh 发表于 2023-12-20 08:13
已收藏,慢慢学习
alanfish 发表于 2023-12-20 08:14
哇,高手在这里
milu1123 发表于 2023-12-20 08:16
    from BeautifulSoup import BeautifulSoup as BS
ModuleNotFoundError: No module named 'BeautifulSoup'
qlcyl110 发表于 2023-12-20 08:16
过来学习学习
okmad 发表于 2023-12-20 08:31
学习了学习了
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-24 17:23

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表