【原创】Python爬虫爬企查查数据

wangyeyu2015 发表于 2019-5-25 17:56

本帖最后由 wangyeyu2015 于 2019-5-26 10:39 编辑

第五次发帖，位置不对管理帮忙挪挪~{:301_997:}
https://static.52pojie.cn/static/image/hrline/4.gif2019/05/26 10：30修复一个爬取重复数据的小BUG。话说，这么多收藏就莫得给分的嘛....
https://static.52pojie.cn/static/image/hrline/4.gif

今日，无意翻到孤的博客，看到凄惨的回复和热度，本着许久未更新会被取关的原则，我决定，，咳咳，，更新一篇关于Python爬虫的文章。运行如下：

爬到数据如下：

源码如下：#-*- coding-8 -*-
import requests
import lxml
import sys
from bs4 import BeautifulSoup
import xlwt
import time
import urllib

def craw(url,key_word,x):
User_Agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'
# if x == 0:
#    re = 'http://www.qichacha.com/search?key='+key_word
# else:
#    re = 'https://www.qichacha.com/search?key={}#p:{}&'.format(key_word,x-1)
re = r'https://www.qichacha.com/search?key='+key_word
headers = {
         'Host':'www.qichacha.com',
         'Connection': 'keep-alive',
         'Accept':r'text/html, */*; q=0.01',
         'X-Requested-With': 'XMLHttpRequest',
         'User-Agent':r'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
         'Referer': re,
         'Accept-Encoding':'gzip, deflate, br',
         'Accept-Language':'zh-CN,zh;q=0.9',
         'Cookie':r'xxxxxxxxx这里换成你的cookiexxxxxxxx这里换成你的cookiexxxxxxxxx这里换成你的cookiexxxxxxx',
         }

try:
   response = requests.get(url,headers = headers)
   if response.status_code != 200:
         response.encoding = 'utf-8'
         print(response.status_code)
         print('ERROR')
   soup = BeautifulSoup(response.text,'lxml')
except Exception:
   print('请求都不让，这企查查是想逆天吗？？？')
try:
   com_all_info = soup.find_all(class_='m_srchList').tbody
   com_all_info_array = com_all_info.select('tr')
   print('开始爬取数据，请勿打开excel')
   for i in range(0,len(com_all_info_array)):
#          try:
            temp_g_name = com_all_info_array.select('td').select('.ma_h1').text #获取公司名
            temp_g_tag = com_all_info_array.select('td').select('.search-tags').text #获取公司标签
            temp_r_name = com_all_info_array.select('td').select('p').a.text #获取法人名
            temp_g_money = com_all_info_array.select('td').select('p').select('span').text.strip('注册资本：') #获取注册资本
            temp_g_date = com_all_info_array.select('td').select('p').select('span').text.strip('成立日期：') #获取公司注册时间
            temp_r_email = com_all_info_array.select('td').select('p').text.split('\n').strip().strip('邮箱：') #获取法人Email
            temp_r_phone = com_all_info_array.select('td').select('p').select('.m-l').text.strip('电话：') #获取法人手机号
            temp_g_addr = com_all_info_array.select('td').select('p').text.strip().strip('地址：') #获取公司地址
            temp_g_state = com_all_info_array.select('td').select('.nstatus.text-success-lt.m-l-xs').text.strip()#获取公司状态

            g_name_list.append(temp_g_name)
            g_tag_list.append(temp_g_tag)
            r_name_list.append(temp_r_name)
            g_money_list.append(temp_g_money)
            g_date_list.append(temp_g_date)
            r_email_list.append(temp_r_email)
            r_phone_list.append(temp_r_phone)
            g_addr_list.append(temp_g_addr)
            g_state_list.append(temp_g_state)

#          except Exception:
#             print('错误！')
except Exception:
   print('好像被拒绝访问了呢...请稍后再试叭...')

if __name__ == '__main__':
global g_name_list
global g_tag_list
global r_name_list
global g_money_list
global g_date_list
global r_email_list
global r_phone_list
global g_addr_list
global g_state_list

g_name_list=[]
g_tag_list=[]
r_name_list=[]
g_money_list=[]
g_date_list=[]
r_email_list=[]
r_phone_list=[]
g_addr_list=[]
g_state_list=[]

key_word = input('请输入您想搜索的关键词：')
num = int(input('请输入您想检索的次数：'))+1
sleep_time = int(input('请输入每次检索延时的秒数：'))

key_word = urllib.parse.quote(key_word)

print('正在搜索，请稍后')

for x in range(1,num):
   url = r'https://www.qichacha.com/search_index?key={}&ajaxflag=1&p={}&'.format(key_word,x)
   s1 = craw(url,key_word,x)
   time.sleep(sleep_time)
workbook = xlwt.Workbook()
#创建sheet对象，新建sheet
sheet1 = workbook.add_sheet('企查查数据', cell_overwrite_ok=True)
#---设置excel样式---
#初始化样式
style = xlwt.XFStyle()
#创建字体样式
font = xlwt.Font()
font.name = '仿宋'
# font.bold = True #加粗
#设置字体
style.font = font
#使用样式写入数据
print('正在存储数据，请勿打开excel')
#向sheet中写入数据
name_list = ['公司名字','公司标签','法定法人','注册资本','成立日期','法人邮箱','法人电话','公司地址','公司状态']
for cc in range(0,len(name_list)):
   sheet1.write(0,cc,name_list,style)
for i in range(0,len(g_name_list)):
   print(g_name_list)
   sheet1.write(i+1,0,g_name_list,style)#公司名字
   sheet1.write(i+1,1,g_tag_list,style)#公司标签
   sheet1.write(i+1,2,r_name_list,style)#法定法人
   sheet1.write(i+1,3,g_money_list,style)#注册资本
   sheet1.write(i+1,4,g_date_list,style)#成立日期
   sheet1.write(i+1,5,r_email_list,style)#法人邮箱
   sheet1.write(i+1,6,r_phone_list,style)#法人电话
   sheet1.write(i+1,7,g_addr_list,style)#公司地址
   sheet1.write(i+1,8,g_state_list,style)#公司状态
#保存excel文件，有同名的直接覆盖
workbook.save(r"D:\wyy-qcc-"+time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime()) +".xls")
print('保存完毕~')

……..
有啥用？
嘿嘿，有用者自有用，无用者不知也罢。
PS：文件保存在D盘根目录下，以“wyy-qcc-年-月-日-时-分-秒.xls”命名。
PPS：你说老是“好像被拒绝访问了呢…请稍后再试叭…”咋办？
这个东西不适合你，孩子，放弃吧~
PPPS：发现每次只能获取五个的话，请把自己的cookie放到26行那里。
PPPPS:发现大量重复数据之类的，请自行百度“Excel去重”。
PPPPPS：最后一句了…要是觉着有用的话…请关注并收藏此链接…我嘛…刚买了几年服务器….不出意外的话，几年内都在，请关注我…拜托拜托 ~
PPPPPPS：我已经尽量的在注释了，，，还没看明白的，，，请加油提升自己，，，还有，，关注我的博客，，诸位拔刀吧！PPPPPPPS：修复了一个会爬取重复数据的BUG。

paxj168 发表于 2019-5-25 19:18

正在学习python爬虫中，谢谢分享

屠戮发表于 2019-5-25 18:18

可以的正在学python爬虫相关知识

Alones 发表于 2019-5-25 18:15

是大佬{:1_937:}{:1_937:}是大佬

yhzh 发表于 2019-5-25 18:24

感谢分享。。。

globlefaster 发表于 2019-5-25 18:31

希望能持续更新最好能获取详情里的基本信息

是你吗 发表于 2019-5-25 19:15

这个要VIP账号才可以爬完整把 ?

wangyeyu2015 发表于 2019-5-25 19:18

是你吗发表于 2019-5-25 19:15
这个要VIP账号才可以爬完整把 ?

他这个访问多了需要挂代｛过滤｝理，再就是，是否登录每次搜索的时候，返回的都不同。
理论上可以靠多次搜索来获取相对完整的数据

wangyeyu2015 发表于 2019-5-25 19:19

globlefaster 发表于 2019-5-25 18:31
希望能持续更新最好能获取详情里的基本信息

靠你啦，二次开发一下，完整的代码已经在这啦

cyantea 发表于 2019-5-25 19:39

感谢分享

页: [1] 2 3 4 5

吾爱破解 - 52pojie.cn's Archiver

【原创】Python爬虫爬企查查数据