小白求助！！！望大佬解答

Prozacs · 发表于 2021-8-3 11:55

直奔主题。本来想学习爬虫。。结果就遇到问题了。我想请求https://mewe.groups.hk/%E9%A3%B2%E9%A3%9F这条数据，结果发现是ajax请求。post请求发现g-recaptcha-response的值是变化的。老规矩，继续调试，结果发现搞不定了。。。求助！！

lxl6832 · 发表于 2021-8-3 14:14

这就是反爬机制啊

Prozacs · 发表于 2021-8-3 14:21

lxl6832 发表于 2021-8-3 14:14
这就是反爬机制啊

望大佬解答一下它那段JS是怎么产生token的。我用selenium其实是能爬的。
from selenium import webdriver
import lxml.html, time
from openpyxl import Workbook
from selenium.webdriver.chrome.options import Options
wb = Workbook()
sheet = wb.active
headRow = ['群组名称', '链接', '规模', '介绍', '群组类别']
sheet.append(headRow)
def scroll(num):
try:
      js = 'return document.body.scrollHeight;'
      height = 0
      page =0
      while page <num:
         page += 1
         new_height = zh.execute_script(js)
         if new_height > height:
            zh.execute_script('window.scrollTo(0, document.body.scrollHeight)')
            time.sleep(3)
            height = new_height
            html = zh.page_source
            html = lxml.html.fromstring(html)
            table = html.xpath('//div[@class="group-wrapper"]/a')
            for sub_x in table:
                  try:
                     name = sub_x.xpath('./@title')[0]
                     link = sub_x.xpath('./@href')[0]
                     nums = sub_x.xpath('.//span/text()')
                     nums = ' '.join(nums)
                     intro = sub_x.xpath('.//div[@class="group-desc"]/text()')
                     intro = ''.join(intro)
                     rowData = [name, link, nums, intro, i[1:]]
                     sheet.append(rowData)
                  except:
                     pass
         else:
            print("滚动条已经处于页面最下方!")
            # zh.execute_script('window.scrollTo(0, 0)')  # 页面滚动到顶部
            break
except Exception as e:
      msg = str(e)
      print(msg)
chrome_options = Options()
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9221")
zh = webdriver.Chrome(r'D:\pythongc\chromedriver.exe', chrome_options=chrome_options)
url1 = 'https://mewe.groups.hk/'
zh.get(url1)
html = zh.page_source
html =lxml.html.fromstring(html)
page_urls = html.xpath('//ul[@class="cat-lists scroll"]/li/a/@href')
for i in page_urls:
page_url = url1[:-1] + i
zh.get(page_url)
scroll(10)
wb.save('sss.xlsx')
print(html)

Domado · 发表于 2021-8-3 14:31

g-recaptcha-response是谷歌验证码的token值，极难破解

rsnodame · 发表于 2021-8-3 14:34

遇事不决，selenium

Prozacs · 发表于 2021-8-3 14:34

Domado 发表于 2021-8-3 14:31
g-recaptcha-response是谷歌验证码的token值，极难破解

。。。。。我就说。。。为啥我调试半天就感觉是验证，但是没办法确定是google验证，感谢感谢

帐号		自动登录	找回密码
密码			注册[Register]

[求助] 小白求助！！！望大佬解答