依旧沉沉 发表于 2021-5-26 16:51

关于python爬虫相关问题。

想利用python爬取一个网站的图标,网站地址:https://flaticons.net/customize.php?dir=Application&icon=View-Incident.png
爬到二级页面了,爬三级页面的时候就没头绪了,如下图红框标记的地方,这个是down页面,但是源代码里面没有该地址(但是手动点击会进入三级页面),是该地址js加密了么,有没有破解思路~求大佬指点一二


lql3344521aaa. 发表于 2021-5-26 17:00

https://flaticons.net/icon.php?slug_category=application&slug_icon=view-incident
这个图标吗?

fanvalen 发表于 2021-5-26 17:37

分析不了js就直接分析js请求的网页链接,也就是f12抓包

pzx521521 发表于 2021-5-26 17:38

本帖最后由 pzx521521 于 2021-5-26 17:40 编辑

没有js加密 看html是一个form, 直接跟curl "https://flaticons.net/customize.php?dir=Application&icon=View-Incident.png" ^
-H "authority: flaticons.net" ^
-H "pragma: no-cache" ^
-H "cache-control: no-cache" ^
-H "sec-ch-ua: ^\^" Not A;Brand^\^";v=^\^"99^\^", ^\^"Chromium^\^";v=^\^"90^\^", ^\^"Google Chrome^\^";v=^\^"90^\^"" ^
-H "sec-ch-ua-mobile: ?0" ^
-H "upgrade-insecure-requests: 1" ^
-H "origin: https://flaticons.net" ^
-H "content-type: application/x-www-form-urlencoded" ^
-H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36" ^
-H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" ^
-H "sec-fetch-site: same-origin" ^
-H "sec-fetch-mode: navigate" ^
-H "sec-fetch-user: ?1" ^
-H "sec-fetch-dest: document" ^
-H "referer: https://flaticons.net/customize.php?dir=Application&icon=View-Incident.png" ^
-H "accept-language: zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7" ^
-H "cookie: PHPSESSID=p70e3ci2v3j0s2fegn1grok0jn; __gads=ID=758a2093593297c7-226b12d5eac8003a:T=1622021420:RT=1622021420:S=ALNI_MYzuXb51WUtJlauQZ2kSlJstFsUEw" ^
--data-raw "background=dark&icon_size=256&icon_color=^%^23FFFFFF&icon_rotate=0&icon_flip=n&shape_id=0&shape_size=512&shape_color=^%^23FFFFFF" ^
--compressed
跟一下很清楚, 往这个地址https://flaticons.net/customize.php?dir=Application&icon=View-Incident.png里面post
加数据 "background=dark&icon_size=256&icon_color=^%^23FFFFFF&icon_rotate=0&icon_flip=n&shape_id=0&shape_size=512&shape_color=^%^23FFFFFF"
会返回一个302 location 里面有地址
https://flaticons.net/custom.php?i=v4ETz7TPwnEXizIQIeIOvcWgv5YiE

Ercilan 发表于 2021-5-26 18:19

确实,是提交一个表单。楼主可以稍微百度学习一下 HTML 的 form。

lifeixue 发表于 2021-5-26 20:38

import requests
from lxml import etree
import os
head = {
    "User-Agent": "Mozilla/5.0(Windows NT 10.0;Win64;x64) AppleWebKit/537.36(KHTML, likeGecko) Chrome/71.0.3578.98 Safari/537.36"
}
# 创建一个文件夹(用于存储图片)
if not os.path.exists("./icons"):# 如果images文件夹不存在
    os.mkdir("./icons")# 创建文件夹

for i in range(1, 2):
    url = "https://flaticons.net/category.php?c=Application&p={}".format(i)
    r = requests.get(url=url, headers=head).text
    html = etree.HTML(r)
    lst = html.xpath("//div[@class='row']/div/a/@href")
    for col in lst:
      data = {
            "background": "dark",
            "icon_size": "256",
            "icon_color":"# FFFFFF",
            "con_rotate": 0,
            "icon_flip": "n",
            "shape_id": 0,
            "shape_size": 512,
            "shape_color":"# FFFFFF",
      }
      baseurl = "https://flaticons.net"+col
      response = requests.post(url=baseurl, headers=head, data=data).content.decode("UTF-8")
      tree = etree.HTML(response)
      down = "https://flaticons.net"+tree.xpath("//div[@class='input-group']/button/@data-value")
      # 图标名称
      icon_name = tree.xpath("//section[@id='home']//p/b/text()").replace(" ", "-")+".png"# icon名称
      icon_data = requests.get(url=down, headers=head).content
      icon_path = "icons/" + icon_name
      with open(icon_path, "wb") as fp:
            fp.write(icon_data)
            print(icon_name, "下载成功")
我也是初学python,这是我写的源代码

刘样andholiday 发表于 2021-5-26 22:51

楼主爬到二级页面的时候可以用下f12

618 发表于 2021-5-27 08:50

```python
import requests
url = "https://flaticons.net/customize.php?dir=Application&icon=View-Incident.png"
form = {
    "background": "dark",
    "icon_size": "256",
    "icon_color": "#FFFFFF",
    "icon_rotate": "0",
    "icon_flip": "n",
    "shape_id": "0",
    "shape_size": "512",
    "shape_color": "#FFFFFF"
}
r = requests.post(url,data = form, allow_redirects=False)
target_url = r.headers['Location']
```
target_url就是跳转的地址

johngao 发表于 2021-5-28 10:23

lifeixue 发表于 2021-5-26 20:38
import requests
from lxml import etree
import os


老哥怎么学的python,我自学可达不到你这个程度

毕竟花了19元 发表于 2021-6-22 21:01

LZ大大,您好,您之前做的“批量修改v3.0”那个软件有个问题向您请教下,就是我需要改名的文件中,都有逗号,比如:“2021-06-03 10-31-39 (B,Radius8,Smoothing4).jpg” ,逗号是英文下的逗号,导出表格的时候一切都正常,原始路径及命名里显示正常,但是修改完表格导入软件以后,逗号就变成了英文下的分号,比如:“2021-06-03 10-31-39 (B;Radius8;Smoothing4).jpg”,导致源文件无法通过路径查找到,不能进行批量修改。之前的原帖已经关闭评论了,只能在您最近的帖子里给您回复,问问您有没有什么解决方式。
页: [1]
查看完整版本: 关于python爬虫相关问题。