一步一步摸索python，纯自学-怎么爬取11个页面内容

wang770150597 · 发表于 2022-1-2 23:09

求指导：
想提取http://fund.eastmoney.com/a/cjjgd_1.html—http://fund.eastmoney.com/a/cjjgd_11.html，11页的所有这些文章的标题：《粤开策略大势研判：关注稳增长主线行情">粤开策略大势研判：关注稳增长主线行情》，怎么实现？
我写的代码是这样的：

import requests
from bs4 import BeautifulSoup
html = requests.get('http://fund.eastmoney.com/a/cjjgd.html')

html.encoding = html.apparent_encoding
soup = BeautifulSoup(html.text,'lxml')
# print(soup.prettify()) #返回网页源代码

print(soup.find_all('a'))

小蜜蜂 · 发表于 2022-1-3 07:57

import requests
from bs4 import BeautifulSoup
html = requests.get('http://fund.eastmoney.com/a/cjjgd.html')

html.encoding = html.apparent_encoding
soup = BeautifulSoup(html.text,'lxml')
# print(soup.prettify()) #返回网页源代码

p=soup.find_all('a')
for i in p:
print(i.text)

wang770150597 · 发表于 2022-1-3 10:56

小蜜蜂发表于 2022-1-3 07:57
import requests
from bs4 import BeautifulSoup
html = requests.get('http://fund.eastmoney.com/a/cjj ...

非常感谢，中金公司：关于银行房地产业务相关敞口的几个焦点问题
2
3
4
5
6
7
...
11
怎么把2到11页的标题显示也显示出来啊

大爱九月 · 发表于 2022-1-3 11:16

import requests
from bs4 import BeautifulSoup

for q in range(1, 12):
print('第'+str(q)+'页')
html = requests.get('http://fund.eastmoney.com/a/cjjgd_'+str(q)+'.html')
html.encoding = html.apparent_encoding
soup = BeautifulSoup(html.text,'lxml')
p=soup.find_all('a',{"target":"_blank"})
for  i in p:
      if i.get('title'):
         if i.get('class')==None:
            print(i.text)

a862427375 · 发表于 2022-1-3 13:49

[Python] 纯文本查看 复制代码

import requests
from lxml import etree
import threading

ls = []
thread_ls = []
def thread(func):
    def main(*args,**kwargs):
        thread_ls.append(threading.Thread(target=func,args=args,kwargs=kwargs))
        thread_ls[-1].start()

    return main

@thread
def 爬取内容(page):
    url = f"http://fund.eastmoney.com/a/cjjgd_{page}.html"
    r =requests.get(url)
    ls.extend(etree.HTML(r.text).xpath(r'//div[@class="infos"]/ul/li/a/@title'))

if __name__ == '__main__':
    for i in range(1,11+1):
        爬取内容(i)
    for i in thread_ls:
        i.join()
    采集的所有内容 = "\n".join(ls)
    print(采集的所有内容)

wang770150597 · 发表于 2022-1-3 14:21

a862427375 发表于 2022-1-3 13:49
[mw_shl_code=python,true]import requests
from lxml import etree
import threading

谢谢大神

唯爱丶雪 · 发表于 2022-1-3 14:54

教你用scrapy去爬：
import scrapy

class EastSpider(scrapy.Spider):
name = 'east'
allowed_domains = ['eastmoney.com']
start_urls = ['http://fund.eastmoney.com/a/cjjgd_1.html']

def parse(self, response):
      dict1 = {}
      for each in response.xpath('//div[@class="infos"]/ul'):
         time = each.xpath('li/span/text()').extract()
         name = each.xpath('li/a/text()').extract()
         for i in range(len(time)):
            print(name[i])
            yield {
                  time[i]: name[i]
            }
      url_path = response.xpath('//a[contains(text(),"下一页")]/@href').extract_first()
      print(url_path)
      if url_path != None:
         next = 'http://fund.eastmoney.com/a/' + url_path
         #print(next)
         yield scrapy.Request(next)
      pass
并且保存到b.json

wang770150597 · 发表于 2022-1-3 17:54

唯爱丶雪发表于 2022-1-3 14:54
教你用scrapy去爬：
import scrapy

这个文件类型还没玩过

Crysis726 · 发表于 2022-1-4 13:12

wang770150597 发表于 2022-1-3 17:54
这个文件类型还没玩过

这不是文件类型，是爬虫框架，后期会学到的

帐号		自动登录	找回密码
密码			注册[Register]

[求助] 一步一步摸索python，纯自学-怎么爬取11个页面内容