求助--爬取天猫商品信息时图片链接加载问题
前几天学习了崔庆才的通过selenium和Chrome爬取淘宝的“美食”商品信息,我也写了一个简单的爬取天猫的“美食”商品信息(别问为什么不爬取淘宝的,问就是淘宝现在搜索要登录,天猫第一页不用登录,第二页才需要登录),发现在获取商品图片地址时,前四五个获取正常,之后的则获取失败,显示为“'data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw=='”https://attach.52pojie.cn//forum/202012/10/140249fevci6r6eiiz6ork.png?l,网上查了一下,为js的一些代码,后来发现天猫的商品图片只有在你下拉到显示位置的时候才会加载,并且网页的源代码也会做相应的改变(猜测是通过JS改变)https://attach.52pojie.cn//forum/202012/10/140319ynywo19y2p61pw6n.png?l
https://attach.52pojie.cn//forum/202012/10/140317fsj7cmv0mzzm9smj.png?l
所以想到通过selenium,控制滚动条(execute_script("window.scrollTo()"),再通过time.sleep()方法进行等待加载,但这样的爬取效率有点低,想问一下各位大佬,你们在爬取这类的网页,一般使用什么工具或库,哪种的方法可以提高效率。
以下为代码:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyquery import PyQuery as py
import time
browser = webdriver.Chrome()
wait = WebDriverWait(browser,10)
def search():
try:
browser.get('https://www.tmall.com')
input = wait.until(
EC.presence_of_element_located((By.CSS_SELECTOR,'#mq'))
)
botton = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,'#mallSearch > form > fieldset > div > button')))
input.send_keys('美食')
time.sleep(3)
botton.click()
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#content > div > div.ui-page > div > b.ui-page-skip > form')))
browser.execute_script("window.scrollTo(0, 200)")
time.sleep(5)
browser.execute_script("window.scrollTo(0, 700)")
time.sleep(5)
browser.execute_script("window.scrollTo(0, 1400)")
time.sleep(5)
browser.execute_script("window.scrollTo(0, 2100)")
time.sleep(5)
browser.execute_script("window.scrollTo(0, 2800)")
time.sleep(5)
browser.execute_script("window.scrollTo(0, 3500)")
time.sleep(5)
browser.execute_script("window.scrollTo(0, 4200)")
time.sleep(5)
browser.execute_script("window.scrollTo(0, 4900)")
time.sleep(5)
get_products()
page = browser.find_element_by_name('totalPage' )
return page.get_attribute('value')
except TimeoutException:
return search()
def get_products():
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'.main .view .product')))
html = browser.page_source
doc = py(html)
itmes = doc('.main .view .product').items()
for i in itmes:
product = {
'image':i.find('.productImg-wrap a img').attr('src'),
'name':i.find('.productTitle a').text(),
'price':i.find('.productPrice em').text(),
'deal':i.find('.productStatus span em').text(),
'evaluate':i.find('.productStatus span a').text(),
'shop':i.find('.productShop .productShop-name').text(),
}
print(product)
def main():
search()
if __name__ == '__main__':
main()
页:
[1]