本帖最后由 thepoy 于 2020-12-7 21:04 编辑
通常需要用爬虫的场景都需要并发或并行,也就离不开进程、线程或协程,而本示例就是一个简单的异步爬虫与同步爬虫的对比。
代码
异步爬虫代码:
import asyncio
import json
import aiohttp
from typing import List, Optional
from datetime import datetime
class Spider:
def __init__(self, urls: List[str], headers: Optional[dict] = None, cookie: Optional[str] = None):
self.urls = urls
self.headers = headers
self.cookies = None if cookie else {'cookie': cookie}
self.loop = asyncio.get_event_loop()
self.result = list()
def excute(self):
self.loop.run_until_complete(self.spiders())
self.loop.close()
with open('main.json', 'w') as f:
json.dump(self.result, f)
async def spiders(self):
semaphore = asyncio.Semaphore(250)
spider = [self.run(url, semaphore) for url in self.urls]
await asyncio.wait(spider)
async def run(self, url, semaphore):
async with semaphore:
async with aiohttp.ClientSession(loop=self.loop, headers=self.headers, cookies=self.cookies) as session:
async with session.get(url) as response:
text = await response.text()
self.result.append(json.loads(text))
if __name__ == "__main__":
urls = []
for i in range(1, 1001):
urls.append(f'http://httpbin.org/anything?page={i}')
s = Spider(urls)
start = datetime.now()
s.excute()
end = datetime.now()
print((end - start).total_seconds(), "秒")
同步爬虫代码:
import json
import requests
from datetime import datetime
if __name__ == "__main__":
start = datetime.now()
result = []
for i in range(1, 1001):
url = f'http://httpbin.org/anything?page={i}'
result.append(requests.get(url).json())
with open('test.json', 'w') as f:
json.dump(result, f)
end = datetime.now()
print((end - start).total_seconds(), "秒")
结果
# 异步
20.837937 秒
# 同步(我实在没想到会耗时这么久...)
650.712683 秒
从结果来看,在爬取1000条链接的场景中,异步爬虫效率是同步爬虫的30多倍。
资源消耗相对较小,效率提升却如此巨大,所以在以后的爬虫中,我就优先考虑异步了。
有兴趣的朋友,可以尝试一下与多线程和多进程的效率对比,请在本贴贴出对比结果。
|