python多线程数据抓不完整?
请问用多线程为什么数据抓不完整?数据共有649条,实际只抓取了619条,问题在哪里?from concurrent.futures import ThreadPoolExecutor
def down_data(i):
html = gethtml(i)
llist = parse_url(html)
for u in llist:
try:
html_detail = gethtml_detail(u)
u = re.findall('https://susong-item.taobao.com/auction/\d{1,}.htm', html_detail, re.S).strip()
html_detail = gethtml_detail(u)
parse = parse_url_detail(html_detail)
df = pd.DataFrame(parse)
for i in df.index:
df['索引'].at = i + 1
df2 = cpca.transform(df['地址'])
df['区'] = df2.loc[:, ['区']]
df['地址'] = df2.loc[:, ['地址']]
# result = pd.concat(, axis=0)
df.to_excel("C:/Users/Administrator/Desktop/Python/AL-SF/1-retail" + st + ".xlsx", index=False)
print('第%s条数据已保存' % str(i + 1))
except Exception:
try:
html_detail = gethtml_detail(u)
u = re.findall('https://zc-item.taobao.com/auction/\d{1,}.htm', html_detail, re.S).strip()
parse = parse_url_detail(u)
df = pd.DataFrame(parse)
for i in df.index:
df['索引'].at = i + 1
df2 = cpca.transform(df['地址'])
df['区'] = df2.loc[:, ['区']]
df['地址'] = df2.loc[:, ['地址']]
# result = pd.concat(, axis=0)
df.to_excel("C:/Users/Administrator/Desktop/Python/AL-SF/1-retail" + st + ".xlsx", index=False)
print('第%s条数据已保存' % str(i + 1))
except Exception:
html_detail = gethtml_detail(u)
parse = parse_url_detail(html_detail)
df = pd.DataFrame(parse)
for i in df.index:
df['索引'].at = i + 1
df2 = cpca.transform(df['地址'])
df['区'] = df2.loc[:, ['区']]
df['地址'] = df2.loc[:, ['地址']]
# result = pd.concat(, axis=0)
df.to_excel("C:/Users/Administrator/Desktop/Python/AL-SF/1-retail" + st + ".xlsx", index=False)
print('第%s条数据已保存' % str(i + 1))
# 主程序
def main():
global p
page = next_page()
time_start = time.time()
with ThreadPoolExecutor(curPage) as t:
for i in page:
p += 1
t.submit(down_data, i)
time_end = time.time()
print('第%s页数据已保存!====用时%.1f秒' % (p, time_end - time_start)) 本帖最后由 thepoy 于 2021-5-29 14:34 编辑
submit 函数不是阻塞函数,会立刻返回。
你还需要用 wait 方法等待所有任务完成。
或者用 as_completed 监听完成的任务。 #shutdown(True)线程还没执行完成 主线程就退出了,加上这一句,等待任务执行完毕;
# 主程序
def main():
global p
page = next_page()
time_start = time.time()
with ThreadPoolExecutor(curPage) as t:
for i in page:
p += 1
t.submit(down_data, i)
t.shutdown(True) #等待所有线程执行完毕
time_end = time.time()
print('第%s页数据已保存!====用时%.1f秒' % (p, time_end - time_start))
ReLoading 发表于 2021-5-29 14:34
#shutdown(True)线程还没执行完成 主线程就退出了,加上这一句,等待任务执行完毕;
大佬 ,加了这句话,运行完结果就一句话 “第17页数据已保存!====用时0.0秒” double07 发表于 2021-5-30 15:58
大佬 ,加了这句话,运行完结果就一句话 “第17页数据已保存!====用时0.0秒”
你可能触发了 反扒机制,或者你的 down_data 函数有问题
页:
[1]