吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 5174|回复: 19
收起左侧

[Python 转载] 欧路词典爬虫

  [复制链接]
liuye 发表于 2022-4-18 16:22
本帖最后由 liuye 于 2022-4-18 16:53 编辑

最近在做一个和英语学习相关的网站需要一个汉英词典数据库,淘宝上的那些又贵又不全,所以自己写了个爬虫爬一下,现在分享给大家
[Python] 纯文本查看 复制代码
import pymysql
import requests
from lxml import etree

#执行插入语句
def commitSQL(conn,sql):
    cursor = conn.cursor()
    try:
        cursor.execute(sql)
        conn.commit()
    except Exception as e:
        conn.rollback()


#获取数据库连接
def getConnect():
    canshu = {}
    #载入配置文件
    with open('configure','r') as f :
        for x in f.read().split('\n'):
            canshu[x.split('=')[0]] =x.split('=')[-1]
    conn = pymysql.connect(host=canshu['host'],
                           port=int(canshu['port']),
                           user=canshu['user'],
                           password=canshu['password'],
                           database=canshu['database'],
                           charset=canshu['charset'])
    return conn

def getInfo(word):
    print("正在爬取单词:"+word)
    respsons = requests.get("https://dict.eudic.net/dicts/en/"+word)
    html = etree.HTML(respsons.content)
    english=[]
    chinese=[]
    try:
        meaningXpath = html.xpath("//ol/li/text()")
        if (meaningXpath == []):
            meaningXpath = html.xpath("//div[@class='exp']/text()")
        British = html.xpath("//span[@class='Phonitic'][1]/text()")[0]
        American = html.xpath("//span[@class='Phonitic'][last()]/text()")[0]
        englishXpath = html.xpath("//p[@class='line']")
        chineseXpath = html.xpath("//div[@class='sentence']//p[@class='exp']")
        for i in englishXpath:
            english.append(i.xpath("string(.)"))
        for i in chineseXpath:
            chinese.append(i.xpath("string(.)"))
        print("单词" + word + "爬取成功")
        print(meaningXpath, British, American, english, chinese)
        return meaningXpath,British,American,english,chinese
    except Exception:
        print("单词" + word +"爬取失败")
        return None

def save(word):
    print("正在存储单词:"+word)
    try:
        meaningXpath, British, American, english, chinese = getInfo(word)
        connect = getConnect()
        commitSQL(connect, f"insert into word(word,british_pronunciation,american_pronunciation) "
                           f"values ('{word}',\"{British}\",\"{American}\")")
        for i in meaningXpath:
            commitSQL(connect, f"insert into meaning(meaning,word) "
                               f"values (\"{i}\",\"{word}\")")
        for i in range(0, len(english)):
            commitSQL(connect, f"insert into example(word,english,chinese) "
                               f"values (\"{word}\",\"{english[i]}\",\"{chinese[i]}\")")
        print("单词" + word +"存储成功")
        for i in english:
            english = i.replace(",", "").replace(".", "").replace("?", "").replace("!", "").lower().split(" ")
            for x in english:
                run(x)
    except Exception:
        print("单词" + word+"存储失败")

def run(startWord):
    cur = getConnect().cursor()
    sql = f"SELECT * FROM word WHERE word='{startWord}'"
    cur.execute(sql)
    count = len(cur.fetchall())
    if count == 0 :
        save(startWord)


if __name__ == '__main__':
    word='test'
    run(word)


配置文件(与代码同级目录下命名为configure)

[XML] 纯文本查看 复制代码
host=localhost
port=3306
user=root
password=root
database=dict
charset=utf8

数据库结构
[SQL] 纯文本查看 复制代码
/*
SQLyog Enterprise v12.09 (64 bit)
MySQL - 5.5.40 : Database - dict
*********************************************************************
*/

/*!40101 SET NAMES utf8 */;

/*!40101 SET SQL_MODE=''*/;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`dict` /*!40100 DEFAULT CHARACTER SET utf8 */;

USE `dict`;

/*Table structure for table `example` */

DROP TABLE IF EXISTS `example`;

CREATE TABLE `example` (
  `word` varchar(64) DEFAULT NULL,
  `english` text,
  `chinese` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

/*Table structure for table `meaning` */

DROP TABLE IF EXISTS `meaning`;

CREATE TABLE `meaning` (
  `meaning` text,
  `word` char(64) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

/*Table structure for table `word` */

DROP TABLE IF EXISTS `word`;

CREATE TABLE `word` (
  `word` varchar(64) DEFAULT NULL,
  `british_pronunciation` varchar(128) DEFAULT NULL,
  `american_pronunciation` varchar(128) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;


大概原理就是输入起始单词,爬取数据(包括 英音 美音 翻译 例句)后 对例句中的所有单词进行同样的操作。
频繁访问这个api是会封IP的,所以要么拉长战线要么买代{过}{滤}理IP的服务

免费评分

参与人数 1热心值 +1 收起 理由
MXGT + 1 我很赞同!

查看全部评分

本帖被以下淘专辑推荐:

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

zohoChou 发表于 2022-4-22 23:27
liuye 发表于 2022-4-22 23:16
主要是代{过}{滤}理IP太贵了

https://blog.csdn.net/weixin_44613063/article/details/102538757
jimmywong85 发表于 2022-4-18 16:33
aeqvkec 发表于 2022-4-18 16:43
hawk005 发表于 2022-4-18 16:57
牛牛牛,向楼主学习!
WanShao 发表于 2022-4-18 17:21
请问代码怎么用
Flytom 发表于 2022-4-18 17:52

牛牛牛,向楼主学习!
liu2514 发表于 2022-4-18 18:42
Python小白来取取经!感谢分享!
头狼 发表于 2022-4-18 19:01
大佬请分享个爬好的数据吧
f23258 发表于 2022-4-18 20:51
哈,在手机上直接买了正版。
Cacarot 发表于 2022-4-18 21:50
多谢,收藏备用
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-25 00:26

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表