吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 1996|回复: 13
收起左侧

[其他原创] 基于C#8.0的通用爬虫静态类,可直接使用

[复制链接]
jdclang 发表于 2024-6-28 20:34
本帖最后由 jdclang 于 2024-6-28 20:41 编辑

2年前曾经基于C#7.0尝试过一次通用爬虫代码,原帖地址,那是我第一次尝试用C#来写爬虫,虽然实现了基本功能,但所需参数太多,用起来比较麻烦,这次为了爬取《六朝》的最新章节,重新编写了爬虫代码,将所有功能打包为一个静态类,并且实现了自动处理目标网站小说章节列表或者章节内容被分页的情况,不在需要自行去查找章节列表和章节内容的下一页Xpath,只需要在program.cs中直接调用即可,当然,一些必须的库还是要自行NuGet一下,比如HtmlAgilityPack,NLog,System.Text.Encoding.CodePages(这个主要是为了处理GBK编码)。
这次为了方便大家测试,我就不删除源代码中的网站地址了,各位自行尝试。

对了,因为这次的目标网站是个只允许手机端浏览器访问的网站,所以User-Agent我用的是模拟手机端的字符串“Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Mobile Safari/537.36 Edg/119.0.0.0",如果需要访问电脑端的网站,可以更换为其他通用的User-Agent或者自行去浏览器里面抓取。
完整代码如下:
[C#] 纯文本查看 复制代码
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using HtmlAgilityPack;
using NLog;
 
namespace ConsoleBookSpyder.Sites
{
    /// <summary>
    /// 提供小说爬取和处理功能的静态类
    /// </summary>
    public static class shubaoju
    {
        #region Constants
 
        private const string BaseUrl = "https://i.shubaoju.cc";
        private const string Encoding = "GBK";
        private const string XpathBookName = "//*[@id=\"_52mb_h1\"]";
        private const string XpathChapterList = "//ul[@class='chapter']/li";
        private const string XpathContents = "//*[@id=\"nr1\"]";
         
        //目标网页抓取错误最大尝试次数
        private const int MaxRetries = 5;
 
        // 控制并发抓取章节内容的最大线程数
        private static readonly int MaxConcurrency = 40;
 
        #endregion
 
        #region Fields
 
        private static readonly Logger Logger = LogManager.GetCurrentClassLogger();
        private static readonly HttpClient HttpClient = CreateHttpClient();
 
        //任务进度参数
        private static int _completedTasks;
        private static int _totalTasks;
 
        #endregion
 
        #region Public Methods
 
        /// <summary>
        /// 执行小说爬取的主要流程
        /// </summary>
        /// <param name="targetUrl">目标小说的URL</param>
        public static async Task ExecuteAsync(string targetUrl)
        {
            try
            {
                var html = await GetHtmlAsync(targetUrl);
                var bookName = GetBookName(html);
                var chapters = await GetBookChaptersAsync(html);
                _totalTasks = chapters.Count; // 设置总任务数
                _completedTasks = 0; // 初始化已完成任务数
                Console.Title = $"任务进度:{_completedTasks}/{_totalTasks}"; // 更新Console.Title
 
                var novelContent = await ProcessChaptersAsync(chapters);
                var amendedContent = AmendContent(novelContent);
                await WriteToFileAsync(amendedContent, GetFilePath(bookName));
            }
            catch (Exception ex)
            {
                Logger.Error(ex, "执行过程中发生错误");
            }
        }
 
        #endregion
 
        #region Private Methods
 
        /// <summary>
        /// 从HTML中提取小说名
        /// </summary>
        private static string GetBookName(string html)
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(html);
            var bookName = doc.DocumentNode.SelectSingleNode(XpathBookName)?.InnerText.Trim() ?? "未知书名";
            bookName = System.Web.HttpUtility.HtmlDecode(bookName);
            Logger.Info($"获取书名: 《{bookName}》");
            return bookName;
        }
 
        /// <summary>
        /// 获取小说的所有章节信息
        /// </summary>
        private static async Task<List<Chapter>> GetBookChaptersAsync(string initialHtml)
        {
            var chapters = new List<Chapter>();
            var currentPageUrl = BaseUrl;
            var html = initialHtml;
 
            while (true)
            {
                var doc = new HtmlDocument();
                doc.LoadHtml(html);
 
                var chapterNodes = doc.DocumentNode.SelectNodes(XpathChapterList);
                if (chapterNodes != null)
                {
                    foreach (var node in chapterNodes)
                    {
                        var aNode = node.SelectSingleNode(".//a");
                        if (aNode != null)
                        {
                            var title = aNode.InnerText.Trim();
                            var href = aNode.GetAttributeValue("href", "");
                            if (!string.IsNullOrEmpty(href))
                            {
                                chapters.Add(new Chapter(title, href));
                            }
                        }
                    }
                }
                else
                {
                    Logger.Warn($"在页面 {currentPageUrl} 未找到章节列表");
                }
 
                // 查找"下一页"链接
                var nextPageNode = doc.DocumentNode.SelectSingleNode("//a[contains(text(), '下一页')]");
                if (nextPageNode != null)
                {
                    var nextPageUrl = nextPageNode.GetAttributeValue("href", "");
                    if (!string.IsNullOrEmpty(nextPageUrl))
                    {
                        currentPageUrl = new Uri(new Uri(BaseUrl), nextPageUrl).AbsoluteUri;
                        Logger.Info($"正在获取下一页章节列表: {currentPageUrl}");
                        html = await GetHtmlAsync(currentPageUrl);
                    }
                    else
                    {
                        Logger.Warn("找到'下一页'链接,但URL为空");
                        break;
                    }
                }
                else
                {
                    Logger.Info("未找到'下一页'链接,章节列表获取完成");
                    break;
                }
            }
 
            if (chapters.Count == 0)
            {
                Logger.Warn("未找到任何章节");
            }
            else
            {
                Logger.Info($"总共获取到 {chapters.Count} 个章节");
            }
 
            return chapters;
        }
 
        /// <summary>
        /// 处理所有章节,获取内容,使用并发控制
        /// </summary>
        private static async Task<string> ProcessChaptersAsync(List<Chapter> chapters)
        {
            var semaphore = new SemaphoreSlim(MaxConcurrency);
            var tasks = chapters.Select(async chapter =>
            {
                try
                {
                    await semaphore.WaitAsync();
                    chapter.Content = await GetChapterContentsAsync(chapter.Url);
                    UpdateConsoleTitle(); // 更新Console.Title
                }
                catch (Exception ex)
                {
                    Logger.Error(ex, $"获取章节内容失败: {chapter.Url}");
                }
                finally
                {
                    semaphore.Release();
                }
            });
 
            await Task.WhenAll(tasks);
 
            Console.Title = $"任务进度:{_totalTasks}/{_totalTasks}"; // 更新Console.Title为完成状态
            return string.Join(Environment.NewLine + Environment.NewLine,
                chapters.Select(chapter => $"{chapter.Title}{Environment.NewLine}{Environment.NewLine}{chapter.Content}"));
        }
        private static void UpdateConsoleTitle()
        {
            Interlocked.Increment(ref _completedTasks); // 增加已完成任务数
            Console.Title = $"任务进度:{_completedTasks}/{_totalTasks}"; // 更新Console.Title
        }
 
        /// <summary>
        /// 获取单个章节的内容
        /// </summary>
        private static async Task<string> GetChapterContentsAsync(string initialChapterUrl)
        {
            var contentBuilder = new StringBuilder();
            var currentUrl = initialChapterUrl;
 
            while (true)
            {
                var chapterHtml = await GetHtmlAsync(currentUrl);
                var doc = new HtmlDocument();
                doc.LoadHtml(chapterHtml);
 
                var contentNodes = doc.DocumentNode.SelectNodes(XpathContents);
                if (contentNodes != null)
                {
                    foreach (var node in contentNodes)
                    {
                        //contentBuilder.AppendLine(node.InnerText.Trim());
                        contentBuilder.AppendLine(node.InnerHtml.Trim());
                    }
                }
                else
                {
                    Logger.Warn($"在页面 {currentUrl} 未找到内容");
                }
 
                // 查找"下一页"链接
                var nextPageNode = doc.DocumentNode.SelectSingleNode("//a[contains(text(), '下一页')]");
                if (nextPageNode != null)
                {
                    var nextPageUrl = nextPageNode.GetAttributeValue("href", "");
                    if (!string.IsNullOrEmpty(nextPageUrl))
                    {
                        currentUrl = new Uri(new Uri(BaseUrl), nextPageUrl).AbsoluteUri;
                        Logger.Info($"正在获取下一页内容: {currentUrl}");
                    }
                    else
                    {
                        Logger.Info("找到'下一页'链接,但URL为空,章节内容获取完成");
                        break;
                    }
                }
                else
                {
                    Logger.Info("未找到'下一页'链接,章节内容获取完成");
                    break;
                }
            }
 
            var content = contentBuilder.ToString().Trim();
            Logger.Info($"获取到的章节内容长度: {content.Length} 字符");
            return content;
        }
 
        /// <summary>
        /// 处理和清理小说内容
        /// </summary>
        private static string AmendContent(string content)
        {
            content = content.Replace("</p><p>", Environment.NewLine + Environment.NewLine)
                .Replace("<br>", Environment.NewLine)
                .Replace("<br/>", Environment.NewLine)
                .Replace("<br />", Environment.NewLine);
            content = System.Web.HttpUtility.HtmlDecode(content);
 
            //字数:.*(新群)\*:匹配"字数:“后面跟着任意内容,最后以”(新群)*"结尾的部分。
            //-->>.* 继续阅读):匹配"–>>"后面跟着任意内容,最后以"继续阅读)"结尾的部分。
            //|:表示逻辑或,匹配前一个或后一个表达式中的任意一个。
            content = System.Text.RegularExpressions.Regex.Replace(content, @"字数:.*(新群)\*|-->>.*继续阅读)|【.*】", "\r\n\r\n");
 
            var regex = new Regex("<.*?>", RegexOptions.Compiled);
            return regex.Replace(content, string.Empty);
        }
 
        /// <summary>
        /// 将处理后的内容写入文件
        /// </summary>
        private static async Task WriteToFileAsync(string content, string filePath)
        {
            try
            {
                var directory = Path.GetDirectoryName(filePath);
                if (!string.IsNullOrEmpty(directory))
                {
                    Directory.CreateDirectory(directory);
                }
 
                await File.WriteAllTextAsync(filePath, content, System.Text.Encoding.Default);
                Logger.Info($"文件已成功写入: {filePath}");
            }
            catch (Exception ex)
            {
                Logger.Error(ex, $"写入文件时发生错误: {filePath}");
            }
        }
 
        /// <summary>
        /// 生成保存小说内容的文件路径
        /// </summary>
        private static string GetFilePath(string bookName)
        {
            var uri = new Uri(BaseUrl);
            var domain = uri.Host;
            return Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "下载", domain, $"{bookName}.txt");
        }
 
        /// <summary>
        /// 获取指定URL的HTML内容
        /// </summary>
        private static async Task<string> GetHtmlAsync(string url)
        {
            for (int i = 0; i < MaxRetries; i++)
            {
                try
                {
                    var response = await HttpClient.GetAsync(url);
                    response.EnsureSuccessStatusCode();
                    var responseBody = await response.Content.ReadAsByteArrayAsync();
                    var responseString = System.Text.Encoding.GetEncoding(Encoding).GetString(responseBody);
                    Logger.Info($"请求 {url} 成功,响应状态码: {(int)response.StatusCode}");
                    return responseString;
                }
                catch (HttpRequestException ex)
                {
                    Logger.Warn($"请求 {url} 失败,{ex.Message},10秒后重试...");
                    await Task.Delay(TimeSpan.FromSeconds(10));
                }
            }
 
            Logger.Error($"多次请求 {url} 失败,已放弃。");
            return string.Empty;
        }
 
        /// <summary>
        /// 创建并配置HttpClient实例
        /// </summary>
        private static HttpClient CreateHttpClient()
        {
            var client = new HttpClient
            {
                BaseAddress = new Uri(BaseUrl)
            };
 
            client.DefaultRequestHeaders.Clear();
            client.DefaultRequestHeaders.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
            client.DefaultRequestHeaders.Add("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6");
            client.DefaultRequestHeaders.Add("Cache-Control", "max-age=0");
            client.DefaultRequestHeaders.Add("Connection", "keep-alive");
            //Cookie可要可不要,请根据目标网站是否需要登录自行修改
            client.DefaultRequestHeaders.Add("Cookie", "cJumpPV10861=1; autojumpPV10861=6; __51vcke__K0XyxJ3OaBItNhg9=3b3030a8-3fd7-5de0-8777-5218fa3ae91f; __51vuft__K0XyxJ3OaBItNhg9=1719545541437; __51uvsct__K0XyxJ3OaBItNhg9=7; PHPSESSID=363b7e9b0cf6d31cb701f1b747f8bc9c; autojumpStats10861=1; autojumpNum10861=1; cJumpPV10861=1; autojumpPV10861=3; __vtins__K0XyxJ3OaBItNhg9=%7B%22sid%22%3A%20%228336a7ae-3043-5bf6-a7c4-736a2d7c0f9c%22%2C%20%22vd%22%3A%208%2C%20%22stt%22%3A%20334475%2C%20%22dr%22%3A%203517%2C%20%22expires%22%3A%201719579012508%2C%20%22ct%22%3A%201719577212508%7D");
            client.DefaultRequestHeaders.Add("Upgrade-Insecure-Requests", "1");
            client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Mobile Safari/537.36 Edg/119.0.0.0");
 
            return client;
        }
 
        #endregion
 
        #region Inner Classes
 
        /// <summary>
        /// 表示小说的一个章节
        /// </summary>
        private class Chapter
        {
            public string Title { get; }
            public string Url { get; }
            public string Content { get; set; }
 
            public Chapter(string title, string url)
            {
                Title = title;
                Url = url;
                Content = string.Empty;
            }
        }
 
        #endregion
    }
}


在Program.cs文件中用一下语句调用:
[C#] 纯文本查看 复制代码
1
2
3
//为了使用GBK或者GB2312编码,注册System.Text.Encoding.CodePages编码包
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
await shubaoju.ExecuteAsync("https://i.shubaoju.cc/html/8/8481/index.html");


最后补充说明,这个类是的核心功能是基于httpclient编写的,其实还可以通过httpClientFactory来编写,实现更加自由灵活和资源更小的自动管理,我也通过另外一个类实现了,不过调用太过复杂,还需要注册.NET CORE 服务,这里就先不提供给大家了,有兴趣的可以自行研究一下。

运行截图

运行截图

免费评分

参与人数 9吾爱币 +16 热心值 +7 收起 理由
cjzzz + 2 + 1 谢谢@Thanks!
CSKSuper + 1 谢谢分享,可以学习
明月相照 + 2 + 1 谢谢@Thanks!
JasonZS + 1 用心讨论,共获提升!
苏紫方璇 + 7 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!
Yifan2007 + 2 我很赞同!
gksj + 1 + 1 谢谢@Thanks!
shengruqing + 1 我很赞同!
luxingyu329 + 1 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!

查看全部评分

本帖被以下淘专辑推荐:

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

 楼主| jdclang 发表于 2024-6-28 21:18
luxingyu329 发表于 2024-6-28 21:16
终于看到有.net爬取的示例了。

其实.NET写爬虫并不比Python复杂,只是因为语法更严谨,所以看起来代码多而已,但是在多线程和并发控制上,.Net要方便得多。
gksj 发表于 2024-6-29 00:17
我没装VS2022和NET8.0,我想请教一下,VS2022+NET8.0,AOT编译的WINFORM程序到底是什么样的,能否用DNSPY类工具看到源代码?
luxingyu329 发表于 2024-6-28 21:16
传闻中的喜哥哥 发表于 2024-6-29 11:16
能爬 需要付费才会显示的链接吗?
msmvc 发表于 2024-6-29 11:40
可以尝试一下playwright 官网有.net使用方法
感觉用起来比较简单
L__ 发表于 2024-6-29 17:03
.NET写爬虫并不比Python复杂
xiaofeng4929 发表于 2024-6-30 07:50
感谢分享,正在学习这方面的技术
xiaochezi01 发表于 2024-7-2 14:54
gksj 发表于 2024-6-29 00:17
我没装VS2022和NET8.0,我想请教一下,VS2022+NET8.0,AOT编译的WINFORM程序到底是什么样的,能否用DNSPY类工具 ...

现在Winform还不能用AOT发布吧?
aisht 发表于 2024-7-2 15:12
. 我看你代码觉得你写得好繁琐...
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2025-4-3 23:52

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表