【求助】C#提取字符串中指定Url
字符串内容是网页 想要实现解析那个下载地址字符串片段:
<div class="load" id="go"><a href="https://developer.lanzoug.com/file/?B2FaZFloAzJSWwc/Cz5dMVdoBz9X7gKWUeZRslK1A7gI7gLDC9ECsAfNCvALuVzLBtsP7FGvU/IFLFMhVHAHcgchWmtZbQM7UmEHDgs2XThXMAcxVz4COVFiUWdSPwM3CDoCcwswAiUHbAptC21cawZ6D3pRflNsBTBTZVQ6BzcHKFo9WTUDeFI2B2kLcF1pVzsHYFdtAmRRZVE0UjcDOwg/AmcLMgJgBzQKbgtnXG4GPg84UWhTNwVhU2JUbgc1BzRaOVlnA2dSYgc1C2tdcld5B3pXfAImUSNRJFJrA3AIYQIxC2sCZwdgCm4LbVxvBmsPOlEoUyUFa1M4VG0HYAc6WjxZMgNgUjUHZgtuXW1XPgc7Vz4CLlF4UXFSaANuCH8CaAtnAmAHZQprC29cZAZvDzpRO1NhBSRTIFR4B3EHOlo8WTIDZlI1B2kLb11pVz0HMFc7AiZRI1E+Un4DPwg6AmcLZQJ4B2kKZAtxXGwGaQ87USBTYwUwU2Q=" target="_blank" rel="noreferrer"><span class="txt">电信下载</span><span class="txt txtc">联通下载</span><span class="txt">普通下载</span></a></div>
请教各位大佬使用C#如何提取图中这个
https://developer.lanzoug.com/file/?XXXXXXXXXX
(如何匹配)
本帖最后由 xzqsr 于 2022-12-22 21:22 编辑
用正则表达式
using System.Text.RegularExpressions;
// 在指定的输入字符串中搜索指定的正则表达式的第一个匹配项
string input = "网页源代码";
string pattern = @"href=""https://developer\.lanzoug\.com/file/\?.*"" target";
Match match = Regex.Match(input, pattern);
if (match.Success)
{
string matchedValue = match.Value.Replace(@"href=""", "").Replace(@""" target", "");
Console.WriteLine(matchedValue); // 已取出的链接
}
我手头没有编译器,手打代码可能有错(本行已删除)
现在有编译器了,已修改,上面代码无误,不过更推荐3楼的写法
话说蓝奏云的链接匹配次数多了以后会失效的。而且这么良心的网盘为数不多,建议仅供学习使用,不要用于其他用途
private static void DumpHRefs(string inputString)
{
string hrefPattern = @"href\s*=\s*(?:[""'](?<1>[^""']*)[""']|(?<1>[^>\s]+))";
try
{
Match regexMatch = Regex.Match(inputString, hrefPattern,
RegexOptions.IgnoreCase | RegexOptions.Compiled,
TimeSpan.FromSeconds(1));
while (regexMatch.Success)
{
Console.WriteLine($"Found href {regexMatch.Groups} at {regexMatch.Groups.Index}");
regexMatch = regexMatch.NextMatch();
}
}
catch (RegexMatchTimeoutException)
{
Console.WriteLine("The matching operation timed out.");
}
} public static void Main()
{
string inputString = "My favorite web sites include:</P>" +
"<A HREF=\"https://docs.microsoft.com/en-us/dotnet/\">" +
".NET Documentation</A></P>" +
"<A HREF=\"http://www.microsoft.com\">" +
"Microsoft Corporation Home Page</A></P>" +
"<A HREF=\"https://devblogs.microsoft.com/dotnet/\">" +
".NET Blog</A></P>";
DumpHRefs(inputString);
}
// The example displays the following output:
// Found href https://docs.microsoft.com/dotnet/ at 43
// Found href http://www.microsoft.com at 114
// Found href https://devblogs.microsoft.com/dotnet/ at 188 EnterpriseSolu 发表于 2022-12-22 20:48
private static void DumpHRefs(string inputString)
{
string hrefPattern = @"href\s*=\s*(?:[""'] ...
emm大佬那个函数确实找到不少 但是没有要的那个{:301_998:}
xzqsr 发表于 2022-12-22 20:38
用正则表达式
using System.Text.RegularExpressions;
大佬这个好像报错 从regex.match括号里面 不知道是不是函数的问题 xzqsr 发表于 2022-12-22 20:38
用正则表达式
using System.Text.RegularExpressions;
好像匹配不到{:301_973:} 本帖最后由 woaidianqian 于 2022-12-22 23:15 编辑
// See https://aka.ms/new-console-template for more information
using System.Text.RegularExpressions;
internal class Program
{
private static void Main(string[] args)
{
string input = """<div class="load" id="go"><atarget="_blank" rel="noreferrer"><span class="txt">电信下载</span><span class="txt txtc">联通下载</span><span class="txt">普通下载</span></a></div>""";
Match match = Regex.Match(input, "<a href=\"(.+)\" target=\"_blank\" rel=\"noreferrer\">");
string result=match.Groups.Value;
Console.WriteLine(result);
}
}
运行结果:
PS D:\Users\chengchao\Documents\临时c#> dotnet run
https://developer.lanzoug.com/file/?B2FaZFloAzJSWwc/Cz5dMVdoBz9X7gKWUeZRslK1A7gI7gLDC9ECsAfNCvALuVzLBtsP7FGvU/IFLFMhVHAHcgchWmtZbQM7UmEHDgs2XThXMAcxVz4COVFiUWdSPwM3CDoCcwswAiUHbAptC21cawZ6D3pRflNsBTBTZVQ6BzcHKFo9WTUDeFI2B2kLcF1pVzsHYFdtAmRRZVE0UjcDOwg/AmcLMgJgBzQKbgtnXG4GPg84UWhTNwVhU2JUbgc1BzRaOVlnA2dSYgc1C2tdcld5B3pXfAImUSNRJFJrA3AIYQIxC2sCZwdgCm4LbVxvBmsPOlEoUyUFa1M4VG0HYAc6WjxZMgNgUjUHZgtuXW1XPgc7Vz4CLlF4UXFSaANuCH8CaAtnAmAHZQprC29cZAZvDzpRO1NhBSRTIFR4B3EHOlo8WTIDZlI1B2kLb11pVz0HMFc7AiZRI1E+Un4DPwg6AmcLZQJ4B2kKZAtxXGwGaQ87USBTYwUwU2Q=
PS D:\Users\chengchao\Documents\临时c#>
网页标签的获取,最好用python的beautifulsoup库,搞网页还不用python做什么,完全没必要用c#,然后正则吧。 本帖最后由 lkhjzw 于 2022-12-23 09:55 编辑
/// <summary>
/// 提取Html字符串中两字符之间的数据
/// </summary>
/// <param name="html">源Html</param>
/// <param name="s">开始字符串</param>
/// <param name="e">结束字符串</param>
/// <returns></returns>
internal static string GetBetweenHtml(string html, string s, string e)
{
string rx = string.Format("{0}{1}{2}", s, “([\\s\\S]*?)”, e);
if (Regex.IsMatch(html, rx, RegexOptions.IgnoreCase))
{
MatchCollection matchs = Regex.Matches(html, rx, RegexOptions.IgnoreCase);
foreach (Match match in matchs)
{
if (match != null && match.Groups.Count > 0 && !string.IsNullOrWhiteSpace(match.Groups.Value.Trim()))
{
return match.Groups.Value.Trim();
}
}
}
return string.Empty;
}
自己修改调整一下就可以了,比如:
GetBetweenHtml(这里是HTML文本, "\\<a href\\=\"", "\"\\starget\\=\"_blank\"\\srel\\=\"noreferrer\"")
页:
[1]
2