c# - C# での正規表現

Question

次の HTML から 2 番目の div を解析したい:

<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>

すなわち、この値:<div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'>

ID には任意の数字を含めることができます。

ここに私がしようとしているものがあります:

Regex rgx = new Regex(@"'post-body-\d*'");
var res = rgx.Replace("<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>", "");

私は結果を期待して<div kubedfiuabefiudsabiubfg><div kubedfiuabefiudsabiubfg>いますが、それは私が得ているものではありません。

score 1 · Accepted Answer

期待どおりの結果が得られない理由は、正規表現文字列が検索しているだけで、タグ'post-body-\d*'の残りの部分は検索していないためです。divさらに、Regex.Replaceを実行すると、検索しているテキストが返されるのではなく、実際に置き換えられるため、検索しているテキスト以外のすべてが取得されることになります。

Regex.Matches（または最初の出現のみを気にする場合はRegex.Match）を使用して、正規表現文字列を@に置き換え、一致を処理して"<div class='post-body entry-content' id='post-body-\d*' itemprop='articleBody'>"みてください。

例えば：

string htmlText = @"<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>";

Regex rgx = new Regex(@`"<div class='post-body entry-content' id='post-body-\d*' itemprop='articleBody'>");
foreach (Match match in rgx.Matches(htmlText))
{
    // Process matches
    Console.WriteLine(match.ToString());
}

score 1 · Accepted Answer

数値の前後のテキストが常に同じであることが 100% 確実な場合は、String クラスの .IndexOf メソッドと .Substring メソッドを使用して、文字列を分割することができます。

string original = @"<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>"

// IndexOf returns the position in the string where the piece we are looking for starts
int startIndex = original.IndexOf(@"<div class='post-body entry-content' id='post-body-");
// For the endIndex, add the number of characters in the string that you are looking for
int endIndex = original.IndexOf(@"' itemprop='articleBody'>") + 25;

// this substring will retrieve just the inner part that you are looking for
string newString = original.Substring(startIndex, endIndex - startIndex);

// newString should now equal "<div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'>"


// or, if you want to just remove the inner part, build a different string like this:
// First, get everything leading up to the startIndex
string divString = original.Substring(0, startIndex);
// then, add everything after the endIndex
divString += original.Substring(endIndex);

// divString should now equal "<div kubedfiuabefiudsabiubfg><div kubedfiuabefiudsabiubfg>"

お役に立てれば...

score 0 · Accepted Answer

HTML フラグメントを XML フラグメントに解析し、id属性を直接引き出すことができます。

var html = "<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>"
var data = XElement.Parse(html).Element("div").Attribute("id");

c# - C# での正規表現

3 に答える 3

Related

Reference