regex - RegExを使用してHTMLから値を抽出するには?

Question

次の HTML があるとします。

<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq:   <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>

要素内の値を取得したいと思います。要素のclass属性の値も取得したいと思います。

理想的には、関数を介して HTML を実行し、抽出されたエンティティの辞書を取得することができます (上記で定義した解析に基づく)。

上記のコードは、より大きなソース HTML ファイルからのスニペットであり、XML パーサーに対応していません。そこで、関心のある情報を抽出するのに役立つ可能性のある正規表現を探しています。

score 9 · Accepted Answer

このツールを使用してください (無料): http://www.radsoftware.com.au/regexdesigner/

この正規表現を使用します。

"<span[^>]*>(.*?)</span>"

グループ 1 の値 (一致ごと) が、必要なテキストになります。

C# では次のようになります。

            Regex regex = new Regex("<span[^>]*>(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string val = m.Groups[1].Value;
                    //Do something with the value
                }
            }

コメントに答えるように修正：

            Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string class = m.Groups[1].Value;
                    string val = m.Groups[2].Value;
                    //Do something with the class and value
                }
            }

score 2 · Accepted Answer

ネストされたスパンタグがないと仮定すると、以下が機能するはずです。

/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/

基本的なテストのみを行いましたが、タグが閉じられるまで、データとともにスパンタグのクラス（存在する場合）と一致します。

score 1 · Accepted Answer

代わりに、実際の HTML または XML パーサーを使用することを強くお勧めします。正規表現を使用して HTML や XML を確実に解析することはできません。できることは、近づくことだけです。近づくほど、正規表現は複雑になり、時間がかかります。解析する HTML ファイルが大きい場合、単純な正規表現パターンが壊れる可能性が高くなります。

あなた<span[^>]*>(.*?)の例では正規表現が機能しますが、正規表現で解析するのが難しい、または不可能でさえあるXML有効なコードがたくさんあります（たとえば、foo bar上記のパターンを壊します）。他の HTML サンプルで動作する何かが必要な場合、ここでは正規表現は適していません。

あなたの HTML コードは XML に対応していないので、非常に優れていると聞いたHTML Agility Packを検討してください。

regex - RegExを使用してHTMLから値を抽出するには?

3 に答える 3

Related

Reference