html - HTMLを無視して文字列を長さにトリミングします

Question

この問題は難しい問題です。私たちのアプリケーションは、ユーザーがホームページにニュースを投稿することを可能にします。そのニュースは、HTMLを許可するリッチテキストエディタを介して入力されます。ホームページでは、ニュース項目の切り捨てられた要約のみを表示したいと思います。

たとえば、HTMLを含む表示している全文は次のとおりです

オフィスやキッチンのスペースをもう少し増やすために、ランダムなマグカップをすべて引き出して、ランチルームのテーブルに置きました。1992年のCheyenneCourierマグカップ、または1997年のBC Tel Advanced Communicationsマグカップの所有権について強く感じない限り、それらは箱に入れられ、私たちよりもマグカップを必要としているオフィスに寄付されます。

ニュースアイテムを250文字にトリミングしますが、HTMLは除外します。

現在、トリミングに使用している方法にはHTMLが含まれているため、HTMLが重いニュース投稿の一部が大幅に切り捨てられます。

たとえば、上記の例に大量のHTMLが含まれている場合、次のようになる可能性があります。

オフィスやキッチンにもう少しスペースを空けるために、引っ張ってきました...

これは私たちが望んでいることではありません。

文字列内の位置を維持し、文字列の長さチェックやトリミングを実行し、文字列内のHTMLを古い場所に復元するために、HTMLタグをトークン化する方法はありますか？

score 10 · Accepted Answer

投稿の最初の文字から始めて、各文字をステップオーバーします。キャラクターをステップオーバーするたびに、カウンターをインクリメントします。'<'文字を見つけたら、'>'文字に当たるまでカウンターのインクリメントを停止します。カウンターが250に達したときのあなたの位置は、あなたが実際にカットしたい場所です。

これには、HTMLタグが開いているが、カットオフの前に閉じられていない場合に対処しなければならない別の問題があることに注意してください。

score 2 · Accepted Answer

2ステートの有限マシンの提案に従って、この目的のためにJavaで単純なHTMLパーサーを開発しました。

http://pastebin.com/jCRqiwNH

そしてここにテストケース：

http://pastebin.com/37gCS4tV

そしてここにJavaコードがあります：

import java.util.Collections;
import java.util.LinkedList;
import java.util.List;

public class HtmlShortener {

    private static final String TAGS_TO_SKIP = "br,hr,img,link";
    private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
    private static final int STATUS_READY = 0;

        private int cutPoint = -1;
    private String htmlString = "";

    final List<String> tags = new LinkedList<String>();

    StringBuilder sb = new StringBuilder("");
    StringBuilder tagSb = new StringBuilder("");

    int charCount = 0;
    int status = STATUS_READY;

    public HtmlShortener(String htmlString, int cutPoint){
        this.cutPoint = cutPoint;
        this.htmlString = htmlString;
    }

    public String cut(){

        // reset 
        tags.clear();
        sb = new StringBuilder("");
        tagSb = new StringBuilder("");
        charCount = 0;
        status = STATUS_READY;

        String tag = "";

        if (cutPoint < 0){
            return htmlString;
        }

        if (null != htmlString){

            if (cutPoint == 0){
                return "";
            }

            for (int i = 0; i < htmlString.length(); i++){

                String strC = htmlString.substring(i, i+1);


                if (strC.equals("<")){

                    // new tag or tag closure

                    // previous tag reset
                    tagSb = new StringBuilder("");
                    tag = "";

                    // find tag type and name
                    for (int k = i; k < htmlString.length(); k++){

                        String tagC = htmlString.substring(k, k+1);
                        tagSb.append(tagC);

                        if (tagC.equals(">")){
                            tag = getTag(tagSb.toString());
                            if (tag.startsWith("/")){

                                // closure
                                if (!isToSkip(tag)){
                                    sb.append("</").append(tags.get(tags.size() - 1)).append(">");
                                    tags.remove((tags.size() - 1));
                                }

                            } else {

                                // new tag
                                sb.append(tagSb.toString());

                                if (!isToSkip(tag)){
                                    tags.add(tag);  
                                }

                            }

                            i = k;
                            break;
                        }

                    }

                } else {

                    sb.append(strC);
                    charCount++;

                }

                // cut check
                if (charCount >= cutPoint){

                    // close previously open tags
                    Collections.reverse(tags);
                    for (String t : tags){
                        sb.append("</").append(t).append(">");
                    }
                    break;
                } 

            }

            return sb.toString();

        } else {
            return null;
        }

    }

    private boolean isToSkip(String tag) {

        if (tag.startsWith("/")){
            tag = tag.substring(1, tag.length());
        }

        for (String tagToSkip : tagsToSkip){
            if (tagToSkip.equals(tag)){
                return true;
            }
        }

        return false;
    }

    private String getTag(String tagString) {

        if (tagString.contains(" ")){
            // tag with attributes
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
        } else {
            // simple tag
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
        }


    }

}

score 0 · Accepted Answer

これは投稿日からかなり遅れていることは承知していますが、同様の問題が発生したため、最終的に解決しました。私の懸念は、正規表現の速度と配列を介した相互作用の速度です。

また、htmlタグの前にスペースがあり、この後はそれが修正されない場合

private string HtmlTrimmer(string input, int len)
{
    if (string.IsNullOrEmpty(input))
        return string.Empty;
    if (input.Length <= len)
        return input;

    // this is necissary because regex "^"  applies to the start of the string, not where you tell it to start from
    string inputCopy;
    string tag;

    string result = "";
    int strLen = 0;
    int strMarker = 0;
    int inputLength = input.Length;     

    Stack stack = new Stack(10);
    Regex text = new Regex("^[^<&]+");                
    Regex singleUseTag = new Regex("^<[^>]*?/>");            
    Regex specChar = new Regex("^&[^;]*?;");
    Regex htmlTag = new Regex("^<.*?>");

    while (strLen < len)
    {
        inputCopy = input.Substring(strMarker);
        //If the marker is at the end of the string OR 
        //the sum of the remaining characters and those analyzed is less then the maxlength
        if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
            break;

        //Match regular text
        result += text.Match(inputCopy,0,len-strLen);
        strLen += result.Length - strMarker;
        strMarker = result.Length;

        inputCopy = input.Substring(strMarker);
        if (singleUseTag.IsMatch(inputCopy))
            result += singleUseTag.Match(inputCopy);
        else if (specChar.IsMatch(inputCopy))
        {
            //think of &nbsp; as 1 character instead of 5
            result += specChar.Match(inputCopy);
            ++strLen;
        }
        else if (htmlTag.IsMatch(inputCopy))
        {
            tag = htmlTag.Match(inputCopy).ToString();
            //This only works if this is valid Markup...
            if(tag[1]=='/')         //Closing tag
                stack.Pop();
            else                    //not a closing tag
                stack.Push(tag);
            result += tag;
        }
        else    //Bad syntax
            result += input[strMarker];

        strMarker = result.Length;
    }

    while (stack.Count > 0)
    {
        tag = stack.Pop().ToString();
        result += tag.Insert(1, "/");
    }
    if (strLen == len)
        result += "...";
    return result;
}

score 0 · Accepted Answer

問題を正しく理解している場合は、HTML形式を保持したいが、保持している文字列の長さの一部としてカウントしたくない。

これは、単純な有限状態マシンを実装するコードで実現できます。

2つの状態：InTag、OutOfTag
InTag：
->文字が検出された場合はOutOfTagに移動します-
他の文字が検出された場合はそれ自体に移動し
ますOutOfTag：
-<文字が検出された場合はInTagに移動します-他の文字が検出された場合は
それ自体に移動します

開始状態はOutOfTagになります。

一度に1文字を処理することにより、有限状態マシンを実装します。各キャラクターの処理により、新しい状態になります。

有限状態マシンを介してテキストを実行するときは、出力バッファーと、これまでに遭遇した可変長（いつ停止するかがわかる）も保持する必要があります。

OutOfTag状態になり、別の文字を処理するたびに、Length変数をインクリメントします。空白文字がある場合は、オプションでこの変数をインクリメントできません。
文字がなくなったとき、または＃1で説明した目的の長さになったら、アルゴリズムを終了します。
出力バッファに、＃1で説明した長さまで遭遇した文字を含めます。
閉じられていないタグのスタックを保持します。長さに達したら、スタック内の要素ごとに、終了タグを追加します。アルゴリズムを実行すると、current_tag変数を保持することで、タグに遭遇したことを知ることができます。このcurrent_tag変数は、InTag状態に入ると開始され、OutOfTag状態に入ると（またはInTag状態でホワイトスペース文字が検出されると）終了します。開始タグがある場合は、それをスタックに入れます。終了タグがある場合は、スタックからポップします。

score 0 · Accepted Answer

これが私が思いついたC＃の実装です：

public static string TrimToLength(string input, int length)
{
  if (string.IsNullOrEmpty(input))
    return string.Empty;

  if (input.Length <= length)
    return input;

  bool inTag = false;
  int targetLength = 0;

  for (int i = 0; i < input.Length; i++)
  {
    char c = input[i];

    if (c == '>')
    {
      inTag = false;
      continue;
    }

    if (c == '<')
    {
      inTag = true;
      continue;
    }

    if (inTag || char.IsWhiteSpace(c))
    {
      continue;
    }

    targetLength++;

    if (targetLength == length)
    {
      return ConvertToXhtml(input.Substring(0, i + 1));
    }
  }

  return input;
}

そして、TDDを介して使用したいくつかの単体テスト：

[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
  Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
  Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                  "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                  "<br/>" +
                  "In an attempt to make a bit more space in the office, kitchen, I";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}

[Test]
public void Html_TrimWellFormedHtml()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
             "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
             "<br/>" +
             "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
             "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
             "</div>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                    "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                    "<br/>" +
                    "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}

[Test]
public void Html_TrimMalformedHtml()
{
  string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                         "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                         "<br/>" +
                         "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
                         "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
              "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
              "<br/>" +
              "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

score 0 · Accepted Answer

次のnpmパッケージを試すことができます

トリム-html

htmlタグ内の十分なテキストを切り取り、元のhtml制限を保存し、制限に達した後にhtmlタグを削除し、開いているタグを閉じます。

score -1 · Accepted Answer

jQueryのtext()メソッドを使用するのが最速の方法ではないでしょうか？

例えば：

<ul>
  <li>One</li>
  <li>Two</li>
  <li>Three</li>
</ul>

var text = $('ul').text();

変数に値OneTwoThreeを与えtextます。これにより、HTMLを含めずにテキストの実際の長さを取得できます。

html - HTMLを無視して文字列を長さにトリミングします

7 に答える 7

Related

Reference