java - Java Scanner & Regex: findWithinHorizon を 2 回適用しても結果が返されない

Question

次のコードがあります。その目的は、正規表現を使用して、特定の Web ページのエンコーディング/文字セットを検出することです。

次の 2 つの正規表現 (regexHTML1およびregexHTML2) をテストする必要があります。この場合、正しい正規表現は 2 番目の正規表現であり、次のようregexHTML2に出力されます。

Found: <meta id="HtmlHead1_desc" name="description" content="Televisores,TV 3D, TV, vídeo e MP3. Compre online Televisores,TV 3D na Fnac" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta
Found: UTF-8

このコードで：

URL url = new URL("http://www.fnac.pt/imagem-e-som/TV-3D/Televisores/s21075?bl=HGAChead");
is = url.openStream();

String regexHTML1 = "<meta.*content=\\\".*;.*charset=(.*)\\\"\\s*/?>";
String regexHTML2 = "<meta.*content=\\\".*;.*charset=(.*)\\\"\\s*/?>\\s*<meta";

//  Scanner s = new Scanner(is);        
//  s.findWithinHorizon(regexHTML1, 0);
//  MatchResult result = s.match();
//  for (int i = 0; i <= result.groupCount(); i++)
//      System.out.println("Found: " + result.group(i));
//  s.close();

Scanner s2 = new Scanner(is);
s2.findWithinHorizon(regexHTML2, 0);
MatchResult result2 = s2.match();
for (int i = 0; i <= result2.groupCount(); i++)
    System.out.println("Found: " + result2.group(i));
s2.close();

ただし、最初の正規表現 ( ) をテストするコメント化されたコードブロックのコメントを外すとregexHTML1、出力は次のようになります。

Found: <meta id="HtmlHead1_desc" name="description" content="Televisores,TV 3D, TV, vídeo e MP3. Compre online Televisores,TV 3D na Fnac" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="PICS-Label" content="(PICS-1.1 &quot;http://www.rsac.org/ratingsv01.html&quot; l gen true comment &quot;RSACi North America Server&quot; by &quot;webmaster@fnac.com&quot; for &quot;http://www.fnac.com/&quot; on &quot;1997.06.30T14:21-0500&quot; r (n 0 s 0 v 0 l 0))" /><link rel="shortcut icon" href="/favicon.ico" /><link id="HtmlHead1_canonicalLink" rel="canonical" href="http://www.fnac.pt/imagem-e-som/TV-3D/Televisores/s21075" />
Found: UTF-8" /><meta http-equiv="PICS-Label" content="(PICS-1.1 &quot;http://www.rsac.org/ratingsv01.html&quot; l gen true comment &quot;RSACi North America Server&quot; by &quot;webmaster@fnac.com&quot; for &quot;http://www.fnac.com/&quot; on &quot;1997.06.30T14:21-0500&quot; r (n 0 s 0 v 0 l 0))" /><link rel="shortcut icon" href="/favicon.ico" /><link id="HtmlHead1_canonicalLink" rel="canonical" href="http://www.fnac.pt/imagem-e-som/TV-3D/Televisores/s21075

のでregexHTML1、適切ではありません。しかし、regexHTML2（正しいもの）をテストすると、例外がスローされます。

java.lang.IllegalStateException: No match result available

これはどのように可能ですか？regexHTML2テストしていないときにのみ機能しますregexHTML1...

score 1 · Accepted Answer

入力ストリームは、読み取られるときに消費されます (つまり、現在の位置がわかります)。したがって、同じストリームを使用しているため、最初のスキャン操作で消費され、2 番目のスキャナーには何も残りません。

2 つの異なるストリームを使用するか、ストリーム全体をStringまたは類似のものにダウンロードして、それと照合します。

score 0 · Accepted Answer

inputStreamと関係があるかもしれませんが、is1とis2の2つの入力ストリームを使用して確認してください。

java - Java Scanner & Regex: findWithinHorizo​​n を 2 回適用しても結果が返されない

2 に答える 2

Related

Reference

java - Java Scanner & Regex: findWithinHorizon を 2 回適用しても結果が返されない