java - HTTP 入力ストリームからのビルド時に Javax xml パーサーがスタックする

Question

org.w3c.dom.DocumentWeb サイトへの HTTP 接続を開いて、html をクラスに解析しようとしています。HTTP 接続を開いて Web ページをコンソールに出力することはできますが、InputStream オブジェクトを XML パーサーに渡すと、1 分間ハングしてエラーが出力されます。

[Fatal Error] :108:55: Open quote is expected for attribute "{1}" associated with an  element type  "onload".

コード：

private static Document getInputStream(String url) throws IOException, SAXException, ParserConfigurationException
{
  System.out.println(url);
  URL webUrl = new URL(url);
  URLConnection connection = webUrl.openConnection();
  connection.setConnectTimeout(60 * 1000);
  connection.setReadTimeout(60 * 1000);

  InputStream stream = connection.getInputStream();

  DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
  domFactory.setNamespaceAware(true);
  DocumentBuilder builder = domFactory.newDocumentBuilder();
  Document doc = builder.parse(stream); // This line is hanging
  return doc;
}

一時停止時のスタックトレース:

Thread [main] (Suspended)   
    SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method]    
    SocketInputStream.read(byte[], int, int) line: not available    
    BufferedInputStream.fill() line: not available  
    BufferedInputStream.read1(byte[], int, int) line: not available 
    BufferedInputStream.read(byte[], int, int) line: not available  
    HttpClient.parseHTTPHeader(MessageHeader, ProgressSource, HttpURLConnection) line: not available    
    HttpClient.parseHTTP(MessageHeader, ProgressSource, HttpURLConnection) line: not available  
    HttpURLConnection.getInputStream() line: not available  
    XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean) line: not available   
    XMLEntityManager.startEntity(String, XMLInputSource, boolean, boolean) line: not available  
    XMLEntityManager.startDTDEntity(XMLInputSource) line: not available 
    XMLDTDScannerImpl.setInputSource(XMLInputSource) line: not available    
    XMLDocumentScannerImpl$DTDDriver.dispatch(boolean) line: not available  
    XMLDocumentScannerImpl$DTDDriver.next() line: not available 
    XMLDocumentScannerImpl$PrologDriver.next() line: not available  
    XMLNSDocumentScannerImpl(XMLDocumentScannerImpl).next() line: not available 
    XMLNSDocumentScannerImpl.next() line: not available 
    XMLNSDocumentScannerImpl(XMLDocumentFragmentScannerImpl).scanDocument(boolean) line: not available  
    XIncludeAwareParserConfiguration(XML11Configuration).parse(boolean) line: not available 
    XIncludeAwareParserConfiguration(XML11Configuration).parse(XMLInputSource) line: not available  
    DOMParser(XMLParser).parse(XMLInputSource) line: not available  
    DOMParser.parse(InputSource) line: not available    
    DocumentBuilderImpl.parse(InputSource) line: not available  
    DocumentBuilderImpl(DocumentBuilder).parse(InputStream) line: not available 
    MSCommunicator.getInputStream(String) line: 45  
    MSCommunicator.getGamePageFromForum(int, int, int) line: 70 
    MSCommunicator.getGamePageFromForum(int, int) line: 57  
    Game.<init>(int, int) line: 21  
    MSCommunicator.main(String[]) line: 26

score 0 · Accepted Answer

取得した HTML ページが適切で整形式の HTML であっても、整形式の XML ではない場合があります。たとえば、これは HTML4 で有効です。

<p class=myclass>Paragraph<br>Next line</p>

一方、XML (XHTML) では、これは有効と見なされます。

<p class="myclass">Paragraph<br/>Next line</p>

閉じた<br/>タグと、タグの class 属性を囲む引用符に注意してくださいp。

また、インターウェブは野生の場所であるため、コンテンツが整形式である可能性は低いため、「すべてを一粒の塩で取る」必要があります。jTidyやnekoHTMLなど。

score 0 · Accepted Answer

HTML を XML DOM ツリーに解析することだけを期待することはできません。必ずしも有効な XML になるとは限りません。おそらく最初にクリーンアップする必要があります。この質問への回答を参照してください。

Javaを使用してHTMLファイルをDOMツリーに読み込む

java - HTTP 入力ストリームからのビルド時に Javax xml パーサーがスタックする

2 に答える 2

Related

Reference