html - bash で XPath を介して HTML 要素を取得する

Question

別のSOの質問で説明されているように、MacOSでページ( Kaggle Competitions )を解析しようとしていました:xpath

curl "https://www.kaggle.com/competitions/search?SearchVisibility=AllCompetitions&ShowActive=true&ShowCompleted=true&ShowProspect=true&ShowOpenToAll=true&ShowPrivate=true&ShowLimited=true&DeadlineColumnSort=Descending" -o competitions.html
cat competitions.html | xpath '//*[@id="competitions-table"]/tbody/tr[205]/td[1]/div/a/@href'

hrefそれは、テーブル内のリンクを取得するだけです。

ただし、値を返す代わりに、xpath検証.htmlを開始し、のようなエラーを返しますundefined entity at line 89, column 13, byte 2964。

man xpath存在せず、何もないのでxpath --help、私は立ち往生しています。また、多くの同様のソリューションxpathは、MacOS ではなく、GNU ディストリビューションに関連しています。

bashでXPath経由でHTML要素を取得する正しい方法はありますか?

score 3 · Accepted Answer

bash で XPath を介して HTML 要素を取得する

html ファイルから (無効な xml を含む)

1 つの可能性は、xsltproc を使用することです。（MACで利用できることを願っています）。--htmlxsltproc には、 html を入力として使用するオプションがあります。ただし、xslt スタイルシートが必要です。

<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 

  <xsl:template match="/*">
    <xsl:value-of  select="//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href" />
  </xsl:template>

</xsl:stylesheet>

xapt が変更されていることに注意してください。tbody入力ファイルにはありません。xsltproc を呼び出します。

xsltproc --html  test.xsl competitions.html 2> /dev/null

html のエラーについて不平を言う xslproc は無視されます ( /devn/null に送信します)。

出力は次のとおりです。/c/R

コマンドラインから別の xpath 式を使用するには、xslt テンプレートを使用して__xpath__.

例: xslt テンプレート:

<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 
  <xsl:template match="/*">
    <xsl:value-of  select="__xpaht__" />
  </xsl:template>
</xsl:stylesheet>

そして、代わりに（例えば）sedを使用してください。

 sed -e "s,__xpaht__,//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href," test.xslt.tmpl > test.xsl
 xsltproc --html  test.xsl competitions.html 2> /dev/null

html - bash で XPath を介して HTML 要素を取得する

1 に答える 1

Related

Reference