azure-data-lake - U-SQL で XML Extractor を使用して XML 要素から属性値を抽出する方法

Question

Azure データレイク分析ジョブのために U-SQL でXML Extractorを使用して XML 要素から属性値を抽出するにはどうすればよいですか。

更新: 問題の詳細

私の XML ファイルは次のようになります。

<?xml version="1.0" encoding="utf-8"?>
<testelement testatr="xyz">
</testelement>

これが私の U-SQL スクリプトです。

DECLARE @testfile string = "sample2.xml";
@logText =
EXTRACT log string            
FROM @testfile
USING Extractors.Tsv();

@gethID = SELECT Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(@logText.log, "testelement/attribute::testatr").ElementAt(0) AS siteName FROM @logText;
OUTPUT @gethID TO "result.out" USING Outputters.Tsv();

デバッグ後、XPath クラスの Load メソッドを読み込もうとすると例外が発生しました。

"<?xml version=1.0 encoding=utf-8?>"

例外は次のとおりです。

Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: '1.0' is an unexpected token. The expected token is '\"' or '''. Line 1, position 15.\nCurrent row dump: \tlog:\t\"<?xml version=1.0 encoding=utf-8?>\"
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
\n   at System.Xml.XmlTextReaderImpl.ParseXmlDeclaration(Boolean isTextDecl)
\n   at System.Xml.XmlTextReaderImpl.Read()
\n   at System.Xml.XmlLoader.Load(XmlDocument doc, XmlReader reader, Boolean preserveWhitespace)
\n   at System.Xml.XmlDocument.Load(XmlReader reader)
\n   at System.Xml.XmlDocument.LoadXml(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath)
\n   at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}

更新 2:

quotes:false を使用した後、別の例外が発生します。

Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: Root element is missing.\nCurrent row dump: \tlog:\t\"<?xml version=\"1.0\" encoding=\"utf-8\"?>\"
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
\n   at System.Xml.XmlTextReaderImpl.ParseDocumentContent()
\n   at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
\n   at System.Xml.XmlDocument.Load(XmlReader reader)
\n   at System.Xml.XmlDocument.LoadXml(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath)
\n   at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}

score 0 · Accepted Answer

新しいエラーメッセージについて: XML ドキュメントに XML 宣言の後に CR または LF、あるいはその両方が含まれているように見えるため、Tsv() エクストラクタが XML ドキュメントを分割します。前の回答の私のコメントを参照してください。

また、組み込みのテキストエクストラクタ (Extractors.*) を使用している場合は XML ドキュメントに CR/LF が含まれていないこと、および .Tsv を使用している場合はタブ値が含まれていないことを確認してください。

XML に CR や LF が含まれる場合は、カスタムエクストラクタを使用して別の行区切り記号を使用する必要があります。それを行う必要がある場合は、私にメッセージを残してください。私は現在、そのようなリクエストを追跡して、組み込みのエクストラクタで何が改善できるかを確認しています。

azure-data-lake - U-SQL で XML Extractor を使用して XML 要素から属性値を抽出する方法

2 に答える 2

Related

Reference