ios - iOS で libxml2 を使用して複数のタグを含むテキストを抽出する XPath

Question

libxml2 を使用した iOS アプリで、この HTML 部分を解析している間 (これは大きなページの一部です) -

...
<span class="ingredient">
    <span class="amount">
        <span class="value">500 </span> 
        <span class="type">g</span>
    </span>    
    <a href="...">bread flour</a> 
    or 
    <span class="ingredient">
        <span class="amount">
            <span class="value">500 </span> 
            <span class="type">g</span>
        </span>  
        <span class="name">
            <a href="...">all-purpose flour</a>
        </span>
    </span>
</span>
...

「500 g パン粉または 500 g 中力粉」というテキストだけを抽出する必要があります。

//span[@class="ingredient"]返された XPath クエリの解析済み NSDictionary 結果-

{
    nodeAttributeArray =     (
                {
            attributeName = class;
            nodeContent = ingredient;
        }
    );
    nodeChildArray =     (
                {
            nodeAttributeArray =             (
                                {
                    attributeName = class;
                    nodeContent = amount;
                }
            );
            nodeChildArray =             (
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = value;
                        }
                    );
                    nodeContent = 500;
                    nodeName = span;
                },
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = type;
                        }
                    );
                    nodeContent = g;
                    nodeName = span;
                }
            );
            nodeContent = "";
            nodeName = span;
        },
                {
            nodeAttributeArray =             (
                                {
                    attributeName = href;
                    nodeContent = "http://www.food.com/library/flour-64";
                }
            );
            nodeContent = "bread flour";
            nodeName = a;
        },
                {
            nodeAttributeArray =             (
                                {
                    attributeName = class;
                    nodeContent = ingredient;
                }
            );
            nodeChildArray =             (
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = amount;
                        }
                    );
                    nodeChildArray =                     (
                                                {
                            nodeAttributeArray =                             (
                                                                {
                                    attributeName = class;
                                    nodeContent = value;
                                }
                            );
                            nodeContent = 500;
                            nodeName = span;
                        },
                                                {
                            nodeAttributeArray =                             (
                                                                {
                                    attributeName = class;
                                    nodeContent = type;
                                }
                            );
                            nodeContent = g;
                            nodeName = span;
                        }
                    );
                    nodeContent = "";
                    nodeName = span;
                },
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = name;
                        }
                    );
                    nodeChildArray =                     (
                                                {
                            nodeAttributeArray =                             (
                                                                {
                                    attributeName = href;
                                    nodeContent = "http://www.food.com/library/flour-64";
                                }
                            );
                            nodeContent = "all-purpose flour";
                            nodeName = a;
                        }
                    );
                    nodeContent = "";
                    nodeName = span;
                }
            );
            nodeContent = "";
            nodeName = span;
        }
    );
    nodeContent = or;
    nodeName = span;
}

問題は、ディクショナリルートの「nodeContent」がテキスト「または」であり、すべてのタグがルートノードの子として配置されているため、断片の順序が失われていることです。すべてのテキストを連結すると、次の文字列が得られます:「または 500 g パン粉 500 g 万能粉」。

純粋なテキストを 1 つの XPath クエリで抽出する方法、または代わりに XPath エンジンを使用して要素の順序付きリストを読み取る方法を見つけられる人はいますか?

score 0 · Accepted Answer

すべてのテキストノードが必要なため、これは次を使用して簡単に実行できます

//text()

すべてのノードを返します。コンテンツの空白にはいくつかの問題があります。空白のみのノードをすべて省略することができます。

//text()[not(matches(., '$[\s]+$', 'm'))]

その後、Objective C で何らかのトリミング (「g」など) を行う必要がありますが、印刷可能な文字を含むすべてのテキストノードの順序どおりの結果セットを取得する必要があります。

ios - iOS で libxml2 を使用して複数のタグを含むテキストを抽出する XPath

1 に答える 1

Related

Reference