python - Python HTMLParser は複雑なカスタムタグ属性を解析できません

Question

HTMLParser を使用して、次の HTML タグを解析しようとしています。

<input type="hidden" name="movingEventItemId" value="<dt:value property="movingEventItemId" javaScriptSafe="1"/>"/>

個々の属性を取得すると、次のものが返されると予想されます。

タイプ - 非表示

名前 -movingEventItemId

値 - < dt:value プロパティ="movingEventItemId" javaScriptSafe="1" />

しかし、値は次のように返されます - < dt:value property=" javascriptsafe="1"/>

Python コード:

 quote = '[\\\'\\"]'

 class HiddenInputParser(HTMLParser):
   def handle_starttag(self, tag, attrs):
    if tag == 'input':
        id = self.findAttr('id', attrs)
        name = self.findAttr('name', attrs)
        if not name and not id:
            print('no name or id     ' + self.get_starttag_text())
        elif id is not None and name is not None and id != name:
            print('id != name    ' + self.get_starttag_text())
        else:
            id = (id if id else name)
            output = '<input type="hidden" id="' + id + '" name="' + id + '" '
            for attr in attrs:
                key = attr[0]
                if attr[1] and key != 'id' and key != 'name' and key != 'type':
                    matcher = re.search( key + '\\s*=\\s*(' + quote + ')', self.get_starttag_text(), re.IGNORECASE)
                    quote2 = matcher.group(1)
                    value = quote2 + attr[1] + quote2
                    output += ' ' + key + '=' + value
            output += '/>'
            print ( output )
    else:
        print ( self.get_starttag_text() )

    def findAttr(self, id, attrs):
      for attr in attrs:
          if attr[0] == id:
              return attr[1]


def fixHiddenInputs():
   inputParser = HiddenInputParser()
   files = []
   for extension in extensions:
      for root, dirnames, filenames in os.walk(path):
         for filename in fnmatch.filter(filenames, extension):
            for line in fileinput.input(os.path.join(root, filename), inplace=1):
                line = line.rstrip()
                if ( re.search( '<input.*type=' + quote + 'hidden' + quote,  line ) ):
                    inputParser.feed(line)
                else:
                    print( line)

def convertHiddenInputs():
   pass

convertHiddenInputs()
fixHiddenInputs()

コードは、name 属性の値を指定して、input タグに id を追加することです。最終結果は次のようになります。

<input type="hidden" id="displayOptions" name="displayOptions" value="<dt:value property="movingEventItemId" javaScriptSafe="1"/>"/>

そして、これは私が得ているものです:

<input type="hidden" id="displayOptions" name="displayOptions"  value="<dt:value property=" javascriptsafe="1"/>

score 0 · Accepted Answer

タグにを追加するidために、行全体を解析してすべての属性を取得する必要はありません。問題のより良い解決策はid、残りの属性にもかかわらず、要素にを追加することです。idをと同じにする必要がある場合nameは、すでにname属性を正常に取得しているようです。取得したを使用して、その前後にname追加するだけです。id="NAME_VALUE"

乾杯！

python - Python HTMLParser は複雑なカスタム タグ属性を解析できません

1 に答える 1

Related

Reference

python - Python HTMLParser は複雑なカスタムタグ属性を解析できません