1

I would like to extract only javascript from script tags in a HTML document which I want to pass it to a JS parser like esprima. I am using nodejs to write this application and have the content extracted from the script tag as a string. The problem is when there are HTML comments in the javascript extracted from html documents which I want to remove.
<!-- var a; --> should be converted to var a
A simple removal of <-- and --> does not work since it fails in the case <!-- if(j-->0); --> where it removes the middle -->
I would also like to remove identifiers like [if !IE] and [endif] which are sometimes found inside script tags. I would also like to extract the JS inside CDATA segments.
<![CDATA[ var a; ]]> should be converted to var a
Is all this possible using a regex or is something more required?
In short I would like to sanitize the JS from script tags so that I can safely pass it into a parser like esprima.
Thanks!

EDIT:
Based on @user568109 's answer. This is the rough code that parses through HTML comments and CDATA segments inside script tags

var htmlparser = require("htmlparser2");
var jstext = '';
var parser = new htmlparser.Pavar htmlparser = require("htmlparser2");
var jstext = '';
var parser = new htmlparser.Parser({
onopentag: function(name, attribs){
    if(name === "script" && attribs.type === "text/javascript"){
        jstext = '';
        //console.log("JS! Hooray!");
    }
},
ontext: function(text) {
    jstext += text;
},
onclosetag: function(tagname) {
    if(tagname === "script") {
        console.log(jstext);
        jstext = '';
    }
},
oncomment : function(data) {
    if(jstext) {
        jstext += data;
    }
}
},  {
xmlMode:true
});
parser.write(input);
parser.end()
4

1 に答える 1

0

それがパーサーの仕事です。htmlparser2または esprima 自体を参照してください。HTML の解析に正規表現を使用しないでください。魅力的です。より多くのタグを照合しようとすると、貴重な時間と労力が無駄になります。

ページの例:

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
    onopentag: function(name, attribs){
        if(name === "script" && attribs.type === "text/javascript"){
            console.log("JS! Hooray!");
        }
    },
    ontext: function(text){
        console.log("-->", text);
    },
    onclosetag: function(tagname){
        if(tagname === "script"){
            console.log("That's it?!");
        }
    }
});
parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</script>");
parser.end();

出力 (簡略化):

--> Xyz 
JS! Hooray!
--> var foo = '<<bar>>';
That's it?!

すべてのタグ div、コメント、スクリプトなどを提供します。ただし、コメント内のスクリプトを自分で検証する必要があります。またCDATA、XML(XHTML) の有効なタグであるため、htmlparser2 はそれをコメントとして検出します。それらも確認する必要があります。

于 2013-07-19T12:08:16.273 に答える