I'm looking for better ideas for extracting tables from html files. Right now I'm using tidy ( http://tidy.sourceforge.net/ ) to convert a html file into xhtml and then I use rapidxml to parse the xml. While parsing I will look for <table>
, <tr>
, and <td>
nodes and so create my table data structures.
It works quite nicely but I'm wondering if there are better ways to accomplish my task. Also the tidy lib seems like an abandoned project.
Also has everyone ever tried the "experimental" patch in tidy source code?
Thanks, Christian