perl - How to parse HTML which does not have id or class information?

Question

If I have HTML of the form

<ol>
    <li>Cheeses
        <ol>
            <li>Red Leicester</li>
            <li>Cheddar</li>
        </ol>
    <li>Wines
        <ol>
            <li>Burgundy</li>
            <li>Beaujolais</li>
        </ol>
</ol>

I would like to parse it into a structure something like

{"Cheeses":["Red Leicester", "Cheddar"], "Wines":["Burgundy", "Beaujolais"]}

There are many "tutorials" on how to use modules like HTML::TreeBuilder or Mojo::DOM to parse HTML, but they seem always to rely on using "id=" or "class=" tags. The HTML I want to parse does not have any ID tags or other attributes. How can I do this?

score 1 · Accepted Answer

I only have experience in Mojo::DOM, and admittedly you might find a better module for converting your XML to a data structure. If you are using Mojo::DOM, you might want to look at the tree structure underlying the Mojo::DOM object:

#!/usr/bin/env perl

use strict;
use warnings;

use Mojo::DOM;
use Data::Dumper;

my $dom = Mojo::DOM->new(<<'END');
<ol>
    <li>Cheeses
        <ol>
            <li>Red Leicester</li>
            <li>Cheddar</li>
        </ol>
    <li>Wines
        <ol>
            <li>Burgundy</li>
            <li>Beaujolais</li>
        </ol>
</ol>
END

print Dumper $dom->tree;

With a little massaging you might be able to get that into the form you want. Perhaps someone has a module that goes a little more directly from HTML (probably actually XML) to the structure.

perl - How to parse HTML which does not have id or class information?

1 に答える 1

Related

Reference