regex - 正規表現を使用して、次のタイトルまで各エントリにタイトルをコピーできますか? (電子ブックの巻末注のハイパーリンク)

Question

さて、正規表現の忍者。ePub 電子ブック XHTML ファイルの文末脚注にハイパーリンクを追加するパターンを考案しようとしています。問題は、各章内で番号付けが再開されることです。そのため、アンカー名へのリンクをハッシュするために、一意の識別子をアンカー名に追加する必要があります。

次のような（非常に単純化された）リストが与えられた場合：

<h2>Introduction</h2>
<p> 1 Endnote entry number one.</p>
<p> 2 Endnote entry number two.</p>
<p> 3 Endnote entry number three.</p>
<p> 4 Endnote entry number four.</p>

<h2>Chapter 1: The Beginning</h2>
<p> 1 Endnote entry number one.</p>
<p> 2 Endnote entry number two.</p>
<p> 3 Endnote entry number three.</p>
<p> 4 Endnote entry number four.</p>

私はそれを次のようなものに変える必要があります:

<h2>Introduction</h2>
<a name="endnote-introduction-1"></a><p> 1 Endnote entry number one.</p>
<a name="endnote-introduction-2"></a><p> 2 Endnote entry number two.</p>
<a name="endnote-introduction-3"></a><p> 3 Endnote entry number three.</p>
<a name="endnote-introduction-4"></a><p> 4 Endnote entry number four.</p>

<h2>Chapter 1: The Beginning</h2>
<a name="endnote-chapter-1-the-beginning-1"></a><p> 1 Endnote entry number one.</p>
<a name="endnote-chapter-1-the-beginning-2"></a><p> 2 Endnote entry number two.</p>
<a name="endnote-chapter-1-the-beginning-3"></a><p> 3 Endnote entry number three.</p>
<a name="endnote-chapter-1-the-beginning-4"></a><p> 4 Endnote entry number four.</p>

明らかに、書籍の実際のテキスト、つまり各文末脚注がリンクされている場所endnotes.xhtml#endnote-introduction-1などで、同様の検索が必要になります。

最大の障害は、前の検索が終了した後に各一致検索が開始されることです。そのため、再帰を使用しない限り、複数のエントリに対して同じビット (この場合はタイトル) を一致させることはできません。ただし、再帰を使用した私の試みでは、これまでのところ無限ループしか得られませんでした。

私は TextWrangler の grep エンジンを使用していますが、別のエディター (vim など) で解決策がある場合は、それも問題ありません。

ありがとう！

score 1 · Accepted Answer

これは 2 段階のプロセスが必要なため、テキストエディターで行うのは難しいと思います。最初にファイルを章に分割する必要があり、次に各章の内容を処理する必要があります。「文末脚注段落」(アンカーを追加したい場所) が、最初の単語が整数単語に等しい段落として定義されていると仮定すると、この PHP スクリプトは必要なことを実行します。

<?php
$data = file_get_contents('testdata.txt');
$output = processBook($data);
file_put_contents('testdata_out.txt', $output);
echo $output;

// Main function to process book adding endnote anchors.
function processBook($text) {
    $re_chap = '%
        # Regex 1: Get Chapter.
        <h2>([^<>]+)</h2>  # $1: Chapter title.
        (                  # $2: Chapter contents.
          .+?              # Contents are everything up to
          (?=<h2>|$)       # next chapter or end of file.
        )                  # End $2: Chapter contents.
        %six';
    // Match and process each chapter using callback function.
    $text = preg_replace_callback($re_chap, '_cb_chap', $text);
    return $text;
}
// Callback function to process each chapter.
function _cb_chap($matches) {
    // Build ID from H2 title contents.
    // Trim leading and trailing ws from title.
    $baseid = trim($matches[1]);
    // Strip all non-space, non-alphanums.
    $baseid = preg_replace('/[^ A-Za-z0-9]/', '', $matches[1]);
    // Append prefix and convert whitespans to single - dash.
    $baseid = 'endnote-'. preg_replace('/ +/', '-', $baseid);
    // Convert to lowercase.
    $baseid = strtolower($baseid);
    $text = preg_replace(
                '/(<p>\s*)(\d+)\b/',
                '<a name="'. $baseid .'-$2"></a>$1$2',
                $matches[2]);
    return '<h2>'. $matches[1] .'</h2>'. $text;

}
?>

このスクリプトは、サンプルデータを正しく処理します。

score 1 · Accepted Answer

ちょっとした awk でうまくいくかもしれません:

次のスクリプトを作成します (名前は add_endnote_tags.awk にしました)。

/^<h2>/ {
    i = 0;
    chapter_name = $0;
    gsub(/<[^>]+>/, "", chapter_name);
    chapter_name = tolower(chapter_name);
    gsub(/[^a-z]+/, "-", chapter_name);
    print;
}

/^<p>/ {
    i = i + 1;
    printf("<a name=\"endnote-%s-%d\"></a>%s\n", chapter_name, i, $0);
}

$0 !~ /^<h2>/ && $0 !~ /^<p>/ {
    print;
}

そして、それを使用してファイルを解析します。

awk -f add_endnote_tags.awk < source_file.xml > dest_file.xml

それが役立つことを願っています。Windows プラットフォームを使用している場合は、cygwinと awk パッケージをインストールするか、Windows 用の gawk をダウンロードして、awk をインストールする必要がある場合があります。

regex - 正規表現を使用して、次のタイトルまで各エントリにタイトルをコピーできますか? (電子ブックの巻末注のハイパーリンク)

2 に答える 2

Related

Reference