python - 正規表現を使用してkindleの「My Clippings.txt」ファイルを解析する

Question

現在、python を使用して kindle のメモファイルを解析しようとしています。これにより、kindle がメモを自動的に保存する時系列順のリストよりも整理しておくことができます。残念ながら、正規表現を使用してファイルを解析するのに問題があります。これまでの私のコードは次のとおりです。

import re


def parse_file(in_file):
    read_file = open(in_file, 'r')
    file_lines = read_file.readlines()
    read_file.close()
    raw_note = "".join(file_lines)

    # Regex parts
    title_regex = "(.+)"
    title_author_regex = "(.+) \((.+)\)"

    loc_norange_regex = "(.+) (Location|on Page) ([0-9]+)"
    loc_range_regex = "(.+) (Location|on Page) ([0-9]+)-([0-9]+)"

    date_regex = "([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)"  # Date
    time_regex = "([0-9]+):([0-9]+) (AM|PM)"  # Time

    content_regex = "(.*)"
    footer_regex = "=+"

    nl_re = "\r*\n"

    # No author
    regex_noauthor_str =\
    title_regex + nl_re +\
    "- Your " + loc_range_regex + " | Added on " +\
    date_regex + ", " + time_regex + nl_re +\
    content_regex + nl_re +\
    footer_regex

    regex_noauthor = re.compile(regex_noauthor_str)
    print regex_noauthor.findall(raw_note)

parse_file("testnotes")

「testnotes」の内容は次のとおりです。

Title
- Your Highlight Location 3360-3362 | Added on Wednesday, March 21, 2012, 12:16 AM

Note content goes here
==========

私が欲しいもの：

[('Title', 'Highlight', 'Location', '3360', '3362', 'Wednesday', 'March', '21', '2012', '12', '16', 'AM',

しかし、プログラムを実行すると、次のようになります。

[('Title', 'Highlight', 'Location', '3360', '3362', '', '', '', '', '', '', '', '')]

私は正規表現にかなり慣れていませんが、これはかなり簡単だと思います。

score 2 · Accepted Answer

|in をエスケープする必要があります"- Your " + loc_range_regex + " | Added on " +\

に："- Your " + loc_range_regex + " \| Added on " +\

|正規表現の OR 演算子です。

score 2 · Accepted Answer

と言うときは" | Added on "、をエスケープする必要があります|。その文字列を次のように置き換えます" \| Added on "

python - 正規表現を使用してkindleの「My Clippings.txt」ファイルを解析する

3 に答える 3

Related

Reference