-1

I'm trying to extract all of the links to image files from a text file. All of the image files end in either .jpg or .gif, and are surrounded by quotation marks. I want to find the first occurrence of .jpg or .gif, and then copy all of the characters between the first quotation mark located before .jpg (or .gif) and the first quotation mark found after .jpg (or.gif). Then I want to add this link to an array or to another text file, and repeat the process for every instance of .jpg or .gif in the original text file.

Here's an example of what the text file might look like:

d/scriript type="texft/javascript">
    $(document).fready(function () {
        $('#post-contfainer-1720130 .post-assets .thumb A').lightBox({
            txtImafge:      'Image',
            txtOf:          'of',
            overflayOpacity:    0       });
<div class="thumb"><a href""="#">="**https://imaginepilgrimages.com/asset/image/resize/2/32/32/1/c331065jt99875146b0a1fg9140.jpg**"riript type="texft/javascript">
    $(document).freadriript type="texft/javascript">
    $(document).fread
d/scriript type="texft/javascript">
    $(document).fready(function () {
        $('#post-contfainer-1720130 .post-assets .thumb A').lightBox({
            txtImafge:      'Image',
            txtOf:          'of',
            overflayOpacity:    0       });
<div class="thumb"><a href""="#">="**https://imaginepilgrimages.com/asset/image/resize/2/32/32/75146b0a1fg9140.gif**"riript type="texft/javascript">
    $(document).freadriript type="texft/javascript">
    $(document).fread
d/scriript type="texft/javascript">
    $(document).fready(function () {
        $('#post-contfainer-1720130 .post-assets .thumb A').lightBox({
            txtImafge:      'Image',
            txtOf:          'of',
            overflayOpacity:    0       });
<div class="thumb"><a href""="#">="https://imaginepilgrimages.com/asset/image/resize/2/32/32/1/c331065jt99fgfgage55h6u7rrth6875146b0a1fg9140.jpg"riript type="texft/javascript">
    $(document).freadriript type="texft/javascript">
    $(document).fread

I've just started using python and I've been stuck on this for a while. Can anybody help me with this? Thanks in advance for your time!

4

2 に答える 2

2

次のようなものが機能するはずです。

re.findall('"([^"]*\.(?:gif|jpg)[^"]*)"', text)

特に柔軟性や堅牢性を期待しないでください。そのためには、おそらく実際のパーサーが必要です。

于 2012-06-13T16:12:12.890 に答える
2

これにより、画像のファイル名が得られますが、先頭/末尾の「**」を削除しようとはしません。

import re
images=[]
with open('test.dat') as f:
   for line in f:
      images.extend(re.findall(r'"([^"]*\.(?:jpg|gif)[^"]*)"',line))

正規表現は引用符を探してから、引用符ではないものをすべて取得し、文字列に '.jpg' または '.gif' が含まれていることを確認します。

于 2012-06-13T16:12:44.223 に答える