bash - ウェブサイトの移行結果の比較 (両方のサイトを並行して実行中)

Question

クライアントの自作サイトを Drupal 7 に移行しています。設計の決定、いくつかの新しい要件など、このプロセスにはしばらく時間がかかります。

(a) 古いデータベースから URL パスのリストを取得する、(b) Drupal サイトと古いサイトの両方から各ページのコンテンツを取得する、(c) ページで xpath クエリを実行して取得するためのツールに取り組み始めました。 xidel を使用して div#maincontent および div#main のコンテンツを取得し、(d) そのデータを new.txt および old.txt ファイルに保存します。すべて、参照用のサイトと同様のフォルダー構造を維持します。

収集_データ.sh

#!/bin/bash
# get URLS
urls=$(ssh user@old_ser "~/data_urls.sh" | egrep "^\/" | sort -u)

# clear out current working folder
rm -rf ./working

# loop through paths
for i in $urls
do  

    # screen status update, set storage area with url_path in folder path, make folder
    echo $i
    storage_area=./working/$i/
    mkdir -p $storage_area


    # strip trailing space
    i=${i%/}

    # pull and and run xpath query
    xidel http://old_server$i  -e '//div[@id="maincontent"]//p' > $storage_area/old.txt
    xidel http://new_server$i -e '//div[@id="content"]//p' > $storage_area/new.txt

    # run a compare and output data into cmp.cmp
    cmp $storage_area/old.txt $storage_area/new.txt > $storage_area/cmp.cmp

done

2 番目のスクリプトは、cmp.cmp ファイルの結果をループします。

run_diff.sh

echo "------------------------------------------------------- "
echo "The following may have differences in content based on wdiff analysis"

for i in `find ./working/ -type d`; do

  better_url_name=`echo $i | sed -e 's#\./working##g'`


  echo -e "\e[1;37m"
  echo -----------------------------------------------------------------------
  echo http://old_server$better_url_name
  echo http://new_server$better_url_name
  echo -----------------------------------------------------------------------
  echo -e "\e[00m"
  wdiff -3s $i/old.txt $i/new.txt  | colordiff
done

上記の結果は、次のようなものを生成します。

-----------------------------------------------------------------------
http://old_server/career_services/career_fair.php
http://new_server/career_services/career_fair.php
-----------------------------------------------------------------------


======================================================================
 [-9. 
School-] {+9.School+}
======================================================================
 [-Imagination
April-] {+ImaginationApril+}
======================================================================
 [-contract.
April-] {+contract.April+}
======================================================================

{+ +}
======================================================================
./working/epics/career_services/career_fair.php/old.txt: 1001 words  995 99% common  0 0% deleted  6 1% changed
./working/epics/career_services/career_fair.php/new.txt: 999 words  995 100% common  1 0% inserted  3 0% changed

私の質問:

これらの誤検知を無視するにはどうすればよいですか?
スペースと改行記号を除外するにはどうすればよいですか?
これは正しいアプローチですか？この方法論を放棄して、より良い結果が得られる別の方法を選択する必要がありますか?

score 0 · Accepted Answer

diffコマンドでは、次のオプションを使用できます-

   -b  --ignore-space-change
         Ignore changes in the amount of white space.

   -w  --ignore-all-space
         Ignore all white space.

   -B  --ignore-blank-lines
         Ignore changes whose lines are all blank.

       --strip-trailing-cr
         Strip trailing carriage return on input.

bash - ウェブサイトの移行結果の比較 (両方のサイトを並行して実行中)

1 に答える 1

Related

Reference