awk - Use awk on prose

Question

I have a sorted list of phrases, list.txt. I want use awk to strip any entry on that list from a lengthy file of prose, and replace it with a return. It's not hard to find examples of using awk to compare two files, but they all assume both are neatly structured, which prose is not.

Here's the relevant part of the script:

#! /bin/sh
...

sed '
s/[0-9]/\n/g        # strip out all numbers, replace with returns
s/[@€•\!¡%“”"_–=\*\&\/\?¿\,\.]/\n/g
' $1 > $1.z.tmp

cp stowplist.txt strip1.tmp

awk 'BEGIN { FS = "\t" } ; { print $1 }' SpanishGlossary.utf8 >> strip1.tmp
#sh ./awkwords SpanishGlossary.utf8 >> strip1.tmp

sort -u strip1.tmp > strip2.tmp

awk '{ print length(), $0 | "sort -rn" }' strip2.tmp > strip3.tmp
#echo "List ordered by length."

#echo "Now creating new script." # THIS AFFECTS THE SCRIPT, NOT THE OUTPUT FILE.
sed '
s/[0-9]//g      # strip out all numbers
s/[\t^\ *\ $]// # strip tabs, leading and trailing spaces
/^.\{0,5\}$/d       # delete lines with less than five characters
/^$/d           # delete blank lines
s/^/\\y/g           # begin word boundary
s/$/\\y/g           #end word boundary
s/\ /\\ /g      # make spaces into literals
' strip3.tmp > strip.tmp

echo "Eliminating existing entries. This may take a while."
awk 'NR==FNR{p = p s $0; s="|" ;next} {gsub(p,"\n");print}' strip.tmp $1.z.tmp > $1.1.tmp

...

And here's a representative sample of strip.tmp:

\yinfraestructura\ de\ la\ fabricación\y
\yFecha\ de\ Vencimiento\ del\ Contrato\y
\yfactores\ importantes\ a\ considerar\y
\yexcepto\ lo\ estrictamente\ personal\y
\yexamen\ de\ los\ ojos\ con\ dilatación\y
\yes\ un\ estado\ capitalista\ corrupto\y
\yes\ un\ derecho\ legal\ reconocido\ en\y
\yestimular\ la\ capacidad\ productiva\y
\yestimación\ de\ la\ edad\ gestacional\y
\yEste\ Programa\ de\ Transición\ Verde\y
\yEstán\ permanentemente\ enfrentados\y

And finally, a representative sample of the input text, with punctuation substituted with line breaks.

Es la historia de más de un siglo del cooperativismo en Argentina
 con empresas en todos los rincones de nuestra geografía y en todos los sectores de la economía

En plena crisis del sistema económico mundial
 con creciente alarma frente al deterioro a escala planetaria de las condiciones medio ambientales
 la comunidad internacional ha declarado
 desde la Organización de las Naciones Unidas
 a éste como el Año Internacional de las Cooperativas

No es casualidad
 el mundo está buscando nuevos caminos
 nuevos paradigmas para organizarse

score 2 · Accepted Answer

@Kent 投稿:

awk 'NR==FNR{p[$0];next}{a[FNR]=$0}END{for(i=1;i<=FNR;i++){for(v in p)gsub(v,"",a[i]);print a[i]}}' file1 file2

l読みやすくするために、変数をに変更しました。数値に似すぎているため、変数名としてv使用しないでください。l1

上記は file2 の全体を配列に読み取り、各行が読み取られるたびに置換を行うだけでなく、その配列をループして置換を行います。次に例を示します。

awk 'NR==FNR{p[$0];next} {for(v in p)gsub(v,"");print}' file1 file2

しかし、さらに速い代替手段は、削除したいフレーズの配列を構築する代わりに、RE 文字列を構築するだけで、file2 のフレーズごとに 1 つの gsub() ではなく、file2 の各行で単一の gsub() を実行できるようにすることです。ファイル1:

awk 'NR==FNR{p = p s $0; s="|" ;next} {gsub(p,"");print}' file1 file2

ファイル 1 の RE メタキャラクタがファイル 2 の一致に違いをもたらすため、RE 比較を行っているすべてのものに注意してください。sed ソリューションと比較しているので、問題ないと思います。

速度だけを気にするのであれば、この GNU awk ソリューションはおそらくさらに高速です。

$ gawk -v RS='\0' -v FS='\n' -v OFS='|' 'NR==FNR{NF--; p=$0; next} {gsub(p,"");print}' file1 file2
line1 blah () blah ()
line2 blah () blah ()()
line3 blah blah () ()()

しかし、それはかなり難解で、他のものよりも多くのメモリを使用し、拡張性があまりないので、私は気にしません.

上記のソリューションを使用して、「p」を単一の RE として構築し、各行で単一の gsub() を実行します。

score 1 · Accepted Answer

このワンライナーはあなたのために働くかもしれません：

 awk 'NR==FNR{p[$0];next}{a[FNR]=$0}END{for(i=1;i<=FNR;i++){for(l in p)gsub(l,"",a[i]);print a[i]}}' file1 file2

ノート：

file1 is your list.txt
file2 is your prose

小さな例：

kent$  head file*                                                                                                     
==> file1 <==
good for you
hi there
awk is nice

==> file2 <==
line1 blah (hi there) blah (good for you)
line2 blah (awk is nice) blah (hi there)(good for you)
line3 blah blah (good for you) (awk is nice)(hi there)

kent$  awk 'NR==FNR{p[$0];next}{a[FNR]=$0}END{for(i=1;i<=FNR;i++){for(l in p)gsub(l,"",a[i]);print a[i]}}' file1 file2
line1 blah () blah ()
line2 blah () blah ()()
line3 blah blah () ()()

awk - Use awk on prose

2 に答える 2

Related

Reference