unix - sed を使用してストップワードリスト内の単語を削除する (テキストファイルから削除するパラメーターのリストを sed にフィードする)

Question

したがって、sed がファイル内のすべての単語の検索と置換に優れていることは誰もが知っています。

sed -i 's/original_word/new_word/g' file.txt

しかし、誰かがファイルから 'original_words' のリストを sed にフィードする方法を教えてもらえますか (grep -f と同様)。すべてを '' (消去) に置き換えたいだけです。

元の単語リストファイルは、行で区切られた一連のストップワードです (wordlist.txt):

a
about
above
according
across
after
afterwards

これは、ストップワードのリストを取得してコーパスから削除する簡単な方法です (データのクリーニングに役立ちます)。

file.txt は次のようになります

05ricardo   RT @shakira: Immigration reform isn't about politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me a copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3

score 2 · Accepted Answer

sed に sed-script を書かせることもできます (GNU sed でテスト済み):

<stopwords sed 's:.*:s/\\b&\\b//:g' | sed -f - file.txt

出力：

05ricardo   RT @shakira: Immigration reform isn't  politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me  copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3

score 1 · Accepted Answer

まず、すべてがsedサポート-iしているわけではありませんが、一般的な方法でその機能を提供するのは簡単なので、必須のオプションではありません。1つの簡単なオプション（csh以外のファミリシェルを想定）：

inline() { f=$1; shift; "$@" < $f > $f.out && mv $f.out $f; }

次に、置換を行います（単語の区切り文字の処理方法を指定していないため、「foo」がブラックリストに含まれている場合、「bar foobaz」は「bar」と「baz」の間に2つのスペースが含まれることになります） awkまたはperlのいずれかを使用すると非常に簡単です。

awk 'NR==FNR{a[$0]; next} {for( i in a ) gsub( i, "" )} 1' original-words file.txt
perl -wne 'if( $ARGV = $ARGV[0] ){ chop; push @no, $_; next } 
    foreach $x( @no ) {s/$x//g } print ' original-words file.txt

結果に満足している場合は、-iwith perl（すべてのsedサポート-iではありませんが、すべてperl> 5.0）を使用するか、次のコマンドでファイルを変更できます。

inline file.txt awk 'NR==FNR{a[$0]; next} 
    {for( i in a ) gsub( i, "" )} 1' original-words -

sedこれらのソリューションのいずれも、ブラックリスト内のすべての単語を呼び出すよりも大幅に高速になります。

score 1 · Accepted Answer

を使用する 1 つの方法を次に示しGNU sedます。

while IFS= read -r word; do sed -ri "s/( |)\b$word\b//g" file; done < wordlist

ファイルの内容:

how about I decide to look at it afterwards. What
across do you think? Is it a good idea to go out and about? I 
think I'd rather go up and above.

結果：

how I decide to look at it. What
 do you think? Is it good idea to go out and? I 
think I'd rather go up and.

score 0 · Accepted Answer

0

多分これ

#!/bin/sh
while read k
do
  sed -i "s/$k//g" file.txt
done < dict.txt

于 2013-02-07T05:58:22.380 に答える

score -1 · Accepted Answer

-1

cat file.txt | grep  -vf wordlist.txt

于 2014-05-21T16:18:50.380 に答える

unix - sed を使用してストップワード リスト内の単語を削除する (テキスト ファイルから削除するパラメーターのリストを sed にフィードする)

5 に答える 5

Related

Reference

unix - sed を使用してストップワードリスト内の単語を削除する (テキストファイルから削除するパラメーターのリストを sed にフィードする)