文字列の文字から形成できるすべての英語の単語を見つける必要があります
sentence="Ziegler's Giant Bar"
私は文字の配列を作ることができます
sentence.split(//)
Rubyの文章から4500以上の英語の単語を作るにはどうすればよいですか?
[編集]
問題をいくつかの部分に分割するのが最善かもしれません:
- 10文字以下の単語の配列のみを作成する
- 長い単語は個別に検索できます
[1つの単語内でソース文字を再利用できると仮定]:辞書リストの各単語に対して、2つの文字配列を作成します。1つは候補単語用、もう1つは入力文字列用です。単語のarray-of-lettersから入力されたarray-of-lettersを引くと、文字が残っていない場合は一致します。これを行うためのコードは次のようになります。
def findWordsWithReplacement(sentence)
out=[]
splitArray=sentence.downcase.split(//)
`cat /usr/share/dict/words`.each{|word|
if (word.strip!.downcase.split(//) - splitArray).empty?
out.push word
end
}
return out
end
次のように、irbデバッガーからその関数を呼び出すことができます。
output=findWordsWithReplacement("some input string"); puts output.join(" ")
...または、スクリプトからインタラクティブに関数を呼び出すために使用できるラッパーは次のとおりです。
puts "enter the text."
ARGF.each {|line|
puts "working..."
out=findWordsWithReplacement(line)
puts out.join(" ")
puts "there were #{out.size} words."
}
これをMacで実行すると、出力は次のようになります。
$./findwords.rb
テキストを入力します。 ジーグラー
のジャイアントバーが
機能しています...
abet abettal Abie Abies abietate abietene abietin Abietineae Abiezer Abigail abigail abigeat abilla abintestate
[....]
Z z za Zabaean zabeta Zabian zabra zabti zabtie zag zain Zan zanella zant zante Zanzalian zanze Zanzibari zar zaratite zareba zat zati zattare Zea zeal zealless zeallessness zebra zebrass Zebrina zebrine zee zein zeist zig zigzag zigzagger Zilla zing zingel Zingiber zingiberene Zinnia zinsang Zinzar zira zirai Zirbanit Zirian Zirianian ZizaniaZiziazizz
には6725語がありました。
これは4500語をはるかに超えていますが、これはMacの単語辞書がかなり大きいためです。Knuthの結果を正確に再現したい場合は、Knuthの辞書をhttp://www.packetstormsecurity.org/Crackers/wordlists/dictionaries/knuth_words.gzからダウンロードして解凍し、「/ usr / share / dict/words」を代替ディレクトリを解凍した場所へのパス。あなたがそれを正しくやった場合、あなたはこのコレクションで終わる4514語を得るでしょう:
zanier zanies zaniness Zanzibar zazen zeal zebra zebras Zeiss zeitgeist Zen Zennist zest zestier zeta Ziegler zig zigging zigzag zigzagging zigzags zing zingier zings zinnia
それが元の質問に答えると思います。
あるいは、質問者/読者は、入力文字を再利用せずに、文字列から作成できるすべての単語をリストしたいと思ったかもしれません。これを実現するための私の提案されたコードは次のように機能します。候補の単語をコピーし、入力文字列の各文字について、その文字の最初のインスタンスをコピーから破壊的に削除します(「スライス!」を使用)。このプロセスがすべての文字を吸収する場合は、その単語を受け入れます。
def findWordsNoReplacement(sentence)
out=[]
splitInput=sentence.downcase.split(//)
`cat /usr/share/dict/words`.each{|word|
copy=word.strip!.downcase
splitInput.each {|o| copy.slice!(o) }
out.push word if copy==""
}
return out
end
指定されたフレーズによって文字とその頻度が制限されている単語を見つけたい場合は、正規表現を作成してこれを行うことができます。
sentence = "Ziegler's Giant Bar"
# count how many times each letter occurs in the
# sentence (ignoring case, and removing non-letters)
counts = Hash.new(0)
sentence.downcase.gsub(/[^a-z]/,'').split(//).each do |letter|
counts[letter] += 1
end
letters = counts.keys.join
length = counts.values.inject { |a,b| a + b }
# construct a regex that matches upto that many occurences
# of only those letters, ignoring non-letters
# (in a positive look ahead)
length_regex = /(?=^(?:[^a-z]*[#{letters}]){1,#{length}}[^a-z]*$)/i
# construct regexes that matches each letter up to its
# proper frequency (in a positive look ahead)
count_regexes = counts.map do |letter, count|
/(?=^(?:[^#{letter}]*#{letter}){0,#{count}}[^#{letter}]*$)/i
end
# combine the regexes, to form a regex that will only
# match words that are made of a subset of the letters in the string
regex = /#{length_regex}#{count_regexes.join('')}/
# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end
words.length #=> 3182
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "Abantes",
"Abaris", "abas", "abase", "abaser", "Abasgi", "abate", "abater", "abatis",
...
"ba", "baa", "Baal", "baal", "Baalist", "Baalite", "Baalize", "baar", "bae",
"Baeria", "baetzner", "bag", "baga", "bagani", "bagatine", "bagel", "bagganet",
...
"eager", "eagle", "eaglet", "eagre", "ean", "ear", "earing", "earl", "earlet",
"earn", "earner", "earnest", "earring", "eartab", "ease", "easel", "easer",
...
"gab", "Gabe", "gabi", "gable", "gablet", "Gabriel", "Gael", "gaen", "gaet",
"gag", "gagate", "gage", "gageable", "gagee", "gageite", "gager", "Gaia",
...
"Iberian", "Iberis", "iberite", "ibis", "Ibsenite", "ie", "Ierne", "Igara",
"Igbira", "ignatia", "ignite", "igniter", "Ila", "ilesite", "ilia", "Ilian",
...
"laang", "lab", "Laban", "labia", "labiate", "labis", "labra", "labret", "laet",
"laeti", "lag", "lagan", "lagen", "lagena", "lager", "laggar", "laggen",
...
"Nabal", "Nabalite", "nabla", "nable", "nabs", "nae", "naegate", "naegates",
"nael", "nag", "Naga", "naga", "Nagari", "nagger", "naggle", "nagster", "Naias",
...
"Rab", "rab", "rabat", "rabatine", "Rabi", "rabies", "rabinet", "rag", "raga",
"rage", "rager", "raggee", "ragger", "raggil", "raggle", "raging", "raglan",
...
"sa", "saa", "Saan", "sab", "Saba", "Sabal", "Saban", "sabe", "saber",
"saberleg", "Sabia", "Sabian", "Sabina", "sabina", "Sabine", "sabine", "Sabir",
...
"tabes", "Tabira", "tabla", "table", "tabler", "tables", "tabling", "Tabriz",
"tae", "tael", "taen", "taenia", "taenial", "tag", "Tagabilis", "Tagal",
...
"zest", "zeta", "ziara", "ziarat", "zibeline", "zibet", "ziega", "zieger",
"zig", "zing", "zingel", "Zingiber", "zira", "zirai", "Zirbanit", "Zirian"]
正の先読みを使用すると、一致する文字列の一部を消費することなく、指定されたパターンが一致する文字列内の位置に一致する正規表現を作成できます。ここでは、単一の正規表現で同じ文字列を複数のパターンと照合するために使用します。すべてのパターンが一致する場合にのみ、位置が一致します。
元のフレーズの文字を無限に再利用できるようにすると (グレンラのコメントに従ってクヌースが行ったように)、正規表現を作成するのがさらに簡単になります。
sentence = "Ziegler's Giant Bar"
# find all the letters in the sentence
letters = sentence.downcase.gsub(/[^a-z]/,'').split(//).uniq
# construct a regex that matches any line in which
# the only letters used are the ones in the sentence
regex = /^([^a-z]|[#{letters.join}])*$/i
# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end
words.length #=> 6725
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "abalienate",
...
"azine", "B", "b", "ba", "baa", "Baal", "baal", "Baalist", "Baalite",
"Baalize", "baar", "Bab", "baba", "babai", "Babbie", "Babbitt", "babbitt",
...
"Britannian", "britten", "brittle", "brittleness", "brittling", "Briza",
"brizz", "E", "e", "ea", "eager", "eagerness", "eagle", "eagless", "eaglet",
"eagre", "ean", "ear", "earing", "earl", "earless", "earlet", "earliness",
...
"eternalize", "eternalness", "eternize", "etesian", "etna", "Etnean", "Etta",
"Ettarre", "ettle", "ezba", "Ezra", "G", "g", "Ga", "ga", "gab", "gabber",
"gabble", "gabbler", "Gabe", "gabelle", "gabeller", "gabgab", "gabi", "gable",
...
"grittiness", "grittle", "Grizel", "Grizzel", "grizzle", "grizzler", "grr",
"I", "i", "iba", "Iban", "Ibanag", "Iberes", "Iberi", "Iberia", "Iberian",
...
"itinerarian", "itinerate", "its", "Itza", "Izar", "izar", "izle", "iztle",
"L", "l", "la", "laager", "laang", "lab", "Laban", "labara", "labba", "labber",
...
"litter", "litterer", "little", "littleness", "littling", "littress", "litz",
"Liz", "Lizzie", "Llanberisslate", "N", "n", "na", "naa", "Naassenes", "nab",
"Nabal", "Nabalite", "Nabataean", "Nabatean", "nabber", "nabla", "nable",
...
"niter", "nitraniline", "nitrate", "nitratine", "Nitrian", "nitrile",
"nitrite", "nitter", "R", "r", "ra", "Rab", "rab", "rabanna", "rabat",
"rabatine", "rabatte", "rabbanist", "rabbanite", "rabbet", "rabbeting",
...
"riteless", "ritelessness", "ritling", "rittingerite", "rizzar", "rizzle", "S",
"s", "sa", "saa", "Saan", "sab", "Saba", "Sabaean", "sabaigrass", "Sabaist",
...
"strigine", "string", "stringene", "stringent", "stringentness", "stringer",
"stringiness", "stringing", "stringless", "strit", "T", "t", "ta", "taa",
"Taal", "taar", "Tab", "tab", "tabaret", "tabbarea", "tabber", "tabbinet",
...
"tsessebe", "tsetse", "tsia", "tsine", "tst", "tzaritza", "Tzental", "Z", "z",
"za", "Zabaean", "zabeta", "Zabian", "zabra", "zabti", "zabtie", "zag", "zain",
...
"Zirian", "Zirianian", "Zizania", "Zizia", "zizz"]
Rubyには英語の辞書があるとは思いません。しかし、元の文字列のすべての順列を配列に格納し、それらの文字列をGoogleと照合することはできますか?ヒット数が100.000を超える場合、単語は実際には単語であると言いますか?
次のように文字の配列を取得できます。
sentence = "Ziegler's Giant Bar"
letters = sentence.split(//)