c# - テキストを単語ごとに読む方法

Question

txt または htm ファイルを使用しています。現在、for ループを使用して char ごとにドキュメントを検索していますが、テキストを単語ごとに検索してから、単語 char ごとに検索する必要があります。これどうやってするの？

for (int i = 0; i < text.Length; i++)
{}

score 5 · Accepted Answer

簡単なアプローチは、string.Split引数なしで使用することです（空白文字で分割）：

using (StreamReader sr = new StreamReader(path)) 
{
    while (sr.Peek() >= 0) 
    {
        string line = sr.ReadLine();
        string[] words = line.Split();
        foreach(string word in words)
        {
            foreach(Char c in word)
            {
                // ...
            }
        }
    }
}

私はStreamReader.ReadLine行全体を読んだことがあります。

HTMLを解析するには、HtmlAgilityPackのような堅牢なライブラリを使用します。

score 2 · Accepted Answer

文字列を空白で分割することはできますが、句読点と HTML マークアップを処理する必要があります (txt ファイルと htm ファイルで作業していると言いました)。

string[] tokens = text.split(); // default for split() will split on white space
foreach(string tok in tokens)
{
    // process tok string here
}

score 0 · Accepted Answer

を使用text.Split(' ')してスペースで単語の配列に分割し、それを繰り返します。

それで

foreach(String word in text.Split(' '))
   foreach(Char c in word)
      Console.WriteLine(c);

score 0 · Accepted Answer

HTMLAgilityPackを使用して、一部のHTMLからすべてのテキストを取得できます。これがやり過ぎだと思われる場合は、こちらをご覧ください。

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);

foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    var nodeText = node.InnerText;
}

次に、単語が何であるかを定義したら、各ノードのテキストコンテンツを単語に分割できます。

多分このように、

using HtmlAgilityPack;

static IEnumerable<string> WordsInHtml(string text)
{
    var splitter = new Regex(@"[^\p{L}]*\p{Z}[^\p{L}]*");

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(text);

    foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
    {
        foreach(var word in splitter.Split(node.InnerText)
        {
            yield return word;
        }
    }
}

次に、各単語の文字を調べます

foreach(var word in WordsInHtml(text))
{
    foreach(var c in word)
    {
        // a enumeration by word then char.
    }
}

score 0 · Accepted Answer

正規表現についてはどうですか？

using System;
using System.Linq;
using System.Text.RegularExpressions;

namespace ConsoleApplication58
{
    class Program
    {
        static void Main()
        {
            string input =
                @"I'm working with a txt or htm file. And currently I'm looking up the document char by char, using for loop, but I need to look up the text word by word, and then inside the word char by char. How can I do this?";
            var list = from Match match in Regex.Matches(input, @"\b\S+\b")
                       select match.Value; //Get IEnumerable of words
            foreach (string s in list) 
                Console.WriteLine(s); //doing something with it
            Console.ReadKey();
        }
    }
}

それは任意の区切り記号で機能し、それを行うための最速の方法です。

score 0 · Accepted Answer

空白で分割できます：

string[] words = text.split(' ')

単語の配列が得られるので、それらを反復処理できます。

foreach(string word in words)
{
    word // do something with each word
}

score 0 · Accepted Answer

分割も使えると思います

         var  words = reader.ReadToEnd().Split(' ');

または使用

foreach(String words in text.Split(' '))
   foreach(Char char in words )

c# - テキストを単語ごとに読む方法

8 に答える 8

Related

Reference