10

これは単語では不可能だという印象を受けていますが、非常に長い論文のどこかに同じ順序で来る 3 ~ 4 語を探していると、同じフレーズの重複を見つけることができると思います。

私は過去の論文から多くのドキュメントをコピーして貼り付けました。この 40 ページ以上のドキュメントで繰り返される情報を見つける簡単な方法を見つけたいと思っていました。多くの異なる書式設定がありますが、順番に書式設定を一時的に削除したいと考えています。繰り返される情報を見つける。

4

3 に答える 3

17

重複するすべての文を強調表示するには、を使用することもできますActiveDocument.Sentences(i)。これが例です

論理

1)単語ドキュメントからすべての文を配列で取得します

2)配列を並べ替えます

3)重複を抽出する

4)重複を強調表示する

コード

Option Explicit

Sub Sample()
    Dim MyArray() As String
    Dim n As Long, i As Long
    Dim Col As New Collection
    Dim itm

    n = 0
    '~~> Get all the sentences from the word document in an array
    For i = 1 To ActiveDocument.Sentences.Count
        n = n + 1
        ReDim Preserve MyArray(n)
        MyArray(n) = Trim(ActiveDocument.Sentences(i).Text)
    Next

    '~~> Sort the array
    SortArray MyArray, 0, UBound(MyArray)

    '~~> Extract Duplicates
    For i = 1 To UBound(MyArray)
        If i = UBound(MyArray) Then Exit For
        If InStr(1, MyArray(i + 1), MyArray(i), vbTextCompare) Then
            On Error Resume Next
            Col.Add MyArray(i), """" & MyArray(i) & """"
            On Error GoTo 0
        End If
    Next i

    '~~> Highlight duplicates
    For Each itm In Col
        Selection.Find.ClearFormatting
        Selection.HomeKey wdStory, wdMove
        Selection.Find.Execute itm
        Do Until Selection.Find.Found = False
            Selection.Range.HighlightColorIndex = wdPink
            Selection.Find.Execute
        Loop
    Next
End Sub

'~~> Sort the array
Public Sub SortArray(vArray As Variant, i As Long, j As Long)
  Dim tmp As Variant, tmpSwap As Variant
  Dim ii As Long, jj As Long

  ii = i: jj = j: tmp = vArray((i + j) \ 2)

  While (ii <= jj)
     While (vArray(ii) < tmp And ii < j)
        ii = ii + 1
     Wend
     While (tmp < vArray(jj) And jj > i)
        jj = jj - 1
     Wend
     If (ii <= jj) Then
        tmpSwap = vArray(ii)
        vArray(ii) = vArray(jj): vArray(jj) = tmpSwap
        ii = ii + 1: jj = jj - 1
     End If
  Wend
  If (i < jj) Then SortArray vArray, i, jj
  If (ii < j) Then SortArray vArray, ii, j
End Sub

SNAPSHOTS

ここに画像の説明を入力してください

ここに画像の説明を入力してください

于 2012-07-24T04:07:01.630 に答える
4

私は自分の DAWG の提案を使用しませんでした。他の誰かがこれを行う方法を持っているかどうかを知りたいのですが、これを思いつくことができました:

Option Explicit

Sub test()
Dim ABC As Scripting.Dictionary
Dim v As Range
Dim n As Integer
    n = 5
    Set ABC = FindRepeatingWordChains(n, ActiveDocument)
    ' This is a dictionary of word ranges (not the same as an Excel range) that contains the listing of each word chain/phrase of length n (5 from the above example).
    ' Loop through this collection to make your selections/highlights/whatever you want to do.
    If Not ABC Is Nothing Then
        For Each v In ABC
            v.Font.Color = wdColorRed
        Next v
    End If
End Sub

' This is where the real code begins.
Function FindRepeatingWordChains(ChainLenth As Integer, DocToCheck As Document) As Scripting.Dictionary
Dim DictWords As New Scripting.Dictionary, DictMatches As New Scripting.Dictionary
Dim sChain As String
Dim CurWord As Range
Dim MatchCount As Integer
Dim i As Integer

    MatchCount = 0

    For Each CurWord In DocToCheck.Words
        ' Make sure there are enough remaining words in our document to handle a chain of the length specified.
        If Not CurWord.Next(wdWord, ChainLenth - 1) Is Nothing Then
            ' Check for non-printing characters in the first/last word of the chain.
            ' This code will read a vbCr, etc. as a word, which is probably not desired.
            ' However, this check does not exclude these 'words' inside the chain, but it can be modified.
            If CurWord <> vbCr And CurWord <> vbNewLine And CurWord <> vbCrLf And CurWord <> vbLf And CurWord <> vbTab And _
                CurWord.Next(wdWord, ChainLenth - 1) <> vbCr And CurWord.Next(wdWord, ChainLenth - 1) <> vbNewLine And _
                CurWord.Next(wdWord, ChainLenth - 1) <> vbCrLf And CurWord.Next(wdWord, ChainLenth - 1) <> vbLf And _
                CurWord.Next(wdWord, ChainLenth - 1) <> vbTab Then
                sChain = CurWord
                For i = 1 To ChainLenth - 1
                    ' Add each word from the current word through the next ChainLength # of words to a temporary string.
                    sChain = sChain & " " & CurWord.Next(wdWord, i)
                Next i

                ' If we already have our temporary string stored in the dictionary, then we have a match, assign the word range to the returned dictionary.
                ' If not, then add it to the dictionary and increment our index.
                If DictWords.Exists(sChain) Then
                    MatchCount = MatchCount + 1
                    DictMatches.Add DocToCheck.Range(CurWord.Start, CurWord.Next(wdWord, ChainLenth - 1).End), MatchCount
                Else
                    DictWords.Add sChain, sChain
                End If
            End If
        End If
    Next CurWord

    ' If we found any matching results, then return that list, otherwise return nothing (to be caught by the calling function).
    If DictMatches.Count > 0 Then
        Set FindRepeatingWordChains = DictMatches
    Else
        Set FindRepeatingWordChains = Nothing
    End If

End Function

このソースの 258 ページのドキュメント ( TheStory.txt) でこれをテストしたところ、わずか数分で実行されました。

test()使用方法についてはサブを参照してください。

オブジェクトを使用するには、Microsoft Scripting Runtime を参照する必要がありScripting.Dictionaryます。Collectionsそれが望ましくない場合は、代わりに使用するために小さな変更を加えることができますDictionaryが、便利な.Exists()方法があるため、 を好みます。

于 2012-07-20T15:05:21.713 に答える
2

私はかなり不十分な理論を選択しましたが、うまくいくようです (少なくとも、私は理解が遅い場合があるため、質問が正しければ)。テキスト全体を文字列に読み込み、個々の単語を配列に読み込み、配列をループして文字列を連結し、毎回 3 つの連続する単語を含めます。
結果はすでに 3 つの単語グループに含まれているため、4 つの単語グループ以上が自動的に認識されます。

Option Explicit

Sub Find_Duplicates()

On Error GoTo errHandler

Dim pSingleLine                     As Paragraph
Dim sLine                           As String
Dim sFull_Text                      As String
Dim vArray_Full_Text                As Variant

Dim sSearch_3                       As String
Dim lSize_Array                     As Long
Dim lCnt                            As Long
Dim lCnt_Occurence                  As Long


'Create a string from the entire text
For Each pSingleLine In ActiveDocument.Paragraphs
    sLine = pSingleLine.Range.Text
    sFull_Text = sFull_Text & sLine
Next pSingleLine

'Load the text into an array
vArray_Full_Text = sFull_Text
vArray_Full_Text = Split(sFull_Text, " ")
lSize_Array = UBound(vArray_Full_Text)


For lCnt = 1 To lSize_Array - 1
    lCnt_Occurence = 0
    sSearch_3 = Trim(fRemove_Punctuation(vArray_Full_Text(lCnt - 1) & _
                    " " & vArray_Full_Text(lCnt) & _
                    " " & vArray_Full_Text(lCnt + 1)))

    With Selection.Find
        .Text = sSearch_3
        .Forward = True
        .Replacement.Text = ""
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False

        Do While .Execute

            lCnt_Occurence = lCnt_Occurence + 1
            If lCnt_Occurence > 1 Then
                Selection.Range.Font.Color = vbRed
            End If
            Selection.MoveRight
        Loop
    End With

    Application.StatusBar = lCnt & "/" & lSize_Array
Next lCnt

errHandler:
Stop

End Sub

Public Function fRemove_Punctuation(sString As String) As String

Dim vArray(0 To 8)      As String
Dim lCnt                As Long


vArray(0) = "."
vArray(1) = ","
vArray(2) = ","
vArray(3) = "?"
vArray(4) = "!"
vArray(5) = ";"
vArray(6) = ":"
vArray(7) = "("
vArray(8) = ")"

For lCnt = 0 To UBound(vArray)
    If Left(sString, 1) = vArray(lCnt) Then
        sString = Right(sString, Len(sString) - 1)
    ElseIf Right(sString, 1) = vArray(lCnt) Then
        sString = Left(sString, Len(sString) - 1)
    End If
Next lCnt

fRemove_Punctuation = sString

End Function

このコードは、箇条書きのない連続したテキストを想定しています。

于 2012-07-25T17:12:11.767 に答える