xml - Powershell で XML をより高速に解析したり、スクリプトをさらに最適化したりするにはどうすればよいですか?

Question

サイズが数 KB から数 MB までさまざまな 700 万の XML ファイルを含むセットアップがあります。全体として、約 180 GB の XML ファイルです。私が実行する必要がある仕事は、各 XML ファイルを分析し、ファイルに string が含まれているかどうか、<ref>現在含まれている Chunk フォルダーから Referenceless フォルダーに移動しないかどうかを判断することです。

私が作成したスクリプトは十分に機能しますが、私の目的には非常に遅いです。毎秒約 3 ファイルの速度で、約 24 日で 700 万ファイルすべての分析を終了する予定です。パフォーマンスを向上させるためにスクリプトを変更できるものはありますか?

また、問題をさらに複雑にしているのは、サーバーボックスで .PS1 ファイルを実行するための適切なアクセス許可がないため、PowerShell から 1 つのコマンドでスクリプトを実行できる必要があることです。権限があれば、権限を設定します。

# This script will iterate through the Chunk folders, removing pages that contain no 
# references and putting them into the Referenceless folder.

# Change this variable to start the program on a different chunk. This is the first   
# command to be run in Windows PowerShell. 
$chunknumber = 1
#This while loop is the second command to be run in Windows PowerShell. It will stop after completing Chunk 113.
while($chunknumber -le 113){
#Jumps the terminal to the correct folder.
cd C:\Wiki_Pages
#Creates an index for the chunk being worked on.
$items = Get-ChildItem -Path "Chunk_$chunknumber"
echo "Chunk $chunknumber Indexed"
#Jumps to chunk folder.
cd C:\Wiki_Pages\Chunk_$chunknumber
#Loops through the index. Each entry is one of the pages.
foreach ($page in $items){
#Creates a variable holding the page's content.
$content = Get-Content $page
#If the page has a reference, then it's echoed.
if($content | Select-String "<ref>" -quiet){echo "Referenced!"}
#if the page doesn't have a reference, it's copied to Referenceless then deleted.
else{
Copy-Item $page C:\Wiki_Pages\Referenceless -force
Remove-Item $page -force
echo "Moved to Referenceless!"
}
}
#The chunk number is increased by one and the cycle continues.
$chunknumber = $chunknumber + 1
}

私は PowerShell についてほとんど知識がありません。昨日、プログラムを開いたのは初めてでした。

score 4 · Accepted Answer

-ReadCount 0コマンドに引数を追加して、Get-Contentコマンドを高速化することをお勧めします (これは非常に役立ちます)。このヒントは、パイプラインを介して解析するよりもファイル全体のコンテンツを実行する方が高速であることを示すこの素晴らしい記事から学びました。foreach

Set-ExecutionPolicy Bypass -Scope Processまた、追加の権限を必要とせずに、現在の Powershell セッションでスクリプトを実行するために使用できます。

score 2 · Accepted Answer

PowerShell パイプラインは、ネイティブシステムコールよりも著しく遅くなる可能性があります。

PowerShell: パイプラインのパフォーマンス

この記事では、PowerShell と従来の Windows コマンドプロンプトで実行される 2 つの同等のコマンドの間でパフォーマンステストが実行されます。

PS> grep [0-9] numbers.txt | wc -l > $null
CMD> cmd /c "grep [0-9] numbers.txt | wc -l > nul"

その出力のサンプルを次に示します。

PS C:\temp> 1..5 | % { .\perf.ps1 ([Math]::Pow(10, $_)) }

10 iterations

   30 ms  (   0 lines / ms)  grep in PS
   15 ms  (   1 lines / ms)  grep in cmd.exe

100 iterations

   28 ms  (   4 lines / ms)  grep in PS
   12 ms  (   8 lines / ms)  grep in cmd.exe

1000 iterations

  147 ms  (   7 lines / ms)  grep in PS
   11 ms  (  89 lines / ms)  grep in cmd.exe

10000 iterations

 1347 ms  (   7 lines / ms)  grep in PS
   13 ms  ( 786 lines / ms)  grep in cmd.exe

100000 iterations

13410 ms  (   7 lines / ms)  grep in PS
   22 ms  (4580 lines / ms)  grep in cmd.exe

編集: この質問に対する最初の回答では、パイプラインのパフォーマンスと他のいくつかの提案について言及しました。この投稿を簡潔にするために、実際にはパイプラインのパフォーマンスとは何の関係もない他の提案を削除しました。

score 0 · Accepted Answer

Start-Jobコマンドレットを使用して、一度に5つのファイルを解析してみます。PowerShellジョブに関する優れた記事はたくさんあります。何らかの理由でそれが役に立たず、I / Oまたは実際のリソースのボトルネックが発生している場合は、Start-JobとWinRMを使用して他のマシンのワーカーを起動することもできます。

score 0 · Accepted Answer

最適化を開始する前に、最適化が必要な場所を正確に決定する必要があります。I/O バウンド (各ファイルの読み取りにかかる時間) はありますか? メモリバウンド（おそらくそうではない）？CPU バウンド (コンテンツを検索する時間)?

これらは XML ファイルだとおっしゃいました。<ref>ファイルを (プレーンテキストではなく) XML オブジェクトに読み込み、 XPath 経由でノードを見つけることをテストしましたか? 次に、次のようになります。

$content = [xml](Get-Content $page)
#If the page has a reference, then it's echoed.
if($content.SelectSingleNode("//ref") -quiet){echo "Referenced!"}

CPU、メモリ、および I/O リソースに余裕がある場合は、複数のファイルを並行して検索することで改善が見られる場合があります。複数のジョブを並行して実行する方法については、このディスカッションを参照してください。明らかに、多数を同時に実行することはできませんが、いくつかのテストでスイートスポットを見つけることができます (おそらく 3 ～ 5 付近)。内部のすべてforeach ($page in $items){がジョブのスクリプトブロックになります。

xml - Powershell で XML をより高速に解析したり、スクリプトをさらに最適化したりするにはどうすればよいですか?

4 に答える 4

Related

Reference