powershell - 生成された文字列 (ファイルではない) から BOM を取り除きます

Question

MS Office ドキュメントのように見える文字列を扱っています。この例では、BOM の「文字」が 2 つあり、1 つは文字列の先頭に、もう 1 つは本文にあることに注意してください。登場人物が複数いる場合もあれば、いない場合もあります。Powershell コンソールでは、? として出力されます。

ï»¿<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=unicode"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
    <snip - bunch of style defs>
--></style></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1>
<p class=MsoNormal style='text-autospace:none'>
 <span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>ï»¿</span>
 <span style='font-size:12.0pt;font-family:"Times New Roman","serif"'>Testing <o:p></o:p></span>
</p></div></body></html>

文字列はオブジェクトから取得されるため、Get-Content で単純に UTF8 エンコードを強制することはできません。他にどのようにそれらを取り除くことができますか? これは単にディスプレイにパイプされているだけなので、余分な文字を取り除きたいという欲求があるため、これが失われることを心配していません。また、HTML を削除します。

score 2 · Accepted Answer

文字列に他の実際の UTF8 文字が含まれている可能性がある場合にこれを行う別の方法は、このルートに進むことです。ただし、バイトオーダーマーク文字が各文字列の先頭にあると想定しています。

$bytes = @()
$strs | Foreach {$bytes += [byte[]][char[]]$_}

$memStream = new-object system.io.memorystream
$memStream.Write($bytes, 0, $bytes.Length)
$memStream.Position = 0

$reader = new-object system.io.streamreader($memStream, [System.Text.Encoding]::UTF8)
$reader.ReadToEnd()
$reader.Dispose()

score 1 · Accepted Answer

ヘルプを求めるときは、出力を取得するために使用するコードを含める必要があります。これは機能しますか？

$s = #your code that gets the output#
$s -replace "ï»¿"  #returns output without the characters

または

( code that creates output ) -replace "ï»¿"

powershell - 生成された文字列 (ファイルではない) から BOM を取り除きます

3 に答える 3

Related

Reference