audio - ノートオンセット検出

Question

ミュージシャンが採譜を行う際の補助としてシステムを開発しています。目的は、単一の楽器のモノフォニック録音で自動音楽トランスクリプションを実行することです (ユーザーが後でグリッチ/ミスを修正するため、完璧である必要はありません)。ここに自動音楽トランスクリプションの経験がある人はいますか? それともデジタル信号処理全般？あなたの経歴に関係なく、誰からの助けも大歓迎です。

これまで、高速フーリエ変換をピッチ検出に使用する方法を調査してきました。MATLAB と私自身の Java テストプログラムの両方で多くのテストを行った結果、高速フーリエ変換が私のニーズに十分に対応できることが示されました。作成された MIDI データを楽譜として表示することも課題の 1 つですが、これについては今のところ気にしていません。

簡単に言えば、私が探しているのは、ノート開始検出、つまり新しいノートが始まる信号内の位置の良い方法です。スローオンセットを適切に検出するのは非常に難しいため、最初はピアノの録音でシステムを使用します。これは、私がピアノを弾くという事実にも一部起因しており、テストに適した録音を取得するのにより適した立場にあるはずです. 前述のように、このシステムの初期バージョンは単純なモノラル録音に使用されますが、今後数週間の進捗状況に応じて、後でより複雑な入力に進む可能性があります。

score 50 · Accepted Answer

Here is a graphic that illustrates the threshold approach to note onset detection:

alt text

This image shows a typical WAV file with three discrete notes played in succession. The red line represents a chosen signal threshold, and the blue lines represent note start positions returned by a simple algorithm that marks a start when the signal level crosses the threshold.

As the image shows, selecting a proper absolute threshold is difficult. In this case, the first note is picked up fine, the second note is missed completely, and the third note (barely) is started very late. In general, a low threshold causes you to pick up phantom notes, while raising it causes you to miss notes. One solution to this problem is to use a relative threshold that triggers a start if the signal increases by a certain percentage over a certain time, but this has problems of its own.

A simpler solution is to use the somewhat-counterintuitively named compression (not MP3 compression - that's something else entirely) on your wave file first. Compression essentially flattens the spikes in your audio data and then amplifies everything so that more of the audio is near the maximum values. The effect on the above sample would look like this (which shows why the name "compression" appears to make no sense - on audio equipment it's usually labelled "loudness"):

alt text

After compression, the absolute threshold approach will work much better (although it's easy to over-compress and start picking up fictional note starts, the same effect as lowering the threshold). There are a lot of wave editors out there that do a good job of compression, and it's better to let them handle this task - you'll probably need to do a fair amount of work "cleaning up" your wave files before detecting notes in them anyway.

In coding terms, a WAV file loaded into memory is essentially just an array of two-byte integers, where 0 represents no signal and 32,767 and -32,768 represent the peaks. In its simplest form, a threshold detection algorithm would just start at the first sample and read through the array until it finds a value greater than the threshold.

short threshold = 10000;
for (int i = 0; i < samples.Length; i++)
{
    if ((short)Math.Abs(samples[i]) > threshold) 
    {
        // here is one note onset point
    }
}

In practice this works horribly, since normal audio has all sorts of transient spikes above a given threshold. One solution is to use a running average signal strength (i.e. don't mark a start until the average of the last n samples is above the threshold).

short threshold = 10000;
int window_length = 100;
int running_total = 0;
// tally up the first window_length samples
for (int i = 0; i < window_length; i++)
{
    running_total += samples[i];
}
// calculate moving average
for (int i = window_length; i < samples.Length; i++)
{
    // remove oldest sample and add current
    running_total -= samples[i - window_length];
    running_total += samples[i];
    short moving_average = running_total / window_length;
    if (moving_average > threshold)
    {
        // here is one note onset point 
        int onset_point = i - (window_length / 2);
    }
}

All of this requires much tweaking and playing around with settings to get it to find the start positions of a WAV file accurately, and usually what works for one file will not work very well on another. This is a very difficult and not-perfectly-solved problem domain you've chosen, but I think it's cool that you're tackling it.

Update: this graphic shows a detail of note detection I left out, namely detecting when the note ends:

alt text

The yellow line represents the off-threshold. Once the algorithm has detected a note start, it assumes the note continues until the running average signal strength drops below this value (shown here by the purple lines). This is, of course, another source of difficulties, as is the case where two or more notes overlap (polyphony).

Once you've detected the start and stop points of each note, you can now analyze each slice of WAV file data to determine the pitches.

Update 2: I just read your updated question. Pitch-detection through auto-correlation is much easier to implement than FFT if you're writing your own from scratch, but if you've already checked out and used a pre-built FFT library, you're better off using it for sure. Once you've identified the start and stop positions of each note (and included some padding at the beginning and end for the missed attack and release portions), you can now pull out each slice of audio data and pass it to an FFT function to determine the pitch.

One important point here is not to use a slice of the compressed audio data, but rather to use a slice of the original, unmodified data. The compression process distorts the audio and may produce an inaccurate pitch reading.

One last point about note attack times is that it may be less of a problem than you think. Often in music an instrument with a slow attack (like a soft synth) will begin a note earlier than a sharp attack instrument (like a piano) and both notes will sound as if they're starting at the same time. If you're playing instruments in this manner, the algorithm with pick up the same start time for both kinds of instruments, which is good from a WAV-to-MIDI perspective.

Last update (I hope): Forget what I said about including some paddings samples from the early attack part of each note - I forgot this is actually a bad idea for pitch detection. The attack portions of many instruments (especially piano and other percussive-type instruments) contain transients that aren't multiples of the fundamental pitch, and will tend to screw up pitch detection. You actually want to start each slice a little after the attack for this reason.

Oh, and kind of important: the term "compression" here does not refer to MP3-style compression.

Update again: here is a simple function that does non-dynamic compression:

public void StaticCompress(short[] samples, float param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        int sign = (samples[i] < 0) ? -1 : 1;
        float norm = ABS(samples[i] / 32768); // NOT short.MaxValue
        norm = 1.0 - POW(1.0 - norm, param);
        samples[i] = 32768 * norm * sign;
    }
}

When param = 1.0, this function will have no effect on the audio. Larger param values (2.0 is good, which will square the normalized difference between each sample and the max peak value) will produce more compression and a louder overall (but crappy) sound. Values under 1.0 will produce an expansion effect.

One other probably obvious point: you should record the music in a small, non-echoic room since echoes are often picked up by this algorithm as phantom notes.

Update: here is a version of StaticCompress that will compile in C# and explicity casts everything. This returns the expected result:

public void StaticCompress(short[] samples, double param)
{
    for (int i = 0; i < samples.Length; i++)
    {
        Compress(ref samples[i], param);
    }
}

public void Compress(ref short orig, double param)
{
    double sign = 1;
    if (orig < 0)
    {
        sign = -1;
    }
    // 32768 is max abs value of a short. best practice is to pre-
    // normalize data or use peak value in place of 32768
    double norm = Math.Abs((double)orig / 32768.0);
    norm = 1.0 - Math.Pow(1.0 - norm, param);
    orig = (short)(32768.0 * norm * sign); // should round before cast,
        // but won't affect note onset detection
}

Sorry, my knowledge score on Matlab is 0. If you posted another question on why your Matlab function doesn't work as expected it would get answered (just not by me).

score 4 · Accepted Answer

MIRToolbox （Matlab用に作成されており、オンセット検出器が組み込まれています）を確認する必要があります。これは非常にうまく機能します。ソースコードはGPLであるため、適切な言語でアルゴリズムを実装できます。プロダクションコードはどの言語を使用しますか？

score 4 · Accepted Answer

やりたいことは、WAV-to-MIDI (google "wav-to-midi") と呼ばれることがよくあります。このプロセスには多くの試みがあり、さまざまな結果が得られました (音の開始は困難の 1 つです。ポリフォニーは対処がはるかに困難です)。既製のソリューションを徹底的に検索することから始めて、受け入れられるものが何もない場合にのみ、自分で作業を開始することをお勧めします.

必要なプロセスの他の部分は、MIDI 出力を従来の楽譜としてレンダリングすることですが、それを行う製品は何十億もあります。

別の答えは: はい、私は多くのデジタル信号処理を行ってきました (私の Web サイトのソフトウェアを参照してください - これは VB と C で書かれた無限音声ソフトウェアシンセサイザーです)。WAV から MIDI への部分は、概念的にはそれほど難しくありません。実際に確実に機能させることは難しいことです。ノートの開始は単にしきい値を設定するだけです。エラーは、ノートのアタックの違いを補正するために、時間を前後に簡単に調整できます。ピッチ検出は、リアルタイムで行うよりも録音で行う方がはるかに簡単で、自己相関ルーチンを実装するだけです。

score 3 · Accepted Answer

ハードオンセットは、平均エネルギー測定を使用することにより、時間領域で簡単に検出されます。

0からNまでの合計（X ^ 2）

信号全体のチャンクでこれを行います。開始時にピークが表示されるはずです（ウィンドウサイズはあなた次第です。私の提案は50ms以上です）。

発症検出に関する広範な論文：

ハードコアエンジニアの場合：

http://www.nyu.edu/classes/bello/MIR_files/2005_BelloEtAl_IEEE_TSALP.pdf

平均的な人が理解しやすい：

http://bingweb.binghamton.edu/~ahess2/Onset_Detection_Nov302011.pdf

score 3 · Accepted Answer

このライブラリは、オーディオラベリングを中心としています。

aubio

aubioはオーディオラベリング用のライブラリです。その機能には、各攻撃の前にサウンドファイルをセグメント化すること、ピッチ検出を実行すること、ビートをタップすること、ライブオーディオからミディストリームを生成することが含まれます。aubioという名前は、タイプミスのある「audio」に由来しています。結果にもいくつかの文字起こしエラーが見つかる可能性があります。

そして、私はそれで発症検出とピッチ検出に幸運をもたらしました。それはcにありますが、swig/pythonラッパーがあります。

また、ライブラリの作成者は、ページに彼の論文のpdfを持っています。これには、ラベル付けに関する優れた情報と背景があります。

score -1 · Accepted Answer

あなたはwav信号を時間に対する振幅のグラフに変換することを試みることができます。次に、一貫した開始を決定する方法は、信号の立ち上がりフランクの変曲点の接線とx軸との交点を計算することです。

audio - ノートオンセット検出

6 に答える 6

Related

Reference