audio - Google Speech Recognition API: 各単語のタイムスタンプ?

Question

リクエストを行うことで、Google の音声認識 API を使用して音声ファイル (WAV、MP3 など) の書き起こしを取得できます。http://www.google.com/speech-api/v2/recognize?...

例: WAV ファイルで「 1 2 3 フォーファイブ」と言いました。Google API は私にこれを与えます:

{
  u'alternative':
  [
    {u'transcript': u'12345'},
    {u'transcript': u'1 2 3 4 5'},
    {u'transcript': u'one two three four five'}
  ],
  u'final': True
}

質問: 各単語が発声された時間 (秒単位) を取得することは可能ですか?

私の例では：

['one', 0.23, 0.80], ['two', 1.03, 1.45], ['three', 1.79, 2.35], etc.

つまり、「1」という単語は 00:00:00.23 から 00:00:00.80 の間に発声され、
「2」という単語は 00:00:01.03 から 00:00:01.45 (秒単位) の間に発声されました。

PS: 英語以外の言語、特にフランス語をサポートする API を探しています。

score 13 · Accepted Answer

EDIT 2020: 可能になりました。他の回答を参照してください

Google API ではできません。

単語のタイムスタンプが必要な場合は、次のような他の API を使用できます。

Vosk-API - 無料のオフライン音声認識 API (開示: 私は Vosk の主な作成者です)。

SpeechMatics SaaS 音声認識 API

IBM の音声認識 API

score 9 · Accepted Answer

はい、可能性は大いにあります。あなたがする必要があるのは次のとおりです。

設定で enable_word_time_offsets=True を設定します

config = types.RecognitionConfig(
        ....
        enable_word_time_offsets=True)

次に、代替の各単語について、次のコードのように開始時刻と終了時刻を出力できます。

for result in result.results:
        alternative = result.alternatives[0]
        print(u'Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}'.format(alternative.confidence))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
                word,
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9))

これにより、次の形式で出力が得られます。

Transcript:  Do you want me to give you a call back?
Confidence: 0.949534416199
Word: Do, start_time: 1466.0, end_time: 1466.6
Word: you, start_time: 1466.6, end_time: 1466.7
Word: want, start_time: 1466.7, end_time: 1466.8
Word: me, start_time: 1466.8, end_time: 1466.9
Word: to, start_time: 1466.9, end_time: 1467.1
Word: give, start_time: 1467.1, end_time: 1467.2
Word: you, start_time: 1467.2, end_time: 1467.3
Word: a, start_time: 1467.3, end_time: 1467.4
Word: call, start_time: 1467.4, end_time: 1467.6
Word: back?, start_time: 1467.6, end_time: 1467.7

ソース: https://cloud.google.com/speech-to-text/docs/async-time-offsets

audio - Google Speech Recognition API: 各単語のタイムスタンプ?

3 に答える 3

Related

Reference