6

[重複の可能性あり] しかし、以下の質問に対する回答が見つかりませんでした。

過去 2 日間、音声認識に関する調査を行っていましたが、質問に対する回答が得られませんでした。

  1. 音声認識をサービスとして実行することはできますか? 私はこのようなものを実装したいと思います.音声認識を介して電話がスリープモードになっているにもかかわらず、番号に電話する必要があります。
  2. 電車やバスなどに乗っているとき、音声認識は正しく動作して単語を認識しますか?
  3. 音声認識以外に音声を検知するセンサーはありますか?
  4. 音声認識が正しく機能するために、ユーザーは電話に近づいて話す必要がありますか?
4

1 に答える 1

10

1) It is proper approach to put voice recognition into a service, like it is made in Google api, where callback methods are used to get results. To make it run continously, service must deal with wakelock that will avoid falling in sleep mode. Some more information is provided here Wake locks android service recurring It has one big disadvantage - high battery usage, cause by continuous work of CPU and coninuous computations of incoming sound data. (Can be reduced with filters, thresholds etc.)

2) Voice recognition is not a simple task. It desires huge number of calculation and data to reference to. If input audio is not clear (noise, many human voices etc.), it is harder to get proper output. What can be done to make accuracy better is, filter input audio: noise suppresion, low pass filter etc. You cannot expect 100% accuracy, but 80-95 % can be achieved.

Harder is to filter many human voices. But there can be used some simple amplitude (audio strength level) algorithms with adaptive threshold that decides when word begins and ends. Idea is that the proper voice is the loudest = nearest to phone/device. So according to 4) accuracy is better when user speak close to microphone, because it is the loudest voice.

3) I dont know what you mean by sensor, but there are algorithms to simply detect human voice rather that decode words. These algorithms are called Voice Activity Detection (VAD) Some code should be found in Speex project documentation http://www.speex.org/

Simplest method to handle voice recognition is to use Google Speech api wich is pretty good, and it recognize plenty of languages but need an Internet connection - and it takes a while to get result.
Faster is CMU Sphinx but it has few language models, needs more RAM memory and proccesor computation since all decoding is made on device. In my opininon it very good when dicitionary (words that are revognized) is small like commands (left,right, backward, stop, start, etc).

于 2012-12-25T02:32:57.943 に答える