Speech recognition technology development and difficult analysis

Development of speech recognition technology

Communicate with the machine to make it understand what you are talking about. Speech recognition technology has turned this once-human dream into reality. Speech recognition is like a "machine's auditory system." This technology allows machines to recognize and understand and convert speech signals into corresponding text or commands.

At the Bell Institute in 1952, Davis et al. Developed the world's first experimental system capable of recognizing the pronunciation of 10 English digits. In 1960, the British Denes and others developed the first computer speech recognition system.

Large-scale speech recognition research began after the 1970s, and has made substantial progress in the recognition of small vocabulary and isolated words. After the 1980s, the focus of speech recognition research gradually turned to large vocabulary, continuous speech recognition of non-specific people.

At the same time, speech recognition has also undergone major changes in research ideas, from the traditional technical ideas based on standard template matching to the technical ideas based on statistical models. In addition, some experts in the industry once again put forward the technical idea of ​​introducing neural network technology into the speech recognition problem.

After the 1990s, there has been no major breakthrough in the system framework of speech recognition. However, great progress has been made in the application and commercialization of speech recognition technology. For example, DARPA was a program funded by the US Department of Defense's Vision Research Projects Agency in the 1970s to support the research and development of language understanding systems. In the 1990s, the DARPA project was still ongoing. Its research focus has turned to the natural language processing part of the recognition device, and the recognition task is set to "air travel information retrieval."

The research on speech recognition in China started in 1958, and the tube circuit was used to recognize 10 vowels by the Institute of Acoustics, Chinese Academy of Sciences. Due to the limitation of conditions at that time, the research work of speech recognition in China has been in a stage of slow development. Until 1973, the Institute of Acoustics, Chinese Academy of Sciences began computer speech recognition.

Since the 1980s, with the gradual popularization and application of computer application technology in China and the further development of digital signal technology, many domestic units have the basic conditions for studying voice technology. At the same time, after years of silence, international speech recognition technology has become a research hotspot again. In this form, many domestic units have devoted themselves to this research work.

In 1986, as an important part of intelligent computer system research, speech recognition was specifically listed as a research topic. With the support of the "863" plan, China began to organize research on speech recognition technology and decided to hold a special meeting on speech recognition every two years. Since then, China's speech recognition technology has entered a new stage of development.

Since 2009, with the development of deep learning research in the field of machine learning and the accumulation of big data corpus, speech recognition technology has developed by leaps and bounds.

Deep learning research in the field of machine learning is introduced to speech recognition acoustic model training. The use of multi-layer neural networks with RBM pre-training improves the accuracy of acoustic models. In this regard, Microsoft researchers have made the first breakthrough. After using the deep neural network model (DNN), the speech recognition error rate has been reduced by 30%, which is the fastest progress in speech recognition technology in the past 20 years.

Around 2009, most mainstream speech recognition decoders have adopted a decoding network based on finite state machine (WFST). The decoding network can integrate language models, dictionaries and acoustic shared phonetic characters into a large decoding network, which improves decoding. The speed provides a basis for real-time application of speech recognition.

With the rapid development of the Internet and the popularization of mobile terminals such as mobile phones, a large amount of text or speech corpus can be obtained from multiple channels, which provides rich resources for the training of language models and acoustic models in speech recognition, making It is possible to construct general-purpose large-scale language models and acoustic models.

In speech recognition, the matching and richness of training data is one of the most important factors that promote the performance of the system. However, the annotation and analysis of corpora require long-term accumulation and precipitation. With the advent of the era of big data, large-scale corpus resources Accumulation will refer to strategic heights.

Nowadays, the application of voice recognition is the most popular on mobile terminals, and voice dialogue robots, voice assistants, and interactive tools are emerging in an endless stream. Many Internet companies have invested manpower, material resources, and financial resources to carry out research and application in this regard. The purpose is to use voice interaction The new and convenient model quickly occupies the customer base. (Rainfield finishing) Related products siri

Siri technology comes from the CALO plan announced by the US Department of Defense Advanced Research and Planning Agency: a digital assistant that allows the military to simplify some complicated tasks, and has learning, organization, and cognitive capabilities. The civilian version of the software Siri virtual individual derived from it assistant Manager.

Siri was established in 2007, initially based on text chat services, and then through collaboration with voice recognition vendor Nuance, Siri achieved voice recognition. In 2010, Siri was acquired by Apple for $ 200 million.

Siri became a voice control function applied by Apple on its products iPhone and iPad Air. Siri can turn iPhone and iPad Air into an intelligent robot. Siri supports natural language input, and can call the system's own weather forecast, schedule, search information and other applications, and can continue to learn new voices and intonations, providing dialogue-type responses.

Google Now

Google Now is an application that Google launched with Android 4.1 system at the same time. It can understand the user's various habits and ongoing actions, and use the information to provide users with relevant information.

On March 24 this year, Google announced that the Google Now voice service officially landed on Windows and Mac desktop Chrome browsers.

The Google Now app will make it easier for users to receive emails. When you receive new mail, it will automatically pop up for you to view. Google Now also introduced walking and driving mileage recording function, this pedometer function can use the Android device's sensor to count the user's monthly mileage, including walking and cycling distance.

In addition, Google Now has added some travel and entertainment features, including: car rental, concert tickets and commuting sharing cards; public transportation and TV program cards have been improved, these cards can now listen to music and program information; Users can set search reminders for the start of new media programs and receive real-time NCAA (American College Sports Association) football scores.

Baidu Voice

Baidu Voice generally refers to Baidu Voice Search, which is a voice-based search service provided by Baidu for the majority of Internet users. Users can use a variety of clients to initiate voice searches, and the server performs voice recognition based on the user's voice request. Feedback the search results to the user.

Baidu Voice Search not only provides general general voice search services, but also special search services for map users, and more personalized search and recognition services will follow.

At present, Baidu voice search uses mobile client as the main platform, which is embedded in other products of Baidu, such as Baidu on the palm, Baidu mobile map, etc. Users can experience voice search while using these client products, supporting all mainstream mobile phone operations system.

Microsoft Cortana

Cortana is a virtual voice assistant under the Windows Phone platform. It is voiced by Jen Taylor, the voice actor of Cortana in the game "Halo". The Chinese version of Cortana is also known as "Microsoft Xiaona".

Microsoft's description of Cortana is "a personal assistant on your phone, providing you with more help in setting calendar items, suggestions, progress, etc." It can interact with you and simulate the voice and thinking of people as much as possible. Way to communicate with you. In addition, the round icon button will be adjusted according to the theme of your phone. If you set a green theme, then Cortana is a green icon.

In addition, you can call Cortana through the start screen or the search button on the device. Cortana uses a question-and-answer method. It only displays enough information when you consult it.

Difficulties in speech recognition technology

Speech recognition becomes the focus of contention

It is reported that artificial intelligence companies around the world specialize in the direction of deep learning, and more than 70% of the 200 or so startup companies in the direction of artificial intelligence in China specialize in image or speech recognition. Which companies are deploying speech recognition in the world? What about their development?

In fact, the idea of ​​automatic speech recognition has been put on the agenda long before the computer was invented. Early vocoders can be regarded as the prototype of speech recognition and synthesis. The earliest computer-based speech recognition system is the Audrey speech recognition system developed by AT & T Bell Laboratories, which can recognize 10 English digits. By the end of the 1950s, Denes of the College of London (Colledge of London) had added grammatical probabilities to speech recognition.

In the 1960s, artificial neural networks were introduced to speech recognition. The two major breakthroughs in this era are Linear Predic TIve Coding (LPC) and Dynamic Time Warping Dynamic TIme Warp technology. The most significant breakthrough in speech recognition technology is the application of Hidden Markov Model. Relevant mathematical reasoning was proposed from Baum. After the research of Rabiner et al., Kai-Fu Lee of Carnegie Mellon University finally realized Sphinx, the first large-vocabulary speech recognition system based on hidden Markov model.

Apple Siri

Many people realize that speech recognition may also be attributed to Siri, Apple's famous voice assistant. In 2011, Apple integrated voice recognition technology into the iPhone 4S and released the Siri voice assistant. However, Siri is not a technology developed by Apple, but a technology acquired by Siri Inc., which was established in 2007. After the release of iPhone4s, Siri's experience was not ideal, and was spit out. Therefore, in 2013, Apple acquired Novauris Technologies. Novauris is a speech recognition technology that can recognize entire phrases. This technology does not simply recognize a single sentence, but attempts to use more than 245 million phrase recognition to assist in understanding the context, which further improves Siri's functionality.

However, Siri did not become perfect due to the acquisition of Novauris. In 2016, Apple acquired the artificial intelligence software developed to help computers and users have a more natural conversation. British voice technology startup VocalIQ. Subsequently, Apple also acquired EmoTIent, a San Diego AI technology company, to receive its facial expression analysis and emotion recognition technology. It is reported that the emotion engine developed by EmoTIent can read people's facial expressions and predict their emotional state.

Google Google Now

Similar to Apple Siri, Google ’s Google Now is also relatively popular. But compared to Apple Google's actions in the field of voice recognition is slightly slower. In 2011, Google only acquired SayNow, a voice communications company, and Phonetic Arts, a speech synthesis company. SayNow can integrate voice communication, peer-to-peer conversations, and group calls with applications such as Facebook, Twitter, MySpace, Android, iPhone, etc., and Phonetic Arts can convert recorded voice conversations into a voice library, and then combine these sounds Come together to create a vocal dialogue that sounds very realistic.

At the Google I / O Developers Conference in 2012, Google Now debuted for the first time.

In 2013, Google acquired news reader app developer Wavii for more than $ 30 million. Wavii is good at "natural language processing" technology, can scan the Internet to find news, and directly give a sentence summary and link. After that, Google acquired a number of SR Tech Group's patents related to speech recognition. These technologies and patents were quickly applied to the market. For example, YouTube has provided automatic voice transcription support for titles, Google Glass uses voice control technology, and Android also Integrating voice recognition technology and so on, Google Now has a complete voice recognition engine.

Google may have invested in China in 2015 due to strategic layout considerations. This is a voice navigation company. Recently, it has also released smart watches. There are also famous domestic acoustic device manufacturers. Acoustic background.

Microsoft Cortana Xiaobing

The most eye-catching Microsoft speech recognition is Cortana and Xiaobing. Cortana is Microsoft's attempt in the field of machine learning and artificial intelligence. Cortana can record user behavior and usage habits, use cloud computing, search engines, and "unstructured data" analysis to read and learn pictures and videos including mobile phones , E-mail and other data to understand the user's semantics and context, so as to achieve human-computer interaction.

Microsoft Xiaobing is an artificial intelligence robot released by Microsoft Research Asia in 2014. In addition to intelligent dialogue, Microsoft Xiaobing also has practical skills such as group reminders, encyclopedias, weather, constellations, jokes, transportation guides, and restaurant reviews.

In addition to Cortana and Microsoft Xiaobing, Skype Translator can provide real-time translation services for English, Spanish, Chinese, and Italian users.

Amazon

Amazon's speech technology started in 2011 when it acquired the speech recognition company Yap. Yap was established in 2006 and mainly provides voice-to-text conversion services. In 2012, Amazon acquired Evi, a voice technology company, to continue to strengthen the application of voice recognition in product search. Evi also used Nuance's voice recognition technology. In 2013, Amazon continued to acquire Ivona Software. Ivona is a Polish company that does text-to-speech conversion. Its technology has been applied to the Kindle Fire ’s text-to-speech function, voice commands, and Explore by Touch applications. Amazon ’s smart speaker Echo This technology is also used.

Facebook

Facebook acquired Mobile Technologies, an entrepreneurial voice recognition company, in 2013. Its product Jibbigo allows users to choose from 25 languages, use one of these languages ​​for voice clip recording or text input, and then display the translation on the screen. Read the selected language aloud. This technology makes Jibbigo a common tool for traveling abroad, and a good substitute for the common language manual.

After that, Facebook continued to acquire Wit.ai, a voice interaction solution service provider. Wit.ai's solution allows users to directly control mobile applications, wearable devices and robots, and almost any smart device through voice. Facebook hopes to apply this technology to targeted advertising and closely integrate the technology with its own business model.

Nuance, a noble in the traditional speech recognition industry

In addition to the speech recognition development of the well-known technology giants introduced above, Nuance, a noble in the traditional speech recognition industry, is also worth knowing. Nuance used to dominate the voice field. Nuance recognition engine technology has been used in more than 80% of the world's voice recognition. Its voice products can support more than 50 languages. It has more than 2 billion users worldwide and almost monopolizes the financial and telecommunications industries. . Now, Nuance is still the world's largest voice technology company, holding the world's most voice technology patents. Apple ’s voice assistant Siri, Samsung ’s voice assistant S-Voice, major airlines and top-tier banks ’automated call centers initially used their voice recognition engine technology.

But because Nuance is a bit too arrogant, Nuance is not as good as it was.

Other foreign speech recognition companies

In 2013 Intel acquired Indisys, a Spanish speech recognition technology company, and in the same year Yahoo acquired SkyPhrase, a natural language processing technology startup. Comcast, the largest cable company in the United States, has also begun to launch its own voice recognition interactive system. Comcast hopes to use voice recognition technology to allow users to control the TV more freely through voice and accomplish some things that the remote control cannot.

Domestic voice recognition manufacturers

IFLYTEK

The University of Science and Technology Xunfei was established at the end of 1999, relying on the speech processing technology of the University of Science and Technology of China and the strong support of the country, and it was quickly on the right track. HKUST's Xunfei was listed on the market in 2008, and its current market value is close to 50 billion. According to the data survey conducted by the Voice Industry Alliance in 2014, HKUST's Xunfei accounted for more than 60% of the market and is definitely a leading domestic voice technology company.

When it comes to iFLYTEK, everyone may think of speech recognition, but in fact its biggest source of income is education. Especially around 2013, it acquired many voice evaluation companies, including Qiming Technology, etc., which formed a market for education. Monopoly. After a series of acquisitions, the spoken language evaluation of all provinces currently uses the HKUST Xunfei engine. Because it occupies the commanding heights of the exam, all schools and parents are willing to pay for it.

Baidu Voice

Baidu Voice was established as a strategic direction very early. In 2010, it cooperated with the Institute of Acoustics of the Chinese Academy of Sciences to develop speech recognition technology, but the market development was relatively slow. Until 2014, Baidu reorganized its strategy and invited Wu Enda, a master in the field of artificial intelligence, to formally form a voice team to specialize in voice-related technologies. Thanks to Baidu ’s strong financial support, it has been very fruitful so far. With nearly 13% market share, its technical strength can be compared with that of HKUST Xunfei, which has more than ten years of technology and experience.

Jietong and Truly

With the help of Tsinghua Technology, Jietong Huasheng invited Mr. Lu Shinan, the Institute of Acoustics, Chinese Academy of Sciences to join in the early stage of establishment, laying the foundation for speech synthesis. Zhongke Truly is completely dependent on the Institute of Acoustics of the Chinese Academy of Sciences. Its initial technical strength was extremely strong. It not only trained a large number of talents for the domestic speech recognition industry, but also played a vital role in the industry field, especially in the military industry.

These talents trained by the Chinese Academy of Sciences are extremely important for the development of the domestic speech recognition industry. Let's call them the Acoustics Department, but relative to the market, these two companies have fallen behind the University of Science and Technology by a long distance. Zhongke Truly still has no background in the market due to its industry market. Jietong Huasheng has recently been pushed to the cusp because of the fraudulent incident of Nanda ’s “Jiaojiao” robot, which is indeed a very negative impact. .

Spitz

Around 2009, DNN was used in the field of speech recognition, and the speech recognition rate has been greatly improved. The recognition rate has exceeded 90%, reaching commercial standards. This has greatly promoted the development of the speech recognition field. In the past few years, many speech recognitions have been established. Related startups.

Spitz was established in 2007. Most of the founders came from the Cambridge team. Its technology has a certain foreign foundation. At that time, the company mainly focused on voice evaluation, that is, education. However, after years of development, although it has occupied some markets, it has It is difficult for Xunfei to achieve breakthroughs while holding the commanding heights.

So in 2014, Spitzer made a painful determination to divest the department responsible for the education industry and sold it to NetDragon for 90 million. However, he focused his energy on smart hardware and mobile Internet. Recently, he has focused on car voice. Assistant, launched the "radish", but the market response is very general.

Yun Zhisheng

With the propaganda momentum of Apple Siri in 2011, Yunzhisheng was established in 2012. The Yunzhisheng team mainly comes from the Shanda Research Institute. Coincidentally, the CEO and CTO also graduated from the University of Science and Technology of China. However, the speech recognition technology is more derived from the Institute of Automation of the Chinese Academy of Sciences. Its speech recognition technology has a certain uniqueness. For a short period of time, the speech recognition rate even exceeded the University of Science and Technology. Therefore, it has also been well received by capital. The B round of financing reached 300 million, mainly aimed at the smart home market. However, it has been established for more than three years. I heard more about publicity. The market development is relatively slow. The B2B market has never seen improvement. The B2C market has rarely heard of actual applications.

Go out and ask

Going out and asking was founded in 2012. Its CEO used to work at Google. After receiving angel investment from Sequoia Capital and Zhenge Fund, he resigned from Google and founded Shanghai Yufanzhi Information Technology Co., Ltd. Search for products-"Go out and ask".

The success of going out to ask is the ranking of Apple ’s APP list, but I do n’t know why there are so many built-in maps. Why download this software? Obviously, sometimes it is more troublesome than looking up the map directly. Going out and asking also has a strong financing ability. In 2015, I got Google's C round of financing, and the total amount of financing has been 75 million US dollars. I went out to ask mainly about the wearable market. Recently, I have also launched smart watches and other products, but they are also loud and have little rain. I do n’t see how their smart watches are sold.

Other domestic voice recognition companies

The threshold for voice recognition is not high, so major domestic companies are gradually joining in. Sogou began to use Yunzhisheng's speech recognition engine, but soon built its own speech recognition engine, mainly used in Sogou input method, the effect is also okay.

Of course, Tencent will not fall behind. WeChat has also built its own speech recognition engine for converting speech to text, but this is still a bit different.

Ali, iQiyi, 360, LeEco, etc. are also building their own speech recognition engines, but these large companies are more self-developed and self-used, basically technically good, and the industry has little influence.

Of course, in addition to the industry-recognized speech recognition companies mentioned above, Cambridge ’s HTK tool has greatly promoted academic research, and CMU, SRI, MIT, RWTH, and ATR have also promoted the development of speech recognition technology.

What is the principle of speech recognition technology?

For speech recognition technology, I believe that everyone has more or less contact and application. Above, we have also introduced the situation of major domestic and foreign speech recognition technology companies. But you may still want to know, what is the principle of speech recognition technology? Then let me introduce you.

Speech recognition technology

Speech recognition technology is a technology that allows machines to convert speech signals into corresponding text or commands through recognition and understanding processes. The purpose of speech recognition is to let the machine give people the hearing characteristics, understand what people say, and make corresponding actions. At present, most speech recognition technologies are based on statistical models. From the perspective of speech generation mechanism, speech recognition can be divided into two parts: speech layer and language layer.

Speech recognition is essentially a pattern recognition process. The pattern of unknown speech is compared with the reference patterns of known speech one by one, and the best matching reference pattern is used as the recognition result.

The mainstream algorithms of today's speech recognition technology mainly include dynamic time warping (DTW) algorithm, vector quantization (VQ) method based on non-parametric model, hidden Markov model (HMM) method based on parametric model, and artificial neural network. (ANN) and support vector machines and other speech recognition methods.

Speech recognition technology development and difficult analysis

Basic block diagram of speech recognition

Voice recognition classification :

According to the degree of dependence on the speaker, it is divided into:

(1) Voice recognition of specific person (SD): Only the voice of a specific user can be recognized, training → use.

(2) Non-specific person voice recognition (SI): can recognize anyone's voice without training.

According to the requirements for speaking style, it is divided into:

(1) Isolated word recognition: only a single vocabulary can be recognized at a time.

(2) Continuous speech recognition: Users can recognize the sentences in them by speaking at a normal speed.

Speech recognition system

The model of the speech recognition system usually consists of two parts: an acoustic model and a language model, which respectively correspond to the calculation of the probability of speech to syllable and the calculation of the probability of syllable to word.

Sphinx is a large vocabulary, non-personal, continuous English speech recognition system developed by Carnegie Mellon University. A continuous speech recognition system can be roughly divided into four parts: feature extraction, acoustic model training, language model training and decoder.

(1) Pre-processing module

Process the input original voice signal, filter out the unimportant information and background noise, and perform endpoint detection of the voice signal (find the beginning and end of the voice signal), voice framing (approximately within 10-30ms The voice signal is stable for a short time, and the voice signal is divided into sections for analysis) and pre-emphasis (increasing the high-frequency part) and other processing.

(2) Feature extraction

Remove redundant information that is not useful for speech recognition in the speech signal, retain the information that can reflect the essential characteristics of the speech, and express it in a certain form. That is, key feature parameters that reflect the characteristics of the speech signal are extracted to form a feature vector sequence for subsequent processing.

There are still more common methods for extracting features, but these extraction methods are all derived from the frequency spectrum.

(3) Acoustic model training

Acoustic model parameters are trained according to the feature parameters of the training speech library. During recognition, the feature parameters of the speech to be recognized can be matched with the acoustic model to obtain the recognition result.

The current mainstream speech recognition systems mostly use Hidden Markov Model HMM for acoustic model modeling.

(4) Language model training

The language model is a probability model used to calculate the probability of a sentence appearing. It is mainly used to decide which word sequence is more likely, or to predict the content of the next upcoming word when several words appear. In other words, the language model is used to restrict word search. It defines which words can follow the last recognized word (matching is a sequential process), so that some impossible words can be excluded for the matching process.

Language modeling can effectively combine knowledge of Chinese grammar and semantics to describe the internal relationship between words, thereby improving recognition rate and reducing search scope. The language model is divided into three levels: dictionary knowledge, grammatical knowledge, and syntactic knowledge.

Perform grammatical and semantic analysis on the training text database, and obtain language models after training based on statistical models. Language modeling methods mainly include rule-based models and statistical models.

(5) Voice decoding and search algorithm

Decoder: Refers to the recognition process in speech technology. For the input voice signal, build a recognition network based on the trained HMM acoustic model, language model and dictionary, and find the best path in the network according to the search algorithm. This path is to output the voice signal with the greatest probability Word string, so as to determine the text contained in this speech sample. So the decoding operation refers to the search algorithm: it refers to the method of finding the optimal word string through the search technology at the decoding end.

The search in continuous speech recognition is to find a word model sequence to describe the input speech signal, thereby obtaining the word decoding sequence. The search is based on scoring the acoustic model and language model in the formula. In actual use, it is often necessary to add a high weight to the language model based on experience, and set a long word penalty score. Today's mainstream decoding technology is based on the Viterbi search algorithm, as is Sphinx.

Difficulties of speech recognition technology

Speaker difference

Different speakers: pronunciation organ, accent, speaking style

Same speaker: different time, different state

Noise impact

Background noise

Transmission channel, microphone frequency response

Robust technology

Discriminative training

Feature compensation and model compensation

Specific applications of speech recognition

Command word system

Recognition grammar network is relatively limited, and stricter requirements for users

Menu navigation, voice dialing, car navigation, number and letter recognition, etc.

Intelligent interactive system

The requirements for users are more relaxed, and the combination of identification and technology in other fields is required

Call routing, POI voice fuzzy query, keyword detection

Large vocabulary continuous speech recognition system

Massive entries, wide coverage, poor accuracy and real-time performance

Audio transliteration

Voice search combined with internet

Realize voice to text, voice to voice search

Ejector Header Connector

Ejector Header Connector,Ejectors Header Smt Type Connectors,Ejectors Four Row Foot Type Connector,Ejector Wire To Board Connector

Shenzhen Hongyian Electronics Co., Ltd. , https://www.hongyiancon.com