After listening to so many voice recognition, do you know its history?

Language is important because human thinking ability develops rapidly with its emergence. This is why human beings are intelligently distinguished from other species. Speech recognition, as a scenario for artificial intelligence applications, is also a problem that researchers have been trying to solve.

Speech recognition is so hot, but do you know its history?

At the end of October 2016, Microsoft announced a historic breakthrough in speech recognition, with a word error rate of only 5.9%. English voice transcription reached the level of professional speed recorders. Microsoft’s breakthrough was the first time that the machine’s recognition ability was at the English level. Beyond humanity. After the release of Microsoft's news, it has caused great concern in the industry. Speech recognition has always been one of the key technologies developed by many technology companies at home and abroad. Baidu’s chief scientist Wu Enda congratulated Microsoft on its breakthrough in English speech recognition, and also recalled Baidu’s breakthrough in Chinese speech recognition a year ago. The word recognition rate of Deep Speech2's phrase recognition has dropped to 3.7%. Deep Speech2's ability to transcribe certain speech is basically superhuman, and it can transcribe shorter queries more accurately than Mandarin speakers.

Dreams start from Bell Labs

Not long ago, the MIT Technology Review, sponsored by the Massachusetts Institute of Technology (MIT), named the “Top Ten Breakthrough Technologies in 2016”. According to the MIT Science and Technology Review, these ten technologies have reached a milestone stage or are about to reach this stage in the past year. The breakthrough in speech recognition is the third of them.

Looking back on the history of human development, it is not difficult to see that with the continuous evolution of human beings, from the initial use of palms, limbs using simple tools, transmitting simple information, developing to control vocalization and receiving through the ear, a rapid information transmission based on speech is formed. Channel and transceiver closed loop, become the most natural and important means of information interaction between human beings. As an audio signal, sound waves, like video signals and radio signals, are non-contact propagation and are the only natural “wireless” resources that humans can freely control without tools. Moreover, the requirements of sound waves for receiving directivity are more relaxed, and this invaluable feature will bring great convenience in many scenarios. Especially for some large specific people who have obstacles in vision, touch, etc. (such as the elderly, amblyopia, disabled people) or unsuitable (such as children need to protect their vision), speech is the best interactive choice.

In 1946, after the advent of modern electronic computers, computers did better than humans in many things. In this case, can machines understand natural language? Communicate with the machine and let it understand what you are talking about. Speech recognition technology is a big dream that human beings will start when the computer appears.

The earliest concept of machine intelligence was Alan Turing, the father of computer science. In 1950, he published a paper entitled "Computational Machines and Intelligence" in Mind magazine. In the paper, Turing did not propose any research methods, but proposed a way to verify whether the machine has intelligence: people communicate with the machine, if people can not judge whether the object of communication is a person or a machine, it means this The machine is smart. This method was later called the Turing test. Turing actually left a question, not an answer, but generally thought that the machine processing of natural language can be traced back to that time.

Scientists believe that speech recognition is like a "machine's auditory system," which allows a machine to transform a speech signal into a corresponding text or command by recognizing and understanding it. In 1952, the Bell Institute, Davis et al. developed the world's first experimental system that could recognize 10 English digital pronunciations. In 1960, Denes et al. of the United Kingdom developed the first computer speech recognition system.

In fact, the development process of speech recognition for more than 60 years can be divided into multiple stages. In the early 20 years, from the 1950s to the 1970s, scientists were detours. Scientists all over the world had done things like computer recognition for speech recognition. They thought that computers must first understand natural language. This is limited to the way humans learn language, that is, using computers to simulate the human brain. The results of more than 20 years of research are almost zero.

Jarinick's contribution

It was not until 1970 that the emergence of statistical linguistics revived speech recognition and achieved today's achievements. The key to driving this technological shift is Frederick Jelinek and his IBM Laboratories (TJ Watson), which began using statistical methods. Using statistical methods, IBM increased the speech recognition rate from 70% to 90% at the time, and the scale of speech recognition increased from a few hundred words to tens of thousands of words, so that speech recognition has the potential to move from the laboratory to the actual application.

"From the Watergate Incident to Monica Lewinsky" is the report of Jalnik's 1999 ICASSP (International Conference on Acoustics, Language and Signal Processing), because the time of the Watergate incident occurred in 1972, which happened to be statistical speech. The beginning of the recognition was identified, and the President of the Lewinsky incident impeached President Clinton in the year before the meeting.

In the ten years of Cornell’s sword, Jarinick studied the information theory and finally realized the truth. In 1972, Jarinick went to IBM Watson Lab for academic leave and inadvertently led the speech recognition lab. Two years later he chose to stay with IBM between Cornell and IBM.

IBM in the 1970s was a bit like Microsoft in the 1990s and Google in the past decade (Schmidt), allowing outstanding scientists to do research that they were interested in. In that relaxed environment, Jarinick et al. proposed a framework for statistical speech recognition.

Prior to Jarinick, scientists used speech recognition as an artificial intelligence and pattern matching problem, and Jarinick used it as a communication problem and used two implicit Markov models (acoustic and linguistic models) to voice The identification is clear and clear. This framework still has far-reaching influence on speech recognition. It not only fundamentally makes speech recognition useful, but also lays the foundation for today's natural language processing. Jarinick was later elected to the American Academy of Engineering and was named one of the 100 inventors of the 20th century by Technology Magazine.

When Jarinick’s predecessors applied statistical methods to speech recognition, they encountered two insurmountable obstacles: the lack of computationally powerful computers and a large number of machine-readable text corpora that could be used for statistics, and the last generation had to choose give up. IBM in the 20th century, although the computing power of computers can not be compared with today, but can do a lot of things, the problem that Jarinick and his colleagues need to solve is how to find a large number of machine-readable corpus. Fortunately, there was a global business connected through the telecommunications network, that is, telex, IBM scientists began to study through the text of the telex business.

Why did IBM have no voice recognition foundation at the beginning, instead of Bell Labs or Carnegie Mellon University, which has a long research time in this field, proposed statistical speech recognition. There are inevitable reasons behind many historical contingency, because IBM has such computing power and material conditions, and at the same time gathers a large number of the world's smartest minds.

The statistical-based speech recognition alternative rule-based method has been alternated for 15 years. The reason for this is that it takes many years for new research methods to mature.

The remaining puzzles remain

The uniqueness of speech recognition is not only because of its achievements: despite the achievements, the remaining problems are as daunting as the ones that have been overcome.

With the change of speech recognition research ideas, large-scale speech recognition research began in the 1970s, and made substantial progress in the identification of small vocabulary and isolated words. After the 1980s, the focus of speech recognition research gradually turned to large vocabulary and non-specific continuous speech recognition. After the 1990s, there was no major breakthrough in the system framework for speech recognition.

However, great progress has been made in the application and productization of speech recognition technology. For example, DARPA was funded by the US Department of Defense's Vision Research Program in the 1970s to support the research and development of language understanding systems. In the 1990s, the DARPA program was still in progress, and its research focus had shifted to the natural language processing part of the identification device, and the identification task was set to “air travel information retrieval”. According to DARPA-funded multiple voice evaluations, the speech recognition word error rate has been the main indicator for evaluating progress.

China's speech recognition research began in 1958, and the Institute of Acoustics of the Chinese Academy of Sciences used a tube circuit to identify 10 vowels. Due to the limitations of the conditions at the time, China's speech recognition research work has been in a slow development stage. Until 1973, the Institute of Acoustics of the Chinese Academy of Sciences began computer speech recognition.

Since the 1980s, with the gradual popularization and application of computer application technology in China and the further development of digital signal technology, many domestic units have the basic conditions for researching speech technology. At the same time, international speech recognition technology has become a research hotspot after years of silence. In this form, many domestic units have invested in this research work.

In 1986, speech recognition, as an important part of the research of intelligent computer systems, was specifically listed as a research topic. With the support of the “863” program, China began to organize research on speech recognition technology and decided to hold a special session on speech recognition every two years.

Big data and deep neural network

Any technology has an energy storage phase and an explosion phase. The outbreak of speech recognition technology is derived from big data, the ripple effect and deep neural network that accompany the Internet. The ripple effect refers to the role of Internet thinking in improving the performance of core technologies. Some people call it optimization iteration. For example, Wu Enda calls it the combination of research layer, product and user use to form a closed-loop iterative optimization, which is an expression of the role of Internet thinking in core technology optimization and breakthrough. . In this way, you can not only get the data, but also learn the experience, know how to use it, and so on, for example, to adjust what makes the user experience better.

Speech recognition requires the combination of experience, data, and user feedback to improve performance. Need to use the user's feedback to summarize some features. For example, users will be cut off when they speak, so you can improve performance by adjusting some parameters. Because speech recognition is not only more data, the recognition rate is improved, and there are more factors, such as the user's feelings, some key parameter points, experience, etc., which can be learned. Internet thinking brings about the same as software iterations, and it is the core of the information that is fed back through feedback.

After the arrival of the big data era, the hidden Markov model has its limitations. When the amount of data is increased, the performance improvement brought by it is not as large as that of the deep neural network, but it is actually a statistical pattern recognition. In the process of speech recognition development, deep learning is a simultaneous event. If there is no deep neural network, but there are big data and ripple effects, the hidden Markov model can also be practical. Deep neural networks allow them to do their best, lowering the threshold and allowing more people to join. In the case of the same 涟漪 effect, the deep neural network is better than the previous algorithm, and the more data, the better the effect of the deep neural network. More importantly, the deep neural network is only one link in the theoretical framework of the statistical machine pattern recognition. The really important link is the statistical decision system.

The deep neural network was first started by Geoffrey Hinton and Microsoft's Deng Li researchers. Google was the first company to use deep neural networks on a large scale worldwide. Google's VoiceSearch also pioneered the use of Internet thinking for speech recognition. In this regard, Keda Xunfei was inspired by Google and quickly followed up to become the first company in China to use deep neural networks in commercial systems.

Speech recognition technology has been developed for decades. Because of the application of big data and deep neural networks, the traditional powerhouses in this field have become American technology giants such as Google, Amazon, Apple and Microsoft, but according to TechCrunch statistics, the United States at least There are 26 companies developing speech recognition technology.

However, although the technology accumulation and first-mover advantage of Google's giants in speech recognition technology make it difficult for latecomers to look back, but for some policy and market reasons, the voice recognition of these giants is mainly biased towards English, which gives the University of Science and Technology a fly. Baidu and Sogou have provided opportunities to achieve outstanding performance in the Chinese language field. In China, these localized products are more familiar to users.

From recognition to perceived cognition

In speech recognition, the matching and richness of training data is one of the most important factors to promote system performance improvement, but the annotation and analysis of corpus requires long-term accumulation and precipitation. With the advent of the era of big data, large-scale corpus resources Accumulation will refer to strategic heights. Nowadays, the application of speech recognition on mobile terminals is the hottest. Voice dialogue robots, voice assistants and interactive tools are emerging one after another. Many Internet companies have invested in human resources, material resources and financial resources to carry out research and application in this field. The purpose is to interact through voice. The new and convenient model quickly captures the customer base.

Although visual and speech recognition has made tremendous progress in recent years, these researchers still remind that there is still a lot of work to be done.

Looking ahead, researchers are working hard to ensure that speech recognition works well in a more real-life environment. These environments include places with a lot of background noise, such as gathering places or driving on highways. They will also focus on how to better distinguish different speakers in a multi-person conversation environment and ensure that they can function in a variety of voices regardless of the speaker's age, accent or speaking ability.

In the longer term, researchers will focus on how to teach computers not only to transcribe sound signals from human mouths, but also to understand what they say. This allows the technology to answer questions or take action based on what you are told.

The next frontier is moving from recognition to understanding. We are moving from a world where humans must understand computers to a computer that must understand our world.

However, we should also be aware that true artificial intelligence is still on the distant horizon. It takes a long time to work before the computer can understand the true meaning of what it hears or sees, and there is still a long way to go.

Tin Zinc Alloy Wire

Tin zinc alloy wire is a kind of electronic welding material speciallized for metal spraying of the end face of metalized film capacitor.The capacitors have been widely used in high-speed rail,automobiles,new energy and aerospace fields.

Application Fields: Metal spraying material is a kind of electronic welding material specialized for metal spraying of the end face of metalized film capacitor.

Tin Zinc Alloy Wire,Zinc Alloy Wire,Tin Zinc Alloy Soldering Wire,Alloy Wire

Shaoxing Tianlong Tin Materials Co.,Ltd. ,

This entry was posted in on