Speech processing

百度 ”男子再摸一张递过去，还是假的。

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.^[1]

History

Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple phonetic elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.^[2] Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.^[3]

Linear predictive coding (LPC), a speech processing algorithm, was first proposed by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone (NTT) in 1966.^[4] Further developments in LPC technology were made by Bishnu S. Atal and Manfred R. Schroeder at Bell Labs during the 1970s.^[4] LPC was the basis for voice-over-IP (VoIP) technology,^[4] as well as speech synthesizer chips, such as the Texas Instruments LPC Speech Chips used in the Speak & Spell toys from 1978.^[5]

One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by Lawrence Rabiner and others at Bell Labs was used by AT&T in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary.^[6]

By the early 2000s, the dominant speech processing strategy started to shift away from Hidden Markov Models towards more modern neural networks and deep learning.^[7]

In 2012, Geoffrey Hinton and his team at the University of Toronto demonstrated that deep neural networks could significantly outperform traditional HMM-based systems on large vocabulary continuous speech recognition tasks. This breakthrough led to widespread adoption of deep learning techniques in the industry.^[8]^[9]

By the mid-2010s, companies like Google, Microsoft, Amazon, and Apple had integrated advanced speech recognition systems into their virtual assistants such as Google Assistant, Cortana, Alexa, and Siri.^[10] These systems utilized deep learning models to provide more natural and accurate voice interactions.

The development of Transformer-based models, like Google's BERT (Bidirectional Encoder Representations from Transformers) and OpenAI's GPT (Generative Pre-trained Transformer), further pushed the boundaries of natural language processing and speech recognition. These models enabled more context-aware and semantically rich understanding of speech.^[11]^[8] In recent years, end-to-end speech recognition models have gained popularity. These models simplify the speech recognition pipeline by directly converting audio input into text output, bypassing intermediate steps like feature extraction and acoustic modeling. This approach has streamlined the development process and improved performance.^[12]

Techniques

Dynamic time warping

Dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.^{[citation needed]}

Hidden Markov models

A hidden Markov model can be represented as the simplest dynamic Bayesian network. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying the Markov property, the conditional probability distribution of the hidden variable x(t) at time t, given the values of the hidden variable x at all times, depends only on the value of the hidden variable x(t ? 1). Similarly, the value of the observed variable y(t) only depends on the value of the hidden variable x(t) (both at time t).^{[citation needed]}

Artificial neural networks

An artificial neural network (ANN) is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs.^{[citation needed]}

Phase-aware processing

Phase is often assumed to be random, but contains useful information. Wrapping of phase:^[13] can be introduced due to periodical jumps on $2\pi$ . Phase unwrapping (see,^[14] Chapter 2.3; Instantaneous phase and frequency), it can be expressed as:^[13]^[15] $\phi (h,l)=\phi _{lin}(h,l)+\Psi (h,l)$ , where $\phi _{lin}(h,l)=\omega _{0}(l'){}_{\Delta }t$ is linear phase ( ${}_{\Delta }t$ is temporal shift at each frame of analysis), $\Psi (h,l)$ is phase contribution of the vocal tract and phase source.^[15] Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase ^[16] and its derivatives by time (instantaneous frequency) and frequency (group delay),^[17] smoothing of phase across frequency.^[17] Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase.^[15]

Applications

References

^ Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2025-08-05). "The Speed Submission to DIHARD II: Contributions & Lessons Learned". arXiv:1911.02388 [eess.AS].
^ Juang, B.-H.; Rabiner, L.R. (2006), "Speech Recognition, Automatic: History", Encyclopedia of Language & Linguistics, Elsevier, pp. 806–819, doi:10.1016/b0-08-044854-2/00906-8, ISBN 9780080448541
^ Myasnikov, L. L.; Myasnikova, Ye. N. (1970). Automatic recognition of sound pattern (in Russian). Leningrad: Energiya.
^ ^a ^b ^c Gray, Robert M. (2010). "A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol" (PDF). Found. Trends Signal Process. 3 (4): 203–303. doi:10.1561/2000000036. ISSN 1932-8346.
^ "VC&G - VC&G Interview: 30 Years Later, Richard Wiggins Talks Speak & Spell Development".
^ Huang, Xuedong; Baker, James; Reddy, Raj (2025-08-05). "A historical perspective of speech recognition". Communications of the ACM. 57 (1): 94–103. doi:10.1145/2500887. ISSN 0001-0782. S2CID 6175701.
^ Furui, Sadaoki (2005). "50 Years of Progress in Speech and Speaker Recognition Research". ECTI Transactions on Computer and Information Technology. 1 (2): 64–74. doi:10.37936/ecti-cit.200512.51834. ISSN 2286-9131.
^ ^a ^b "Deep Neural Networks for Acoustic Modeling in Speech Recognition" (PDF). 2025-08-05. Retrieved 2025-08-05.
^ "SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS" (PDF). 2025-08-05. Retrieved 2025-08-05.
^ Hoy, Matthew B. (2018). "Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants". Medical Reference Services Quarterly. 37 (1): 81–88. doi:10.1080/02763869.2018.1404391. ISSN 1540-9597. PMID 29327988.
^ "Vbee". vbee.vn (in Vietnamese). Retrieved 2025-08-05.
^ Hagiwara, Masato (2025-08-05). Real-World Natural Language Processing: Practical applications with deep learning. Simon and Schuster. ISBN 978-1-63835-039-2.
^ ^a ^b Mowlaee, Pejman; Kulmer, Josef (August 2015). "Phase Estimation in Single-Channel Speech Enhancement: Limits-Potential". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 23 (8): 1283–1294. doi:10.1109/TASLP.2015.2430820. ISSN 2329-9290. S2CID 13058142.
^ Mowlaee, Pejman; Kulmer, Josef; Stahl, Johannes; Mayer, Florian (2017). Single channel phase-aware signal processing in speech communication: theory and practice. Chichester: Wiley. ISBN 978-1-119-23882-9.
^ ^a ^b ^c Kulmer, Josef; Mowlaee, Pejman (April 2015). "Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR". Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE. pp. 5063–5067.
^ Kulmer, Josef; Mowlaee, Pejman (May 2015). "Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition". IEEE Signal Processing Letters. 22 (5): 598–602. Bibcode:2015ISPL...22..598K. doi:10.1109/LSP.2014.2365040. ISSN 1070-9908. S2CID 15503015.
^ ^a ^b Mowlaee, Pejman; Saeidi, Rahim; Stylianou, Yannis (July 2016). "Advances in phase-aware signal processing in speech communication". Speech Communication. 81: 1–29. doi:10.1016/j.specom.2016.04.002. ISSN 0167-6393. S2CID 17409161. Retrieved 2025-08-05.

[1] Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2025-08-05). "The Speed Submission to DIHARD II: Contributions & Lessons Learned". arXiv:1911.02388 [eess.AS].

[2] Juang, B.-H.; Rabiner, L.R. (2006), "Speech Recognition, Automatic: History", Encyclopedia of Language & Linguistics, Elsevier, pp. 806–819, doi:10.1016/b0-08-044854-2/00906-8, ISBN 9780080448541

[3] Myasnikov, L. L.; Myasnikova, Ye. N. (1970). Automatic recognition of sound pattern (in Russian). Leningrad: Energiya.

[Gray-4] Gray, Robert M. (2010). "A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol" (PDF). Found. Trends Signal Process. 3 (4): 203–303. doi:10.1561/2000000036. ISSN 1932-8346.

[vintagecomputing_article-5] "VC&G - VC&G Interview: 30 Years Later, Richard Wiggins Talks Speak & Spell Development".

[6] Huang, Xuedong; Baker, James; Reddy, Raj (2025-08-05). "A historical perspective of speech recognition". Communications of the ACM. 57 (1): 94–103. doi:10.1145/2500887. ISSN 0001-0782. S2CID 6175701.

[7] Furui, Sadaoki (2005). "50 Years of Progress in Speech and Speaker Recognition Research". ECTI Transactions on Computer and Information Technology. 1 (2): 64–74. doi:10.37936/ecti-cit.200512.51834. ISSN 2286-9131.

[:0-8] "Deep Neural Networks for Acoustic Modeling in Speech Recognition" (PDF). 2025-08-05. Retrieved 2025-08-05.

[9] "SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS" (PDF). 2025-08-05. Retrieved 2025-08-05.

[10] Hoy, Matthew B. (2018). "Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants". Medical Reference Services Quarterly. 37 (1): 81–88. doi:10.1080/02763869.2018.1404391. ISSN 1540-9597. PMID 29327988.

[11] "Vbee". vbee.vn (in Vietnamese). Retrieved 2025-08-05.

[12] Hagiwara, Masato (2025-08-05). Real-World Natural Language Processing: Practical applications with deep learning. Simon and Schuster. ISBN 978-1-63835-039-2.

[limits-13] Mowlaee, Pejman; Kulmer, Josef (August 2015). "Phase Estimation in Single-Channel Speech Enhancement: Limits-Potential". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 23 (8): 1283–1294. doi:10.1109/TASLP.2015.2430820. ISSN 2329-9290. S2CID 13058142.

[14] Mowlaee, Pejman; Kulmer, Josef; Stahl, Johannes; Mayer, Florian (2017). Single channel phase-aware signal processing in speech communication: theory and practice. Chichester: Wiley. ISBN 978-1-119-23882-9.

[vonMises-15] Kulmer, Josef; Mowlaee, Pejman (April 2015). "Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR". Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE. pp. 5063–5067.

[16] Kulmer, Josef; Mowlaee, Pejman (May 2015). "Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition". IEEE Signal Processing Letters. 22 (5): 598–602. Bibcode:2015ISPL...22..598K. doi:10.1109/LSP.2014.2365040. ISSN 1070-9908. S2CID 15503015.

[Advances-17] Mowlaee, Pejman; Saeidi, Rahim; Stylianou, Yannis (July 2016). "Advances in phase-aware signal processing in speech communication". Speech Communication. 81: 1–29. doi:10.1016/j.specom.2016.04.002. ISSN 0167-6393. S2CID 17409161. Retrieved 2025-08-05.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

Authority control databases
National	United States Japan Israel
Other	Yale LUX

甲状腺桥本是什么意思	卵泡生成素高是什么原因	3.23是什么星座	葡萄糖氯化钠注射作用是什么	舌苔发青是什么原因
什么叫肾阴虚和肾阳虚	15度穿什么衣服	乳腺囊肿和乳腺结节有什么区别	sakose是什么牌子	什么胆忠心
什么是佣金	藏医最擅长治什么病	沙中土命什么意思	1月21号是什么星座	郭敬明为什么叫小四
耳鸣挂什么科	什么叫布病	猕猴桃什么时候上市	八仙过海是什么生肖	人的舌头有什么作用

九月三日是什么日子hanqikai.com	merry是什么意思hcv9jop4ns2r.cn	无蒂息肉是什么意思hcv8jop2ns5r.cn	怀疑哮喘要做什么检查jasonfriends.com	打鸟是什么意思hcv8jop6ns5r.cn
沉冤得雪是什么意思hcv8jop2ns6r.cn	风热感冒用什么药hcv9jop0ns0r.cn	煮牛肉放什么容易烂hcv8jop6ns0r.cn	消肿吃什么药hcv8jop5ns3r.cn	boy是什么品牌hcv9jop2ns9r.cn
红指什么生肖xinmaowt.com	白色糠疹是什么原因引起的hcv8jop6ns8r.cn	女上位什么意思hcv8jop7ns1r.cn	上马是什么意思hcv9jop7ns1r.cn	粉盒和硒鼓有什么区别hcv8jop8ns8r.cn
石敢当是什么意思hcv8jop6ns7r.cn	87年是什么年hcv9jop5ns2r.cn	高铁二等座是什么意思gangsutong.com	捌是什么数字hcv9jop1ns1r.cn	什么是雾霾hcv9jop1ns8r.cn

AC米兰超神秘买家李勇鸿背景成疑家族曾涉非法集资案