Computer Science Artificial Intelligence

Speech and dialogue systems

Description

This cluster of papers focuses on the modeling and optimization of dialogue acts in spoken language systems, utilizing techniques such as Markov decision processes, user simulation, multimodal interaction, reinforcement learning, natural language generation, and the hidden information state model. The research also delves into semantic processing, referring expressions, and the management of dialogues in various contexts.

Keywords

Spoken Dialogue Systems; Markov Decision Processes; User Simulation; Multimodal Interaction; Reinforcement Learning; Natural Language Generation; Hidden Information State Model; Dialog Management; Semantic Processing; Referring Expressions

This paper focuses on the processes involved in collaboration using a microanalysis of one dyad's work with a computer-based environment (the Envisioning Machine). The interaction between participants is analysed with … This paper focuses on the processes involved in collaboration using a microanalysis of one dyad's work with a computer-based environment (the Envisioning Machine). The interaction between participants is analysed with respect to a 'Joint Problem Space', which comprises an emergent, socially-negotiated set of knowledge elements, such as goals, problem state descriptions and problem solving actions. Our analysis shows how this shared conceptual space is constructed through the external mediational framework of shared language, situation and activity. This approach has particular implications for understanding how the benefits of collaboration are realised and serves to clarify the possible roles of the computers in supporting collaborative learning.
Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact both on usability and perceived quality.Most NLG systems in common use employ rules … Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact both on usability and perceived quality.Most NLG systems in common use employ rules and heuristics and tend to generate rigid and stylised responses without the natural variation of human language.They are also not easily scaled to systems covering multiple domains and languages.This paper presents a statistical language generator based on a semantically controlled Long Short-term Memory (LSTM) structure.The LSTM generator can learn from unaligned data by jointly optimising sentence planning and surface realisation using a simple cross entropy training criterion, and language variation can be easily achieved by sampling from output candidates.With fewer heuristics, an objective evaluation in two differing test domains showed the proposed method improved performance compared to previous methods.Human judges scored the LSTM system higher on informativeness and naturalness and overall preferred it to the other systems.
Statistical dialog systems (SDSs) are motivated by the need for a data-driven framework that reduces the cost of laboriously handcrafting complex dialog managers and that provides robustness against the errors … Statistical dialog systems (SDSs) are motivated by the need for a data-driven framework that reduces the cost of laboriously handcrafting complex dialog managers and that provides robustness against the errors created by speech recognizers operating in noisy environments. By including an explicit Bayesian model of uncertainty and by optimizing the policy via a reward-driven process, partially observable Markov decision processes (POMDPs) provide such a framework. However, exact model representation and optimization is computationally intractable. Hence, the practical application of POMDP-based systems requires efficient algorithms and carefully constructed approximations. This review article provides an overview of the current state of the art in the development of POMDP-based spoken dialog systems.
One of the most astonishing features of human language is its capacity to convey information efficiently in context. Many theories provide informal accounts of communicative inference, yet there have been … One of the most astonishing features of human language is its capacity to convey information efficiently in context. Many theories provide informal accounts of communicative inference, yet there have been few successes in making precise, quantitative predictions about pragmatic reasoning. We examined judgments about simple referential communication games, modeling behavior in these games by assuming that speakers attempt to be informative and that listeners use Bayesian inference to recover speakers' intended referents. Our model provides a close, parameter-free fit to human judgments, suggesting that the use of information-theoretic tools to predict pragmatic reasoning may lead to more effective formal models of communication.
College students (in Experiment 1) and 7th-grade students (in Experiment 2) learned how to design the roots, stem, and leaves of plants to survive in 8 different environments through a … College students (in Experiment 1) and 7th-grade students (in Experiment 2) learned how to design the roots, stem, and leaves of plants to survive in 8 different environments through a computer-based multimedia lesson. They learned by interacting with an animated pedagogical agent who spoke to them (Group PA) or received identical graphics and explanations as on-screen text without a pedagogical agent (Group No PA). Group PA outperformed Group No PA on transfer tests and interest ratings but not on retention tests. To investigate further the basis for this personal agent effect, we varied the interactivity of the agent-based lesson (Experiment 3) and found an interactivity effect: Students who participate in the design of plant parts remember more and transfer what they have learned to solve new problems better than students who learn the same materials without participation. Next, we varied whether the agent's words were presented as speech or on-screen text, and whether the agent's image appeared on the screen. Both with a fictional agent (Experiment 4) and a video of a human face (Experiment 5), students performed better on tests of retention and problem-solving transfer when words were presented as speech rather than on-screen text (producing a modality effect) but visual presence of the agent did not affect test performance (producing no image effect). Results support the introduction of interactive pedagogical agents who communicate with students via speech to promote meaningful learning in multimedia lessons.
article Free Access Share on The Hearsay-II Speech-Understanding System: Integrating Knowledge to Resolve Uncertainty Authors: Lee D. Erman USC/Information Sciences Institute, Marina del Rey, California USC/Information Sciences Institute, Marina del … article Free Access Share on The Hearsay-II Speech-Understanding System: Integrating Knowledge to Resolve Uncertainty Authors: Lee D. Erman USC/Information Sciences Institute, Marina del Rey, California USC/Information Sciences Institute, Marina del Rey, CaliforniaView Profile , Frederick Hayes-Roth The Rand Corporation, Santa Monica, California The Rand Corporation, Santa Monica, CaliforniaView Profile , Victor R. Lesser University of Massachusetts, Amherst, Massachusetts University of Massachusetts, Amherst, MassachusettsView Profile , D. Raj Reddy Carnegie-Mellon University, Pittsburgh, Pennsylvania Carnegie-Mellon University, Pittsburgh, PennsylvaniaView Profile Authors Info & Claims ACM Computing SurveysVolume 12Issue 2pp 213–253https://doi.org/10.1145/356810.356816Published:01 June 1980Publication History 1,006citation2,698DownloadsMetricsTotal Citations1,006Total Downloads2,698Last 12 Months226Last 6 weeks28 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteeReaderPDF
When people in conversation refer repeatedly to the same object, they come to use the same terms. This phenomenon, called lexical entrainment, has several possible explanations. Ahistorical accounts appeal only … When people in conversation refer repeatedly to the same object, they come to use the same terms. This phenomenon, called lexical entrainment, has several possible explanations. Ahistorical accounts appeal only to the informativeness and availability of terms and to the current salience of the object's features. Historical accounts appeal in addition to the recency and frequency of past references and to partner-specific conceptualizations of the object that people achieve interactively. Evidence from 3 experiments favors a historical account and suggests that when speakers refer to an object, they are proposing a conceptualization of it, a proposal their addresses may or may not agree to. Once they do establish a shared conceptualization, a conceptual pact, they appeal to it in later references even when they could use simpler references. Over time, speakers simplify conceptual pacts and, when necessary, abandon them for new conceptualizations.
In evaluations of text entry methods, participants enter phrases of text using a technique of interest while performance data are collected. This paper describes and publishes (via the internet) a … In evaluations of text entry methods, participants enter phrases of text using a technique of interest while performance data are collected. This paper describes and publishes (via the internet) a collection of 500 phrases for such evaluations. Utility programs are also provided to compute statistical properties of the phrase set, or any other phrase set. The merits of using a pre-defined phrase set are described as are methodological considerations, such as attaining results that are generalizable and the possible addition of punctuation and other characters.
In order to discover design principles for a large memory that can enable it to serve as the base of knowledge underlying human-like language behavior, experiments with a model memory … In order to discover design principles for a large memory that can enable it to serve as the base of knowledge underlying human-like language behavior, experiments with a model memory are being performed. This model is built up within a computer by "recoding" a body of information from an ordinary dictionary into a complex network of elements and associations interconnecting them. Then, the ability of a program to use the resulting model memory effectively for simulating human performance provides a test of its design. One simulation program, now running, is given the model memory and is required to compare and contrast the meanings of arbitrary pairs of English words. For each pair, the program locates any relevant semantic information within the model memory, draws inferences on the basis of this, and thereby discovers various relationships between the meanings of the two words. Finally, it creates English text to express its conclusions. The design principles embodied in the memory model, together with some of the methods used by the program, constitute a theory of how human memory for semantic and other conceptual material may be formatted, organized, and used.
This paper describes a corpus of unscripted, task-oriented dialogues which has been designed, digitally recorded, and transcribed to support the study of spontaneous speech on many levels. The corpus uses … This paper describes a corpus of unscripted, task-oriented dialogues which has been designed, digitally recorded, and transcribed to support the study of spontaneous speech on many levels. The corpus uses the Map Task (Brown, Anderson, Yule, and Shillcock, 1983) in which speakers must collaborate verbally to reproduce on one participant's map a route printed on the other's. In all, the corpus includes four conversations from each of 64 young adults and manipulates the following variables: familiarity of speakers, eye contact between speakers, matching between landmarks on the participants' maps, opportunities for contrastive stress, and phonological characteristics of landmark names. The motivations for the design are set out and basic corpus statistics are presented.
Our original paper (Grosz, Joshi, and Weinstein, 1983) on centering claimed that certain entities mentioned in an utterance were more central than others and that this property imposed constraints on … Our original paper (Grosz, Joshi, and Weinstein, 1983) on centering claimed that certain entities mentioned in an utterance were more central than others and that this property imposed constraints on a speaker's use of di erent types of referring expressions.Centering was proposed as a model that accounted for this phenomenon.We argued that the coherence of discourse was a ected by the compatibility between centering properties of an utterance and choice of referring expression.Subsequently, we revised and expanded the ideas presented therein.We de ned various centering constructs and proposed two centering rules in terms of these constructs.A draft manuscript describing this elaborated centering framework and presenting some initial theoretical claims has been in wide circulation since 1986.This draft (Grosz, Joshi, and Weinstein 1986, hereafter, gjw86) has led to a number of papers by others on this topic and has been extensively cited, but has never been published. 1e have been urged to publish the more detailed description of the centering framework and theory proposed in gjw86 so that an o cial version would be archivally available.The task of completing and revising this draft became more daunting as time passed and more and more papers appeared on centering.Many of these papers proposed extensions to or revisions of the theory and attempted to answer questions posed in gjw86.It has become ever more clear that it would be useful to have a \de nitive" statement of the original motivations for centering, the basic de nitions underlying the centering framework, and the original theoretical claims.This paper attempts to meet that need.To accomplish this goal, we have chosen to remove descriptions of many open research questions posed in gjw86 as well as solutions that were only partially developed.We have also greatly shortened the discussion of criteria for and constraints on a possible semantic theory as a foundation for this work. IntroductionThis paper presents an initial attempt to develop a theory that relates focus of attention, choice of referring expression, and perceived coherence of utterances within a discourse segment.The research described here is a further development of several strands of previous research.It ts within a larger e ort to provide an overall theory of discourse structure and meaning.In this section we describe the larger research context of this work and then brie y discuss the previous work that led to it.Centering ts within the theory of discourse structure developed by Grosz and Sidner (1986), henceforth, G&S.G&S distinguish among three components of discourse structure: a linguistic structure, an intentional structure, and an attentional state.At the level of linguistic structure, discourses divide into constituent discourse segments; an embedding relationship may hold between two segments.The intentional structure comprises intentions and relations among them.The intentions provide the basic rationale for the discourse, and the 1 Early drafts of gjw86 were in circulation from 1983.Some citations to other work have dates between 1983 and 1986.This work utilized these earlier drafts.
We describe a statistical approach for modeling dialogue acts in conversational speech, i.e., speech-act-like units such as Statement, Question, Backchannel, Agreement, Disagreement, and Apology. Our model detects and predicts dialogue … We describe a statistical approach for modeling dialogue acts in conversational speech, i.e., speech-act-like units such as Statement, Question, Backchannel, Agreement, Disagreement, and Apology. Our model detects and predicts dialogue acts based on lexical, collocational, and prosodic cues, as well as on the discourse coherence of the dialogue act sequence. The dialogue model is based on treating the discourse structure of a conversation as a hidden Markov model and the individual dialogue acts as observations emanating from the model states. Constraints on the likely sequence of dialogue acts are modeled via a dialogue act n-gram. The statistical dialogue grammar is combined with word n-grams, decision trees, and neural networks modeling the idiosyncratic lexical and prosodic manifestations of each dialogue act. We develop a probabilistic integration of speech recognition with dialogue modeling, to improve both speech recognition and dialogue act classification accuracy. Models are trained and evaluated using a large hand-labeled database of 1,155 conversations from the Switchboard corpus of spontaneous human-to-human telephone speech. We achieved good dialogue act labeling accuracy (65% based on errorful, automatically recognized words and prosody, and 71% based on word transcripts, compared to a chance baseline accuracy of 35% and human accuracy of 84%) and a small reduction in word recognition error.
This paper presents a spreading-acti vation theory of human semantic processing, which can be applied to a wide range of recent experimental results. The theory is based on Quillian's theory … This paper presents a spreading-acti vation theory of human semantic processing, which can be applied to a wide range of recent experimental results. The theory is based on Quillian's theory of semantic memory search and semantic preparation, or priming. In conjunction with this, several of the miscondeptions concerning Qullian's theory are discussed. A number of additional assumptions are proposed for his theory in order to apply it to recent experiments. The present paper shows how the extended theory can account for results of several production experiments by Loftus, Juola and Atkinson's multiple-category experiment, Conrad's sentence-verification experiments, and several categorization experiments on the effect of semantic relatedness and typicality by Holyoak and Glass, Rips, Shoben, and Smith, and Rosch. The paper also provides a critique of the Smith, Shoben, and Rips model for categorization judgments. Some years ago, Quillian1 (1962, 1967) proposed a spreading-acti vation theory of human semantic processing that he tried to implement in computer simulations of memory search (Quillian, 1966) and comprehension (Quillian, 1969). The theory viewed memory search as activation spreading from two or more concept nodes in a semantic network until an intersection was found. The effects of preparation (or priming) in semantic memory were also explained in terms of spreading activation from the node of the primed concept. Rather than a theory to explain data, it was a theory designed to show how to build human semantic structure and processing into a computer.
Recent technological advances in connected-speech recognition and position sensing in space have encouraged the notion that voice and gesture inputs at the graphics interface can converge to provide a concerted, … Recent technological advances in connected-speech recognition and position sensing in space have encouraged the notion that voice and gesture inputs at the graphics interface can converge to provide a concerted, natural user modality.
Currently, computational linguists and cognitive scientists working in the area of discourse and dialogue argue that their subjective judgments are reliable using several different statistics, none of which are easily … Currently, computational linguists and cognitive scientists working in the area of discourse and dialogue argue that their subjective judgments are reliable using several different statistics, none of which are easily interpretable or comparable to each other. Meanwhile, researchers in content analysis have already experienced the same difficulties and come up with a solution in the kappa statistic. We discuss what is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and argue that we would be better off as a field adopting techniques from content analysis.
ELIZA is a program operating within the MAC time-sharing system of MIT which makes certain kinds of natural language conversation between man and computer possible. Input sentences are analyzed on … ELIZA is a program operating within the MAC time-sharing system of MIT which makes certain kinds of natural language conversation between man and computer possible. Input sentences are analyzed on the basis of decomposition rules which are triggered by key words appearing in the input text. Responses are generated by reassembly rules associated with selected decomposition rules. The fundamental technical problems with which ELIZA is concerned are: (1) the identification of key words, (2) the discovery of minimal context, (3) the choice of appropriate transformations, (4) generation of responses in the absence of key words, and (5) the provision of an editing capability for ELIZA “scripts”. A discussion of some psychological issues relevant to the ELIZA approach as well as of future developments concludes the paper.
ion Information that has been selected because it is important and/or relevant to the schema is further reduced during the encoding process by abstraction. This process codes the meaning but … ion Information that has been selected because it is important and/or relevant to the schema is further reduced during the encoding process by abstraction. This process codes the meaning but not the format of a message (e.g., Bobrow, 1970; Bransford, Barclay, & Franks, 1972). Thus, details such as the lexical form of an individual word (e.g., Schank, 1972, 1976) and the syntactic form of a sentence (e.g., Sachs, 1967) will not be preserved in memory. Because memory for syntax appears to be particularly sparce as well as brief (e.g., J. R. Anderson, 1974; Begg & Wickelgren, 1974; Jarvella, 1971; Sachs, 1967, 1974), the abstraction process is thought to operate during encoding. Additional support for the notion that what is stored is an abstracted representation of the original stimulus comes from studies that demonstrate that after a passage is read, it takes subjects the same amount of time to verify information originally presented in a complex linguistic format as it does to verify that same information presented in a simpler format (e.g., King & Greeno, 1974; Kintsch M Bransford, et al., 1972; Brewer, 1975; Frederiksen, 1975a; Kintsch, 1974; Norman & Rumelhart, 1975; Schank, 1972, 1976). One formalized presentation of this idea is Schank's conceptual dependency theory (1972). The theory asserts that all propositions can be expressed by a small set of primitive concepts. All lexical expressions that share an identical meaning will be represented in one way (and so stored economically) regardless of their presentation format. As a result people should often incorrectly recall or misrecognize synonyms of originally presented words, and they do (e.g., Anderson & Bower, 1973; R. C. Anderson, 1974; Anisfeld & Knapp, 1968; Brewer, 1975; Graesser, 1978b; Sachs, 1974). Abstraction and memory theories. Since considerable is lost via the abstraction process, this process can easily account for the incompleteness that is characteristic of people's recall of complex events. In light of the abstraction process, the problem for schema theories becomes one of accounting for accurate recall. Schema theories do this by borrowing a finding from psycholinguistic research, to wit, that speakers of a language share preferred ways of expressing information. If both the creator and perceiver of a message are operating with the same preferences or under the same biases, the perceiver's reproduction of the input may appear to be accurate. The accuracy, however, is the product of recalling the semantic content of the message and imposing the preferred structure onto it. Thus, biases operate in a manner that is similar to the probable detail reconstruction process. Biases have been documented for both syntactic information (J. R. Anderson, 1974; Bock, 1977; Bock & Brewer, 1974; Clark & Clark, 1968; James, Thompson, & Baldwin, 1973) and lexical information (Brewer, 1975; Brewer & Lichtenstein, 1974). Distortions may result from the abstraction process if biases are not shared by the person who creates the message and the one who receives it. More importantly, the ab-
This article reviews research on the use of situation models in language comprehension and memory retrieval over the past 15 years. Situation models are integrated mental representations of a described … This article reviews research on the use of situation models in language comprehension and memory retrieval over the past 15 years. Situation models are integrated mental representations of a described state of affairs. Significant progress has been made in the scientific understanding of how situation models are involved in language comprehension and memory retrieval. Much of this research focuses on establishing the existence of situation models, often by using tasks that assess one dimension of a situation model. However, the authors argue that the time has now come for researchers to begin to take the multidimensionality of situation models seriously. The authors offer a theoretical framework and some methodological observations that may help researchers to tackle this issue.
Software radios are emerging as platforms for multiband multimode personal communications systems. Radio etiquette is the set of RF bands, air interfaces, protocols, and spatial and temporal patterns that moderate … Software radios are emerging as platforms for multiband multimode personal communications systems. Radio etiquette is the set of RF bands, air interfaces, protocols, and spatial and temporal patterns that moderate the use of the radio spectrum. Cognitive radio extends the software radio with radio-domain model-based reasoning about such etiquettes. Cognitive radio enhances the flexibility of personal services through a radio knowledge representation language. This language represents knowledge of radio etiquette, devices, software modules, propagation, networks, user needs, and application scenarios in a way that supports automated reasoning about the needs of the user. This empowers software radios to conduct expressive negotiations among peers about the use of radio spectrum across fluents of space, time, and user context. With RKRL, cognitive radio agents may actively manipulate the protocol stack to adapt known etiquettes to better satisfy the user's needs. This transforms radio nodes from blind executors of predefined protocols to radio-domain-aware intelligent agents that search out ways to deliver the services the user wants even if that user does not know how to obtain them. Software radio provides an ideal platform for the realization of cognitive radio.
This paper introduces the Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words.This provides a unique … This paper introduces the Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words.This provides a unique resource for research into building dialogue managers based on neural language models that can make use of large amounts of unlabeled data.The dataset has both the multi-turn property of conversations in the Dialog State Tracking Challenge datasets, and the unstructured nature of interactions from microblog services such as Twitter.We also describe two neural learning architectures suitable for analyzing this dataset, and provide benchmark performance on the task of selecting the best next response.
Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gašić, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, Steve Young. Proceedings of the 15th Conference of the European Chapter of the Association for … Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gašić, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, Steve Young. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. 2017.
We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available.Recent works in response generation have adopted metrics from machine translation to … We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available.Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response.We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain.We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.
Recent technological advances in connected-speech recognition and position sensing in space have encouraged the notion that voice and gesture inputs at the graphics interface can converge to provide a concerted, … Recent technological advances in connected-speech recognition and position sensing in space have encouraged the notion that voice and gesture inputs at the graphics interface can converge to provide a concerted, natural user modality. The work described herein involves the user commanding simple shapes about a large-screen graphics display surface. Because voice can be augmented with simultaneous pointing, the free usage of pronouns becomes possible, with a corresponding gain in naturalness and economy of expression. Conversely, gesture aided by voice gains precision in its power to reference.
Comparisons between Japanese and English prosodics have usually either focused on the strikingly apparent phonetic differences between the stress patterns of English and the tonal accent patterns of Japanese or … Comparisons between Japanese and English prosodics have usually either focused on the strikingly apparent phonetic differences between the stress patterns of English and the tonal accent patterns of Japanese or concentrated upon formal similarities between the abstract arrangements of the stresses and tones. A recent investigation of tone structure in Japanese (Pierrehumbert & Beckman forthcoming), however, has convinced us that if the proper prosodic phenomena are compared, far more pervasive similarities can be discovered and of a much more concrete sort than hitherto suspected. In particular, there is now extensive evidence that Japanese tonal patterns are very sparsely specified, which suggests that they are much more similar to English intonational structures than earlier descriptions would have allowed.
Abstract Interactional Competence (IC) is an important subcomponent of oral proficiency, but many computer‐mediated oral English assessments fall short in assessing this construct mainly due to technological limitations. Spoken Dialogue … Abstract Interactional Competence (IC) is an important subcomponent of oral proficiency, but many computer‐mediated oral English assessments fall short in assessing this construct mainly due to technological limitations. Spoken Dialogue Systems (SDSs) have shown promise in assessing L2 oral communication, yet further investigation is needed on their effectiveness in eliciting IC features in high‐stakes assessment contexts. This study is unique in that it analyzed both test takers' and an SDS's spoken discourse in human–computer interactions. Using an SDS to simulate an examiner in an IELTS Speaking task, the study explored how well the system mimicked human–human interaction and elicited IC features from test takers, focusing on the IC features documented in prior research. Thirty participants completed the SDS‐mediated test, with their performances rated by two trained raters. Semi‐structured interviews with 10 test takers were conducted following the assessment. The findings revealed that the SDS successfully elicited key IC features, which helped distinguish test takers at different proficiency levels, with reliable scoring across raters. Most test takers found the SDS competent, though some noted its limitations in nonverbal communication and conversational flow. These results suggest that SDSs have potential in oral proficiency assessments and provide valuable insights for refining SDS design to ensure reliable and valid assessments.
This paper introduces an innovative mobile-assisted language-learning (MALL) system that harnesses deep learning technology to analyze pronunciation patterns and deliver real-time, personalized feedback. Drawing inspiration from how the human brain … This paper introduces an innovative mobile-assisted language-learning (MALL) system that harnesses deep learning technology to analyze pronunciation patterns and deliver real-time, personalized feedback. Drawing inspiration from how the human brain processes speech through neural pathways, our system analyzes multiple speech features using spectrograms, mel-frequency cepstral coefficients (MFCCs), and formant frequencies in a manner that mirrors the auditory cortex’s interpretation of sound. The core of our approach utilizes a convolutional neural network (CNN) to classify pronunciation patterns from user-recorded speech. To enhance the assessment accuracy and provide nuanced feedback, we integrated a fuzzy inference system (FIS) that helps learners identify and correct specific pronunciation errors. The experimental results demonstrate that our multi-feature model achieved 82.41% to 90.52% accuracies in accent classification across diverse linguistic contexts. The user testing revealed statistically significant improvements in pronunciation skills, where learners showed a 5–20% enhancement in accuracy after using the system. The proposed MALL system offers a portable, accessible solution for language learners while establishing a foundation for future research in multilingual functionality and mobile platform optimization. By combining advanced speech analysis with intuitive feedback mechanisms, this system addresses a critical challenge in language acquisition and promotes more effective self-directed learning.
Listeners of all ages and hearing abilities must contend with the fact that speech is often heard in challenging conditions. Research has found that the process of spoken word recognition … Listeners of all ages and hearing abilities must contend with the fact that speech is often heard in challenging conditions. Research has found that the process of spoken word recognition changes in these contexts, but what these changes represent and whether they have meaningful effects on outcomes is unknown. Here, we build on recent work by applying a principal component analysis approach to eye-tracking data to show that these changes reflect a set of underlying dimensions that are shared across two types of challenging listening (noise and vocoding). Moreover, we show that individual listener's use of each dimension is largely consistent across challenge types and is predicted by domain-general factors. Finally, we show that changes to word recognition offer indirect benefits to performance in challenging conditions, but only for some listeners. (PsycInfo Database Record (c) 2025 APA, all rights reserved).
Carlos Acuña-Fariña | Routledge eBooks
In recent years, Virtual Reality (VR) has emerged as a powerful tool for disseminating Cultural Heritage (CH), often incorporating Virtual Humans (VHs) to guide users through historical recreations. The advent … In recent years, Virtual Reality (VR) has emerged as a powerful tool for disseminating Cultural Heritage (CH), often incorporating Virtual Humans (VHs) to guide users through historical recreations. The advent of Large Language Models (LLMs) now enables natural, unscripted communication with these VHs, even on limited devices. This paper details a natural interaction system for VHs within a VR application of San Cristóbal de La Laguna, a UNESCO World Heritage Site. Our system integrates Speech-to-Text, LLM-based dialogue generation, and Text-to-Speech synthesis. Adhering to user-centered design (UCD) principles, we conducted two studies: a preliminary study revealing user interest in historically adapted language, and a qualitative test that identified key user experience improvements, such as incorporating feedback mechanisms and gender selection for VHs. The project successfully developed a prioritized user experience, focusing on usability evaluation, immersion, and dialogue quality. We propose a generalist methodology and recommendations for integrating unscripted VH dialogue in VR. However, limitations include dialogue generation latency and reduced quality in non-English languages. While a formative usability test evaluated the process, the small sample size restricts broad generalizations about user behavior.
Personalized chatbots concentrate on learning human personalities, making them act similar to real users. When it is authorized to respond to other people’s messages, it has the same way of … Personalized chatbots concentrate on learning human personalities, making them act similar to real users. When it is authorized to respond to other people’s messages, it has the same way of speaking as the user. Many personalized methods have been proposed to use several persona descriptions or key-value-based persona information to assign a personality for dialogue chatbots. Most of them employ explicit user profiles. However, obtaining generous explicit user profiles are extremely time-consuming and requires tremendous manual labor. In addition, explicit user profiles cannot be updated as the user’s interests change. In this paper, we propose a generation-based personalized chatbot model, IMDPchat, that learns latent user representation from the abundant users’ dialogue history. Specially, we train a personalized language model to build a global user profile using dialogue responses. To take full advantage of users’ information used in the historical dialogue, we establish a key-value memory network and construct a post-sensitive personalized selection module. The above two parts is context-aware: we endow higher weights to historical post-response pairs that are connected to the current post. To predict more personalized responses, we design a personalized response decoder that can well integrate two decoding modes, including generating tokens and copying personalized words. Experimental results indicate that the IMDPchat model outperforms previous baselines remarkably.
Narasimha Rao | INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Abstract: In today’s technologically advanced world, bridging the communication gap between hearing-impaired individuals and the rest of society is a critical challenge. Sign language serves as a primary mode of … Abstract: In today’s technologically advanced world, bridging the communication gap between hearing-impaired individuals and the rest of society is a critical challenge. Sign language serves as a primary mode of communication for the deaf and hard of-hearing community. However, due to the limited number of people proficient in sign language, there exists a significant communication barrier. This project aims to address this gap by developing an intelligent, real-time sign language recognition system using deep learning techniques. The proposed system utilizes a combination of computer vision and deep learning algorithms to accurately recognize hand gestures representing sign language. Leveraging tools such as MediaPipe for hand tracking and a Convolutional Neural Network (CNN) or keypoint-based classifier for gesture classification, the system processes live video input or uploaded images to identify signs and convert them into readable text. The model is trained on a custom or publicly available sign language dataset, ensuring accuracy and robustness across various lighting conditions and hand orientations. Key modules of the system include data preprocessing, feature extraction, model training using gesture sequences, and real-time inference. The model demonstrates high classification accuracy and low latency, making it suitable for real-world applications such as education, customer service, and accessibility platforms. This project not only highlights the potential of artificial intelligence in assistive technologies but also contributes to fostering inclusivity and equal communication opportunities for all individuals, regardless of physical ability. Keywords: Sign Language Recognition (SLR), Real-Time Gesture Recognition, Deaf and Hard-of-Hearing Communication, MediaPipe, Deep Learning, Computer Vision, DualNet-SLR, Point History Network, Keypoint History Network, Streamlit Interface.
ABSTRACT-- This paper examines ASKQUESTIONS!, a real-time audience engagement platform employing multimodal Artificial Intelligence (AI) to optimize question management in conferences and events. The system allows attendees to submit questions … ABSTRACT-- This paper examines ASKQUESTIONS!, a real-time audience engagement platform employing multimodal Artificial Intelligence (AI) to optimize question management in conferences and events. The system allows attendees to submit questions via QR codes, eliminating app installations, and uses AI to filter and prioritize these questions based on relevance, tone, clarity, and crowd-sourced feedback like upvotes. Core technologies include React for interactive UIs, Node.js with Express for backend API, and BERT-based AI microservices for question analysis. Compared to traditional Q&A methods, ASKQUESTIONS! offers enhanced inclusivity, efficient question handling, and comprehensive session management tools for organizers. Evaluations demonstrate the platform's ability to streamline question curation, improve audience engagement, and provide valuable analytics. The integration of AI in ASKQUESTIONS! significantly enhances the event experience by ensuring that the most pertinent and representative questions are addressed, making interactions more meaningful and efficient for both speakers and attendees. KEYWORDS-- Audience engagement, real-time interaction, artificial intelligence, question prioritization, conference platform, event technology, natural language processing, BERT, session management, QR codes
D. Kumar | INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Abstract—The evolution of virtual personal assistants (VPAs) has been significantly influenced by advancements in voice command recognition and response optimization. Modern systems leverage sophisticated technologies such as Automatic Speech Recognition … Abstract—The evolution of virtual personal assistants (VPAs) has been significantly influenced by advancements in voice command recognition and response optimization. Modern systems leverage sophisticated technologies such as Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) synthesis to facilitate seamless human-computer interactions. These integrations enable VPAs to comprehend and process voice inputs, interpret user intent, and generate contextually appropriate responses. Recent developments have introduced multimodal capabilities, allowing VPAs to engage in voice, text, and visual interactions. For instance, OpenAI's GPT-4o model supports real-time voice conversations, providing users with dynamic and natural interactions. This advancement enhances the VPA's ability to manage a wide spectrum of tasks—from answering questions and managing calendars to niche functions like coding. Furthermore, the integration of VPAs with hardware platforms, such as the ESP32 microcontroller, has facilitated the development of intelligent voice interfaces. These systems utilize cloud APIs and conversational intelligence to deliver comprehensive solutions for voice-based interactions, enhancing productivity across various environments. Despite these advancements, challenges persist in ensuring the accuracy, security, and privacy of voice interactions. Addressing issues related to data protection and system vulnerabilities is crucial for the continued success and adoption of voice-enabled VPAs. In conclusion, the integration of voice command recognition and response optimization in VPAs represents a significant leap towards more intuitive and efficient human-computer interactions. Ongoing research and development in this field are essential to overcome existing challenges and unlock the full potential of voice-enabled technologies.
| Journal on Electronic and Automation Engineering
As voice interfaces become increasingly essential for smart devices, the demand for efficient, embedded AI solutions that operate reliably in real time is increasing. Audio input is captured using a … As voice interfaces become increasingly essential for smart devices, the demand for efficient, embedded AI solutions that operate reliably in real time is increasing. Audio input is captured using a high-sensitivity I2S microphone, which is processed over the I2S interface to ensure high-quality digital audio streaming. The captured voice data is securely transmitted over Wi-Fi with TTL encryption to the Deepgram cloud-based ASR engine for accurate speech-to-text conversion. The resulting text is then sent to Google Gemini AI for advanced natural language understanding, allowing the system to interpret user intent in context. Based on the interpreted query, a response is generated and sent via a text-to-speech (TTS) engine, either cloud-based or local, before being output via an onboard DAC and a connected audio amplifier. The system uses a hybrid processing model: simple voice commands, such as toggling GPIOs, controlling devices, or accessing local sensor data, are processed locally to reduce latency, while more complex or open-ended queries are handled in the cloud. A lightweight command parser on the ESP32 detects and processes predefined keywords or phrases. The entire communication system is designed to prioritize secure, low-latency performance, ensure fast response times, and protect user data. The modular design allows for easy customization and integration with additional sensors, devices, or third-party services. This voice assistant platform is a versatile solution for edge AI applications, ideal for smart home automation, IoT systems, portable assistants, and voice-controlled embedded devices.
Matheus Valente | THEORIA An International Journal for Theory History and Foundations of Science
Can others grasp my first-person thoughts, or are such thoughts inherently private? Philosophers disagree: some argue that first-person thoughts are apprehensible only by their owners, while others contend that they … Can others grasp my first-person thoughts, or are such thoughts inherently private? Philosophers disagree: some argue that first-person thoughts are apprehensible only by their owners, while others contend that they can be shared through communication—expressible by ‘you’ as readily as by ‘I’. In this paper, I set out to clarify the stakes of this age-long dispute. Taking J. L. Bermúdez’s forceful defence of shareability as the backdrop of my discussion, I examine how the intersubjective availability of thoughts interacts with issues concerning the objectivity of thought, testimonial knowledge transmission, and rational action. The bulk of this paper is an elaboration of the Asymmetry Argument, which grounds the privacy of first-person thoughts in the need to explain how thinkers who believe and desire the same as each other might nonetheless have distinct reasons for action. If successful, the argument reveals how first-person thoughts cannot be shareable in a philosophically significant sense without compromising their fundamental connection to motivating reasons for action.
This study presents a novel fuzzy logic framework to quantitatively evaluate dialogue coherence, integrating mathematical modeling with an experimental case study approach. Recognizing that dialogue coherence is a continuous and … This study presents a novel fuzzy logic framework to quantitatively evaluate dialogue coherence, integrating mathematical modeling with an experimental case study approach. Recognizing that dialogue coherence is a continuous and multidimensional construct, we employ fuzzy set theory to design membership functions for critical linguistic variables, including topical continuity, syntactic alignment, and semantic relevance. Unlike traditional binary metrics, our approach computes a continuous coherence score using a weighted aggregation model, where each score is deriving expert-calibrated fuzzy inference rules. The empirical case study uses a heterogeneous dialogue corpus, consisting of interview transcripts and natural conversation recordings. The corpus was split into segments, with the segments annotated by linguistic experts. The Pearson correlation statistical analysis shows a strong correlation between the fuzzy coherence scores and the expert ratings, highlighting the robustness and reliability of the method. The research elaborates on implications for communication studies, such as applications to therapy, education, and human-computer interaction, as well as its limitations like subjectivity in defining the rules or challenges for scaling it. We conclude by proposing several lines of future research, such as incorporating additional variables spanning linguistic and non-verbal aspects and creating methods for automated calibration that would allow the model to personalize itself over time. In summary, our study confirms the usage of fuzzy logic systems with respect to the subtle gradience of dialogue coherence, enriching not just the theoretical notions of a dialogue but also being used as an exhaust model for classification.