Computer Science Artificial Intelligence

Topic Modeling

Description

This cluster of papers covers a wide range of advancements in natural language processing, including neural network architectures, word representation models, machine translation techniques, text classification algorithms, semantic similarity measures, named entity recognition methods, pretrained language models, sequence-to-sequence learning approaches, topic modeling strategies, and information retrieval systems.

Keywords

Neural Networks; Word Representation; Machine Translation; Text Classification; Semantic Similarity; Named Entity Recognition; Pretrained Models; Sequence-to-Sequence Learning; Topic Modeling; Information Retrieval

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, … We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, … Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.
We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks.We show that a simple CNN with little … We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks.We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks.Learning task-specific vectors through fine-tuning offers further gains in performance.We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors.The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.
column Share on A Language Modeling Approach to Information Retrieval Authors: Jay M. Ponte University of Massachesetts, Amherst University of Massachesetts, AmherstView Profile , W. Bruce Croft University of Massachesetts, … column Share on A Language Modeling Approach to Information Retrieval Authors: Jay M. Ponte University of Massachesetts, Amherst University of Massachesetts, AmherstView Profile , W. Bruce Croft University of Massachesetts, Amherst University of Massachesetts, AmherstView Profile Authors Info & Claims ACM SIGIR ForumVolume 51Issue 2July 2017 pp 202–208https://doi.org/10.1145/3130348.3130368Published:02 August 2017Publication History 19citation438DownloadsMetricsTotal Citations19Total Downloads438Last 12 Months248Last 6 weeks34 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteGet Access
Most current statistical natural language processing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long … Most current statistical natural language processing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sampling, a simple Monte Carlo method used to perform approximate inference in factored probabilistic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorporate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consistency constraints. This technique results in an error reduction of up to 9% over state-of-the-art systems on two established information extraction tasks.
The ability to accurately represent sentences is central to language understanding. We describe a convolutional architecture dubbed the Dynamic Convolutional Neural Network (DCNN) that we adopt for the semantic modelling … The ability to accurately represent sentences is central to language understanding. We describe a convolutional architecture dubbed the Dynamic Convolutional Neural Network (DCNN) that we adopt for the semantic modelling of sentences. The network uses Dynamic k-Max Pooling, a global pooling operation over linear sequences. The network handles input sentences of varying length and induces a feature graph over the sentence that is capable of explicitly capturing short and long-range relations. The network does not rely on a parse tree and is easily applicable to any language. We test the DCNN in four experiments: small scale binary and multi-class sentiment prediction, six-way question classification and Twitter sentiment prediction by distant supervision. The network achieves excellent performance in the first three tasks and a greater than 25% error reduction in the last task with respect to the strongest baseline.
Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite … Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, powerful, strong and Paris are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.
Continuous space language models have recently demonstrated outstanding results across a variety of tasks. In this paper, we examine the vector-space word representations that are implicitly learned by the input-layer … Continuous space language models have recently demonstrated outstanding results across a variety of tasks. In this paper, we examine the vector-space word representations that are implicitly learned by the input-layer weights. We find that these representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset. This allows vector-oriented reasoning based on the offsets between words. For example, the male/female relationship is automatically learned, and with the induced vector representations, “King Man + Woman” results in a vector very close to “Queen.” We demonstrate that the word vectors capture syntactic regularities by means of syntactic analogy questions (provided with this paper), and are able to correctly answer almost 40% of the questions. We demonstrate that the word vectors capture semantic regularities by using the vector offset method to answer SemEval-2012 Task 2 questions. Remarkably, this method outperforms the best previous systems.
This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve … This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
Surveying a suite of algorithms that offer a solution to managing large document archives. Surveying a suite of algorithms that offer a solution to managing large document archives.
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, Eduard Hovy. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language … Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, Eduard Hovy. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016.
Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these … Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.
For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and … For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at http://github.com/locuslab/TCN .
Deep learning methods employ multiple processing layers to learn hierarchical representations of data, and have produced state-of-the-art results in many domains. Recently, a variety of model designs and methods have … Deep learning methods employ multiple processing layers to learn hierarchical representations of data, and have produced state-of-the-art results in many domains. Recently, a variety of model designs and methods have blossomed in the context of natural language processing (NLP). In this paper, we review significant deep learning related models and methods that have been employed for numerous NLP tasks and provide a walk-through of their evolution. We also summarize, compare and contrast the various models and put forward a detailed understanding of the past, present and future of deep learning in NLP.
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations … We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
Human ability to understand language is general, flexible, and robust. In contrast, most NLU models above the word level are designed for a specific task and struggle with out-of-domain data. … Human ability to understand language is general, flexible, and robust. In contrast, most NLU models above the word level are designed for a specific task and struggle with out-of-domain data. If we aspire to develop models with understanding beyond the detection of superficial correspondences between inputs and outputs, then it is critical to develop a unified model that can execute a range of linguistic tasks across different domains. To facilitate research in this direction, we present the General Language Understanding Evaluation (GLUE, gluebenchmark.com): a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models. For some benchmark tasks, training data is plentiful, but for others it is limited or does not match the genre of the test set. GLUE thus favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. While none of the datasets in GLUE were created from scratch for the benchmark, four of them feature privately-held test data, which is used to ensure that the benchmark is used fairly. We evaluate baselines that use ELMo (Peters et al., 2018), a powerful transfer learning technique, as well as state-of-the-art sentence representation models. The best models still achieve fairly low absolute scores. Analysis with our diagnostic dataset yields similarly weak performance over all phenomena tested, with some exceptions.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. Proceedings of the 2019 Conference of the North American Chapter of the Association for … Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). 2019.
Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite … Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, … We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational … Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.
Summarization based on text extraction is inherently limited, but generation-style abstractive methods have proven challenging to build.In this work, we propose a fully data-driven approach to abstractive sentence summarization.Our method … Summarization based on text extraction is inherently limited, but generation-style abstractive methods have proven challenging to build.In this work, we propose a fully data-driven approach to abstractive sentence summarization.Our method utilizes a local attention-based model that generates each word of the summary conditioned on the input sentence.While the model is structurally simple, it can easily be trained end-to-end and scales to a large amount of training data.The model shows significant performance gains on the DUC-2004 shared task compared with several strong baselines.
Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective … Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100 times more data. We open-source our pretrained models and code.
This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While … This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.
Kai Sheng Tai, Richard Socher, Christopher D. Manning. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing … Kai Sheng Tai, Richard Socher, Christopher D. Manning. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015.
We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to … We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com
Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks trained on abundant data. These models have obtained notable gains in … Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks trained on abundant data. These models have obtained notable gains in accuracy across many NLP tasks. However, these accuracy improvements depend on the availability of exceptionally large computational resources that necessitate similarly substantial energy consumption. As a result these models are costly to train and develop, both financially, due to the cost of hardware and electricity or cloud compute time, and environmentally, due to the carbon footprint required to fuel modern tensor processing hardware. In this paper we bring this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP. Based on these findings, we propose actionable recommendations to reduce costs and improve equity in NLP research and practice.
Adina Williams, Nikita Nangia, Samuel Bowman. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. Adina Williams, Nikita Nangia, Samuel Bowman. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables … Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation. Notably, we improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens. Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, … Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input … With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.
Nils Reimers, Iryna Gurevych. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. Nils Reimers, Iryna Gurevych. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
Iz Beltagy, Kyle Lo, Arman Cohan. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. Iz Beltagy, Kyle Lo, Arman Cohan. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains … As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine … Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, Alexander Rush. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020.
Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it … Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. \textit{Transformers} is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at \url{https://github.com/huggingface/transformers}.
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations … Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov. Proceedings of the 58th Annual Meeting of the Association for … Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
We introduce a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. We test the technique on the problem of Text Summarization (TS). Extractive TS … We introduce a stochastic graph-based method for computing relative importance of textual units for Natural Language Processing. We test the technique on the problem of Text Summarization (TS). Extractive TS relies on the concept of sentence salience to identify the most important sentences in a document or set of documents. Salience is typically defined in terms of the presence of particular important words or in terms of similarity to a centroid pseudo-sentence. We consider a new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. Our system, based on LexRank ranked in first place in more than one task in the recent DUC 2004 evaluation. In this paper we present a detailed analysis of our approach and apply it to a larger data set including data from earlier DUC evaluations. We discuss several methods to compute centrality using the similarity graph. The results show that degree-based methods (including LexRank) outperform both centroid-based methods and other systems participating in DUC in most of the cases. Furthermore, the LexRank with threshold method outperforms the other degree-based techniques including continuous LexRank. We also show that our approach is quite insensitive to the noise in the data that may result from an imperfect topical clustering of documents.
We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited … We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs.
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, … We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not … Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). … Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic … Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
This paper explores Text-to-Knowledge Graph (T2KG) construction, assessing Zero-Shot Prompting, Few-Shot Prompting, and Fine-Tuning methods with Large Language Models. Through comprehensive experimentation with Llama2, Mistral, and Starling, we highlight the … This paper explores Text-to-Knowledge Graph (T2KG) construction, assessing Zero-Shot Prompting, Few-Shot Prompting, and Fine-Tuning methods with Large Language Models. Through comprehensive experimentation with Llama2, Mistral, and Starling, we highlight the strengths of FT, emphasize dataset size's role, and introduce nuanced evaluation metrics. Promising perspectives include synonym-aware metric refinement, and data augmentation with Large Language Models. The study contributes valuable insights to KG construction methodologies, setting the stage for further advancements. 1
Abhinash Mohanta , Sneha , Kyvalya Garikapati +2 more | Advances in computational intelligence and robotics book series
The integration of Deep Belief Networks (DBNs) into forensic linguistics enhances AI-driven policing in smart cities by automating witness testimony analysis. DBNs address limitations of traditional methods, such as subjectivity … The integration of Deep Belief Networks (DBNs) into forensic linguistics enhances AI-driven policing in smart cities by automating witness testimony analysis. DBNs address limitations of traditional methods, such as subjectivity and inefficiency, by detecting linguistic markers of deception and inconsistency. AI tools have accelerated investigations by analyzing digital evidence, though specific architectures like DBNs were not disclosed in these cases. Ethical considerations, including algorithmic bias and data privacy, are addressed through strategies like adversarial debiasing and synthetic data augmentation. Collaboration between law enforcement, AI experts, and policymakers ensures ethical deployment and inclusive dataset curation. Future directions include expanding DBN applications to real-time threat detection and multilingual forensic analysis. DBNs improve urban security by enhancing testimony credibility assessments, fostering community trust, and aligning AI-driven policing with ethical standards.
Antonio Lobo-Santos , Joaquín Borrego‐Díaz | Modelling—International Open Access Journal of Modelling in Engineering Science
The rapid growth in scientific knowledge has created a critical need for advanced systems capable of managing mathematical knowledge at scale. This study presents a novel approach that integrates ontology-based … The rapid growth in scientific knowledge has created a critical need for advanced systems capable of managing mathematical knowledge at scale. This study presents a novel approach that integrates ontology-based knowledge representation with large language models (LLMs) to automate the extraction, organization, and reasoning of mathematical knowledge from LaTeX documents. The proposed system enhances Mathematical Knowledge Management (MKM) by enabling structured storage, semantic querying, and logical validation of mathematical statements. The key innovations include a lightweight ontology for modeling hypotheses, conclusions, and proofs, and algorithms for optimizing assumptions and generating pseudo-demonstrations. A user-friendly web interface supports visualization and interaction with the knowledge graph, facilitating tasks such as curriculum validation and intelligent tutoring. The results demonstrate high accuracy in mathematical statement extraction and ontology population, with potential scalability for handling large datasets. This work bridges the gap between symbolic knowledge and data-driven reasoning, offering a robust solution for scalable, interpretable, and precise MKM.
G. Wang , Sibo Cheng | Engineering Applications of Artificial Intelligence
With the rapid increase in textual data sources in recent years, it is seen that many studies have been carried out in the field of automatic text summarization. In this … With the rapid increase in textual data sources in recent years, it is seen that many studies have been carried out in the field of automatic text summarization. In this study, a new method is proposed for graph-based extractive text summarization. In addition, within the scope of the study, Karcı Dominant Cluster Algorithm is used for the first time in text summarization systems. In the proposed method, firstly, a graph is created from the neighborhood matrix based on the common word numbers of the sentences belonging to the text to be summarized. In the second step, the sentences represented by the nodes in the dominant cluster of the graph are determined using the Karcı Dominant Set Algorithm. In the third step, a new graph is created from the remaining sentences by removing the sentences belonging to the dominant clusters determined from the main text. According to the eigenvector centrality values of the new graph created in the last step, the central sentences were found and the sentences were selected to start with the most valuable sentence and summaries were obtained. The study was carried out on the Document Understanding Conference (DUC-2002 and DUC-2004) dataset. Its performance was calculated with ROUGE evaluation metrics and the results were compared with other competitive methods. The developed model reached a ROUGE performance value of 0.35748 for a 100-word summary, 0.49049 for a 200-word summary, and 0.57586 for a 400-word summary. The values reported during the experimental processes of the study clearly reveal the contribution of this innovative method to the literature.
Extracting clinical entities from unstructured medical documents is critical for improving clinical decision support and documentation workflows. This study examines the performance of various encoder and decoder models trained for … Extracting clinical entities from unstructured medical documents is critical for improving clinical decision support and documentation workflows. This study examines the performance of various encoder and decoder models trained for Named Entity Recognition (NER) of clinical parameters in pathology and radiology reports, highlighting the applicability of Large Language Models (LLMs) for this task. Three NER methods were evaluated: (1) flat NER using transformer-based models, (2) nested NER with a multi-task learning setup, and (3) instruction-based NER utilizing LLMs. A dataset of 2013 pathology reports and 413 radiology reports, annotated by medical students, was used for training and testing. The performance of encoder-based NER models (flat and nested) was superior to that of LLM-based approaches. The best-performing flat NER models achieved F1-scores of 0.87-0.88 on pathology reports and up to 0.78 on radiology reports, while nested NER models performed slightly lower. In contrast, multiple LLMs, despite achieving high precision, yielded significantly lower F1-scores (ranging from 0.18 to 0.30) due to poor recall. A contributing factor appears to be that these LLMs produce fewer but more accurate entities, suggesting they become overly conservative when generating outputs. LLMs in their current form are unsuitable for comprehensive entity extraction tasks in clinical domains, particularly when faced with a high number of entity types per document, though instructing them to return more entities in subsequent refinements may improve recall. Additionally, their computational overhead does not provide proportional performance gains. Encoder-based NER models, particularly those pre-trained on biomedical data, remain the preferred choice for extracting information from unstructured medical documents.
ABSTRACT Tibetan medical named entity recognition (Tibetan MNER) involves extracting specific types of medical entities from unstructured Tibetan medical texts. Tibetan MNER provide important data support for the work related … ABSTRACT Tibetan medical named entity recognition (Tibetan MNER) involves extracting specific types of medical entities from unstructured Tibetan medical texts. Tibetan MNER provide important data support for the work related to Tibetan medicine. However, existing Tibetan MNER methods often struggle to comprehensively capture multi‐level semantic information, failing to sufficiently extract multi‐granularity features and effectively filter out irrelevant information, which ultimately impacts the accuracy of entity recognition. This paper proposes an improved embedding representation method called syllable–word–sentence embedding. By leveraging features at different granularities and using un‐scaled dot‐product attention to focus on key features for feature fusion, the syllable–word–sentence embedding is integrated into the transformer, enhancing the specificity and diversity of feature representations. The model leverages multi‐level and multi‐granularity semantic information, thereby improving the performance of Tibetan MNER. We evaluate our proposed model on datasets from various domains. The results indicate that the model effectively identified three types of entities in the Tibetan news dataset we constructed, achieving an F1 score of 93.59%, which represents an improvement of 1.24% compared to the vanilla FLAT. Additionally, results from the Tibetan medical dataset we developed show that it is effective in identifying five kinds of medical entities, with an F1 score of 71.39%, which is a 1.34% improvement over the vanilla FLAT.
Abstract The current study investigated the impact of AI‐mediated interaction on L2 learners' oral speaking ability, summarizing ability, and perceptions. Participants were EFL university students from South Korea who were … Abstract The current study investigated the impact of AI‐mediated interaction on L2 learners' oral speaking ability, summarizing ability, and perceptions. Participants were EFL university students from South Korea who were randomly assigned to one of three groups: aiSPEAK, aiSUM, and control. The two experimental groups, aiSPEAK and aiSUM, engaged in interactional exchanges with ChatGPT‐3.5 to complete oral summary tasks, while the control group completed the same tasks without using an AI tool. To initiate interaction with ChatGPT‐3.5, the aiSPEAK group used a prompt that instructed ChatGPT‐3.5 to play the role of a tutor, while the aiSUM group used a more specific prompt instructing ChatGPT‐3.5 to also provide guidance on summarizing. Results showed that the two experimental groups performed significantly better than the control group on global speaking task performance, while the aiSUM group performed significantly better than the control group on global speaking task performance as well as summarizing performance. Perceptions regarding AI use were generally positive. Our findings point to the feasibility of using generative AI tools in the language classroom to facilitate interaction and improve language outcomes.
Medical imaging is crucial for diagnosing, monitoring, and treating medical conditions. The medical reports of radiology images are the primary medium through which medical professionals can attest to their findings, … Medical imaging is crucial for diagnosing, monitoring, and treating medical conditions. The medical reports of radiology images are the primary medium through which medical professionals can attest to their findings, but their writing is time-consuming and requires specialized clinical expertise. Therefore, the automated generation of radiography reports has the potential to improve and standardize patient care and significantly reduce the workload of clinicians. Through our work, we have designed and evaluated an end-to-end transformer-based method to generate accurate and factually complete radiology reports for X-ray images. Additionally, we are the first to introduce curriculum learning for end-to-end transformers in medical imaging and demonstrate its impact in obtaining improved performance. The experiments were conducted using the MIMIC-CXR-JPG database, the largest available chest X-ray dataset. The results obtained are comparable with the current state of the art on the natural language generation (NLG) metrics BLEU and ROUGE-L, while setting new state-of-the-art results on F1 examples-averaged F1-macro and F1-micro metrics for clinical accuracy and on the METEOR metric widely used for NLG.
Deploying large-scale language models such as GPT on edge devices faces practical challenges such as high memory usage and high inference latency. This study proposes a two-stage compression scheme integrating … Deploying large-scale language models such as GPT on edge devices faces practical challenges such as high memory usage and high inference latency. This study proposes a two-stage compression scheme integrating knowledge extraction and parameter pruning to achieve model weight reduction while maintaining accuracy in the question-answering task. The six-layer reduced model was trained by extracting deep features from the medium-sized model GPT-2, and then structured pruning was performed on the attention head and the fully connected layer. Finally, the model achieved an F1 value of 86.8% on both SQuAD and TriviaQA tests. The data actually deployed on devices such as Jetson Nano shows that the model volume was compressed from 6.5GB to 1.8GB, and the time consumption for a single inference was reduced from 1200ms to 300ms, with reductions of 72% and 75% respectively. The ablation experiment confirmed that knowledge extraction retains key semantic features and that structured pruning optimizes the computational path. The synergy of the two helps control the question-answer accuracy loss of edge devices to less than 3%. This hybrid compression technology provides a solution that balances efficiency and performance for scenarios such as smart assistants and offline applications, opening a new avenue for edge deployment of large models.
The rise of large language models (LLMs) that generate human-like text has sparked debates over their potential to replace human participants in behavioral and cognitive research. We critically evaluate this … The rise of large language models (LLMs) that generate human-like text has sparked debates over their potential to replace human participants in behavioral and cognitive research. We critically evaluate this replacement perspective to appraise the fundamental utility of language models in psychology and social science. Through a five-dimension framework—characterization, representation, interpretation, implication, and utility—we identify six fallacies that undermine the replacement perspective: (1) equating token prediction with human intelligence, (2) assuming LLMs represent the average human, (3) interpreting alignment as explanation, (4) anthropomorphizing AI, (5) essentializing identities, and (6) purporting LLMs as primary tools that directly reveal the human mind. Rather than replacement, the evidence and arguments are consistent with a simulation perspective, where LLMs offer a new paradigm to simulate roles and model cognitive processes. We highlight limitations and considerations about internal, external, construct, and statistical validity, providing methodological guidelines for effective integration of LLMs into psychological research—with a focus on model selection, prompt design, interpretation, and ethical considerations. This perspective reframes the role of language models in behavioral and cognitive science, serving as linguistic simulators and cognitive models that shed light on the similarities and differences between machine intelligence and human cognition and thoughts.
Electronic Health Records store extensive patient health data, playing a crucial role in healthcare management. Extracting information from these text-heavy records is difficult due to their domain-specific vocabulary, which challenges … Electronic Health Records store extensive patient health data, playing a crucial role in healthcare management. Extracting information from these text-heavy records is difficult due to their domain-specific vocabulary, which challenges applying general-domain techniques. Recent advancements in Large Language Models (LLMs) and an increasing interest in the field have sparked considerable progress in solving Clinical Information Extraction (IE) tasks. We review these applications in Clinical IE, highlighting the most common tasks, most successful methods, and most used datasets and evaluation criteria. Examining 85 studies, we synthesize and organize the current research trends, highlighting common points between papers. The presence of LLMs can be felt in the most common tasks, with novel approaches being attempted and showing promising results. However, breakthroughs are still necessary in designing reliable end-to-end systems that can perform all the Clinical IE tasks within a single system.
Entity alignment is a critical technique for integrating diverse knowledge graphs. Although existing methods have achieved impressive success in traditional entity alignment, they may struggle to handle the complexities arising … Entity alignment is a critical technique for integrating diverse knowledge graphs. Although existing methods have achieved impressive success in traditional entity alignment, they may struggle to handle the complexities arising from interactions and dependencies in multi-modal knowledge. In this paper, a novel multi-modal entity alignment model called ERMF is proposed, which leverages distinct modal characteristics of entities to identify equivalent entities across different multi-modal knowledge graphs. The symmetry in cross-modal interactions and hierarchical feature fusion is a core design principle of our approach. Specifically, we first utilize different feature encoders to independently extract features from different modalities. Concurrently, visual features and nearest neighbor negative sampling methods are incorporated to design a vision-guided negative sample generation strategy based on contrastive learning, ensuring a symmetric balance between positive and negative samples and guiding the model to learn effective relationship embeddings. Subsequently, in the feature fusion stage, we propose a multi-layer feature fusion approach that incorporates cross-attention and cross-modal attention mechanisms with symmetric processing of intra- and inter-modal correlations, thereby obtaining multi-granularity features. Extensive experiments were conducted on two public datasets, namely FB15K-DB15K and FB15K-YAGO15K. With 20% aligned seeds, ERMF improves Hits@1 by 8.4% and 26%, and MRR by 6% and 19.2% compared to the best baseline. The symmetric architecture of our model ensures the robust and balanced utilization of multi-modal information, aligning with the principles of structural and functional symmetry in knowledge integration.
Text summarization condenses a text into its essential points for a quick grasp of the main ideas. Multi-document summarization integrates information from several sources to provide a comprehensive overview. Techniques … Text summarization condenses a text into its essential points for a quick grasp of the main ideas. Multi-document summarization integrates information from several sources to provide a comprehensive overview. Techniques include extractive methods, which select key sentences, and abstractive methods, which generate new sentences. Hybrid methods combine these approaches to improve summary quality. Limitations include challenges in maintaining coherence, context, and nuance. Further improvements are needed to enhance coherence, accuracy, and comprehensiveness in summaries. To address these issues, this research proposes the Improved Bidirectional Gated Recurrent Unit (IBi-GRU) model for Multi-document Text Summarization through COOT optimization updated Coati Optimization Algorithm (CuCOA). The process involves preprocessing, feature extraction, and summarization. Initially, tokenization is performed during the preprocessing. Pertinent features are then extracted from the preprocessed text in the phase of feature extraction, followed by summarization using the IBi-GRU model with its weight parameters optimally tuned by the CuCOA approach. Comprehensive simulations and experimental assessments in terms of accuracy, Mathews Correlation Coefficient (MCC), False Negative Rate (FNR), etc., validate the IBi-GRU model. This demonstrates its robustness and potential for various text summarization applications in comparison with conventional approaches. The CuCOA + IBi-GRU scheme achieved the highest scores, with a Rouge of 0.866, Precision of 0.887, Recall of 0.853, and F-Measure of 0.913.
Large Language Models (LLMs) have demonstrated remarkable performance across diverse applications, yet they inadvertently absorb spurious correlations from training data, leading to stereotype associations between biased concepts and specific social … Large Language Models (LLMs) have demonstrated remarkable performance across diverse applications, yet they inadvertently absorb spurious correlations from training data, leading to stereotype associations between biased concepts and specific social groups. These associations perpetuate and even amplify harmful social biases, raising significant concerns about fairness, which is a crucial issue in software engineering. To mitigate such biases, prior studies have attempted to project model embeddings into unbiased spaces during inference. However, these approaches have shown limited effectiveness due to their weak alignment with downstream social biases. Inspired by the observation that concept cognition in LLMs is primarily represented through a linear associative memory mechanism, where key-value mapping occurs in the MLP layers, we posited that biased concepts and social groups are similarly encoded as entity (key) and information (value) pairs, which can be manipulated to promote fairer associations. To this end, we propose Fairness Mediator (FairMed), an effective and efficient bias mitigation framework that neutralizes stereotype associations. Our framework comprises two main components: a stereotype association prober and an adversarial debiasing neutralizer. The prober captures stereotype associations encoded within MLP layer activations by employing prompts centered around biased concepts (keys) to detect the emission probabilities for social groups (values). Subsequently, the adversarial debiasing neutralizer intervenes in MLP activations during inference to equalize the association probabilities among different social groups. Extensive experiments across nine protected attributes show that our FairMed significantly outperforms state-of-the-art methods in effectiveness, achieving average bias reductions of up to 84.42
With the increasing complexity of power grid field operations, frequent operational violations have emerged as a major concern in the domain of power grid field operation safety. To support dispatchers … With the increasing complexity of power grid field operations, frequent operational violations have emerged as a major concern in the domain of power grid field operation safety. To support dispatchers in accurately identifying and addressing violation risks, this paper introduces a profiling approach for power grid field operation violations based on knowledge graph techniques. The method enables deep modeling and structured representation of violation behaviors. In the structured data processing phase, statistical analysis is conducted based on predefined rules, and mutual information is employed to quantify the contribution of various operational factors to violations. At the municipal bureau level, statistical modeling of violation characteristics is performed to support regional risk assessment. For unstructured textual data, a multi-source embedding-based named entity recognition (NER) model is developed, incorporating domain-specific power lexicon information to enhance the extraction of key entities. High-weight domain terms related to violations are further identified using the TF-IDF algorithm to characterize typical violation behaviors. Based on the extracted entities and relationships, a knowledge graph of field operation violations is constructed, providing a computable and inferable semantic representation of operational scenarios. Finally, visualization techniques are applied to present the structural patterns and distributional features of violations, offering graph-based support for violation risk analysis and dispatch decision-making. Experimental results demonstrate that the proposed method effectively identifies critical features of violation behaviors and provides a structured foundation for intelligent decision support in power grid operation management.