Computer Science Information Systems

Web Data Mining and Analysis

Description

This cluster of papers focuses on techniques and technologies for extracting structured data from web pages, including web crawling, automatic wrapper generation, page segmentation, and mining data records. It also covers topics related to the hidden web, information retrieval, and content adaptation for different devices.

Keywords

Web Data Extraction; Web Crawling; Information Retrieval; Structured Data; Automatic Wrapper Generation; Hidden Web; Page Segmentation; Data Records Mining; Deep Web; Content Adaptation

Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting … Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.
The rapid growth of the Web in the last decade makes it the largest publicly accessible data source in the world.Web mining aims to discover useful information or knowledge from … The rapid growth of the Web in the last decade makes it the largest publicly accessible data source in the world.Web mining aims to discover useful information or knowledge from Web hyperlinks, page contents, and usage logs.Based on the primary kinds of data used in the mining process, Web mining tasks can be categorized into three main types: Web structure mining, Web content mining and Web usage mining.Web structure mining discovers knowledge from hyperlinks, which represent the structure of the Web.Web content mining extracts useful information/knowledge from Web page contents.Web usage mining mines user access patterns from usage logs, which record clicks made by every user.The goal of this book is to present these tasks, and their core mining algorithms.The book is intended to be a text with a comprehensive coverage, and yet, for each topic, sufficient details are given so that readers can gain a reasonably complete knowledge of its algorithms or techniques without referring to any external materials.Four of the chapters, structured data extraction, information integration, opinion mining, and Web usage mining, make this book unique.These topics are not covered by existing books, but yet they are essential to Web data mining.Traditional Web mining topics such as search, crawling and resource discovery, and link analysis are also covered in detail in this book.Although the book is entitled Web Data Mining, it also includes the main topics of data mining and information retrieval since Web mining uses their algorithms and techniques extensively.The data mining part mainly consists of chapters on association rules and sequential patterns, supervised learning (or classification), and unsupervised learning (or clustering), which are the three most important data mining tasks.The advanced topic of partially (semi-) supervised learning is included as well.For information retrieval, its core topics that are crucial to Web mining are described.This book is thus naturally divided into two parts.The first part, which consists of Chaps.2-5, covers data mining foundations.The second part, which contains Chaps.6-12, covers Web specific mining.Two main principles have guided the writing of this book.First, the basic content of the book should be accessible to undergraduate students, and yet there are sufficient in-depth materials for graduate students who plan to Many other researchers also assisted in various ways.Yang Dai and Rudy Setiono helped with Support Vector Machines (SVM).Chris Ding helped with link analysis.
From the Publisher: Tim Berners-Lee, the inventor of the World Wide Web, has been hailed by Time magazine as one of the 100 greatest minds of this century. His creation … From the Publisher: Tim Berners-Lee, the inventor of the World Wide Web, has been hailed by Time magazine as one of the 100 greatest minds of this century. His creation has already changed the way people do business, entertain themselves, exchange ideas, and socialize with one another.. offers insights to help readers understand the true nature of the Web, enabling them to use it to their fullest advantage. He shares his views on such critical issues as censorship, privacy, the increasing power of software companies in the online world, and the need to find the ideal balance between the commercial and social forces on the Web. His criticism of the Web's current state makes clear that there is still much work to be done. Finally, Berners-Lee presents his own plan for the Web's future, one that calls for the active support and participation of programmers, computer manufacturers, and social organizations to make it happen.
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively … The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Information extraction systems usually require two dictionaries: a semantic lexicon and a dictionary of extraction patterns for the domain. We present a multilevel bootstrapping algorithm that generates both the semantic … Information extraction systems usually require two dictionaries: a semantic lexicon and a dictionary of extraction patterns for the domain. We present a multilevel bootstrapping algorithm that generates both the semantic lexicon and extraction patterns simultaneously. As input, our technique requires only unannotated training texts and a handful of seed words for a category. We use a mutual bootstrapping technique to alternately select the best extraction pattern for the category and bootstrap its extractions into the semantic lexicon, which is the basis for selecting the next extraction pattern. To make this approach more robust, we add a second level of bootstrapping (metabootstrapping) that retains only the most reliable lexicon entries produced by mutual bootstrapping and then restarts the process. We evaluated this multilevel bootstrapping technique on a collection of corporate web pages and a corpus of terrorism news articles. The algorithm produced high-quality dictionaries for several semantic categories.
article Free Access Share on Web mining research: a survey Authors: Raymond Kosala Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001 Heverlee, Belgium Department of Computer Science, Katholieke … article Free Access Share on Web mining research: a survey Authors: Raymond Kosala Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001 Heverlee, Belgium Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001 Heverlee, BelgiumView Profile , Hendrik Blockeel Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001 Heverlee, Belgium Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001 Heverlee, BelgiumView Profile Authors Info & Claims ACM SIGKDD Explorations NewsletterVolume 2Issue 1June, 2000 pp 1–15https://doi.org/10.1145/360402.360406Published:01 June 2000Publication History 813citation14,361DownloadsMetricsTotal Citations813Total Downloads14,361Last 12 Months480Last 6 weeks50 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteeReaderPDF
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first. Obtaining important pages rapidly can … In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.
Keyword-based search engines are in widespread use today as a popular means for Web-based information retrieval. Although such systems seem deceptively simple, a considerable amount of skill is required in … Keyword-based search engines are in widespread use today as a popular means for Web-based information retrieval. Although such systems seem deceptively simple, a considerable amount of skill is required in order to satisfy non-trivial information needs. This paper presents a new conceptual paradigm for performing search in context, that largely automates the search process, providing even non-professional users with highly relevant results. This paradigm is implemented in practice in the IntelliZap system, where search is initiated from a text query marked by the user in a document she views, and is guided by the text surrounding the marked query in that document (“the context”). The context-driven information retrieval process involves semantic keyword extraction and clustering to automatically generate new, augmented queries. The latter are submitted to a host of general and domain-specific search engines. Search results are then semantically reranked, using context. Experimental results testify that using context to guide search, effectively offers even inexperienced users an advanced search tool on the Web.
This paper develops a general, formal framework for modeling term dependencies via Markov random fields. The model allows for arbitrary text features to be incorporated as evidence. In particular, we … This paper develops a general, formal framework for modeling term dependencies via Markov random fields. The model allows for arbitrary text features to be incorporated as evidence. In particular, we make use of features based on occurrences of single terms, ordered phrases, and unordered phrases. We explore full independence, sequential dependence, and full dependence variants of the model. A novel approach is developed to train the model that directly maximizes the mean average precision rather than maximizing the likelihood of the training data. Ad hoc retrieval experiments are presented on several newswire and web collections, including the GOV2 collection used at the TREC 2004 Terabyte Track. The results show significant improvements are possible by modeling dependencies, especially on the larger web collections.
Amenable to extensive parallelization, Google's web search application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. to … Amenable to extensive parallelization, Google's web search application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. to handle this workload, Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software. This architecture achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.
A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and … A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database2. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.
The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops … The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and dierences. Experimental results on real-life data-intensive Web sites confirm the feasibility of the approach.
Manually querying search engines in order to accumulate a large bodyof factual information is a tedious, error-prone process of piecemealsearch. Search engines retrieve and rank potentially relevantdocuments for human perusal, … Manually querying search engines in order to accumulate a large bodyof factual information is a tedious, error-prone process of piecemealsearch. Search engines retrieve and rank potentially relevantdocuments for human perusal, but do not extract facts, assessconfidence, or fuse information from multiple documents. This paperintroduces KnowItAll, a system that aims to automate the tedious process ofextracting large collections of facts from the web in an autonomous,domain-independent, and scalable manner.The paper describes preliminary experiments in which an instance of KnowItAll, running for four days on a single machine, was able to automatically extract 54,753 facts. KnowItAll associates a probability with each fact enabling it to trade off precision and recall. The paper analyzes KnowItAll's architecture and reports on lessons learned for the design of large-scale information extraction systems.
Targeted IE methods are transforming into open-ended techniques. Targeted IE methods are transforming into open-ended techniques.
Application of data mining techniques to the World Wide Web, referred to as Web mining, has been the focus of several recent research projects and papers. However, there is no … Application of data mining techniques to the World Wide Web, referred to as Web mining, has been the focus of several recent research projects and papers. However, there is no established vocabulary, leading to confusion when comparing research efforts. The term Web mining has been used in two distinct ways. The first, called Web content mining in this paper, is the process of information discovery from sources across the World Wide Web. The second, called Web usage mining, is the process of mining for user browsing and access patterns. We define Web mining and present an overview of the various research issues, techniques, and development efforts. We briefly describe WEBMINER, a system for Web usage mining, and conclude the paper by listing research issues.
The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability … The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible Information Extraction (IE) systems that transform the Web pages into program-friendly structures such as a relational database will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. Unfortunately, in only a few cases can the results generated by distinct tools be directly compared since the addressed extraction tasks are different. This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used. The criteria of the first dimension explain why an IE system fails to handle some Web sites of particular structures. The criteria of the second dimension classify IE systems based on the techniques used. The criteria of the third dimension measure the degree of automation for IE systems. We believe these criteria provide qualitatively measures to evaluate various IE approaches.
Collaborative filters help people make choices based on the opinions of other people. GroupLens is a system for collaborative filtering of netnews, to help people find articles they will like … Collaborative filters help people make choices based on the opinions of other people. GroupLens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles. News reader clients display predicted scores and make it easy for users to rate articles after they read them. Rating servers, called Better Bit Bureaus, gather and disseminate the ratings. The rating servers predict scores based on the heuristic that people who agreed in the past will probably agree again. Users can protect their privacy by entering ratings under pseudonyms, without reducing the effectiveness of the score prediction. The entire architecture is open: alternative software for news clients and Better Bit Bureaus can be developed independently and can interoperate with the components we have developed.
Classic IR (information retrieval) is inherently predicated on users searching for information, the so-called "information need". But the need behind a web search is often not informational -- it might … Classic IR (information retrieval) is inherently predicated on users searching for information, the so-called "information need". But the need behind a web search is often not informational -- it might be navigational (give me the url of the site I want to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop, download a file, or find a map). We explore this taxonomy of web searches and discuss how global search engines evolved to deal with web-specific needs.
With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the … With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.
This document describes the processes that govern the operation of the World Wide Web Consortium (W3C) and the relationships between W3C Members, the W3C Team, and the general public.This version … This document describes the processes that govern the operation of the World Wide Web Consortium (W3C) and the relationships between W3C Members, the W3C Team, and the general public.This version of the Process Document was made available to the Advisory Committee on 1 November 1999 and to the general public on 11 November 1999.
This article aims to explore the relatively under-researched text–image relationships within reply GIFs, with and without embedded text and/or user-added text. Drawing on existing frameworks for analysing text–image relationships, the … This article aims to explore the relatively under-researched text–image relationships within reply GIFs, with and without embedded text and/or user-added text. Drawing on existing frameworks for analysing text–image relationships, the article qualitatively examines a set of 41 response GIFs replying to the @AITA_Online Twitter account between May 2019 and August 2020. Each text (the stimulus, the user-added text and the embedded text) is related separately to the visual content to tease out the relationships therein, to ultimately answer what role, if any, the GIFs’ visual element plays in the interpretation of a reply as a whole.
Mohini M. Sathe | INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Serverless web development marks a significant transformation in how modern applications are built by eliminating the burden of server management for developers. This review examines the progression, advantages, drawbacks, and … Serverless web development marks a significant transformation in how modern applications are built by eliminating the burden of server management for developers. This review examines the progression, advantages, drawbacks, and potential of serverless computing, with a primary focus on Function-as-a-Service (FaaS) and Backend-as-a-Service (BaaS) models. The main aim is to evaluate current academic and industry trends, draw comparisons between leading technologies, and provide synthesized insights into their real-world usage. The analysis is based on diverse scholarly publications, performance studies, and practical implementations. Key themes include scalability, operational cost, security challenges, and implementation scenarios. While serverless systems promote faster development and operational simplicity, limitations such as cold start delays, monitoring difficulties, and dependency on specific vendors remain significant concerns. This review identifies areas requiring further investigation and offers guidance for future research, aiming to deepen the understanding of serverless frameworks. It serves as a resource for both researchers and developers interested in adopting or refining serverless methodologies across a variety of use cases. As the demand for rapid, scalable, and cost-effective digital solutions increases, serverless computing has emerged as a strategic approach for organizations seeking to streamline DevOps processes and focus more on business logic than infrastructure. Its event-driven nature and automatic scaling capabilities align well with the dynamic needs of web-based services, making it a suitable architecture for microservices, real-time APIs, and data processing tasks. However, achieving consistent performance and maintaining observability in ephemeral environments present ongoing challenges that must be addressed through innovation in tooling and cross-platform standardization. Key Words: Serverless Computing, Function-as-a-Service (FaaS), Backend-as-a-Service (BaaS), Cloud Computing.
ABSTRACT The hypertext transfer protocol (HTTP) request–response cycles during webpage access and content posting exhibit recognisable patterns; however, no unified standard currently streamlines both activities, despite the existence of independent … ABSTRACT The hypertext transfer protocol (HTTP) request–response cycles during webpage access and content posting exhibit recognisable patterns; however, no unified standard currently streamlines both activities, despite the existence of independent specifications for each. Previous research has leveraged cycles of client–server requests and responses to predict outcomes such as user behaviour (UB) analysis, anomaly detection (AD), performance optimisation (PE), predictive maintenance (PM) and user authentication and security (UA), often without explicitly associating these activities. Addressing this gap, the present study focuses on the combined modelling of HTTP request–response cycles for both webpage access and personal information submission. An experimental study was conducted, where HTTP sessions were generated and analysed for both access and posting activities. Six machine learning models—Decision Tree, Random Forest, Gradient Boosting, k‐Nearest Neighbours (kNNs), Logistic Regression and Support Vector Machine—were applied to both the CSIC 2010 HTTP dataset and lab‐generated HTTP transmission datasets across the UB, AD, PE‐PM and UA tasks. Results indicate that the Random Forest classifier achieved the highest accuracy of 97.53% in predicting AD‐based HTTP request–response cycles during webpage access, and 85.93% accuracy in predicting PE‐PM tasks during content posting. Gradient Boosting, kNNs and Support Vector Machine models also demonstrated strong versatility and robustness across different HTTP cycle prediction tasks. Furthermore, the analysis concluded that HTTP request–response cycles for webpage access exhibit greater structural consistency compared to those associated with content posting activities.
For large oil-immersed transformers, their metal-enclosed structure poses significant challenges for direct visual inspection of internal defects. To ensure the effective detection of internal insulation defects, this study employs a … For large oil-immersed transformers, their metal-enclosed structure poses significant challenges for direct visual inspection of internal defects. To ensure the effective detection of internal insulation defects, this study employs a self-developed micro-robot for internal visual inspection. Given the substantial morphological and dimensional variations of target defects (e.g., carbon traces produced by surface discharge inside the transformer), the intelligent and efficient extraction of carbon trace features from complex backgrounds becomes critical for robotic inspection. To address these challenges, we propose the DCMC-UNet, a semantic segmentation model for carbon traces containing adaptive illumination enhancement and dynamic feature fusion. For blurred carbon trace images caused by unstable light reflection and illumination in transformer oil, an improved CLAHE algorithm is developed, incorporating learnable parameters to balance luminance and contrast while enhancing edge features of carbon traces. To handle the morphological diversity and edge complexity of carbon traces, a dynamic deformable encoder (DDE) was integrated into the encoder, leveraging deformable convolutional kernels to improve carbon trace feature extraction. An edge-aware decoder (EAD) was integrated into the decoder, which extracts edge details from predicted segmentation maps and fuses them with encoded features to enrich edge features. To mitigate the semantic gap between the encoder and the decoder, we replace the standard skip connection with a cross-level attention connection fusion layer (CLFC), enhancing the multi-scale fusion of morphological and edge features. Furthermore, a multi-scale atrous feature aggregation module (MAFA) is designed in the neck to enhance the integration of deep semantic and shallow visual features, improving multi-dimensional feature fusion. Experimental results demonstrate that DCMC-UNet outperforms U-Net, U-Net++, and other benchmarks in carbon trace segmentation. For the transformer carbon trace dataset, it achieves better segmentation than the baseline U-Net, with an improved mIoU of 14.04%, Dice of 10.87%, pixel accuracy (P) of 10.97%, and overall accuracy (Acc) of 5.77%. The proposed model provides reliable technical support for surface discharge intensity assessment and insulation condition evaluation in oil-immersed transformers.
Fan Wu , Cuiyun Gao , Shuqing Li +2 more | Proceedings of the ACM on software engineering.
Converting user interfaces into code (UI2Code) is a crucial step in website development, which is time-consuming and labor-intensive. The automation of UI2Code is essential to streamline this task, beneficial for … Converting user interfaces into code (UI2Code) is a crucial step in website development, which is time-consuming and labor-intensive. The automation of UI2Code is essential to streamline this task, beneficial for improving the development efficiency. There exist deep learning-based methods for the task; however, they heavily rely on a large amount of labeled training data and struggle with generalizing to real-world, unseen web page designs. The advent of Multimodal Large Language Models (MLLMs) presents potential for alleviating the issue, but they are difficult to comprehend the complex layouts in UIs and generate the accurate code with layout preserved. To address these issues, we propose LayoutCoder, a novel MLLM-based framework generating UI code from real-world webpage images, which includes three key modules: (1) Element Relation Construction, which aims at capturing UI layout by identifying and grouping components with similar structures; (2) UI Layout Parsing, which aims at generating UI layout trees for guiding the subsequent code generation process; and (3) Layout-Guided Code Fusion, which aims at producing the accurate code with layout preserved. For evaluation, we build a new benchmark dataset which involves 350 real-world websites named Snap2Code, divided into seen and unseen parts for mitigating the data leakage issue, besides the popular dataset Design2Code. Extensive evaluation shows the superior performance of LayoutCoder over the state-of-the-art approaches. Compared with the best-performing baseline, LayoutCoder improves 10.14% in the BLEU score and 3.95% in the CLIP score on average across all datasets.
Jing Huang , Jie Song | Journal of King Saud University - Computer and Information Sciences
Literature mirrors a nation’s ideology and language system, with British and American literature (BALW) differing greatly from Chinese literature due to distinct historical, linguistic, and cultural backgrounds. Chinese language system … Literature mirrors a nation’s ideology and language system, with British and American literature (BALW) differing greatly from Chinese literature due to distinct historical, linguistic, and cultural backgrounds. Chinese language system struggles to accurately retrieve and classify BALW. To tackle this, we propose a BALW appreciation model. Built upon a base model akin to DocBERT, it uses the BERT pre-trained model in the text representation module to capture contextual semantics. A heterogeneous graph attention network (HAN) with word, feature word, and label nodes is designed to extract local semantics in the text. These features are then integrated for multi-label classification. Experiments on a curated dataset show our model outperforms the base one, improving Hamming Loss, Macro-F1, and Micro-F1 by 0.009, 1.9%, and 5.38%, respectively. This enhances intelligent classification and retrieval of BALW, benefiting literary appreciation and cross-cultural literary exchange.
This study compares Progressive Web Apps (PWA) and traditional web applications performance using a custom Chrome extension and Google Lighthouse, focusing on Tokopedia's e-commerce platform. The research employs a quantitative … This study compares Progressive Web Apps (PWA) and traditional web applications performance using a custom Chrome extension and Google Lighthouse, focusing on Tokopedia's e-commerce platform. The research employs a quantitative approach with controlled testing environments across three viewports for the custom extension (desktop, tablet, mobile) and two viewports for Google Lighthouse (desktop, mobile). The custom extension measures eleven metrics, including Core Web Vitals, PWA features, and resource usage, while Google Lighthouse provides five core metrics. Results show PWA implementation improves performance with 9.9% better First Contentful Paint on desktop and significant memory efficiency (29-33MB vs 59-62MB). The comparison between testing tools reveals methodology differences, with custom extension showing optimistic results in real-world conditions and Lighthouse providing more conservative measurements under throttled conditions. This research contributes to PWA performance measurement methodology by combining real-world and standardized testing approaches.
Abstract Creating collections of societally impactful events is a challenging task given the sheer amount of information about such events covering a large variety of aspects and perspectives in web … Abstract Creating collections of societally impactful events is a challenging task given the sheer amount of information about such events covering a large variety of aspects and perspectives in web archives and the live web. The automatic creation of such collections from web archives typically does not live up to the high standards of web archivists, who put lots of manual effort into carefully curating collections. Furthermore, the lack of engaging presentation methods sets up a burden for any users aiming to interact effectively with event collections in order to explore an event in its entirety. Therefore, we (i) conduct expert interviews to determine the requirements for building and utilising event collections from the perspectives of web archivists, (ii) introduce EventExplorer – a retrieval-augmented generation (RAG) approach to create event collections through efficient retrieval and diversified ranking – and make it available in an interactive web system, (iii) apply EventExplorer on different sources including a web archive and the live web, (iv) discuss which requirements are met by EventExplorer as well as the challenges that remain for future work, with a specific emphasis on the distinctive characteristics of both archived web and the live web environments. We demonstrate the effectiveness of EventExplorer applied on web archives through a user study of our interactive system. Then, we transfer our lessons learned to the live web by creating event collections of 166 elections in Europe. Our evaluation results show the effectiveness of EventExplorer in addressing the requirements identified in our expert interviews. Further, we derive a set of challenges and potential future steps for bringing together the automatic creation of web archive collections and manual curation. Finally, we discuss how to make web archives ready for their use in RAG systems.
Zhengbing Hu , Dmytro Uhryn , Artem Kalancha | Vìsnik Nacìonalʹnogo unìversitetu Lʹvìvsʹka polìtehnìka Serìâ Ìnformacìjnì sistemi ta merežì
This article presents a study aimed at developing an optimal concept for analyzing and comparing information sources based on large amounts of text information using natural language processing (NLP) methods. … This article presents a study aimed at developing an optimal concept for analyzing and comparing information sources based on large amounts of text information using natural language processing (NLP) methods. The object of the study was Telegram news channels, which are used as sources of text data. Pre-processing of texts was carried out, including cleaning, tokenization and lemmatization, to form a global dictionary consisting of unique words from all information sources. For each source, a vector representation of texts was constructed, the dimension of which corresponds to the number of unique words in the global dictionary. The frequency of use of each word in the channel texts was displayed in the corresponding positions of the vector. By applying the cosine similarity algorithm to pairs of vectors, a square matrix was obtained that demonstrates the degree of similarity between different sources. An analysis of the similarity of channels in limited time intervals was conducted, which allowed us to identify trends in changes in their information policies. The model parameters were optimized to ensure maximum channel differentiation, which increased the efficiency of the analysis. Clustering algorithms were applied, which divided the channels into groups according to the degree of lexical similarity. The results of the study demonstrate the effectiveness of the proposed approach for quantitatively assessing the similarity and clustering text data from different sources. The proposed method can be used to analyze information sources, identify relationships between sources, study the dynamics of changes in their activities, and assess the socio-cultural impact of media content.
The rapid growth of large-scale information and the dynamic nature of user behaviors pose significant challenges for modern information retrieval systems, which often struggle to adapt to non-stationary environments and … The rapid growth of large-scale information and the dynamic nature of user behaviors pose significant challenges for modern information retrieval systems, which often struggle to adapt to non-stationary environments and fail to fully utilize multimodal data, leading to suboptimal performance. To address these issues, this study proposes the adaptive deep reinforcement learning (RL) framework for information retrieval and management, which combines RL, multimodal data fusion, and an adaptive update mechanism to dynamically adjust to evolving user preferences and document collections. The adaptive deep RL framework for information retrieval and management employs a RL-based policy network to optimize retrieval strategies, a multimodal encoder to integrate diverse data sources, and an adaptive mechanism to maintain robustness in dynamic scenarios.
José Alfonso Aguilar-Calderón , Carolina Tripp-Barba , Pedro Alfonso Aguilar-Calderón +2 more | Advances in computational intelligence and robotics book series
Generative Artificial Intelligence has emerged as a transformative force across multiple domains of computer science, including Web Engineering. By leveraging models such as Generative Pretrained Transformers, Diffusion Models, and Large … Generative Artificial Intelligence has emerged as a transformative force across multiple domains of computer science, including Web Engineering. By leveraging models such as Generative Pretrained Transformers, Diffusion Models, and Large Language Models, developers can now automate, augment, and innovate traditional processes involved in the design, development, and maintenance of web systems. This chapter explores the integration of generative AI in Web Engineering, with a focus on current applications, architectural enhancements, tooling implications, ethical concerns, and emerging research directions. It critically analyzes how generative AI reshapes paradigms within the field and offers a roadmap for its responsible and effective use in the web ecosystem.
During crisis events, digital information volume can increase by over 500% within hours, with social media platforms alone generating millions of crisis-related posts. This volume creates critical challenges for emergency … During crisis events, digital information volume can increase by over 500% within hours, with social media platforms alone generating millions of crisis-related posts. This volume creates critical challenges for emergency responders who require timely access to the concise subset of accurate information they are interested in. Existing approaches strongly rely on the power of large language models. However, the use of large language models limits the scalability of the retrieval procedure and may introduce hallucinations. This paper introduces a novel multi-stage text retrieval framework to enhance information retrieval during crises. Our framework employs a novel three-stage extractive pipeline where (1) a topic modeling component filters candidates based on thematic relevance, (2) an initial high-recall lexical retriever identifies a broad candidate set, and (3) a dense retriever reranks the remaining documents. This architecture balances computational efficiency with retrieval effectiveness, prioritizing high recall in early stages while refining precision in later stages. The framework avoids the introduction of hallucinations, achieving a 15% improvement in BERT-Score compared to existing solutions without requiring any costly abstractive model. Moreover, our sequential approach accelerates the search process by 5% compared to the use of a single-stage based on a dense retrieval approach, with minimal effect on the performance in terms of BERT-Score.
In this article, we present Stanford Named Entity Recognition and Classification (SNERC), an intelligent system designed to enhance knowledge management through named entity recognition (NER) and document classification (DC) in … In this article, we present Stanford Named Entity Recognition and Classification (SNERC), an intelligent system designed to enhance knowledge management through named entity recognition (NER) and document classification (DC) in the field of Applied Gaming. In this domain, the effective application of NER and DC is essential for addressing information overload (IO), enabling software developers to efficiently search, filter, and retrieve large volumes of textual data from web sources. SNERC streamlines the management and deployment of machine learning (ML)-based NER models, supporting the accurate extraction of named entities (NEs) and the classification of heterogeneous textual documents. The system tackles key challenges in NER, such as the impact of language and domain specificity on model performance, domain adaptation, and the complexity of handling diverse NE types. We demonstrate SNERC’s capabilities through real-world use cases, highlighting improvements in DC and information retrieval (IR) within applied gaming scenarios. The system provides core functionalities for training, evaluating, and managing NER models using the Stanford CoreNLP framework. Additionally, SNERC integrates with a rule-based expert system (RBES) to enable the automatic categorization of documents into predefined taxonomies within a knowledge management system. We present results from comprehensive qualitative and quantitative evaluations—measured through precision, recall, and F-score—to assess the system’s effectiveness and identify areas for further optimization, supporting seamless integration into real-world operational environments. Received: 15 July 2024 | Revised: 9 April 2025 | Accepted: 29 April 2025 Conflicts of Interest The authors declare that they have no conflicts of interest to this work. Data Availability Statement Data sharing is not applicable to this article as no new data were created or analyzed in this study. Author Contribution Statement Philippe Tamla: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization. Florian Freund: Methodology, Software, Validation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization. Matthias Hemmje: Conceptualization, Validation, Supervision, Project administration, Funding acquisition.
The exponential growth in web content fueled an increased demand for smart, scalable and responsible filters for website content. This talk explores the origins, classification, and specifications comparison (traditional rule … The exponential growth in web content fueled an increased demand for smart, scalable and responsible filters for website content. This talk explores the origins, classification, and specifications comparison (traditional rule based and keyword based system to modern machine learning based algorithm and hybrid attempts) of the content filtering algorithm. Particular attention is paid to their working methods, the evaluation processes, deployment complexities as well as their action in practical deployments. The article outlines how static filters are on the verge of becoming entirely inadequate for combating dynamic and encrypted threats and it describes how more recent innovations - visual phishing detection, context aware systems and federated learning - are changing the landscape of filtering paradigm. This chapter draws on the latest 30 academic works on literature on the theoretical concepts and direct practices of content filtering systems and compare them along the axis of accuracy, precision, recall, and scalability. Besides, it identifies major challenges including privacy compromise, false positives, and the need for explainable AI. The review ends with the directions for future research, base on personalization, transparency, and balloon of the deep learning architectures. Through this comprehensive study the rearing provides useful inferences to researchers, cybersecurity professionals; and platform developers to develop safer and more diligent web spaces.
Large Language Models (LLMs) are highly effective at replicating human tasks and boosting productivity but face challenges in accurate data extraction due to prioritizing fluency over factual precision. Researchers are … Large Language Models (LLMs) are highly effective at replicating human tasks and boosting productivity but face challenges in accurate data extraction due to prioritizing fluency over factual precision. Researchers are addressing these limitations by combining LLMs with Retrieval-Augmented Generation (RAG) models. This approach utilizes chunking, searching, and ranking algorithms to streamline data retrieval from unstructured text, improving LLMs’ precision and processing. The findings provide key insights into optimizing chunking strategies and set the stage for the advancement and broader application of RAG-enhanced systems.
Machine learning techniques have generated significant interest in the development of recommender systems. Given the vast array of direct and indirect variables that can be used to predict user preferences, … Machine learning techniques have generated significant interest in the development of recommender systems. Given the vast array of direct and indirect variables that can be used to predict user preferences, there is a growing need for scalable, reliable algorithms and systems that offer high availability and scalability. In today’s technologically advanced era, people have become more open-minded and increasingly depend on modern applications for daily needs such as purchasing accessories, watching movies, and more. The rising demand for online shopping and media consumption has led businesses to adopt machine learning-driven technologies to efficiently identify the most relevant products for users, with less effort compared to traditional marketing methods. Recommender systems (RS), particularly content-based filtering systems, play a vital role in both personal and professional contexts. These systems act as intermediaries between content providers—including social media platforms, e-commerce websites, and streaming services—and end users by suggesting items that match user preferences and past behaviors. Such personalized solutions are especially valuable when users are uncertain about what they want. Clustering is a key method in this space, which involves organizing a population or dataset into distinct groups, ensuring that data points within a single group are more similar to each other than to those in other groups. The goal is to identify users with similar characteristics and group them together in clusters associated with specific products. This research introduces a Double Token Weighted Clustering Model (DTWCM) designed to analyze and group relevant product recommendations sourced from multiple online recommendation systems. The model efficiently delivers high-quality suggestions to users. When compared with the traditional Adaptive Weights Clustering model, the proposed DTWCM demonstrates improved accuracy and scalability.
This research explores the translation of a website (web page). This research is library research. As the internet expands in non-English-speaking countries, multilingual websites have become essential for businesses and … This research explores the translation of a website (web page). This research is library research. As the internet expands in non-English-speaking countries, multilingual websites have become essential for businesses and accessibility. Translating websites into English enhances global reach but presents challenges in technical, content, and linguistic aspects. Technical issues include file formats and encoding, while content considerations involve text, images, and symbols. Linguistically, style, meaning, and idioms require careful handling. Website translation methods include manual and machine translation. Manual translation ensures accuracy, requiring expertise, contextual understanding, and cultural sensitivity. Machine translation, using AI, offers speed, cost-effectiveness, and consistency but may lack nuance and context. Browsers like Google Chrome, Mozilla Firefox, and Microsoft Edge provide built-in translation features, while extensions further enhance accessibility. However, machine translation can negatively impact SEO rankings and fail to consider reader-specific vocabulary. Moreover, translation influences website design and user experience (UX), affecting layout, menus, and usability. Translators must balance linguistic accuracy with usability, ensuring readability without compromising design. Ultimately, effective website translation requires careful planning to maintain accessibility, accuracy, and functionality. The implications of this research indicate the need for a more holistic approach to translating websites, taking into account readability, user experience, and cultural fit. Translators must understand the target audience well and adapt content to suit the local context while maintaining global communication goals.