Computer Science Information Systems

Data Mining Algorithms and Applications

Description

This cluster of papers covers a wide range of topics in data mining, including frequent pattern mining, association rule mining, sequential pattern mining, machine learning, decision trees, interestingness measures, high utility itemsets, temporal data mining, and knowledge discovery.

Keywords

Data Mining; Frequent Patterns; Association Rules; Sequential Patterns; Machine Learning; Decision Trees; Interestingness Measures; High Utility Itemsets; Temporal Data Mining; Knowledge Discovery

We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving thii problem that are fundamentally different … We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving thii problem that are fundamentally different from the known algorithms. Empirical evaluation shows that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also show how the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the transaction size and the number of items in the database.
1 Introduction 1.1 What is Data Mining? 1.2 Motivating Challenges 1.3 The Origins of Data Mining 1.4 Data Mining Tasks 1.5 Scope and Organization of the Book 1.6 Bibliographic Notes … 1 Introduction 1.1 What is Data Mining? 1.2 Motivating Challenges 1.3 The Origins of Data Mining 1.4 Data Mining Tasks 1.5 Scope and Organization of the Book 1.6 Bibliographic Notes 1.7 Exercises 2 Data 2.1 Types of Data 2.2 Data Quality 2.3 Data Preprocessing 2.4 Measures of Similarity and Dissimilarity 2.5 Bibliographic Notes 2.6 Exercises 3 Exploring Data 3.1 The Iris Data Set 3.2 Summary Statistics 3.3 Visualization 3.4 OLAP and Multidimensional Data Analysis 3.5 Bibliographic Notes 3.6 Exercises 4 Classification: Basic Concepts, Decision Trees, and Model Evaluation 4.1 Preliminaries 4.2 General Approach to Solving a Classification Problem 4.3 Decision Tree Induction 4.4 Model Overfitting 4.5 Evaluating the Performance of a Classifier 4.6 Methods for Comparing Classifiers 4.7 Bibliographic Notes 4.8 Exercises 5 Classification: Alternative Techniques 5.1 Rule-Based Classifier 5.2 Nearest-Neighbor Classifiers 5.3 Bayesian Classifiers 5.4 Artificial Neural Network (ANN) 5.5 Support Vector Machine (SVM) 5.6 Ensemble Methods 5.7 Class Imbalance Problem 5.8 Multiclass Problem 5.9 Bibliographic Notes 5.10 Exercises 6 Association Analysis: Basic Concepts and Algorithms 6.1 Problem Definition 6.2 Frequent Itemset Generation 6.3 Rule Generation 6.4 Compact Representation of Frequent Itemsets 6.5 Alternative Methods for Generating Frequent Itemsets 6.6 FP-Growth Algorithm 6.7 Evaluation of Association Patterns 6.8 Effect of Skewed Support Distribution 6.9 Bibliographic Notes 6.10 Exercises 7 Association Analysis: Advanced Concepts 7.1 Handling Categorical Attributes 7.2 Handling Continuous Attributes 7.3 Handling a Concept Hierarchy 7.4 Sequential Patterns 7.5 Subgraph Patterns 7.6 Infrequent Patterns 7.7 Bibliographic Notes 7.8 Exercises 8 Cluster Analysis: Basic Concepts and Algorithms 8.1 Overview 8.2 K-means 8.3 Agglomerative Hierarchical Clustering 8.4 DBSCAN 8.5 Cluster Evaluation 8.6 Bibliographic Notes 8.7 Exercises 9 Cluster Analysis: Additional Issues and Algorithms 9.1 Characteristics of Data, Clusters, and Clustering Algorithms 9.2 Prototype-Based Clustering 9.3 Density-Based Clustering 9.4 Graph-Based Clustering 9.5 Scalable Clustering Algorithms 9.6 Which Clustering Algorithm? 9.7 Bibliographic Notes 9.8 Exercises 10 Anomaly Detection 10.1 Preliminaries 10.2 Statistical Approaches 10.3 Proximity-Based Outlier Detection 10.4 Density-Based Outlier Detection 10.5 Clustering-Based Techniques 10.6 Bibliographic Notes 10.7 Exercises Appendix A Linear Algebra Appendix B Dimensionality Reduction Appendix C Probability and Statistics Appendix D Regression Appendix E Optimization Author Index Subject Index
Since most real-world applications of classification learning involve continuous-valued attributes, properly addressing the discretization process is an important problem. This paper addresses the use of the entropy minimization heuristic for … Since most real-world applications of classification learning involve continuous-valued attributes, properly addressing the discretization process is an important problem. This paper addresses the use of the entropy minimization heuristic for discretizing the range of a continuous-valued attribute into multiple intervals.
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of … Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like … Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns. In this study, we propose a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods.
This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search … This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward-building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a … Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.
Building a Data Warehouse: With Examples in SQL Server describes how to build a data warehouse completely from scratch and shows practical examples on how to do it. Author Vincent … Building a Data Warehouse: With Examples in SQL Server describes how to build a data warehouse completely from scratch and shows practical examples on how to do it. Author Vincent Rainardi also descri
No abstract available. No abstract available.
Provides a systematic account of the subject area, concentrating on the most recent advances in the field. While the focus is on practical considerations, both theoretical and practical issues are … Provides a systematic account of the subject area, concentrating on the most recent advances in the field. While the focus is on practical considerations, both theoretical and practical issues are explored. Among the advances covered are: regularized discriminant analysis and bootstrap-based assessment of the performance of a sample-based discriminant rule and extensions of discriminant analysis motivated by problems in statistical image analysis. Includes over 1,200 references in the bibliography.
Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good … Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems arise and novel approaches to feature selection are in demand. This survey is a comprehensive overview of many existing methods from the 1970's to the present. It identifies four steps of a typical feature selection method, and categorizes the different existing methods in terms of generation procedures and evaluation functions, and reveals hitherto unattempted combinations of generation procedures and evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion via example. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses of different methods are explained. Guidelines for applying feature selection methods are given based on data types and domain characteristics. This survey identifies the future research areas in feature selection, introduces newcomers to this field, and paves the way for practitioners who search for suitable methods for solving domain-specific real-world applications.
More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a … More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
From the Publisher: For over 25 years, C. J. Date's An Introduction to Database Systems has been the authoritative resource for readers interested in gaining insight into and understanding of … From the Publisher: For over 25 years, C. J. Date's An Introduction to Database Systems has been the authoritative resource for readers interested in gaining insight into and understanding of the principles of database systems. This revision continues to provide a solid grounding in the foundations of database technology and to provide some ideas as to how the field is likely to develop in the future.. Readers of this book will gain a strong working knowledge of the overall structure, concepts, and objectives of database systems and will become familiar with the theoretical principles underlying the construction of such systems.
Current computing and storage technology is rapidly outstripping society's ability to make meaningful use of the torrent of available data. Without a concerted effort to develop knowledge discovery techniques, organizations … Current computing and storage technology is rapidly outstripping society's ability to make meaningful use of the torrent of available data. Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.
Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as … Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information-providing services, such as data warehousing and online services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided and to increase business opportunities. In response to such a demand, this article provides a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.
■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article … ■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field.
We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant … We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.
The World Wide Web as a Global Information System has flooded us with a tremendous amount of data and information. Our capabilities of generating and collecting data have been increasing … The World Wide Web as a Global Information System has flooded us with a tremendous amount of data and information. Our capabilities of generating and collecting data have been increasing rapidly every day, in this age of Information Technology. This explosive growth in stored data has generated an urgent need for new technologies and automated tools to assist us in transforming the data into useful information and knowledge. Data Mining, popularly known as Knowledge Discovery in databases is the automated or convenient extraction of patterns representing knowledge implicitly stored in large databases which solves the above problem. This article explains What is data mining? and Why is it important? Also it deals about the basic concepts of data mining, data cluster and data mining rules. How to choose a data mining system with some examples also have been discussed in this article.
We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant … We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like … Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns.
Subhash Padhyé | International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences
This paper presents high level design of grouping data on its and its children’s attribute using Java Persistence API (JPA) and criteria builder future. Grouping of the data set is … This paper presents high level design of grouping data on its and its children’s attribute using Java Persistence API (JPA) and criteria builder future. Grouping of the data set is defined as view of the data set based on unique value of one or more attributes. This paper describes the design and approach to construct grouping criteria dynamically. Grouping criteria can be defined in JSON format and then JSON will be parsed to create grouping query run time. JPA is implemented by many leading OR mapping tools, which represents data Object Oriented Programming Models. Criteria builder API provides a way to construct SQL queries dynamically against data objects
Anjali Mathur | International Journal for Research in Applied Science and Engineering Technology
Data mining is used to extract knowledge from huge amount of the data Today, Data mining helps different organizations focus on customer’s behavior patterns. The research scope of data mining … Data mining is used to extract knowledge from huge amount of the data Today, Data mining helps different organizations focus on customer’s behavior patterns. The research scope of data mining extended in various fields. This paper, discusses the concept of data mining, important issues and applications. So there comes the need of powerful and most importantly automatic tools for uncovering valuable slots of organized information from tremendous amount of data. Considering social networking site or a search engine, they receive millions of queries every day. Firstly, the Database Management Systems evolved to handle the queries of similar types. Then the approach was modified to advanced Database management system, Data warehousing and Data mining for advance data analysis and web based databases. Data mining has immensely penetrated in each and every field of day to day life
Abstract Purpose: The extraction of actionable insights is critical for intelligent systems and recommendation engines. However, traditional methods for action rule discovery face challenges in scalability and efficiency when applied … Abstract Purpose: The extraction of actionable insights is critical for intelligent systems and recommendation engines. However, traditional methods for action rule discovery face challenges in scalability and efficiency when applied to large datasets. This study introduces a correlation-based vertical partitioning method to improve the consistency and interpretability of action rules while addressing the limitations of random partitioning and unstructured approaches. Methods: The proposed method clusters flexible attributes using correlations, enabling structured partitions for parallel rule generation via hierarchical clustering. Comparative experiments evaluated its precision, runtime, lightness, and coverage against random and baseline partitioning approaches. Results: The correlation-based method outperformed random partitioning and significantly improved runtime efficiency over the baseline. It generates interpretable rules in a single iteration, avoiding variability and repeated runs, though challenges in rule combination efficiency suggest areas for improvement. Conclusion: The correlation-based vertical partitioning method strikes a balance between computational efficiency and rule quality, making it a promising solution for large-scale action rule discovery. Future work could enhance scalability further by improving the rule combination process and exploring hybrid or adaptive partitioning strategies to extend the method’s applicability across diverse domains.
Sunny Kesireddy | World Journal of Advanced Engineering Technology and Sciences
The integration of large language models (LLMs) into enterprise workflows has opened new frontiers in cloud data engineering. This article presents a comprehensive evaluation of AI copilots in the development … The integration of large language models (LLMs) into enterprise workflows has opened new frontiers in cloud data engineering. This article presents a comprehensive evaluation of AI copilots in the development of scalable data pipelines across regulated environments. The article benchmarks LLMs on key engineering tasks including pipeline scaffolding, SQL optimization, IAM policy generation, and compliance rule encoding, providing insights into their capabilities and limitations in specialized technical contexts. It measures improvements in developer velocity, reduction in syntax errors, and overall impact on quality assurance cycles. Beyond automation, the article assesses how LLMs learn and generalize patterns from metadata-driven frameworks—making intelligent suggestions aligned with domain rules and architectural best practices. Special attention is given to the risks of hallucination, governance gaps, and security considerations that organizations must actively manage. It contributes to a deeper understanding of human-AI pair programming in high-stakes data systems, offering a framework for safely scaling AI-augmented development across data teams while preserving auditability, trust, and compliance.
Abstract Hi-C and single cell Hi-C (scHi-C) data are now routinely generated for studying an array of biological questions of interest, including whole genome chromatin organization to gain a better … Abstract Hi-C and single cell Hi-C (scHi-C) data are now routinely generated for studying an array of biological questions of interest, including whole genome chromatin organization to gain a better understanding of the chromosome three-dimensional hierarchical structure: compartments, Topologically Associated Domains (TADs), and long-range interactions. Due to concerns about data quality, especially for scHi-C because of its sparsity, data quality improvement is seen as a necessary step before performing analyses to answer biological questions. As such, methods have been developed accordingly, among them is a set of methods that are “random walk”-based, including random walk with a limited number of steps (RWS) and random walk with restart (RWR). Nevertheless, there is little justification for the use of such methods, nor quantification of their performance success. Taking correct identification of TADs as the end point, in this paper, we describe the characteristics of random-walk-based approaches and carry out empirical investigation for identifying TADs before and after random walks. Due to the lack of practical guidelines for choosing tuning parameters necessary for performing random walks, it is difficult to know how many steps of random walk for RWS or how small a restart probability for RWR should one choose to achieve good performance. Even in the unrealistic scenario when one has the hindsight of using the optimal parameter values, little improvement in downstream studies by first performing random walk was observed. This conclusion was based on extensive analytical analyses, simulation study, and real data applications. Therefore, the current study provides a cautionary note to researchers who may consider using random-walk-based approaches prior to downstream analyses.
Hong Lin , Wensheng Gan , Gengsen Huang +1 more | International Journal of Machine Learning and Cybernetics
Jakob Bach | Proceedings of the ACM on Management of Data
Subgroup-discovery methods find interesting regions in a dataset. In this article, we analyze two constraint types to enhance the interpretability of subgroups: First, we make subgroup descriptions small by limiting … Subgroup-discovery methods find interesting regions in a dataset. In this article, we analyze two constraint types to enhance the interpretability of subgroups: First, we make subgroup descriptions small by limiting the number of features used. Second, we propose the novel problem of finding alternative subgroup descriptions, which cover a similar set of data objects as a given subgroup but use different features. We describe how to integrate both constraint types into heuristic subgroup-discovery methods as well as a novel Satisfiability Modulo Theories (SMT) formulation, which enables a solver-based search for subgroups. Further, we prove NP -hardness of optimization with either constraint type. Finally, we evaluate unconstrained and constrained subgroup discovery with 27 binary-classification datasets. We observe that heuristic search methods often yield high-quality subgroups fast, even with constraints.
This paper studies incremental rule discovery. Given a dataset D, rule discovery is to mine the set of the rules on D such that their supports and confidences are above … This paper studies incremental rule discovery. Given a dataset D, rule discovery is to mine the set of the rules on D such that their supports and confidences are above thresholds 𝜎 and 𝛅 , respectively. We formulate incremental problems in response to updates Δ𝜎 and/or Δ𝛅, to compute rules added and/or removed with respect to 𝜎 + Δ𝜎 and 𝛅 + Δ𝛅. The need for studying the problems is evident since practitioners often want to adjust their support and confidence thresholds during discovery. The objective is to minimize unnecessary recomputation during the adjustments, not to restart the costly discovery process from scratch. As a testbed, we consider entity enhancing rules, which subsume popular data quality rules as special cases. We develop three incremental algorithms, in response to Δ𝜎 , Δ𝜎 and both. We show that relative to a batch discovery algorithm, these algorithms are bounded, i.e., they incur the minimum cost among all incrementalizations of the batch one, and parallelly scalable, i.e., they guarantee to reduce runtime when given more processors. Using real-life data, we empirically verify that the incremental algorithms outperform the batch counterpart by up to 658× when Δ𝜎 and Δ𝜎 are either positive or negative.
Abstract Motivation HAP-SAMPLE2 extends the functionality of the original HAP-SAMPLE tool for simulating genotype-phenotype data, now with features to handle population admixture and rare variant analysis. It allows users to … Abstract Motivation HAP-SAMPLE2 extends the functionality of the original HAP-SAMPLE tool for simulating genotype-phenotype data, now with features to handle population admixture and rare variant analysis. It allows users to define parameters such as disease prevalence and allele effect sizes for both common and rare variant simulations. Application HAP-SAMPLE2 provides an efficient means for simulating complex datasets, suitable for large-scale projects like the 1000 Genomes Project. Its capabilities for population admixture allow users to create admixed populations or preserve substructures, while introducing novel variation through artificial recombination. Additionally, the tool supports burden testing for rare variants using fixed and Madsen-Browning weighting schemes. Availability The software, along with a detailed vignette, is available on GitHub: https://github.com/M3dical/HAPSAMPLE2. Supplementary information A supplemental material file and software vignette are available at Bioinformatics online.
This article aims to limit the rule explosion problem affecting market basket analysis (MBA) algorithms. More specifically, it is shown how, if the minimum support threshold is not specified explicitly, … This article aims to limit the rule explosion problem affecting market basket analysis (MBA) algorithms. More specifically, it is shown how, if the minimum support threshold is not specified explicitly, but in terms of the number of items to consider, it is possible to compute an upper bound for the number of generated association rules. Moreover, if the results of previous analyses (with different thresholds) are available, this information can also be taken into account, hence refining the upper bound and also being able to compute lower bounds. The support determination technique is implemented as an extension to the Apriori algorithm but may be applied to any other MBA technique. Tests are executed on benchmarks and on a real problem provided by one of the major Italian supermarket chains, regarding more than 500,000 transactions. Experiments show, on these benchmarks, that the rate of growth in the number of rules between tests with increasingly more permissive thresholds ranges, with the proposed method, is from 21.4 to 31.8, while it would range from 39.6 to 3994.3 if the traditional thresholding method were applied.
Numerical Association Rule Mining (NARM), which simultaneously handles both numerical and categorical attributes, is a powerful approach for uncovering meaningful associations in heterogeneous datasets. However, designing effective NARM solutions is … Numerical Association Rule Mining (NARM), which simultaneously handles both numerical and categorical attributes, is a powerful approach for uncovering meaningful associations in heterogeneous datasets. However, designing effective NARM solutions is a complex task involving multiple sequential steps, such as data preprocessing, algorithm selection, hyper-parameter tuning, and the definition of rule quality metrics, which together form a complete processing pipeline. In this paper, we introduce NiaAutoARM, a novel Automated Machine Learning (AutoML) framework that leverages stochastic population-based metaheuristics to automatically construct full association rule mining pipelines. Extensive experimental evaluation on ten benchmark datasets demonstrated that NiaAutoARM consistently identifies high-quality pipelines, improving both rule accuracy and interpretability compared to baseline configurations. Furthermore, NiaAutoARM achieves superior or comparable performance to the state-of-the-art VARDE algorithm while offering greater flexibility and automation. These results highlight the framework’s practical value for automating NARM tasks, reducing the need for manual tuning, and enabling broader adoption of association rule mining in real-world applications.
Sistemas multiprocessados são frequentimente utilizados para resolução de problemas computacionais. Mas em alguns desses sistemas são necessários não somente a resposta mais rápida, mas uma resposta em um tempo inferior … Sistemas multiprocessados são frequentimente utilizados para resolução de problemas computacionais. Mas em alguns desses sistemas são necessários não somente a resposta mais rápida, mas uma resposta em um tempo inferior a um tempo máximo pré-determinado. Este trabalho descreve um método de projeto de sistemas multiprocessados K-tolerante a falhas em uma configuração de grafo circulante, adicionando K-processadores, de modo que se até k-processadores falhem, o sistema continue fornecendo a resposta em um tempo menor que o limite.
Gauri S. Lolage | INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
Abstract - Campus placements often demand significant manual effort, from matching student profiles to job requirements to tracking application statuses. This process is transformed with the Training and Placement Officer … Abstract - Campus placements often demand significant manual effort, from matching student profiles to job requirements to tracking application statuses. This process is transformed with the Training and Placement Officer (TPO) portal, built using the MERN stack and AI-powered resume parsing. The portal automates candidate shortlisting by analyzing resumes and ensuring at least a 50% skill match with job descriptions, significantly reducing TPO workloads. Job postings created by TPOs are instantly visible on student dashboards, with automated email notifications keeping students updated and engaged. Beyond facilitating placements, the system empowers students by offering personalized skill improvement recommendations, guiding them to enhance their profiles and stay competitive in the job market. For TPOs, the portal provides robust features like report generation for shortlisted candidates, offering valuable insights for better decision-making. Its intuitive interface ensures seamless navigation for users, fostering a transparent and efficient placement process. By bridging the gap between student abilities and employer expectations, the TPO portal creates a streamlined, fair, and mutually beneficial ecosystem, enhancing the campus recruitment experience for all stakeholders. Key Words: Resume Parsing, Job Matching, Automation, Resume Builder, MERN Stack, TPO.
Wenhao Jiang , Gaohui Peng | International Journal of Computer Science and Information Technology
Based on the water resource data of Zhengzhou City from 1999 to 2023, this study employs the time-weighted Apriori algorithm (with a decay factor of 0.85) combined with static-dynamic discretization … Based on the water resource data of Zhengzhou City from 1999 to 2023, this study employs the time-weighted Apriori algorithm (with a decay factor of 0.85) combined with static-dynamic discretization preprocessing methods to mine dynamic association rules for three major water use types: integrated urban-rural water use, industrial water use, and agricultural water use. The results indicate that integrated urban-rural water use exhibits rigid growth, with its proportion rising from 18.4% to 67.8%; industrial water use has achieved "reduced volume with enhanced efficiency," with its proportion decreasing from 24.9% to 11.2%; and agricultural water use presents three coexisting patterns: stable, increasing, and decreasing. Water resource allocation strategies have shifted from "production-promotes-water-use" to "water-use-determines-production," with newly added water resources primarily allocated to urban-rural water use.
Periodic pattern mining, a branch of data mining, is expanding to provide insight into the occurrence behavior of large volumes of data. Recently, a variety of industries, including fraud detection, … Periodic pattern mining, a branch of data mining, is expanding to provide insight into the occurrence behavior of large volumes of data. Recently, a variety of industries, including fraud detection, telecommunications, retail marketing, research, and medical have found applications for rare association rule mining, which uncovers unusual or unexpected combinations. A limited amount of literature demonstrated how periodicity is essential in mining low-support rare patterns. In addition, attention must be placed on temporal datasets that analyze crucial information about the timing of pattern occurrences and stream datasets to manage high-speed streaming data. Several algorithms have been developed that effectively track the cyclic behavior of patterns and identify the patterns that display complete or partial periodic behavior in temporal datasets. Numerous frameworks have been created to examine the periodic behavior of streaming data. Nevertheless, such a method that focuses on the temporal information in the data stream and extracts rare partial periodic patterns has yet to be proposed. With a focus on identifying rare partial periodic patterns from temporal data streams, this paper proposes two novel sliding window-based single scan approaches called R3PStreamSW-Growth and R3PStreamSW-BitVectorMiner . The findings showed that when a dense dataset Accidents is considered, for different threshold variations R3P-StreamSWBitVectorMiner outperformed R3PStreamSW-Growth by about 93%. Similarly, when the sparse dataset T10I4D100K is taken into account, R3P-StreamSWBitVectorMiner exhibits a 90% boost in performance. This demonstrates that on a range of synthetic, real-world, sparse, and dense datasets for different thresholds, R3P-StreamSWBitVectorMiner is significantly faster than R3PStreamSW-Growth .
<title>Abstract</title> Objective To develop a risk prediction model for discordant growth in dichorionic twins, enabling early identification and screening of high-risk cases. Methods Clinical data from 1,098 dichorionic twin pregnancies … <title>Abstract</title> Objective To develop a risk prediction model for discordant growth in dichorionic twins, enabling early identification and screening of high-risk cases. Methods Clinical data from 1,098 dichorionic twin pregnancies delivered at Anhui Maternal and Child Health Hospital between January 2016 and January 2024 were retrospectively analyzed. Based on the presence of discordant growth, the cohort was divided into two groups: 231 cases with discordant growth and 867 without. The dataset was randomly split into a training set (70%) and a validation set (30%). Predictive models were developed using the training set, and performance was evaluated using the validation set. Candidate predictors were selected through univariate and multivariate logistic regression analyses. A risk prediction model was built using LR and five machine learning (ML) algorithms: logistic regression, random forest (RF), Gaussian Naive Bayes (GNB), k-nearest neighbors (k-NN), and extreme gradient boosting (XGBoost), to assess the likelihood of discordant twin growth. Results Univariate LR identified birth time, pre-pregnancy benign hypertension, pre-pregnancy autoimmune disease, umbilical cord abnormalities, and placental abnormalities as significant risk factors for discordant growth (<italic>P</italic> &lt; 0.05). Multivariate analysis confirmed the independence of these variables, with placental abnormalities showing the highest adjusted odds ratio (OR), followed by umbilical cord abnormalities and pre-pregnancy benign hypertension. The significance of pre-pregnancy autoimmune disease was reduced in the multivariate model. The logistic regression model achieved an area under the curve (AUC) of 0.710 in the training set and 0.711 in the validation set. Sensitivity and specificity were 0.665 and 0.679, respectively, in the training set, and 0.667 and 0.638 in the validation set. Positive predictive values (PPVs) were high in both sets (training: 0.886; validation: 0.874), while negative predictive values (NPVs) were lower (training: 0.351; validation: 0.336). The Hosmer-Lemeshow goodness-of-fit test indicated a satisfactory model fit (<italic>P</italic> = 0.456 for training; <italic>P</italic> = 0.338 for validation). Calibration curves showed that for threshold probabilities between 10% and 50%, the model provided substantial net clinical benefit in both sets. Among the ML models (MLMs), k-NN achieved the highest AUC (0.687) and specificity (0.881), indicating strong discrimination and a low false-positive rate. GNB showed the highest sensitivity (0.710), effectively identifying true positives. LR and RF demonstrated balanced but moderate performance. In clinical decision curve analysis, at a threshold probability of 0.5, RF and GNB remained profitable (net benefit = 0.4), while XGBoost resulted in net loss (-0.1), indicating overconfidence. Overall, the k-NN model demonstrated the best predictive performance. Conclusion Prediction models developed using birth time, pre-pregnancy benign hypertension, pre-pregnancy autoimmune disease, umbilical cord abnormalities, and placental abnormalities showed good predictive value for discordant growth in dichorionic twins. These models can assist clinicians in risk assessment, clinical consultation, and targeted screening of high-risk groups, enabling more precise follow-up and intervention.
| Periodicals of Engineering and Natural Sciences (PEN)
Data mining is slowly but surely making its way into the educational field after dominating the business fields. This paper focuses on the research completed in the area of data … Data mining is slowly but surely making its way into the educational field after dominating the business fields. This paper focuses on the research completed in the area of data mining in the higher education sector: colleges and universities. We will look at the different implementation of data mining and to what extent was it utilized and benefited from.