Computer Science β€Ί Computer Networks and Communications

Advanced Database Systems and Queries

Description

This cluster of papers focuses on data stream management systems and techniques, including continuous queries, XML and relational database systems, query optimization, stream processing, approximate query processing, column-oriented database systems, complex event processing, and main memory databases.

Keywords

Data Stream Management; Continuous Queries; XML; Relational Database Systems; Query Optimization; Stream Processing; Approximate Query Processing; Column-oriented Database Systems; Complex Event Processing; Main Memory Databases

Introduction. 1. Improving Analysis with Object-Oriented Techniques. 2. Experiencing an Object Perspective. 3. Identifying Objects. 4. Identifying Structures. 5. Identifying Subjects. 6. Defining Attributes. 7. Defining Services. 8. Moving to … Introduction. 1. Improving Analysis with Object-Oriented Techniques. 2. Experiencing an Object Perspective. 3. Identifying Objects. 4. Identifying Structures. 5. Identifying Subjects. 6. Defining Attributes. 7. Defining Services. 8. Moving to Object-Oriented Design.
In database systems, users access shared data under the assumption that the data satisfies certain consistency constraints. This paper defines the concepts of transaction, consistency and schedule and shows that … In database systems, users access shared data under the assumption that the data satisfies certain consistency constraints. This paper defines the concepts of transaction, consistency and schedule and shows that consistency requires that a transaction cannot request new locks after releasing a lock. Then it is argued that a transaction needs to lock a logical rather than a physical subset of the database. These subsets may be specified by predicates. An implementation of predicate locks which satisfies the consistency condition is suggested.
article Free AccessParallel database systems: the future of high performance database systems Authors: David DeWitt Computer Sciences Department, University of Wisconsin, 1210 West Dayton Street, Madison, WI Computer Sciences Department, … article Free AccessParallel database systems: the future of high performance database systems Authors: David DeWitt Computer Sciences Department, University of Wisconsin, 1210 West Dayton Street, Madison, WI Computer Sciences Department, University of Wisconsin, 1210 West Dayton Street, Madison, WIView Profile , Jim Gray San Francisco Systems Center, Digital Equipment Corporation, 455 Market Street-7th Floor, San Francisco, CA San Francisco Systems Center, Digital Equipment Corporation, 455 Market Street-7th Floor, San Francisco, CAView Profile Authors Info & Claims Communications of the ACMVolume 35Issue 6June 1992 pp 85–98https://doi.org/10.1145/129888.129894Published:01 June 1992Publication History 962citation7,471DownloadsMetricsTotal Citations962Total Downloads7,471Last 12 Months582Last 6 weeks70 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteeReaderPDF
A federated database system (FDBS) is a collection of cooperating database systems that are autonomous and possibly heterogeneous. In this paper, we define a reference architecture for distributed database management … A federated database system (FDBS) is a collection of cooperating database systems that are autonomous and possibly heterogeneous. In this paper, we define a reference architecture for distributed database management systems from system and schema viewpoints and show how various FDBS architectures can be developed. We then define a methodology for developing one of the popular architectures of an FDBS. Finally, we discuss critical issues related to developing and operating an FDBS.
Two kinds of abstraction that are fundamentally important in database design and usage are defined. Aggregation is an abstraction which turns a relationship between objects into an aggregate object. Generalization … Two kinds of abstraction that are fundamentally important in database design and usage are defined. Aggregation is an abstraction which turns a relationship between objects into an aggregate object. Generalization is an abstraction which turns a class of objects into a generic object. It is suggested that all objects (individual, aggregate, generic) should be given uniform treatment in models of the real world. A new data type, called generic, is developed as a primitive for defining such models. Models defined with this primitive are structured as a set of aggregation hierarchies intersecting with a set of generalization hierarchies. Abstract objects occur at the points of intersection. This high level structure provides a discipline for the organization of relational databases. In particular this discipline allows: (i) an important class of views to be integrated and maintained; (ii) stability of data and programs under certain evolutionary changes; (iii) easier understanding of complex models and more natural query formulation; (iv) a more systematic approach to database design; (v) more optimization to be performed at lower implementation levels. The generic type is formalized by a set of invariant properties. These properties should be satisfied by all relations in a database if abstractions are to be preserved. A triggering mechanism for automatically maintaining these invariants during update operations is proposed. A simple mapping of aggregation/generalization hierarchies onto owner-coupled set structures is given.
Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now … Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at the VLDB Conference, 1996.
article Free AccessFast Probabilistic Algorithms for Verification of Polynomial Identities Author: J. T. Schwartz Computer Science Department, Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, … article Free AccessFast Probabilistic Algorithms for Verification of Polynomial Identities Author: J. T. Schwartz Computer Science Department, Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY Computer Science Department, Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NYView Profile Authors Info & Claims Journal of the ACMVolume 27Issue 4Oct. 1980 pp 701–717https://doi.org/10.1145/322217.322225Published:01 October 1980Publication History 950citation2,248DownloadsMetricsTotal Citations950Total Downloads2,248Last 12 Months266Last 6 weeks30 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteeReaderPDF
One of the fundamental principles of the database approach is that a database allows a nonredundant, unified representation of all data managed in an organization. This is achieved only when … One of the fundamental principles of the database approach is that a database allows a nonredundant, unified representation of all data managed in an organization. This is achieved only when methodologies are available to support integration across organizational and application boundaries. Methodologies for database design usually perform the design activity by separately producing several schemas, representing parts of the application, which are subsequently merged. Database schema integration is the activity of integrating the schemas of existing or proposed databases into a global, unified schema. The aim of the paper is to provide first a unifying framework for the problem of schema integration, then a comparative review of the work done thus far in this area. Such a framework, with the associated analysis of the existing approaches, provides a basis for identifying strengths and weaknesses of individual methodologies, as well as general guidelines for future improvements and extensions.
During the last three or four years several investigators have been exploring β€œsemantic models” for formatted databases. The intent is to capture (in a more or less formal way) more … During the last three or four years several investigators have been exploring β€œsemantic models” for formatted databases. The intent is to capture (in a more or less formal way) more of the meaning of the data so that database design can become more systematic and the database system itself can behave more intelligently. Two major thrusts are clear. In this paper we propose extensions to the relational model to support certain atomic and molecular semantics. These extensions represent a synthesis of many ideas from the published work in semantic modeling plus the introduction of new rules for insertion, update, and deletion, as well as new algebraic operators.
Schema matching is a critical step in many applications, such as XML message mapping, data warehouse loading, and schema integration. In this paper, we investigate algorithms for generic schema matching, … Schema matching is a critical step in many applications, such as XML message mapping, data warehouse loading, and schema integration. In this paper, we investigate algorithms for generic schema matching, outside of any particular data model or application. We first present a taxonomy for past solutions, showing that a rich range of techniques is available. We then propose a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches. Some of our innovations are the integrated use of linguistic and structural matching, context-dependent matching of shared types, and a bias toward leaf structure where much of the schema content resides. After describing our algorithm, we present experimental results that compare Cupid to two other schema matching systems.
For single databases, primary hindrances for end-user access are the volume of data that is becoming available, the lack of abstraction, and the need to understand the representation of the … For single databases, primary hindrances for end-user access are the volume of data that is becoming available, the lack of abstraction, and the need to understand the representation of the data. When information is combined from multiple databases, the major concern is the mismatch encountered in information representation and structure. Intelligent and active use of information requires a class of software modules that mediate between the workstation applications and the databases. It is shown that mediation simplifies, abstracts, reduces, merges, and explains data. A mediator is a software module that exploits encoded knowledge about certain sets or subsets of data to create information for a higher layer of applications. A model of information processing and information system components is described. The mediator architecture, including mediator interfaces, sharing of mediator modules, distribution of mediators, and triggers for knowledge maintenance, are discussed.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">&gt;</ETX>
article Free Access`` Direct Search'' Solution of Numerical and Statistical Problems Authors: Robert Hooke Wesinghouse Research Laboratories, Pittsburgh, Pennsylvania Wesinghouse Research Laboratories, Pittsburgh, PennsylvaniaView Profile , T. A. Jeeves Wesinghouse … article Free Access`` Direct Search'' Solution of Numerical and Statistical Problems Authors: Robert Hooke Wesinghouse Research Laboratories, Pittsburgh, Pennsylvania Wesinghouse Research Laboratories, Pittsburgh, PennsylvaniaView Profile , T. A. Jeeves Wesinghouse Research Laboratories, Pittsburgh, Pennsylvania Wesinghouse Research Laboratories, Pittsburgh, PennsylvaniaView Profile Authors Info & Claims Journal of the ACMVolume 8Issue 2April 1961 pp 212–229https://doi.org/10.1145/321062.321069Published:01 April 1961Publication History 2,930citation8,393DownloadsMetricsTotal Citations2,930Total Downloads8,393Last 12 Months769Last 6 weeks95 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteeReaderPDF
SUMMARY The technique set out in the paper, CHAID, is an offshoot of AID (Automatic Interaction Detection) designed for a categorized dependent variable. Some important modifications which are relevant to … SUMMARY The technique set out in the paper, CHAID, is an offshoot of AID (Automatic Interaction Detection) designed for a categorized dependent variable. Some important modifications which are relevant to standard AID include: built-in significance testing with the consequence of using the most significant predictor (rather than the most explanatory), multi-way splits (in contrast to binary) and a new type of predictor which is especially useful in handling missing information.
We discuss the design of an acquisitional query processor for data collection in sensor networks. Acquisitional issues are those that pertain to where, when, and how often data is physically … We discuss the design of an acquisitional query processor for data collection in sensor networks. Acquisitional issues are those that pertain to where, when, and how often data is physically acquired ( sampled ) and delivered to query processing operators. By focusing on the locations and costs of acquiring data, we are able to significantly reduce power consumption over traditional passive systems that assume the a priori existence of data. We discuss simple extensions to SQL for controlling data acquisition, and show how acquisitional issues influence query optimization, dissemination, and execution. We evaluate these issues in the context of TinyDB, a distributed query processor for smart sensor devices, and show how acquisitional techniques can provide significant reductions in power consumption on our sensor devices.
Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm … Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the 'accuracy' of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several high-level operators in an implemented testbed for managing information models and mappings.
This book goes into the details of database conception and use, it tells you everything on relational databases. from theory to the actual used algorithms. This book goes into the details of database conception and use, it tells you everything on relational databases. from theory to the actual used algorithms.
Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such … Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information. Existing noninferential, formatted data systems provide users with tree-structured files or slightly more general network models of the data. In Section 1, inadequacies of these models are discussed. A model based on n -ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced. In Section 2, certain operations on relations (other than logical inference) are discussed and applied to the problems of redundancy and consistency in the user's model.
Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such … Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information. Existing noninferential, formatted data systems provide users with tree-structured files or slightly more general network models of the data. In Section 1, inadequacies of these models are discussed. A model based on n -ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced. In Section 2, certain operations on relations (other than logical inference) are discussed and applied to the problems of redundancy and consistency in the user's model.
A data model, called the entity-relationship model, is proposed. This model incorporates some of the important semantic information about the real world. A special diagrammatic technique is introduced as a … A data model, called the entity-relationship model, is proposed. This model incorporates some of the important semantic information about the real world. A special diagrammatic technique is introduced as a tool for database design. An example of database design and description using the model and the diagrammatic technique is given. Some implications for data integrity, information retrieval, and data manipulation are discussed. The entity-relationship model can be used as a basis for unification of different views of data: the network model, the relational model, and the entity set model. Semantic ambiguities in these models are analyzed. Possible ways to derive their views of data from the entity-relationship model are presented.
Performance is crucial for database management systems (DBMSs), and they are always designed to handle ever-changing workloads efficiently. However, the complexity of the cost-based optimizer (CBO) and its interactions can … Performance is crucial for database management systems (DBMSs), and they are always designed to handle ever-changing workloads efficiently. However, the complexity of the cost-based optimizer (CBO) and its interactions can introduce implementation errors, leading to data-sensitive performance anomalies. These anomalies may cause significant performance degradation compared to the expected design under certain datasets. To diagnose performance issues, DBMS developers often rely on intuitions or compare execution times to a baseline DBMS. These approaches overlook the impact of datasets on performance. As a result, only a subset of performance issues is identified and resolved. In this paper, we propose Hulk to automatically explore these data-sensitive performance anomalies via data-driven analysis. The key idea is to identify performance anomalies as the dataset evolves. Specifically, Hulk estimates a reasonable response time range for each data volume to pinpoint performance cliffs. Then, performance cliffs are checked for deviations from expected performance by finding a reasonable plan that aligns with performance expectations. We evaluate Hulk on six widely used DBMSs, namely MySQL, MariaDB, Percona, TiDB, PostgreSQL, and AntDB. Hulk totally reports 135 anomalies, with 129 have been confirmed as new bugs, including 14 CVEs. Among them, 94 are data-sensitive performance anomalies.
Recent studies have made it possible to integrate learning techniques into database systems for practical utilization. In particular, the state-of-the-art studies hook the conventional query optimizer to explore multiple execution … Recent studies have made it possible to integrate learning techniques into database systems for practical utilization. In particular, the state-of-the-art studies hook the conventional query optimizer to explore multiple execution plan candidates, then choose the optimal one with a learned model. This framework simplifies the integration of learning techniques into the database system. However, these methods still have room for improvement due to their limited plan exploration space and ineffective learning from execution plans. In this work, we propose Athena, an effective learning-based framework of query optimizer enhancer. It consists of three key components: (i) an order-centric plan explorer, (ii) a Tree-Mamba plan comparator and (iii) a time-weighted loss function. We implement Athena on top of the open-source database PostgreSQL and demonstrate its superiority via extensive experiments. Specifically, We achieve 1.75x, 1.95x, 5.69x, and 2.74x speedups over the vanilla PostgreSQL on the JOB, STATS-CEB, TPC-DS, and DSB benchmarks, respectively. Athena is 1.74x, 1.87x, 1.66x, and 2.28x faster than the state-of-the-art competitor Lero on these benchmarks. Additionally, Athena is open-sourced and it can be easily adapted to other relational database systems as all these proposed techniques in Athena are generic.
Acyclic conjunctive queries form the backbone of most analytical workloads, and have been extensively studied in the literature from both theoretical and practical angles. However, there is still a large … Acyclic conjunctive queries form the backbone of most analytical workloads, and have been extensively studied in the literature from both theoretical and practical angles. However, there is still a large divide between theory and practice. While the 40-year-old Yannakakis algorithm has strong theoretical running time guarantees, it has not been adopted in real systems due to its high hidden constant factor. In this paper, we strive to close this gap by proposing Yannakakis+, an improved version of the Yannakakis algorithm, which is more practically efficient while preserving its theoretical guarantees. Our experiments demonstrate that Yannakakis+ consistently outperforms the original Yannakakis algorithm by 2x to 5x across a wide range of queries and datasets. Another nice feature of our new algorithm is that it generates a traditional DAG query plan consisting of standard relational operators, allowing Yannakakis+ to be easily plugged into any standard SQL engine. Our system prototype currently supports four different SQL engines (DuckDB, PostgreSQL, SparkSQL, and AnalyticDB from Alibaba Cloud), and our experiments show that Yannakakis+ is able to deliver better performance than their native query plans on 160 out of the 162 queries tested, with an average speedup of 2.41x and a maximum speedup of 47,059x.
State-of-the-art relational schema design generates a lossless, dependency-preserving decomposition into Third Normal Form (3NF), that is in Boyce-Codd Normal Form (BCNF) whenever possible. In particular, dependency-preservation ensures that data integrity … State-of-the-art relational schema design generates a lossless, dependency-preserving decomposition into Third Normal Form (3NF), that is in Boyce-Codd Normal Form (BCNF) whenever possible. In particular, dependency-preservation ensures that data integrity can be maintained on individual relation schemata without having to join them, but may need to tolerate a priori unbounded levels of data redundancy and integrity faults. As our main contribution we parameterize 3NF schemata by the numbers of minimal keys and functional dependencies they exhibit. Conceptually, these parameters quantify, already at schema design time, the effort necessary to maintain data integrity, and allow us to break ties between 3NF schemata. Computationally, the parameters enable us to optimize normalization into 3NF according to different strategies. Operationally, we show through experiments that our optimizations translate from the logical level into significantly smaller update overheads during integrity maintenance. Hence, our framework provides access to parameters that guide the computation of logical schema designs which reduce operational overheads.
Wei Zhou , Y. Gao , Xuanhe Zhou +1 more | Proceedings of the ACM on Management of Data
Automatic dialect translation reduces the complexity of database migration, which is crucial for applications interacting with multiple database systems. However, rule-based translation tools (e.g., SQLGlot, jOOQ, SQLines) are labor-intensive to … Automatic dialect translation reduces the complexity of database migration, which is crucial for applications interacting with multiple database systems. However, rule-based translation tools (e.g., SQLGlot, jOOQ, SQLines) are labor-intensive to develop and often (1) fail to translate certain operations, (2) produce incorrect translations due to rule deficiencies, and (3) generate translations compatible with some database versions but not the others. In this paper, we investigate the problem of automating dialect translation with large language models (LLMs). There are three main challenges. First, queries often involve lengthy content (e.g., excessive column values) and multiple syntax elements that require translation, increasing the risk of LLM hallucination. Second, database dialects have diverse syntax trees and specifications, making it difficult for cross-dialect syntax matching. Third, dialect translation often involves complex many-to-one relationships between source and target operations, making it impractical to translate each operation in isolation. To address these challenges, we propose an automatic dialect translation system CrackSQL. First, we propose Functionality-based Query Processing that segments the query by functionality syntax trees and simplifies the query via (i) customized function normalization and (ii) translation-irrelevant query abstraction. Second, we design a Cross-Dialect Syntax Embedding Model to generate embeddings by the syntax trees and specifications (of certain version), enabling accurate query syntax matching. Third, we propose a Local-to-Global Dialect Translation strategy, which restricts LLM-based translation and validation on operations that cause local failures, iteratively extending these operations until translation succeeds. Experiments show CrackSQL significantly outperforms existing methods (e.g., by up to 77.42%). The code is available at https://github.com/weAIDB/CrackSQL.
Key-value databases with Log Structured Merge tree are increasingly favored by modern applications. Apart from supporting fast lookup on primary key, efficient queries on non-key attributes are also highly demanded … Key-value databases with Log Structured Merge tree are increasingly favored by modern applications. Apart from supporting fast lookup on primary key, efficient queries on non-key attributes are also highly demanded by many of these applications. To enhance query performance, many auxiliary structures like secondary indexing and filters have been developed. However, existing auxiliary structures suffer from three limitations. First, creating filter for every disk component has low lookup efficiency as all components need to be searched during query processing. Second, current secondary index design requires primary table access to fetch the data entries for each output primary key from the index. This indirect entries fetching process involves significant point lookup overhead in the primary table and hence hinders the query performance. Last, maintaining the consistency between the secondary index and the primary table is challenging due to the out-of-place update mechanism of the LSM-tree. To overcome the limitations in existing auxiliary structures for non-key attributes queries, this paper proposes a novel secondary index framework, NEXT, for LSM-based key-value storage system. NEXT utilizes a two-level structure which is integrated with the primary table. In particular, NEXT proposes to create secondary index blocks on each LSM disk component to map the secondary attributes to their corresponding data blocks. In addition, NEXT introduces a global index component which is created on top of all secondary index blocks to direct the secondary index operation to the target secondary index blocks. Finally, NEXT adopts two optimization strategies to further improve the query performance. We implement NEXT on RocksDB and experimentally evaluate its performance against existing methods. Experiments on both static and mixed workloads demonstrate that NEXT outperforms existing methods for different types of non-key attributes.
Joris Nix , Jens Dittrich | Proceedings of the ACM on Management of Data
Every SQL statement is limited to return a single, possibly denormalized table. This approximately 50-year-old design decision has far-reaching consequences. The most apparent problem is the redundancy introduced through denormalization, … Every SQL statement is limited to return a single, possibly denormalized table. This approximately 50-year-old design decision has far-reaching consequences. The most apparent problem is the redundancy introduced through denormalization, which can result in long transfer times of query results and high memory usage for materializing intermediate results. Additionally, regardless of their goals, users are forced to fit query computations into one single result, mixing the data retrieval and transformation aspect of SQL. Moreover, both problems violate the principles and core ideas of normal forms. In this paper, we argue for eliminating the single-table limitation of SQL. We extend SQL's SELECT clause by the keyword 'RESULTDB' to support returning a result subdatabase. Our extension has clear semantics, i.e., by annotating any existing SQL statement with the RESULTDB keyword, the DBMS returns the tables participating in the query, each restricted to the relevant tuples that occur in the traditional single-table query result. Thus, we do not denormalize the query result in any way. Our approach has significant, far-reaching consequences, impacting the querying of hierarchical data, materialized views, and distributed databases, while maintaining backward compatibility. In addition, our proposal paves the way for a long list of exciting future research opportunities. We propose multiple algorithms to integrate our feature into both closed-source and open-source database systems. For closed-source systems, we provide several SQL-based rewrite methods. In addition, we present an efficient algorithm for cyclic and acyclic join graphs that we integrated into an open-source database system. We conduct a comprehensive experimental study. Our results show that returning multiple individual result sets can significantly decrease the result set size. Furthermore, our rewrite methods and algorithm introduce minimal overhead and can even outperform single-table execution in certain cases.
Junyi Zhao , Kai Su , Yifei Yang +3 more | Proceedings of the ACM on Management of Data
Join order optimization is critical in achieving good query performance. Despite decades of research and practice, modern query optimizers could still generate inferior join plans that are orders of magnitude … Join order optimization is critical in achieving good query performance. Despite decades of research and practice, modern query optimizers could still generate inferior join plans that are orders of magnitude slower than optimal. Existing research on robust query processing often lacks theoretical guarantees on join-order robustness while sacrificing query performance. In this paper, we rediscover the recent Predicate Transfer technique from a robustness point of view. We introduce two new algorithms, LargestRoot and SafeSubjoin, and then propose Robust Predicate Transfer (RPT) that is provably robust against arbitrary join orders of an acyclic query. We integrated Robust Predicate Transfer with DuckDB, a state-of-the-art analytical database, and evaluated against all the queries in TPC-H, JOB, TPC-DS, and DSB benchmarks. Our experimental results show that RPT improves join-order robustness by orders of magnitude compared to the baseline. With RPT, the largest ratio between the maximum and minimum execution time out of random join orders for a single acyclic query is only 1.6x (the ratio is close to 1 for most evaluated queries). Meanwhile, applying RPT also improves the end-to-end query performance by β‰ˆ1.5x (per-query geometric mean). We hope that this work sheds light on solving the practical join ordering problem.
After decades of research in approximate query processing (AQP), its adoption in the industry remains limited. Existing methods struggle to simultaneously provide user-specified error guarantees, eliminate maintenance overheads, and avoid … After decades of research in approximate query processing (AQP), its adoption in the industry remains limited. Existing methods struggle to simultaneously provide user-specified error guarantees, eliminate maintenance overheads, and avoid modifications to database management systems. To address these challenges, we introduce two novel techniques, TAQA and BSAP. TAQA is a two-stage online AQP algorithm that achieves all three properties for arbitrary queries. However, it can be slower than exact queries if we use standard row-level sampling. BSAP resolves this by enabling block-level sampling with statistical guarantees in TAQA. We implement TAQA and BSAP in a prototype middleware system, PilotDB, that is compatible with all DBMSs supporting efficient block-level sampling. We evaluate PilotDB on PostgreSQL, SQL Server, and DuckDB over real-world benchmarks, demonstrating up to 126X speedups when running with a 5% guaranteed error.
Regular expressions (RegEx) are an essential tool for pattern matching over streaming data, e.g., in network and security applications. The evaluation of RegEx queries becomes challenging, though, once subsequences are … Regular expressions (RegEx) are an essential tool for pattern matching over streaming data, e.g., in network and security applications. The evaluation of RegEx queries becomes challenging, though, once subsequences are incorporated, i.e., characters in a sequence may be skipped during matching. Since the number of subsequence matches may grow exponentially in the input length, existing RegEx engines fall short in finding all subsequence matches, especially for queries including Kleene closure. In this paper, we argue that common applications for RegEx queries over streams do not require the enumeration of all distinct matches at any point in time. Rather, only an aggregate over the matches is typically fetched at specific, yet unknown time points. To cater for these scenarios, we present SuSe, a novel architecture for RegEx evaluation that is based on a query-specific summary of the stream. It employs a novel data structure, coined StateSummary, to capture aggregated information about subsequence matches. This structure is maintained by a summary selector, which aims at choosing the stream projections that minimize the loss in the aggregation result over time. Experiments on real-world and synthetic data demonstrate that SuSe is both effective and efficient, with the aggregates being based on several orders of magnitude more matches compared to baseline techniques.
The workhorse of property graph query languages such as Cypher and GQL is pattern matching. The result of pattern matching is a collection of paths and mappings of variables to … The workhorse of property graph query languages such as Cypher and GQL is pattern matching. The result of pattern matching is a collection of paths and mappings of variables to graph elements. To increase expressiveness of post-processing of pattern matching results, languages such as Cypher introduce the capability of creating lists of nodes and edges from matched paths, and provide users with standard list processing tools such as reduce. We show that on the one hand, this makes it possible to capture useful classes of queries that pattern matching alone cannot do. On the other hand, we show that this opens backdoor to very high and unexpected expressiveness. In particular one can very easily express several classical NP-hard problems by simple queries that use reduce. This level of expressiveness appears to be beyond what query optimizers can handle, and indeed this is confirmed by an experimental evaluation, showing that such queries time out already on very small graphs. We conclude our analysis with a suggestion on the use of list processing in queries that while retaining its usefulness, avoids the above pitfalls and prevents highly intractable queries.
The goal of classical normalization is to maintain data consistency under updates, with a minimum level of effort. Given functional dependencies (FDs) alone, this goal is only achievable in the … The goal of classical normalization is to maintain data consistency under updates, with a minimum level of effort. Given functional dependencies (FDs) alone, this goal is only achievable in the special case an FD-preserving Boyce-Codd Normal Form (BCNF) decomposition exists. As we show, in all other cases the level of effort can be neither controlled nor quantified. In response, we establish the β„“-Bounded Cardinality Normal Form, parameterized by a positive integer β„“. For every β„“, the normal form condition requires from every instance that every value combination over the left-hand side of every non-trivial FD does not occur in more than β„“ tuples. BCNF is captured when β„“ = 1. We show that schemata in β„“-Bounded Cardinality Normal Form characterize instances in which updates to at most β„“ occurrences of any redundant data value are sufficient to maintain data consistency. In fact, for the smallest β„“ in which a schema is in β„“-Bounded Cardinality Normal Form we capture an equilibrium between worst-case update inefficiency and best-case join efficiency, where some redundant data value can be joined with up to β„“ other data values. We then establish algorithms that compute schemata in β„“-Bounded Cardinality Normal Form for the smallest level β„“ attainable across all lossless, FD-preserving decompositions. Additional algorithms i) attain even smaller levels of effort based on the loss of some FDs, and ii) decompose schemata based on prioritized FDs that cause high levels of effort. Our framework informs de-normalization already during logical design. In particular, every materialized view exhibits an equilibrium level β„“ that quantifies its worst-case incremental maintenance cost and its best-case support for join queries. Experiments with synthetic and real-world data illustrate which properties the schemata have that result from our algorithms, and how these properties predict the performance of update and query operations on instances over the schemata, without and with materialized views. We further demonstrate how our framework can automate the design of data warehouses by mining data for dimensions that exhibit high levels of data redundancy. In an effort to align data and the business rules that govern them, we use lattice theory to characterize β„“-Bounded Cardinality Normal Form on database instances and schemata. As a consequence, any difference in constraints observed at instance and schema levels provides an opportunity to improve data quality, insight derived from analytics, and database performance.
Xinyi Ye , Xiangyang Gou , Lei Zou +1 more | Proceedings of the ACM on Management of Data
Multi-way join, which refers to the join operation among multiple tables, is widely used in database systems. With the development of the Internet and social networks, a new variant of … Multi-way join, which refers to the join operation among multiple tables, is widely used in database systems. With the development of the Internet and social networks, a new variant of the multi-way join query has emerged, requiring continuous monitoring of the query results as the database is updated. This variant is called continuous multi-way join. The join order of continuous multi-way join significantly impacts the operation's cost. However, existing methods for continuous multi-way join order selection are heuristic, which may fail to select the most efficient orders. On the other hand, the high-cost order computation will become a system bottleneck if we directly transfer join order selection algorithms for static multi-way join to the dynamic setting. In this paper, we propose a new A daptive J oin O rder S election algorithm for the C ontinuous multi-way join queries named AJOSC. It uses dynamic programming to find the optimal join order with a new cost model specifically designed for continuous multi-way join. We further propose a lower-bound-based incremental re-optimization algorithm to restrict the search space and recompute the join order with low cost when data distribution changes. Experimental results show that AJOSC is up to two orders of magnitude faster than the state-of-the-art methods.
Metadata-driven data pipelines represent a transformative approach to cloud-native data engineering, addressing the limitations of traditional hand-coded solutions that struggle with complexity and scale. This architectural pattern decouples transformation logic … Metadata-driven data pipelines represent a transformative approach to cloud-native data engineering, addressing the limitations of traditional hand-coded solutions that struggle with complexity and scale. This architectural pattern decouples transformation logic from execution by storing pipeline definitions as structured metadata, which is dynamically interpreted at runtime. The resulting framework enables organizations to automate pipeline development, enforce consistent standards, and adapt rapidly to changing business requirements. In cloud environments characterized by distributed teams and evolving data schemas, this approach delivers significant advantages in development velocity, operational efficiency, and governance capabilities. By externalizing pipeline logic into configurable metadata, organizations can streamline source onboarding, ensure compliance, and establish the foundation for advanced data initiatives, including AI-driven analytics and self-service data access.
G. Venkataraman | World Journal of Advanced Engineering Technology and Sciences
This article presents a comprehensive framework for designing end-to-end real-time inference platforms that enable organizations to deliver personalized experiences and make intelligent decisions within milliseconds. It explores the architectural components … This article presents a comprehensive framework for designing end-to-end real-time inference platforms that enable organizations to deliver personalized experiences and make intelligent decisions within milliseconds. It explores the architectural components essential for supporting hundreds of concurrent models while maintaining sub-second latency, from data pipelines and feature engineering to model serving and performance optimization. The discussion encompasses hybrid batch-stream processing, feature stores, Kubernetes orchestration, latency optimization techniques, and cross-functional collaboration practices. By addressing both technical infrastructure and organizational considerations, the article provides engineering leaders, MLOps practitioners, and platform architects with practical guidance for creating resilient AI systems that align with business objectives and deliver measurable value to end users across industries such as e-commerce, finance, media, and healthcare.