Key Innovations from CIDR’ 25 Full Papers

Link

Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing

Contributions & Methods: Palimpzest introduces a declarative query framework for AI-powered analytics that allows seamlessly orchestrating ML models and prompts on unstructured data using database-style optimization. It addresses the complexity of current AI pipelines by letting users express AI analytics queries in a high-level language, which Palimpzest compiles into optimized execution plans. A cost-based optimizer explores different choices of models, prompting strategies, and data processing pipelines, balancing runtime, monetary cost (e.g. API calls), and result quality. Key techniques include representing model calls as query operators and estimating their costs/accuracy, and applying novel plan transformations to trade off quality vs. speed. The system implements concurrency control for model access and a new query language to specify AI tasks. Palimpzest’s optimizer can automatically choose cheaper or smaller models when appropriate and adjust prompt strategies to meet user-defined accuracy/cost goals.

Technical Details: Palimpzest’s prototype is built into a query engine that can call external AI models. It introduces new logical operators (e.g. AI-Scan to extract facts from text, AI-Join to combine textual and structured data) and uses a plan enumerator to consider alternatives like using an embedding model vs. a regex for a task. A novel incremental planning technique defers certain decisions to runtime, using feedback on model performance. It also applies multi-query optimization by caching model outputs for reuse. Experiments show Palimpzest can achieve up to 3.3× faster runtime and 2.9× lower cost than naive pipelines, while even improving accuracy (F1-score) by using models ensemble intelligently. These gains come from avoiding redundant model calls and picking the right model for each subtask automatically.

Industry Impact: This work is significant for cloud analytics providers (like AWS, Azure, GCP) that incorporate AI into data platforms. Palimpzest’s approach could be integrated into cloud data warehouses to let users run complex AI analytics with simple SQL-like queries, automatically optimizing which AI services to use for cost-efficiency. Cloud providers could offer Palimpzest as a managed service to minimize customers’ spending on large models – e.g. using cheaper open-source models when possible and only calling expensive APIs when needed. By treating model inference as query operators, cloud systems can manage and monitor AI workload similar to SQL, which simplifies deployment at scale. This declarative AI query paradigm could drive new products where data and AI pipelines converge, reducing the barrier for enterprises to apply AI on large document corpora cost-effectively.

Broader Implications: Palimpzest opens a new research direction in “AI-native” query optimizers that extend decades of DB optimization techniques to machine learning workflows. It demonstrates the benefit of fusing ML and DB worlds – future research can extend this to more types of models (e.g., graph ML) and investigate cost modeling for model accuracy. It may inspire academic work on learned cost models for prompts or on integrating user feedback loops into query processing (to refine model outputs). By achieving better performance-cost trade-offs, Palimpzest’s ideas push the envelope on autonomous analytics systems, and could influence future SQL standards to incorporate AI operators. Overall, it bridges a gap between AI and databases, showing a path toward autonomous analytics systems that optimize both database operations and AI model invocations jointly.

Adaptive data transformations for QaaS (Query-as-a-Service)

Contributions & Methods: This paper targets cloud Query-as-a-Service systems (like Athena, BigQuery) which let users run SQL on raw data without managing servers. It identifies that these systems suffer when querying semi-structured data (e.g. big CSV/JSON) because they lack indices or upfront optimization, causing high latency and cost. The key contribution is a framework that dynamically performs data transformations (format conversion, partitioning, sorting) on-the-fly using serverless functions to improve performance for subsequent queries. The authors propose an intelligent middleware that, given a query workload and user-provided budget (latency or cost constraints), decides when to automatically transcode raw data into columnar formats (like Parquet) or chunk and sort data to speed up future queries. These transformations are done adaptively – e.g. if a CSV is queried frequently, the system might trigger a one-time conversion to Parquet in the background using AWS Lambda functions.

Technical Details: The system performs “in situ” transformations: it monitors incoming queries, and if a query would benefit from a certain data layout, it invokes a serverless function to perform that conversion on the dataset (or relevant partition) just-in-time. It uses simple heuristics (like if the same CSV has been scanned N times, convert it, or if a query is selective, sort or partition the data by the filter column). It also supports parallel, incremental conversion – e.g. splitting a large file and converting chunks concurrently in cloud functions. Critically, it keeps track of data versioning so that transformed results remain consistent with original data. The approach essentially treats transformation as another query processing step, scheduled adaptively. Experiments with TPC-H show that converting raw CSV to Parquet on-the-fly can reduce query runtime by 3–4× and cost by 50–80% for large-scale queries on Athena and BigQuery (since less data is scanned). They introduce strategies to decide which queries justify the overhead of transformation based on usage frequency and data reuse.

Industry Impact: This work can directly impact cloud providers offering serverless query engines. By dynamically optimizing data layout under the hood, providers like AWS or Google can save infrastructure costs and pass on performance improvements to users. The use of serverless functions means transformations scale out easily and only incur cost when used, aligning with cloud’s pay-per-use model. Enterprise users benefit by getting near-warehouse performance on data lake files without manually prepping data. Cloud vendors could incorporate this as an “auto-optimize” feature – e.g., BigQuery could automatically convert hot JSON data to columnar format, improving user experience. Overall, it moves data lakes closer to warehouses in performance, while retaining schema-on-read flexibility, which is highly attractive in industry.

Broader Implications: Academically, this work merges database optimization with cloud elasticity – raising new questions on cost-based adaptive optimization in serverless environments. It encourages research into learning which transformations yield the best trade-offs under uncertain workloads (perhaps using reinforcement learning to schedule transformations). It also highlights the importance of format and physical design in serverless analytics, an area that will grow as data lakes and warehouses converge. Future research might extend this to more complex transformations (like creating summary statistics or synopses automatically). The concept of treating data format as fluid, with systems automatically refining it based on usage, could be extended to streaming data or applied to multi-tenant scenarios where a cloud service balances transformations across many users for overall efficiency. This work helps bridge the gap between raw data lakes and optimized query engines through adaptive, cloud-native techniques.

Trampoline-Style Queries for SQL

Contributions & Methods: Trampoline-Style Queries introduce a new iteration construct in SQL to perform complex recursive computations more flexibly than standard recursive CTEs. The paper’s high-level contribution is a proposal to extend SQL with a WITH TRAMPOLINE clause, enabling iterative query execution where each iteration’s output can influence control flow of subsequent iterations. Unlike traditional recursive CTEs, which have fixed seminaïve evaluation, trampoline-style queries allow fine-grained control of how intermediate results propagate between iterations (via branch queries). This makes it easier to express algorithms like graph traversals, iterative ML, or dynamic programming in SQL. The authors provide a formal design where a dispatcher query (the “trampoline”) routes rows to one of multiple branch queries in each loop, and branches can emit not only result rows but also directives for the next iteration. Key contributions include demonstrating that this covers a broader range of applications than current SQL:1999 recursion, and outlining execution strategies for parallel databases.

Technical Details: The “trampoline” syntax defines a set of branch sub-queries, each labeled, and on each iteration, the dispatcher sends each incoming tuple to a specific branch based on a label column. Developers can thereby implement state machines or complex loop logic in a single SQL query. The paper relates this to known concepts – showing it is essentially an explicit control flow mechanism akin to a program loop but optimized by the SQL engine. Internally, they discuss how a database engine can implement this efficiently: e.g., using in-memory buffers for branch outputs and coordinating termination when no branch produces new output. They also describe a massively parallel execution plan where each branch query can run on distributed data partitions, synchronized by the dispatcher. The prototype covers use cases like iterative graph algorithms (e.g., computing transitive closure, BFS), showing more natural expression and better performance than awkward encodings with recursive CTEs. A case study indicates a trampoline-style BFS query is easier to read and can execute with fewer iterations (hence lower overhead) than the equivalent recursive CTE, due to eliminating redundant work by controlling which tuples loop again.

Industry Impact: For database vendors and cloud analytics providers, adopting trampoline-style queries could significantly improve support for advanced analytics and iterative algorithms in SQL. Large cloud warehouses (BigQuery, Snowflake, etc.) currently have limited recursion capabilities – this approach could let them offer powerful new SQL analytics (e.g., graph queries, iterative machine learning directly in SQL). Particularly, enterprises could push more logic into their cloud DB instead of exporting data to Python for iterative processing, simplifying architecture. If implemented efficiently, it could also attract new workloads (like network analysis, hierarchical computations) to run on cloud databases. Database systems known for analytical prowess (Oracle, SQL Server, etc.) might incorporate this to differentiate with better support for recursive logic. Overall, it can bring the ease of high-level SQL to tasks that currently require custom procedural code, improving developer productivity in industry.

Broader Implications: This work opens a line of research into richer computational paradigms in declarative languages. It essentially merges ideas from programming language iteration (trampolines in functional programming) into SQL, which might spur further work on unifying database querying with control flow constructs. Future academic work might formalize optimization techniques for trampolines (ensuring minimal overhead loops) or explore static analysis of termination and complexity in these queries. It also invites discussion on standardization – trampoline-style iteration could become part of SQL standard proposals if broadly seen as useful. Furthermore, it provides a foundation for “active databases” that perform complex logic internally, possibly reducing the impedance mismatch between databases and application logic. By showing how to integrate such control flows, it may inspire other extensions (e.g., conditionals or better user-defined iteration mechanisms) in SQL, all while keeping the benefits of set-oriented execution.#