AI and Storage Systems in OSDI ‘25 – Trends, Contributions, and Emerging Directions
1. Overall Trends & Challenges
Hardware-Driven Scaling
A prominent trend is leveraging new memory and storage hardware to meet AI’s scale. For example, distributed systems are exploiting CXL-attached memory (Compute Express Link) instead of traditional networks for cross-host data sharing. Similarly, wafer-scale accelerators with massive on-chip memory are emerging to host large models entirely on-chip, promising huge speedups in AI inference. The challenge is to redesign software to fully exploit these new hardware capabilities (e.g., overcoming higher latencies or limited coherence in CXL memory).
Fast Model Data Access & Caching
As AI models and datasets grow, systems face a trade-off between storage/memory usage and latency. OSDI ‘25 papers highlight the need for efficient data loading and caching for model serving. One key issue is that loading large model parameters can stall autoscaling; solutions like BlitzScale show that using high-bandwidth GPU interconnects and multicast can achieve near O(1) scaling without heavy host caching. The general challenge is ensuring models scale out quickly across nodes without incurring prohibitive memory overhead or startup delays.
Adaptive Data Retrieval
Many AI applications rely on vector search (finding nearest embedding vectors for recommendations, retrieval-augmented generation, etc.), which stresses storage systems. A trend is toward adaptive indexes and hybrid memory/disk solutions that maintain low latency under dynamic, skewed workloads. For instance, Quake dynamically partitions and tunes indexes to handle evolving data distributions, yielding 1.5–38× lower query latency under updates compared to static indexes. Likewise, PipeANN aligns its search algorithm with SSD characteristics to shrink the performance gap between disk-based and in-memory vector search, achieving on-disk search latencies only ~1.14–2× those of in-memory methods. The challenge here is balancing speed and cost: keeping most data on cheaper storage (SSD) or adjusting indexes on the fly, yet approaching the performance of in-memory systems.
Quality vs. Efficiency in Model Storage
AI systems increasingly grapple with model compression (quantization, pruning) to fit models into limited storage or memory, which can degrade accuracy. A notable challenge is preserving model quality while enjoying the efficiency gains. DecDEC, for example, addresses aggressive low-bit quantization (3–4 bits) by offloading a small portion of weight data (the “residuals”) to CPU memory and fetching them on the fly to correct errors. This approach retains the speed and memory savings of quantization but significantly improves model accuracy (e.g., reducing a quantized model’s perplexity from 10.15 to 9.12 with negligible memory overhead). Such techniques highlight the ongoing trade-off between model size and fidelity, pushing systems to creatively use storage hierarchies (GPU memory plus slower memory) to get the best of both worlds.
End-to-End Throughput Bottlenecks
As AI deployments scale, new performance bottlenecks have surfaced in storage and data movement. Contrary to conventional wisdom that large-model inference is memory-bound, analysis showed that LLM serving can be compute-bound due to sequential execution of heterogeneous operations (compute, memory copy, networking). This insight is driving systems to seek more parallelism and overlap in the pipeline. For example, NanoFlow splits inference into “nano-batches” to overlap computation with data transfer, boosting throughput by ~1.9× and utilizing available GPU resources better. Another challenge in distributed training is the communication overhead for synchronizing huge models’ gradients; approaches like ZEN exploit sparsity in gradients to cut synchronization time by up to 5×, addressing network and I/O bottlenecks in multi-GPU training. Overall, OSDI ‘25 reflects a push toward holistic optimization – ensuring that storage, memory, and network are not limiting factors as we scale up AI models and clusters.
2. Key Contributions and Breakthroughs
Several OSDI ‘25 papers made significant contributions at the intersection of AI and storage/database systems:
Fine-Grained Model Autoscaling (BlitzScale)
BlitzScale introduced a novel model-serving autoscaler that breaks the conventional instance-level scaling into layer-level scaling. By loading model weights directly over the GPU interconnect (avoiding slow host storage) and using multicast to share parameters, it achieves “fast and live” scaling with essentially O(1) caching cost. This eliminated the typical speed vs. memory trade-off in autoscaling – BlitzScale can rapidly spawn new model replicas without lengthy loading delays. The result was up to 94% lower tail inference latency compared to the state-of-the-art serverless LLM scaling system. This is a breakthrough in handling bursty AI service demand without over-provisioning memory.
Adaptive Vector Search Indexing (Quake)
Quake tackled the problem of dynamic vector databases. Unlike traditional ANN (Approximate Nearest Neighbor) indexes that degrade under changing data, Quake adjusts to evolving data and query patterns on the fly. It uses a multi-level partitioning index that self-tunes based on a cost model predicting query latency, and it even adjusts search parameters to maintain target recall. In evaluations on a Wikipedia embedding workload, Quake maintained high recall with drastically lower latency, achieving 1.5–38× speedups in query time and up to 126× faster updates over state-of-the-art indexes on dynamic workloads. This is a significant advance for AI applications that require real-time search on continuously updating vectors (e.g. personalization systems).
Disk-Based ANN Performance (PipeANN)
Recognizing that storing billion-scale embeddings purely in memory is costly, PipeANN showed how to use SSD storage without sacrificing much speed. The key innovation is aligning the ANN search algorithm with the SSD’s strengths: it reorganizes the best-first search pattern to minimize unnecessary I/O stalls and leverage parallelism. This yielded an on-disk approximate nearest neighbor search that runs at only ~1.14–2.02× the latency of an in-memory solution while maintaining accuracy. Compared to a prior disk-based system (DiskANN), PipeANN’s latency is about 35% (≈3× faster) of the baseline. Such a result is a breakthrough for cost-effective AI data stores, enabling scalable vector search on commodity SSDs.
Wafer-Scale AI Inference (WaferLLM)
WaferLLM demonstrated the first system to fully exploit a wafer-scale AI accelerator for large language model inference. It introduces a performance model (PLMR) tailored to the wafer’s unique hardware (with hundreds of thousands of cores and tens of PB/s on-chip bandwidth). Using this model, WaferLLM coordinates parallel execution across the giant chip, introducing custom matrix multiplication kernels optimized for the mesh architecture. The impact is dramatic: on a Cerebras WSE-2 wafer-scale engine, WaferLLM attained 200× higher hardware utilization than conventional GPU setups and delivered end-to-end LLM inference speedups of 10–20× over an NVIDIA A100 cluster. This work is a milestone in hardware-software co-design for AI, pointing to future systems where enormous on-chip memory can largely replace slower off-chip storage for model serving.
Low-Bit LLM Quantization with Residual Storage (DecDEC)
DecDEC offered a fresh systems-level solution to push quantization to extreme lows (3–4 bits) while keeping model quality high. The insight is to use a hybrid storage approach: keep a small fraction of model data – the residuals that compensate for quantization error – in CPU memory (outside the GPU) and fetch them on demand during inference. This retains quantization’s benefits (huge GPU memory savings and latency gains) but recovers lost accuracy by correcting outlier activations with high-precision data. Impressively, DecDEC improved a 3-bit quantized Llama model’s perplexity substantially (from 10.15 down to 9.12, even beating a 3.5-bit model) with <0.0003% memory overhead and only 1.7% slowdown. This contribution marries storage hierarchy design with ML model needs, showing a path forward for efficient yet accurate model deployment on resource-constrained devices.
3. Emerging Directions
OSDI ‘25 papers hint at several emerging research directions at the crossroads of AI and storage systems:
Hybrid Memory Architectures for AI
The use of CXL-based memory pooling and wafer-scale integrated memory suggests that future AI systems will blur the line between local and remote or on-chip memory. We can expect research into new memory-management techniques that treat far memory, persistent memory, and accelerators as first-class storage tiers for training and inference. The early success of systems like Tigon (CXL memory DB) and WaferLLM indicates a broader hardware/software co-design trend, where system architects tailor data placement and movement to novel hardware capabilities. An open question is how to generalize these designs – future work may explore OS and database support to automatically place model data in the optimal tier (DRAM vs. NVRAM vs. CXL memory vs. on-chip SRAM) for both performance and cost efficiency.
AI-Native Data Management
Managing data for AI workloads (e.g. embeddings, model parameters, training data) is becoming a specialty of its own. We see the rise of AI-native databases and caches: vector databases that adapt in real-time (as in Quake), distributed caches that offer bounded staleness for eventual consistency, and file systems optimized for the I/O patterns of AI (such as handling large sequential writes or checkpointing). Future systems research will likely expand on these, for instance by integrating learned components or AI models to decide caching and indexing strategies dynamically. There’s also a push toward making traditionally “dumb” storage smarter – for example, embedding simple compute logic (filtering, aggregations, similarity search) near storage to reduce data movement for ML workloads.
Holistic Pipeline Optimization
A clear direction is optimizing end-to-end pipelines for both training and serving. Instead of treating computation, memory access, and communication separately, upcoming systems overlap and co-schedule them. NanoFlow and PipeThreader illustrate this by using software-defined scheduling to overlap GPU computation with memory transfers or to utilize specialized cores efficiently. Similarly, addressing straggler delays in multi-node training (as studied in a ByteDance trace analysis) and balancing load in complex parallelism schemes (WLB-LLM for 4D parallelism) point to more research on dynamic scheduling and load balancing for large-scale AI jobs. We anticipate new frameworks that can automatically reorganize or fine-tune the execution of AI tasks (both at micro-scale within a GPU and at macro-scale across clusters) to squeeze out inefficiencies – essentially treating the model, hardware, and data flow as one integrated system.
Intelligent Systems and Automation
Ironically, AI is also helping to build better systems. OSDI ‘25 saw tools like QiMeng-Xpiler, which uses an LLM (Large Language Model) plus symbolic methods to automate generating tensor code across hardware backends. There’s also SysGPT, a GPT-4-based assistant fine-tuned on systems papers, that suggests performance optimizations comparable to expert strategies. This foreshadows a future where machine learning assists in systems optimization – from auto-tuning database configurations to verifying storage system consistency. We might see emerging research on using reinforcement learning or large models to manage storage caching policies, schedule data transfers, or even design new file formats optimized for ML. The combination of domain knowledge and learned guidance could yield systems that continuously self-optimize for the workload at hand.
Reliability and Correctness at Scale
Finally, as AI systems become infrastructure, fault tolerance and correctness are growing concerns. The difficulty of debugging large model training (with subtle “silent” errors) led to tools like TrainCheck for invariant detection during training. In storage, we see formal verification being applied to ensure crash consistency and corruption detection (e.g., the PoWER framework verifying storage systems with standard logic). An emerging direction is to bring these reliability techniques to AI data pipelines – for instance, verifying the integrity of data preprocessing, or automatically detecting anomalies in model checkpoints and gradients. As AI-driven services handle critical data, we expect future work on making AI storage systems not just fast, but also provably robust against bugs, crashes, and adversarial conditions.
4. Top Papers Table: AI & Storage-Related Papers from OSDI ‘25
Below is a selection of influential OSDI ‘25 papers focusing on AI and storage/database systems, highlighting their key findings and relevance:
| Paper Title (OSDI 2025) | Key Findings | Relevance to AI & Storage |
|---|---|---|
| BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching | Introduces fine-grained, layer-level autoscaling for LLM serving by loading model parameters via high-speed GPU networks (multicast) instead of relying on disk or host RAM. Achieves up to 94% lower tail latency and ~50% less GPU usage versus prior autoscaling systems. | Enables rapid scaling of AI services without heavy local caching, solving the speed vs. memory trade-off in serving large models. Shows how storage bottlenecks (parameter loading) can be overcome with network-assisted caching. |
| Quake: Adaptive Indexing for Vector Search | Develops a self-adapting ANN indexing scheme for high-dimensional vectors that adjusts to data evolution and access skew. Uses multi-level partitions and a cost model to maintain low latency and high recall as data changes. Improves query latency by 1.5–38× and update speed by up to 126× on dynamic workloads vs. state-of-the-art indexes. | Provides a storage backend for embeddings that meets the real-time demands of AI applications (e.g. recommendations, retrieval-augmented generation). Ensures fast vector search even as the dataset grows or shifts, highlighting a new direction for database indexing in ML systems. |
| Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search with SSD (PipeANN) | Aligns a graph-based ANN search algorithm with SSD hardware characteristics to avoid I/O stalls. Bridges the gap between in-memory and disk-based vector search: on billion-scale data, achieves search latency about 1.14–2.02× that of an in-memory index and only ~35% of the latency of a conventional disk-based approach (no accuracy loss). | Makes large-scale vector databases feasible using commodity SSDs. It’s highly relevant for AI systems that need to store and query massive embedding sets without investing in gigantic RAM clusters – effectively democratizing vector search by using storage more intelligently. |
| WaferLLM: Large Language Model Inference at Wafer Scale | Presents the first inference system for wafer-scale chips (with 100k+ cores and >50 TB/s bandwidth). Introduces a new performance model (PLMR) and specialized parallel algorithms (MeshGEMM/GEMV) to fully utilize the hardware. Achieves 200× higher core utilization and 10–20× faster LLM inference compared to conventional GPU clusters. | Showcases a radical hardware-software co-design for AI: storage and computation on a single huge silicon wafer. Relevant as a blueprint for future AI accelerators – how to orchestrate on-chip memory and core resources for extreme-scale model serving, potentially reducing reliance on external memory or networks. |
| NanoFlow: Towards Optimal Large Language Model Serving Throughput | Proposes splitting inference into nano-batches and overlapping compute, memory transfers, and network operations within each GPU. This intra-device parallelism yields up to 1.91× throughput improvement in LLM serving, reaching ~50–72% of the theoretical optimal throughput on popular models. | Targets the throughput bottlenecks in serving big models by treating the GPU like a pipeline. It highlights how careful scheduling of data movement (a storage/memory concern) alongside computation can significantly improve utilization. This is crucial for large-scale AI services where maximizing existing hardware throughput reduces the need for additional servers. |
| DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization | Improves the accuracy of ultra-low-bit (3–4 bit) quantized LLMs by storing a small “residual” matrix in CPU memory and fetching it for outlier activations. Maintains nearly all benefits of quantization (memory savings, speed) while greatly reducing error. For instance, lowered a 3-bit LLaMA model’s perplexity from 10.15 to 9.12 (better than a 3.5-bit model) with <0.001% extra GPU memory and ~1.7% slowdown. | A novel hybrid storage approach that splits model data across GPU and CPU memory to push model compression further. Relevant to on-device AI and any scenario with limited GPU memory – it shows how offloading part of the model to cheaper storage can preserve quality, pointing toward more flexible memory management in AI frameworks. |
| FuseLink: Enabling Efficient GPU Communication over Multiple NICs | Utilizes idle NICs by having GPUs relay traffic across multiple network interfaces, avoiding static one-GPU-per-NIC bottlenecks. Integrated into the NCCL library, it achieved 212 GB/s GPU-to-GPU bandwidth and sped up distributed ML tasks (e.g., first-token LLM inference latency by 1.04–2.73×, mixture-of-experts training throughput by 1.3×). | Although focused on networking, this work touches on data transfer optimizations for AI clusters – effectively a part of the storage/memory hierarchy in distributed training. By overcoming network hotspots, it ensures faster data exchange for model shards and embeddings. This is highly relevant for scaling AI training and serving, where moving data quickly between GPUs (and thus between memory pools) is as important as local storage speed. |
Each of these papers exemplifies how cutting-edge research is addressing the intersection of AI workloads with storage and data management. From novel indexing structures and caching mechanisms to co-designing with new memory hardware, OSDI ‘25 showcased solutions that push the boundaries of performance and efficiency for AI systems. The innovations not only tackle current bottlenecks but also open up new possibilities for future AI infrastructure, where storage systems are intelligent, adaptive, and deeply integrated with machine learning needs.