<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.2.2">Jekyll</generator><link href="https://pfzuo.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://pfzuo.github.io/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-03-27T11:41:58+00:00</updated><id>https://pfzuo.github.io/feed.xml</id><title type="html">blank</title><entry><title type="html">2025 Year-End Summary: Four Trends and Five Representative Works of Innovation in LLM Inference Systems</title><link href="https://pfzuo.github.io/blog/2026/2025-Year-End-Summary-Zhihu/" rel="alternate" type="text/html" title="2025 Year-End Summary: Four Trends and Five Representative Works of Innovation in LLM Inference Systems"/><published>2026-01-06T00:00:00+00:00</published><updated>2026-01-06T00:00:00+00:00</updated><id>https://pfzuo.github.io/blog/2026/2025-Year-End-Summary-Zhihu</id><content type="html" xml:base="https://pfzuo.github.io/blog/2026/2025-Year-End-Summary-Zhihu/"><![CDATA[<h3 id="introduction-the-first-year-of-inference-explosion-and-the-hundred-billion-cost-battle">Introduction: The First Year of Inference Explosion and the Hundred-Billion Cost Battle</h3> <p>Looking back at 2025, we not only experienced technological iterations, but also witnessed a dramatic shift in the industrial landscape. If the past few years were an arms race of “large-scale model training,” then 2025 was undoubtedly the first year of “inference business explosion.” As model capabilities matured and Agent applications landed, the balance of cloud computing power fundamentally shifted: currently, the vast majority of GPU/NPU resources on the cloud are occupied by inference workloads. The scale of accelerator cards serving inference is often several times larger than that of training cards, and in some companies even an order of magnitude larger.</p> <p>For any AI company, the largest cost center is often AI Infra. Within the territory of Infra, as user volume surges, inference costs take an overwhelming share. This means that optimizing the efficiency of inference systems is no longer just a challenge for tech enthusiasts; it is the “lifeline” directly tied to enterprise survival and development. Imagine, if through system architecture innovation, we could double inference throughput on the same hardware scale; for a tech company with massive compute investments, this could correspond directly to cost savings of tens of billions of RMB.</p> <p>It is precisely this tremendous business value and technical challenge that drove our team to continuously explore the boundaries of inference systems over the past year. In 2025, we witnessed DeepSeek-V3 pushing MoE to the extreme, saw Agents evolve from demos to complex production environments, and experienced Context Caching transforming from a “slaying-the-dragon trick” to a “mainstream commodity.” As traditional inference system architectures (such as early vLLM and TGI) began to show fatigue and the marginal returns of single-point optimizations (Kernel Fusing, Quantization) declined, system-level architectural overhaul became the new growth engine.</p> <p>This article reviews the four key trends driving our inference technology innovation and provides a detailed interpretation of our five most representative works in 2025 (SparseServe, Adrenaline, TaiChi, DualMap, MemArt), in the hope of offering a substantial technical response to the community.</p> <h3 id="i-four-key-trends-driving-innovation-in-inference-systems">I. Four Key Trends Driving Innovation in Inference Systems</h3> <p>Before introducing specific works, it is necessary to first revisit the “wind direction” we observed this year. It is these underlying changes that determine why we pursued these research directions. As shown in Figure 1, I map these four trends to the four layers of cloud inference services.</p> <p><img src="/assets/img/2026-01-06-2025-Year-End-Summary-Zhihu/1.jpg" alt="Figure 1"/> <strong>Figure 1:</strong> Four Key Trends Driving Innovation in Inference Systems</p> <h4 id="trend-1-application-layer--from-simple-chatbots-to-complex-agents">Trend 1: Application Layer — From Simple Chatbots to Complex Agents</h4> <p><strong>Phenomenon:</strong> The form of LLM applications has undergone a qualitative change. In 2023–2024, the mainstream workload faced by inference was chatbots (e.g., ChatGPT). These applications feature relatively balanced inputs and outputs, usually in the same order of magnitude (e.g., between 3:1 and 1:3). But in 2025, LLM applications evolved into complex Agents (e.g., DeepResearch), featuring a closed loop of “planning — tool invocation — execution — reflection,” continuously reading environmental information, invoking tools, and iterating decisions across multi-step tasks. This causes the Input/Output Token Ratio to surge to 100:1 or even 1000:1.</p> <p><strong>Challenges:</strong> 1) Prefill dominates: end-to-end latency and cost are no longer determined by Decode, but by Prefill. 2) Memory becomes a key factor: Agents need to “remember” user preferences and state across tasks and sessions. How to effectively represent, manage, and retrieve these memories to improve inference accuracy and efficiency becomes critical.</p> <h4 id="trend-2-service-system-layer--context-caching-evolves-from-optional-to-standard">Trend 2: Service System Layer — Context Caching Evolves from “Optional” to “Standard”</h4> <p><strong>Phenomenon:</strong> In multi-turn dialogues and Agent workflows, the prefix repetition rate of online requests increases significantly: system prompts, tool descriptions, fixed templates, and session context repeatedly appear within the same task chain. Each time doing Prefill from scratch directly translates the high input-output ratio of Agents into repeated computation and high cost. Therefore, Context Caching (prefix caching) has become a standard feature among major vendors (OpenAI, Anthropic, DeepSeek, Google, etc.).</p> <p><strong>Challenges:</strong> Caching introduces “state.” In the era without caching, requests were stateless; schedulers could freely dispatch requests to any idle node (Round Robin or Least Loaded sufficed). However, after introducing caching, to achieve cache hits, the scheduler needs to send requests to the specific node holding the prefix data. This is called Cache Affinity. This leads to the most thorny “affinity vs. balance” contradiction in distributed systems: 1) Pursuing affinity can lead to overload of nodes holding popular prefixes (Hotspots), creating long-tail latencies. 2) Pursuing balance by forcibly scattering requests prevents cache reuse, wasting compute resources.</p> <h4 id="trend-3-inference-engine-layer--optimization-after-prefilldecode-separation-reaches-a-watershed">Trend 3: Inference Engine Layer — Optimization After Prefill–Decode Separation Reaches a Watershed</h4> <p><strong>Phenomenon:</strong> Early LLM serving often adopted PD coupling, placing Prefill and Decode in the same resource pool, squeezing resource utilization through batching, scheduling, and operator optimization (e.g., Orca). As inference services began simultaneously constraining TTFT/TPOT, the industry gradually shifted to PD separation to reduce latency interference between Prefill and Decode.</p> <p><strong>Challenges:</strong> After PD separation, “double waste” of resource utilization emerged. Prefill side has busy compute but underutilized memory capacity and bandwidth; Decode side has KV staying resident leading to tight memory capacity and bandwidth, yet compute is underutilized. Hence “low resource utilization” becomes the key contradiction. At this point, inference engine optimization reaches a crossroads: 1) Heterogeneous deployment: e.g., run Prefill on cards with strong compute but weak memory, and Decode on cards with weaker compute but stronger memory. Along this line, one can even further split the Decode’s Attention and FFN (AF separation) and deploy onto heterogeneous cards to continue to improve elasticity and utilization (e.g., MegaScale-Infer). However, in cloud data centers, heterogeneous deployment often encounters issues such as the unavailability of heterogeneous resources, difficulties in high-performance network interconnects, and reduced elasticity. 2) Separated mixed colocation: logically still separated, but along the execution path cross-stage mixing, e.g., co-deploy compute-intensive and memory-intensive inference sub-stages on the same card to improve resource utilization.</p> <h4 id="trend-4-model-layer--optimization-focus-shifts-from-ffn-to-attention">Trend 4: Model Layer — Optimization Focus Shifts from FFN to Attention</h4> <p><strong>Phenomenon:</strong> On the FFN side, after DeepSeek successfully ran the large-scale small-expert MoE route, base models gradually converged to highly sparse MoE architectures (such as DeepSeek-R1/V3, Qwen3-MoE, Kimi-K2), substantially reducing inference computation. However, as the context length breaks through 1M, the computational and storage complexity of Attention becomes the new demon, making Attention optimization the new main battlefield.</p> <p>Attention optimization is mainly KV cache compression. DeepSeek’s MLA compresses per-token KV along the head dimension to the extreme; as a result, compression along the token dimension becomes the new battleground. Currently, there are two main routes for token-dimension compression: 1) Sparse attention: KV is still stored, but access is compressed, i.e., each token only interacts with a small subset of the most important KV, reducing computation and memory bandwidth (e.g., DeepSeek-V3.2’s DSA). 2) Linear attention: Compress dependency on history into a recursible state, making Decode closer to constant per-token overhead (e.g., Qwen3-Next, Kimi Linear). These two routes can be mixed to create hybrid attention. Other token-dimension compression techniques exist, such as compressing KV for consecutive tokens together, but we have not seen mature models using them yet.</p> <p><strong>Challenges:</strong> When the model’s Attention structure changes, it inevitably impacts the implementation of the inference system; system bottlenecks may shift, and some modules may need redesign. For example, we found that after dynamic sparse attention eliminates most Attention computation, the bottleneck of Decode throughput shifts from HBM bandwidth to HBM capacity (KV must remain resident; batch size is more easily bound by capacity). Another example is that when using linear attention, the model state is no longer a complete per-token KV cache but an SSM maintained per layer; therefore, traditional “prefix KV reuse-style Context Caching” needs to evolve into “SSM checkpoint/restore.” In hybrid structures, the system must manage two types of state objects simultaneously: KV cache + SSM. Caching, routing, and memory orchestration all become more complex.</p> <h3 id="ii-systematic-layout-of-inference-technology-innovation-overview">II. Systematic Layout of Inference Technology Innovation (Overview)</h3> <p>To transform the above “trends” into concrete research topics, we narrowed the problem space of LLM inference systems into a three-layer architecture: covering the full chain from bottom-level cache management, mid-level engine optimization, to top-level distributed scheduling. In 2025, our team conducted a comprehensive and systematic layout across these three levels.</p> <p><img src="/assets/img/2026-01-06-2025-Year-End-Summary-Zhihu/2.jpg" alt="Figure 2"/> <strong>Figure 2:</strong> Our Team’s Layout of Inference System Technology Innovations As shown in Figure 2, our research covers the complete chain:</p> <p>Distributed Scheduling Layer: responsible for global request scheduling. The core challenge is, in a stateful service system, how to balance locality and load balance. We proposed DualMap (Achieving Both Cache Affinity and Load Balance), which breaks the traditional either-or dilemma through double hashing and state-aware routing inspired by The Power of Two Choices. Inference Engine Layer: responsible for efficient execution of model computation. In response to the resource utilization mismatch after PD separation, we introduced Adrenaline (Attention Disaggregation), which simultaneously improves resource utilization of P and D through attention disaggregation and mixed colocation; and TaiChi, which unifies the architectural debate between PD aggregation and separation, and squeezes resources on SLO-over-satisfied requests to improve overall throughput. In addition, for the HBM capacity bottleneck of dynamic sparse attention (DSA) models, we introduced SparseServe, scaling batch size to improve Decode throughput. Caching System Layer: responsible for the storage and reuse of KV cache. For this layer, we published CachedAttention at USENIX ATC ‘24, which is likely the first top-conference paper on Context Caching systems, and we pioneered the use of hierarchical storage for KV cache. In 2025, for the Agent scenario, we further evolved into KV Cache-Centric Agent Memory (MemArt), revolutionizing the representation of Agent memory. These three layers, through tight vertical co-design, together constitute our answer to inference-system technical innovation.</p> <h3 id="iii-five-representative-works-in-2025">III. Five Representative Works in 2025</h3> <p>Next, I will briefly introduce the content of these five representative works. Interested readers can further read the original papers.</p> <h4 id="1-sparseserve-breaking-the-capacity-wall-of-dynamic-sparse-attention">1) SparseServe: Breaking the “Capacity Wall” of Dynamic Sparse Attention</h4> <p><strong>Paper link:</strong> <a href="https://arxiv.org/abs/2509.24626v1">https://arxiv.org/abs/2509.24626v1</a></p> <p><strong>Motivation:</strong> Introducing dynamic sparse attention (DSA) dramatically reduces Attention computation and per-step memory access. However, the system faces a severe “storage efficiency paradox”: to guarantee low decoding latency, a large number of KV caches corresponding to unselected “cold” tokens must still reside in HBM. This directly shifts the system bottleneck from “compute/bandwidth” to “memory capacity.” As shown in Figure 3, for full attention, limited by the memory bandwidth wall, simply increasing batch size yields saturated improvements to decoding throughput (the curve flattens); whereas for DSA, thanks to its low bandwidth requirements, increasing batch size should deliver a linear jump in end-to-end throughput. Unfortunately, in reality, batch size is often prematurely hard-limited by HBM physical capacity, preventing DSA’s extremely high theoretical throughput ceiling from being fulfilled in production.</p> <p><img src="/assets/img/2026-01-06-2025-Year-End-Summary-Zhihu/3.jpg" alt="Figure 3"/> <strong>Figure 3:</strong> Impact of Increasing Batch Size on Throughput of Full Attention and DSA (two HBM capacity lines for 40GB and 80GB A100)</p> <p><strong>Core idea:</strong> Offloading these underutilized KV caches to DRAM (system memory) can free up HBM capacity, thereby allowing larger parallel batch sizes. However, implementing such hierarchical HBM–DRAM storage brings new challenges, including fragmented KV cache access, HBM cache contention, and the high HBM demands of hybrid batching—all of which remain unresolved in prior work.</p> <p>To address these challenges, we propose SparseServe, an LLM inference technique designed to unleash the parallel potential of DSA through efficient hierarchical HBM–DRAM management. SparseServe introduces three key innovations: 1) Fragmentation-aware KV cache transfer: accelerating data movement between HBM and DRAM via GPU-direct loading (FlashH2D) and CPU-assisted saving (FlashD2H); 2) Working-set-aware batch size control: adjusting batch sizes based on real-time working-set estimation to minimize HBM cache thrashing; 3) Layer-segmented Prefill: bounding HBM usage during Prefill to a single layer, enabling efficient execution even for long prompts.</p> <p><strong>Results:</strong> By breaking the capacity wall, SparseServe reduces TTFT (time-to-first-token) latency by 9.26× and increases throughput by 3.14×.</p> <h4 id="2-adrenaline-injecting-adrenaline-into-inference-systems">2) Adrenaline: Injecting “Adrenaline” into Inference Systems</h4> <p><strong>Paper link:</strong> <a href="https://arxiv.org/abs/2503.20552">https://arxiv.org/abs/2503.20552</a></p> <p><strong>Motivation:</strong> In the mainstream PD separation architecture, we face a severe resource mismatch. As shown in Figure 4, the Decode node, limited by HBM bandwidth, often has its expensive compute resources “starved” and cannot be fully utilized; meanwhile, the Prefill node, handling compute-intensive Prefill tasks, has its abundant HBM bandwidth long underutilized. However, physical separation prevents Decode from “borrowing” the Prefill node’s HBM bandwidth, and Prefill cannot “support” the compute needs of Decode. This divide forms “resource islands” between Prefill and Decode nodes, making it difficult to break the bottleneck of overall cluster resource utilization.</p> <p><img src="/assets/img/2026-01-06-2025-Year-End-Summary-Zhihu/4.jpg" alt="Figure 4"/> <strong>Figure 4:</strong> Resource Utilization of Prefill and Decode Nodes Under PD Separation</p> <p><strong>Core idea:</strong> To solve this structural imbalance, we propose Adrenaline, an inference service system that realizes “fluid resource pooling.” Inspired by osmosis in biology—solvent naturally passes through a semi-permeable membrane to balance concentration—Adrenaline allows memory-intensive Decode attention computation (and its associated KV cache) to permeate across the physical boundary between Prefill and Decode instances.</p> <p>By letting Decode attention tasks naturally flow from resource-constrained Decode nodes to HBM-rich Prefill nodes, Adrenaline effectively transforms underutilized HBM on Prefill GPUs into an extended resource pool for Decode tasks. This mechanism successfully balances resource pressure within the cluster: it “feeds” the idle HBM capacity and bandwidth on Prefill nodes while unlocking larger batch sizes on Decode nodes. In addition, Adrenaline overcomes cross-instance latency and interference via low-latency decoding synchronization, resource-efficient Prefill colocation, and SLO-aware offloading.</p> <p><img src="/assets/img/2026-01-06-2025-Year-End-Summary-Zhihu/5.jpg" alt="Figure 5"/> <strong>Figure 5:</strong> Comparison of Adrenaline with Original PD Separation Workflow (increasing Decode batch size from M to M+N)</p> <p><strong>Results:</strong> Compared to state-of-the-art PD separation systems, Adrenaline increases utilization across different resources by 1.05× to 6.66×, and improves overall inference throughput by 2.04× under SLO constraints.</p> <h4 id="3-taichi-the-tai-chi-way-of-unifying-architectures">3) TaiChi: The Tai Chi Way of Unifying Architectures</h4> <p><strong>Paper link:</strong> <a href="https://arxiv.org/abs/2508.01989">https://arxiv.org/abs/2508.01989</a></p> <p><strong>Motivation:</strong> In the LLM inference field, there is an architectural debate: one side is PD aggregation (placing Prefill and Decode on the same GPU), while the other is PD separation (deploying them on different GPUs). We systematically compared their performance under different TTFT and TPOT SLOs and found: when TTFT SLO is strict and TPOT SLO is relaxed, aggregation wins; under the opposite conditions, separation is better. However, under balanced TTFT/TPOT SLOs, both show significant suboptimality in terms of Goodput (effective throughput under SLO), revealing a previously uncharacterized “Goodput Gap,” as shown in Figure 6.</p> <p><img src="/assets/img/2026-01-06-2025-Year-End-Summary-Zhihu/6.jpg" alt="Figure 6"/> <strong>Figure 6:</strong> TTFT and TPOT distributions under different scheduling strategies, with the same number of compute nodes and QPS</p> <p><strong>Core idea:</strong> We attribute this Goodput Gap to underutilized Latency Slack: many requests finish far below their TTFT/TPOT SLOs while other requests face SLA violation risk; yet current systems only expose unified, stage-level knobs and cannot reallocate slack across requests and stages. Therefore, we propose Latency Shifting as a design principle for LLM serving: treat TTFT/TPOT SLO slack as a core resource, strategically reassign it to maximize Goodput.</p> <p>To realize this idea, we propose TaiChi, an LLM serving system that achieves Latency Shifting via hybrid-mode inference architecture and two request-level schedulers. Hybrid mode, operating on heterogeneous Prefill-heavy and Decode-heavy instances, combines “aggregated batching” with “per-request stage decoupling” to fill the blanks in the 2D PD design space. On this basis, Flowing Decode shapes TPOT under constraints of batched decode and unknown output lengths, while Length-aware Prefill selectively “downgrades” Prefill on requests with abundant TTFT slack according to TTFT prediction. Finally, through the design of three sliders, TaiChi unifies PD aggregation, PD separation, and hybrid-mode inference under a single architecture, as shown in Figure 7.</p> <p><img src="/assets/img/2026-01-06-2025-Year-End-Summary-Zhihu/7.jpg" alt="Figure 7"/> <strong>Figure 7:</strong> TaiChi architecture unifying PD aggregation, PD separation, and hybrid-mode inference</p> <p><strong>Results:</strong> Compared to state-of-the-art PD aggregation and separation systems, TaiChi improves Goodput by up to 40%, reduces P90 TTFT by up to 5.3×, and reduces P90 TPOT by up to 1.6×.</p> <h4 id="4-dualmap-achieving-both-affinity-and-balance-in-distributed-scheduling">4) DualMap: Achieving Both Affinity and Balance in Distributed Scheduling</h4> <p><strong>Paper link:</strong> To be supplemented</p> <p><strong>Motivation:</strong> In LLM inference services, reusing prompt KV cache across requests is key to reducing TTFT and service cost. Cache-affinity scheduling aims to colocate requests with the same prompt prefix to maximize KV reuse; however, this often conflicts with load-balancing scheduling, which aims to evenly distribute requests across instances. Existing schedulers struggle to reconcile this trade-off because they typically operate in a single mapping space, applying affinity routing to some requests and load balancing to others, lacking a unified scheme for achieving both goals simultaneously.</p> <p><img src="/assets/img/2026-01-06-2025-Year-End-Summary-Zhihu/8.jpg" alt="Figure 8"/> <strong>Figure 8:</strong> DualMap vs. existing work on cache affinity and load balancing</p> <p><strong>Core idea:</strong> To overcome this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that simultaneously achieves cache affinity and load balancing, as shown in Figure 8. The core idea is: based on the request’s prompt, use two independent hash functions to map each request to two candidate instances, then intelligently select the better one based on the current system state. This design leverages The Power of Two Choices, increasing the probability of colocating requests with shared prefixes while also ensuring that requests with different prefixes are evenly spread across the cluster.</p> <p>To keep DualMap robust under dynamic and skewed real-world workloads, we introduce three techniques: SLO-aware request routing: prioritize cache affinity, but switch to load-aware scheduling when TTFT exceeds SLO, enhancing load balancing without sacrificing cache reuse; Hotspot-aware rebalancing: dynamically migrate requests from overloaded instances to low-load instances to eliminate hotspots and rebalance the system; Lightweight dual-hash-ring scaling: support fast, low-overhead instance scaling using dual-hash-ring mapping, avoiding expensive global remapping. <strong>Results:</strong> Compared to state-of-the-art work, under the same TTFT SLO constraint, DualMap increases the system’s Effective Request Capacity by up to 2.25×.</p> <h4 id="5-kvcache-centric-memory-memart-native-memory-for-agents">5) KVCache-Centric Memory (MemArt): Native Memory for Agents</h4> <p><strong>Paper link:</strong> To be supplemented</p> <p><strong>Motivation:</strong> LLM agents are becoming a new paradigm for applying base models to complex real-world workflows, such as scientific exploration, programming assistants, and automated task planning. Unlike single-prompt or short-dialogue bots, agents often run for hours to days, involve dozens to hundreds of iterative calls, and quickly accumulate context that exceeds the model’s context window. To address this scalability bottleneck, the industry has begun introducing external memory systems to store and retrieve historical information on demand, maintaining efficiency, accuracy, and robustness in long-horizon tasks.</p> <p>Currently, most mainstream memory systems adopt “plaintext memory”: they segment/summarize historical dialogues into entries, then retrieve using a vector database or graph structure. This approach has two fundamental problems: (1) summarization and similarity-based retrieval struggle to preserve the complete semantic dependencies of multi-turn interactions, easily missing key information or introducing noise, and often perform worse than full-context reasoning; (2) discrete memory entries break the continuous structure of the prompt prefix, undermining the inference engine’s reliance on prefix caching and thereby weakening performance and efficiency benefits.</p> <p><img src="/assets/img/2026-01-06-2025-Year-End-Summary-Zhihu/9.jpg" alt="Figure 9"/> <strong>Figure 9:</strong> Workflow comparison between MemArt’s KVCache-centric memory and plaintext memory</p> <p><strong>Core idea:</strong> We propose MemArt, a new memory paradigm: shifting from plaintext memory to KVCache-centric memory, improving both inference effectiveness and efficiency. MemArt stores historical context directly as reusable KV blocks and computes attention scores between the current prompt and each KV block in the latent space to retrieve relevant memory, as shown in Figure 9. It offers three advantages: (1) High-fidelity retrieval: aligned with the model’s attention mechanism, yielding more accurate semantics; (2) High efficiency: KV blocks that hit can be reused directly in Prefill, avoiding reprocessing tokens, reducing latency and cost; (3) Easy integration: plug-and-play without changing model weights or structure.</p> <p>There remain two major challenges to implementing KVCache-centric memory: First, as the memory repository grows, how to avoid scanning everything while still retrieving accurately; second, retrieved KV blocks are usually non-contiguous and carry original positional information—direct concatenation causes positional inconsistency, affecting output quality. To this end, MemArt constructs a compressed representative key for each KV block for fast screening; then uses a multi-token aggregation strategy to combine attention scores across all prompt tokens, improving relevance. Finally, it verifies and adjusts the positions of retrieved blocks via a decoupled positional encoding mechanism, enabling safe, coherent reuse in the current context.</p> <p><strong>Results:</strong> Compared to state-of-the-art plaintext memory methods, MemArt improves inference accuracy by 11.8%–39.4%, approaching full-context reasoning performance. More importantly, compared to plaintext memory methods, MemArt reduces the number of Prefill tokens by 91–135×.</p> <p>These results suggest that KVCache-centric memory may become the key foundation for building high-accuracy, high-efficiency, long-context LLM agents. Meanwhile, it introduces a series of system implementation challenges, especially storage capacity and cost pressure as memory scale increases: hierarchical storage (HBM/DRAM/SSD), efficient cache management strategies, and KV cache compression and eviction are needed for sustainable engineering deployment. For those interested in this direction, a promising path is “lower-cost KV-level memory management.”</p> <h3 id="iv-conclusion">IV. Conclusion</h3> <p>In 2025, our team’s main research line was very clear: no longer confined to single-operator or single-model optimization, but advancing end-to-end, full-stack system architecture innovation to break down the “resource walls” and “efficiency walls” obstructing inference efficiency.</p> <p>From SparseServe’s breakthrough on the memory capacity bottleneck of dynamic sparse attention (DSA), to Adrenaline’s ingenious bridging of “resource islands” created by PD separation; from TaiChi putting an end to the architectural route debate between aggregation and separation, to DualMap mathematically reconciling the contradiction between cache affinity and load balancing, and finally to MemArt sinking agent memory from the application layer plaintext into system-level primitives—these works together form a new infrastructure for large-model inference.</p> <p><strong>BTW:</strong> The innovative research results introduced in this article mainly come from my intern team, and I thank them for their outstanding work. We also welcome more excellent students to contact us for internships! In addition, beneath the iceberg, our engineering team has accumulated more highly valuable system practices in production environments, which are not elaborated here due to compliance and space limitations.</p> <h4 id="looking-forward-to-2026">Looking Forward to 2026:</h4> <p>If 2025 was a year of system architecture reshaping, then 2026 will be a year where “model evolution and application explosion force a paradigm shift in systems.”</p> <p>We foresee that model architectures will move towards extreme dynamic sparsity. The deep fusion of MoE and sparse/linear attention heterogeneous compute graphs will drive inference loads to unprecedented levels of dynamism. The full explosion of multimodal capabilities means input streams will extend from text to audio-video streams, bringing heterogeneous context pressure several times larger than today. The further boom of Agent applications will push inference from single-shot interactions to long-horizon, complex-state task orchestration. Another direction of note is Continual Learning, which is likely to become mainstream in the future, but in the short term still faces algorithmic breakthroughs before commercial adoption.</p> <p>Facing these changes, we will continue to “seek certainty amidst uncertainty”: turn extreme sparsity on the model side into system-side SLO-goodput that is predictable and deliverable; converge multimodal heterogeneous inputs into unified adjustable scheduling and resource orchestration primitives; internalize complex Agent states into reusable, low-cost, governable system-level memory. We look forward to pushing inference infrastructure from “high performance” toward “high adaptability, high reliability, and sustainable scalability,” providing a truly scalable foundation for the next generation of AI applications.</p>]]></content><author><name></name></author><category term="LLM"/><category term="Inference"/><category term="System"/><summary type="html"><![CDATA[Introduction: The First Year of Inference Explosion and the Hundred-Billion Cost Battle Looking back at 2025, we not only experienced technological iterations, but also witnessed a dramatic shift in the industrial landscape. If the past few years were an arms race of “large-scale model training,” then 2025 was undoubtedly the first year of “inference business explosion.” As model capabilities matured and Agent applications landed, the balance of cloud computing power fundamentally shifted: currently, the vast majority of GPU/NPU resources on the cloud are occupied by inference workloads. The scale of accelerator cards serving inference is often several times larger than that of training cards, and in some companies even an order of magnitude larger.]]></summary></entry><entry><title type="html">Does NVIDIA Dynamo’s PD Disaggregation Have Issues? Our Proposed “Adrenaline” Is The Remedy!</title><link href="https://pfzuo.github.io/blog/2025/Adrenaline/" rel="alternate" type="text/html" title="Does NVIDIA Dynamo’s PD Disaggregation Have Issues? Our Proposed “Adrenaline” Is The Remedy!"/><published>2025-03-15T00:00:00+00:00</published><updated>2025-03-15T00:00:00+00:00</updated><id>https://pfzuo.github.io/blog/2025/Adrenaline</id><content type="html" xml:base="https://pfzuo.github.io/blog/2025/Adrenaline/"><![CDATA[<p><a href="https://zhuanlan.zhihu.com/p/1888519961636487325?utm_psn=1895963156649574869">Does NVIDIA Dynamo’s PD Disaggregation have issues? Our proposed “Adrenaline” is the remedy!</a></p>]]></content><author><name></name></author><category term="AI"/><category term="LLM"/><category term="Machine-Learning"/><summary type="html"><![CDATA[Does NVIDIA Dynamo’s PD Disaggregation have issues? Our proposed “Adrenaline” is the remedy!]]></summary></entry><entry><title type="html">DeepSeek Has NSA (Native Sparse Attention), While We Have PSA (Progressive Sparse Attention)</title><link href="https://pfzuo.github.io/blog/2025/DeepSeek-has-NSA-(Native-Sparse-Attention),-while-we-have-PSA-(Progressive-Sparse-Attention)/" rel="alternate" type="text/html" title="DeepSeek Has NSA (Native Sparse Attention), While We Have PSA (Progressive Sparse Attention)"/><published>2025-03-01T00:00:00+00:00</published><updated>2025-03-01T00:00:00+00:00</updated><id>https://pfzuo.github.io/blog/2025/DeepSeek%20has%20NSA%20(Native%20Sparse%20Attention),%20while%20we%20have%20PSA%20(Progressive%20Sparse%20Attention)</id><content type="html" xml:base="https://pfzuo.github.io/blog/2025/DeepSeek-has-NSA-(Native-Sparse-Attention),-while-we-have-PSA-(Progressive-Sparse-Attention)/"><![CDATA[<p><a href="https://zhuanlan.zhihu.com/p/28475636063?utm_psn=1895961819992023976">DeepSeek has NSA (Native Sparse Attention), while we have PSA (Progressive Sparse Attention).</a></p>]]></content><author><name></name></author><category term="AI"/><category term="LLM"/><category term="Machine-Learning"/><summary type="html"><![CDATA[DeepSeek has NSA (Native Sparse Attention), while we have PSA (Progressive Sparse Attention).]]></summary></entry><entry><title type="html">In The Era of AI, Where Are The Opportunities for The Storage Industry?</title><link href="https://pfzuo.github.io/blog/2024/In-the-era-of-AI,-where-are-the-opportunities-for-the-storage-industry/" rel="alternate" type="text/html" title="In The Era of AI, Where Are The Opportunities for The Storage Industry?"/><published>2024-10-20T00:00:00+00:00</published><updated>2024-10-20T00:00:00+00:00</updated><id>https://pfzuo.github.io/blog/2024/In%20the%20era%20of%20AI,%20where%20are%20the%20opportunities%20for%20the%20storage%20industry</id><content type="html" xml:base="https://pfzuo.github.io/blog/2024/In-the-era-of-AI,-where-are-the-opportunities-for-the-storage-industry/"><![CDATA[<p><a href="https://zhuanlan.zhihu.com/p/3462257980?utm_psn=1895960067028791774">In the era of AI, where are the opportunities for the storage industry?</a></p>]]></content><author><name></name></author><category term="AI"/><category term="LLM"/><category term="Machine-Learning"/><summary type="html"><![CDATA[In the era of AI, where are the opportunities for the storage industry?]]></summary></entry><entry><title type="html">Install and Run ISPASS2009-benchmarks on GPGPU-Sim</title><link href="https://pfzuo.github.io/blog/2019/Install-and-Run-ISPASS2009-Benchmarks-on-GPGPUSim/" rel="alternate" type="text/html" title="Install and Run ISPASS2009-benchmarks on GPGPU-Sim"/><published>2019-01-10T00:00:00+00:00</published><updated>2019-01-10T00:00:00+00:00</updated><id>https://pfzuo.github.io/blog/2019/Install%20and%20Run%20ISPASS2009%20Benchmarks%20on%20GPGPUSim</id><content type="html" xml:base="https://pfzuo.github.io/blog/2019/Install-and-Run-ISPASS2009-Benchmarks-on-GPGPUSim/"><![CDATA[<p><a href="https://github.com/gpgpu-sim/ispass2009-benchmarks">ISPASS2009-Benchmarks</a> are used in the ISPASS 2009 paper on GPGPU-Sim for evaluation. The benchmark suite includes 11 benchmarks, i.e., AES, BFS, CP, LPS, LIB, MUM, NN, NQU, RAY, STO, and WP. Please do the following steps to install and run ISPASS2009 Benchmarks.</p> <blockquote> <h4 id="1-build-the-nvidia-cuda-sdk-benchmarks">1 Build the NVIDIA CUDA SDK benchmarks</h4> </blockquote> <p>1) Install NVIDIA driver if have no one:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get install nvidia-340
</code></pre></div></div> <p>There may be some errors during the installing. That is all right.</p> <p>2) We have installed the NVIDIA CUDA SDK benchmarks (i.e., GPU Computing SDK code samples) when installing GPGPU-Sim. We now build it:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd ~/NVIDIA_GPU_Computing_SDK
make
</code></pre></div></div> <p>During the building, if the error <code class="language-plaintext highlighter-rouge">/usr/bin/ld: cannot find -lOpenCL collect2: ld returned 1 exit status ../../common/common_opencl.mk:254: recipe for target '../../..//OpenCL//bin//linux/release/oclPostprocessGL' failed</code> occurs, make the following modifications:</p> <p>     (a) Edit <code class="language-plaintext highlighter-rouge">./C/common/common.mk</code>, lines like <code class="language-plaintext highlighter-rouge">LIB += … ${OPENGLLIB} …. $(RENDERCHECKGLLIB) …</code> should have <code class="language-plaintext highlighter-rouge">$(RENDERCHECKGLLIB)</code> moved before <code class="language-plaintext highlighter-rouge">${OPENGLLIB}</code>. There are 3 lines like this.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LIB += $(RENDERCHECKGLLIB) ${OPENGLLIB} $(PARAMGLLIB) $(CUDPPLIB) ${LIB} -ldl -rdynamic
LIB += -lcuda   $(RENDERCHECKGLLIB) ${OPENGLLIB} $(PARAMGLLIB) $(CUDPPLIB) ${LIB}
LIB += $(RENDERCHECKGLLIB) ${OPENGLLIB} $(PARAMGLLIB) $(CUDPPLIB) ${LIB}
</code></pre></div></div> <p>     (b) Similarly, edit <code class="language-plaintext highlighter-rouge">./CUDALibraries/common/common.mk</code></p> <p>     (c) <code class="language-plaintext highlighter-rouge">cd ~/NVIDIA_GPU_Computing_SDK</code></p> <p>     (d) Edit <code class="language-plaintext highlighter-rouge">Makefile</code>. Comment all lines with <code class="language-plaintext highlighter-rouge">CUDALibraries</code> and <code class="language-plaintext highlighter-rouge">OpenCL</code> as we only want the application binaries. You comment by placing <code class="language-plaintext highlighter-rouge">#</code> in the front of the line.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># GPU Computing SDK Version 4.0.8
all:
    @$(MAKE) -C ./shared
    @$(MAKE) -C ./C
    #@$(MAKE) -C ./CUDALibraries
    #@$(MAKE) -C ./OpenCL

clean:
    @$(MAKE) -C ./shared clean
    @$(MAKE) -C ./C clean
    #@$(MAKE) -C ./CUDALibraries clean
    #@$(MAKE) -C ./OpenCL clean

clobber:
    @$(MAKE) -C ./shared clobber
    @$(MAKE) -C ./C clobber
    #@$(MAKE) -C ./CUDALibraries clobber
    #@$(MAKE) -C ./OpenCL clobber
</code></pre></div></div> <p>     (e) <code class="language-plaintext highlighter-rouge">make</code></p> <p>The NVIDIA CUDA SDK benchmarks have been installed by now. All executed files are listed in the fold <code class="language-plaintext highlighter-rouge">~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/</code>.</p> <p>3) Test GPGPU-Sim using one of NVIDIA CUDA SDK benchmarks</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd /home/gpgpu-sim_distribution/test
~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/
</code></pre></div></div> <blockquote> <h4 id="2-build-the-ispass2009-benchmarks">2 Build the ISPASS2009-Benchmarks</h4> </blockquote> <p>1) Download ISPASS2009-Benchmarks</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd /home/gpgpu-sim_distribution
git clone https://github.com/gpgpu-sim/ispass2009-benchmarks.git
cd ispass2009-benchmarks/
</code></pre></div></div> <p>2) Define the following environment variables at the top of <code class="language-plaintext highlighter-rouge">Makefile.ispass-2009</code>:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export CUDA_INSTALL_PATH=/usr/local/cuda
NVIDIA_COMPUTE_SDK_LOCATION=/root/NVIDIA_GPU_Computing_SDK
</code></pre></div></div> <p>3) Comment some benchmarks that fail to be built, e.g., AES, DG, and WP, in <code class="language-plaintext highlighter-rouge">Makefile.ispass-2009</code>:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#$(SETENV) make noinline=$(noinline) -C AES
#$(SETENV) make noinline=$(noinline) -C DG/3rdParty/ParMetis-3.1
#$(SETENV) make noinline=$(noinline) -C DG
#$(SETENV) make noinline=$(noinline) -C WP
</code></pre></div></div> <p>4) Build the benchmarks</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"make -f Makefile.ispass-2009
</code></pre></div></div> <p>The binaries generated are in the <code class="language-plaintext highlighter-rouge">./bin/release/</code> fold.</p> <p>5) Source setup_environment and Place a link to the GPU configuration files:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd /home/gpgpu-sim_distribution
source setup_environment 
cd ispass2009-benchmarks/
./setup_config.sh GTX480
</code></pre></div></div> <p>You can also change the GPU type (e.g., to <code class="language-plaintext highlighter-rouge">TeslaC2050</code>) by the following instructions:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./setup_config.sh --cleanup
./setup_config.sh TeslaC2050
</code></pre></div></div> <p>6) Run a benchmark such as <code class="language-plaintext highlighter-rouge">NN</code>:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd NN/
sh README.GPGPU-Sim
</code></pre></div></div>]]></content><author><name></name></author><category term="GPU"/><category term="Machine-Learning"/><category term="Benchmark"/><summary type="html"><![CDATA[ISPASS2009-Benchmarks are used in the ISPASS 2009 paper on GPGPU-Sim for evaluation. The benchmark suite includes 11 benchmarks, i.e., AES, BFS, CP, LPS, LIB, MUM, NN, NQU, RAY, STO, and WP. Please do the following steps to install and run ISPASS2009 Benchmarks.]]></summary></entry><entry><title type="html">Install and Run GPGPU-Sim</title><link href="https://pfzuo.github.io/blog/2019/Install-and-Run-GPGPUSim/" rel="alternate" type="text/html" title="Install and Run GPGPU-Sim"/><published>2019-01-09T00:00:00+00:00</published><updated>2019-01-09T00:00:00+00:00</updated><id>https://pfzuo.github.io/blog/2019/Install%20and%20Run%20GPGPUSim</id><content type="html" xml:base="https://pfzuo.github.io/blog/2019/Install-and-Run-GPGPUSim/"><![CDATA[<p><a href="http://www.gpgpu-sim.org/">GPGPU-Sim</a> is a cycle-level simulator for modeling contemporary GPUs running CUDA and OpenCL workloads. The current GPGPU-Sim supports the GPU simulation with four kinds of architectures, i.e., GTX480, QuadroFX5600, QuadroFX5800, and TeslaC2050 architectures. This blog introduces the detailed steps to install and run GPGPU-Sim.</p> <blockquote> <h4 id="1-download-and-install-nvdia-cuda-40">1 Download and Install NVDIA CUDA 4.0</h4> </blockquote> <p>GPGPU-Sim has to be run with NVDIA CUDA and does not support the CUDA versions larger than 4.0. Hence, we should first install NVDIA CUDA 4.0. The linux OS in my computer is Ubuntu 18.04, and the gcc version is 7.3.0. To install NVDIA CUDA 4.0, please do the following setps.</p> <p><strong>1)</strong> Download the <a href="https://developer.nvidia.com/cuda-toolkit-40">CUDA Toolkit for Ubuntu Linux 10.10</a> and <a href="https://developer.nvidia.com/cuda-toolkit-40">GPU Computing SDK code samples</a> from the NVDIA website.</p> <p><strong>2)</strong> Install CUDA Toolkit for Ubuntu Linux 10.10 first:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chmod +x cudatoolkit_4.0.17_linux_64_ubuntu10.10.run
sudo ./cudatoolkit_4.0.17_linux_64_ubuntu10.10.run
</code></pre></div></div> <p>The CUDA Toolkit has been installed in the path of <code class="language-plaintext highlighter-rouge">/usr/local/cuda</code> in default.</p> <p>3) Add the path of CUDA Toolkit into the <code class="language-plaintext highlighter-rouge">~/.bashrc</code> file:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo 'export PATH=$PATH:/usr/local/cuda/bin' &gt;&gt; ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64' &gt;&gt; ~/.bashrc
source ~/.bashrc
</code></pre></div></div> <p>4) Install GPU Computing SDK code samples:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chmod +x gpucomputingsdk_4.0.17_linux.run
sudo ./gpucomputingsdk_4.0.17_linux.run
</code></pre></div></div> <p>The GPU Computing SDK has been installed in the path of <code class="language-plaintext highlighter-rouge">~/NVIDIA_GPU_Computing_SDK</code> in default.</p> <p>5) Install gcc-4.4 and g++-4.4 (since CUDA 4.0 supports the gcc version until 4.4):</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> apt-get install gcc-4.4 g++-4.4
</code></pre></div></div> <p>If the error <code class="language-plaintext highlighter-rouge">package gcc-4.4 is not available, but is referred to by another package</code> occurs, do the following steps to address it:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vim /etc/apt/sources.list
</code></pre></div></div> <p>Add the two-line codes into the opened file:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>deb http://dk.archive.ubuntu.com/ubuntu/ trusty main universe    
deb http://dk.archive.ubuntu.com/ubuntu/ trusty-updates main universe 
</code></pre></div></div> <p>Then, update the apt source:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get update
</code></pre></div></div> <p>By now, the gcc-4.4 and g++-4.4 have been installed.</p> <p>6) Change the gcc/g++ in the system to gcc-4.4/g++4.4:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 150
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.4 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 150
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.4 100
</code></pre></div></div> <p>Select the 4.4 version by using <code class="language-plaintext highlighter-rouge">update-alternatives</code>:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo update-alternatives --config gcc
sudo update-alternatives --config g++
</code></pre></div></div> <blockquote> <h4 id="2-download-and-install-gpgpu-sim">2 Download and Install GPGPU-Sim</h4> </blockquote> <p>1) Download GPGPU-Sim from GitHub</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/gpgpu-sim/gpgpu-sim_distribution.git
</code></pre></div></div> <p>2) Install dependencies</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt-get install build-essential xutils-dev bison zlib1g-dev flex libglu1-mesa-dev
sudo apt-get install doxygen graphviz
sudo apt-get install python-pmw python-ply python-numpy libpng12-dev python-matplotlib
sudo apt-get install libxi-dev libxmu-dev freeglut3-dev
</code></pre></div></div> <p>3) Add the CUDA_INSTALL_PATH into the <code class="language-plaintext highlighter-rouge">~/.bashrc</code> file:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo 'export CUDA_INSTALL_PATH=/usr/local/cuda' &gt;&gt; ~/.bashrc
source ~/.bashrc
</code></pre></div></div> <p>4) Build GPGPU-Sim:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make
</code></pre></div></div> <p>During the building, if there is an error <code class="language-plaintext highlighter-rouge">cuobjdump.l:110: error: unterminated comment cuobjdump.l:108: error: expected declaration or statement at end of input</code>, remove the comments in cuobjdump.l:108-109.</p> <p>5) Run GPGPU-Sim:</p> <p>Copy the contents of a GPU config, e.g., <code class="language-plaintext highlighter-rouge">configs/GTX480/*</code>, to your application’s working directory, and then run a CUDA application.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkdir test
cd test/
cp ../configs/GTX480/* ./
</code></pre></div></div>]]></content><author><name></name></author><category term="GPU"/><category term="Machine-Learning"/><category term="Simulator"/><summary type="html"><![CDATA[GPGPU-Sim is a cycle-level simulator for modeling contemporary GPUs running CUDA and OpenCL workloads. The current GPGPU-Sim supports the GPU simulation with four kinds of architectures, i.e., GTX480, QuadroFX5600, QuadroFX5800, and TeslaC2050 architectures. This blog introduces the detailed steps to install and run GPGPU-Sim.]]></summary></entry><entry><title type="html">Using Quartz to Simulate Persistent Memory</title><link href="https://pfzuo.github.io/blog/2017/Using-Quartz-to-simulate-Persistent-Memory/" rel="alternate" type="text/html" title="Using Quartz to Simulate Persistent Memory"/><published>2017-07-22T00:00:00+00:00</published><updated>2017-07-22T00:00:00+00:00</updated><id>https://pfzuo.github.io/blog/2017/Using%20Quartz%20to%20simulate%20Persistent%20Memory</id><content type="html" xml:base="https://pfzuo.github.io/blog/2017/Using-Quartz-to-simulate-Persistent-Memory/"><![CDATA[<p>前几篇博客介绍的<a href="http://wiki.nvmain.org/">NVMain</a>是一个体系结构级的非易失内存模拟器，主要是提供给体系结构方面的研究者使用。关于NVM硬件级研究可以使用NVMain，如NVM内存的写策略、磨损均衡策略、内存控制器设计等。由于需要模拟NVM硬件级的特征，包括时序、能耗、写耐久性等，在NVMain模拟器的运行负载的速度远小于在真实DRAM系统上运行的速度。</p> <p>然而系统软件方面的研究者主要关注系统软件在NVM系统中的性能（延时/吞吐量）并不需要对NVM的硬件机制做更改。NVMain这类体系结构级模拟器中大部分功能，系统软件方面的研究都用不上，而且由于运行速度问题也不能运行大规模的负载。因此，惠普（Hewlett Packard）公司为系统软件方面的研究者开发了一款轻量级的基于DRAM的NVM模拟器，<a href="https://github.com/HewlettPackard/quartz">Quartz</a>。在Quartz模拟器上的运行负载的速度可实现接近于在真实DRAM系统上运行的速度。Quartz只支持三种CPU架构包括Sandy Bridge, Ivy Bridge, 和Haswell（注意其它架构类型的CPU使用不了Quartz）。下面介绍Quartz的用法：</p> <blockquote> <h4 id="1-下载和安装quartz">1 下载和安装Quartz</h4> </blockquote> <p>Quartz的代码已开源在GitHub上，可以直接下载：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/HewlettPackard/quartz.git
</code></pre></div></div> <p>Quartz的安装需要一些依赖库，可以运行其提供的<code class="language-plaintext highlighter-rouge">install.sh</code>文件自动安装所有的依赖库：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo scripts/install.sh
</code></pre></div></div> <p>使用以下命令编译Quartz源代码：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkdir build
cd build
cmake ..
make clean all
</code></pre></div></div> <p>编译好后，可以看到在<code class="language-plaintext highlighter-rouge">./build/lib/</code>路径下生成了一个动态库文件<code class="language-plaintext highlighter-rouge">libnvmemul.so</code>。</p> <blockquote> <h4 id="2-运行quartz">2 运行Quartz</h4> </blockquote> <p>首先load模拟器核心模块，在Quartz根目录运行如下命令：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo scripts/setupdev.sh load
</code></pre></div></div> <p>设置CPU运行在最大频率：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
</code></pre></div></div> <p>如果机器的Linux内核版本号大于或等于4.0需要运行如下命令：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo 2 | sudo tee /sys/devices/cpu/rdpmc
</code></pre></div></div> <p>运行自己的程序：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scripts/runenv.sh &lt;your_app&gt;
</code></pre></div></div> <blockquote> <h4 id="2-模拟nvm延迟">2 模拟NVM延迟</h4> </blockquote> <p>Quartz目前的版本不能同时模拟NVM的延迟和带宽。只能让带宽不变模拟不同的延迟，或让延迟不变模拟不同的带宽。模拟带宽我们一般用不到，这里主要介绍怎样模拟NVM延迟。</p> <p>模拟读延迟：NVM的读延迟可以直接在根目录下的<code class="language-plaintext highlighter-rouge">./nvmemul.ini</code>文件中配置（里面的写延时配置好像并没有用）：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>read = 200;
</code></pre></div></div> <p>模拟写延迟：Quartz目前的版本不能支持对写延迟的模拟，所以我们需要自己实现写延迟模拟。由于NVM一般作为持久化内存（Persistent Memory），所以CPU对NVM的写都需要使用CLFLUSH指令（cache line flush）把CPU cache中的脏数据刷回NVM中，并使用MFENCE指令（memory fence）保证cache line flush的顺序性。为了模拟NVM写延迟，我们在每个CLFLUSH指令后面植入额外的延迟。</p> <p>MFENCE和CLFLUSH指令后植入延迟的实现代码可参照<code class="language-plaintext highlighter-rouge">./quartz-master/src/lib/</code>路径下的<code class="language-plaintext highlighter-rouge">pflush.c</code>文件。</p> <blockquote> <h4 id="3-编写基于persistent-memory的程序">3 编写基于Persistent Memory的程序</h4> </blockquote> <p>基于Persistent Memory的程序中，内存的分配和回收一定要用Quartz中对于的函数<code class="language-plaintext highlighter-rouge">pmalloc</code>和<code class="language-plaintext highlighter-rouge">pfree</code>。所以程序中需要引用头文件<code class="language-plaintext highlighter-rouge">./quartz-master/src/lib/pmalloc.h</code>，并且编译时需要链接<code class="language-plaintext highlighter-rouge">libnvmemul.so</code>动态库。</p> <p>对Persistent Memory的写需要使用CLFLUSH指令刷回NVM中，并且使用MFENCE指令保证多个CLFLUSH指令执行的顺序性。</p> <p>对于大于原子写（一般是8 bytes）的数据需要进一步使用logging或copy-on-write(CoW)来保证一致性。</p>]]></content><author><name></name></author><category term="NVM"/><category term="Simulator"/><summary type="html"><![CDATA[前几篇博客介绍的NVMain是一个体系结构级的非易失内存模拟器，主要是提供给体系结构方面的研究者使用。关于NVM硬件级研究可以使用NVMain，如NVM内存的写策略、磨损均衡策略、内存控制器设计等。由于需要模拟NVM硬件级的特征，包括时序、能耗、写耐久性等，在NVMain模拟器的运行负载的速度远小于在真实DRAM系统上运行的速度。]]></summary></entry><entry><title type="html">Configure Gem5 with NVMain to Simulate Non valotile Memory</title><link href="https://pfzuo.github.io/blog/2017/Configure-GEM5-with-NVMain-to-simulate-Non-valotile-Memories/" rel="alternate" type="text/html" title="Configure Gem5 with NVMain to Simulate Non valotile Memory"/><published>2017-01-12T00:00:00+00:00</published><updated>2017-01-12T00:00:00+00:00</updated><id>https://pfzuo.github.io/blog/2017/Configure%20GEM5%20with%20NVMain%20to%20simulate%20Non%20valotile%20Memories</id><content type="html" xml:base="https://pfzuo.github.io/blog/2017/Configure-GEM5-with-NVMain-to-simulate-Non-valotile-Memories/"><![CDATA[<p><a href="http://wiki.nvmain.org/">NVMain</a>是一个体系结构级的非易失内存模拟器，可以准确地模拟内存系统的时序和能耗。NVMain需要放在<a href="http://www.m5sim.org/Main_Page">GEM5</a>全系统模拟器中运行。</p> <blockquote> <h4 id="1-安装mercurial">1 安装Mercurial</h4> </blockquote> <p>集成NVMain到GEM5中需要用到一个源代码控制管理工具：<a href="https://www.mercurial-scm.org/">Mercurial</a>,请自行安装并学习使用方法。</p> <blockquote> <h4 id="2-安装gem5">2 安装GEM5</h4> </blockquote> <p>使用<code class="language-plaintext highlighter-rouge">hg clone</code>命令下载GEM5（推荐使用最新版本的GEM5）：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hg clone http://repo.gem5.org/gem5
</code></pre></div></div> <p>配置GEM5的运行环境，可参照该<a href="http://pfzuo.github.io/2016/04/30/Install-and-Run-GEM5-in-Unbuntu-14.04/">教程</a>。</p> <blockquote> <h4 id="3-配置hgrc文件">3 配置hgrc文件</h4> </blockquote> <p>3.1 打开hgrc文件：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vim ~/.hgrc
</code></pre></div></div> <p>3.2 把以下内容加入到hgrc文件中，并将相关配置（如：<code class="language-plaintext highlighter-rouge">username</code>，<code class="language-plaintext highlighter-rouge">style</code>，<code class="language-plaintext highlighter-rouge">from</code>）修改成自己的信息：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[ui]
# Set the username you will commit code with
username=Your Name &lt;your@email.address&gt;
ssh = ssh -C
# Always use git diffs since they contain permission changes and rename info
[defaults]
qrefresh = --git
email = --git
diff = --git
[extensions]
# These are various extensions we find useful
# Mercurial Queues -- allows managing of changes as a series of patches
hgext.mq =
# PatchBomb -- send a series of changesets as e-mailed patches
hgext.patchbomb =
# External Diff tool (e.g. kdiff3, meld, vimdiff, etc)
hgext.extdiff =
# Fetch allows for a pull/update operation to be done with one command and automatically commits a merge changeset
hgext.fetch =
# Path to the style file for the M5 repository
# This file enforces our coding style requirements
style = /path/to/your/m5/util/style.py
[email]
method = smtp
from = Your Name &lt;your@email.address&gt;
[smtp]
host = your.smtp.server.here
</code></pre></div></div> <blockquote> <h4 id="4-下载nvmain">4 下载NVMain</h4> </blockquote> <p>4.1 注册<a href="https://bitbucket.org/">bitbucket</a>账号；</p> <p>4.2 按照<a href="http://wiki.nvmain.org/index.php?n=Site.GettingNVMain">NVMain网站</a>上的说明获取NVMain的使用权；</p> <p>4.3 进入GEM5根目录，使用<code class="language-plaintext highlighter-rouge">hg clone</code>命令下载NVMain；</p> <blockquote> <h4 id="5-安装nvmain补丁">5 安装NVMain补丁</h4> </blockquote> <p>5.1 进入GEM5根目录；</p> <p>5.2 Initialize queues in gem5:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hg qinit
</code></pre></div></div> <p>5.3 Import the NVMain patch:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hg qimport -f ./nvmain/patches/gem5/nvmain2-gem5-10688+
</code></pre></div></div> <p>5.4 Apply the patch:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hg qpush
</code></pre></div></div> <blockquote> <h4 id="6-编译gem5-with-nvmain">6 编译GEM5 with NVMain</h4> </blockquote> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scons EXTRAS=nvmain ./build/X86/gem5.opt
</code></pre></div></div>]]></content><author><name></name></author><category term="GEM5"/><category term="NVMain"/><category term="Simulator"/><category term="NVM"/><summary type="html"><![CDATA[NVMain是一个体系结构级的非易失内存模拟器，可以准确地模拟内存系统的时序和能耗。NVMain需要放在GEM5全系统模拟器中运行。]]></summary></entry><entry><title type="html">Compile and Debug SPEC CPU2006 in Linux</title><link href="https://pfzuo.github.io/blog/2016/Compile-and-debug-spec-cpu-2006-in-linux/" rel="alternate" type="text/html" title="Compile and Debug SPEC CPU2006 in Linux"/><published>2016-06-12T00:00:00+00:00</published><updated>2016-06-12T00:00:00+00:00</updated><id>https://pfzuo.github.io/blog/2016/Compile%20and%20debug%20spec%20cpu%202006%20in%20linux</id><content type="html" xml:base="https://pfzuo.github.io/blog/2016/Compile-and-debug-spec-cpu-2006-in-linux/"><![CDATA[<p>SPEC CPU 2006是一个比较老的benchmark，所以在较新的Linux系统上编译会出现不兼容的问题。在编译过程中，需要对SPEC CPU 2006的源代码做几处修改来兼容新的Linux系统。本文以CentOS 7系统为例，介绍在Linux系统中SPEC CPU 2006的编译过程。</p> <h4 id="compile">Compile</h4> <p>首先，由于兼容性问题SPEC CPU 2006中自带的<code class="language-plaintext highlighter-rouge">install.sh</code>文件是运行不了的，我们需要重新编译源代码。进入<code class="language-plaintext highlighter-rouge">/tool/src</code>目录，运行<code class="language-plaintext highlighter-rouge">buildtools</code>文件：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./buildtools
</code></pre></div></div> <h4 id="debug">Debug</h4> <p>运行过程中，会出现几个错误。下面列出了这几个错误和相应的解决方法。</p> <ol> <li> <p>error building specmd5sum</p> <p>编译specmd5sum时，会出现如下错误：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> gcc -DHAVE_CONFIG_H    -I/home/gem5/cpu2006/tools/output/include   -I. -Ilib  -c -o md5sum.o md5sum.c
 In file included from md5sum.c:38:0:
 lib/getline.h:31:1: error: conflicting types for 'getline'
  getline PARAMS ((char **_lineptr, size_t *_n, FILE *_stream));
  ^
 In file included from md5sum.c:26:0:
 /usr/include/stdio.h:678:20: note: previous declaration of 'getline' was here
  extern _IO_ssize_t getline (char **__restrict __lineptr,
             ^
 In file included from md5sum.c:38:0:
 lib/getline.h:34:1: error: conflicting types for 'getdelim'
  getdelim PARAMS ((char **_lineptr, size_t *_n, int _delimiter, FILE *_stream));
  ^
 In file included from md5sum.c:26:0:
 /usr/include/stdio.h:668:20: note: previous declaration of 'getdelim' was here
   extern _IO_ssize_t getdelim (char **__restrict __lineptr,
                      ^
 make: *** [md5sum.o] Error 1
 + testordie 'error building specmd5sum'
 + test 2 -ne 0
 + echo '!!! error building specmd5sum'
 !!! error building specmd5sum
 + kill -TERM 1299
 + exit 1
 !!!!! buildtools killed
</code></pre></div> </div> <p>错误原因主要是：函数冲突，stdio.h库已经定义getline和getdelim函数，而SPEC CPU 2006中的getline.h中也定义了这两个函数。</p> <p>解决方法：打开<code class="language-plaintext highlighter-rouge">./tools/src/specmd5sum/md5sum.c</code>文件，注释掉<code class="language-plaintext highlighter-rouge">getline.h</code>头文件（第38行）</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> //#include "getline.h"
</code></pre></div> </div> </li> <li> <p>error building Perl</p> <p>编译Perl时，会出现如下两个错误。</p> <p>ERROR 1:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> collect2: error: ld returned 1 exit status
 make: *** [miniperl] Error 1
 + testordie 'error building Perl'
 + test 2 -ne 0
 + echo '!!! error building Perl'
 !!! error building Perl
 + kill -TERM 15173
 + exit 1
 !!!!! buildtools killed
</code></pre></div> </div> <p>ERROR 2:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> t/op/sprintf..............................FAILED--no leader found
 t/op/sprintf2.............................FAILED--expected 263 tests, saw 3
</code></pre></div> </div> <p>错误原因：</p> <p>1) 高版本的Linux内核中删除了<code class="language-plaintext highlighter-rouge">asm/page.h</code>头文件;</p> <p>2) 配置perl时，需要用到数学库;</p> <p>解决方法：</p> <p>1) 打开<code class="language-plaintext highlighter-rouge">./tools/src/perl-5.8.8/ext/IPC/SysV/SysV.xs</code>文件，注释<code class="language-plaintext highlighter-rouge">asm/page.h</code>头文件（第7行）</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> //#   include &lt;asm/page.h&gt;
</code></pre></div> </div> <p>2) 打开<code class="language-plaintext highlighter-rouge">./tools/src/buildtools</code>文件，在编译perl的代码部分（第333行和334行）做如下修改。</p> <p>修改前：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> LD_LIBRARY_PATH=`pwd`
 DYLD_LIBRARY_PATH=`pwd`
 export LD_LIBRARY_PATH DYLD_LIBRARY_PATH
 ./Configure -dOes -Ud_flock $PERLFLAGS -Ddosuid=undef -Dprefix=$INSTALLDIR -Dd_bincompat3=undef -A ldflags=-L${INSTALLDIR}/lib -A ccflags=-I${INSTALLDIR}/include -Ui_db -Ui_gdbm -Ui_ndbm -Ui_dbm -Uuse5005threads ; testordie "error configuring perl"
</code></pre></div> </div> <p>修改后：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> LD_LIBRARY_PATH=`pwd`
 DYLD_LIBRARY_PATH=`pwd`
 ./Configure -Dcc="gcc -lm" -Dlibpth='/usr/local/lib64 /lib64 /usr/lib64' -dOes -Ud_flock $PERLFLAGS -Ddosuid=undef -Dprefix=$INSTALLDIR -Dd_bincompat3=undef -A ldflags=-L${INSTALLDIR}/lib -A ccflags=-I${INSTALLDIR}/include -Ui_db -Ui_gdbm -Ui_ndbm -Ui_dbm -Uuse5005threads ; testordie "error configuring perl"	
</code></pre></div> </div> </li> </ol>]]></content><author><name></name></author><category term="Benchmark"/><category term="Simulator"/><summary type="html"><![CDATA[SPEC CPU 2006是一个比较老的benchmark，所以在较新的Linux系统上编译会出现不兼容的问题。在编译过程中，需要对SPEC CPU 2006的源代码做几处修改来兼容新的Linux系统。本文以CentOS 7系统为例，介绍在Linux系统中SPEC CPU 2006的编译过程。]]></summary></entry><entry><title type="html">Configure and Run PARSEC-2.1 Benchmark in Gem5</title><link href="https://pfzuo.github.io/blog/2016/Configure-and-run-parsec-2.1-benchmark-in-GEM5/" rel="alternate" type="text/html" title="Configure and Run PARSEC-2.1 Benchmark in Gem5"/><published>2016-06-06T00:00:00+00:00</published><updated>2016-06-06T00:00:00+00:00</updated><id>https://pfzuo.github.io/blog/2016/Configure%20and%20run%20parsec%202.1%20benchmark%20in%20GEM5</id><content type="html" xml:base="https://pfzuo.github.io/blog/2016/Configure-and-run-parsec-2.1-benchmark-in-GEM5/"><![CDATA[<p>上一篇讲了怎样在linux系统里单独运行PARSEC Benchmark，本篇介绍如何在GEM5模拟器中配置和运行PARSEC Benchmark （以ARPHA架构方式为例）。PARSEC Benchmark需要在GEM5中的全系统（full system）模式下运行，其配置方法和上上篇中2.4节比较相似。相关教程可参考：<a href="http://www.m5sim.org/PARSEC_benchmarks">http://www.m5sim.org/PARSEC_benchmarks</a> .</p> <ol> <li> <p>首先新建一个文件夹用于存储PARSEC Benchmark的disk image</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mkdir full_system_images
 cd full_system_images
</code></pre></div> </div> </li> <li> <p>下载初始的系统文件，并解压，再重命名文件夹（重命名可选）</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> wget http://www.m5sim.org/dist/current/m5_system_2.0b3.tar.bz2
 tar jxvf m5_system_2.0b3.tar.bz2
 mv m5_system_2.0b3 system
</code></pre></div> </div> <p>解压后，文件的目录结构如下：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> system/
     binaries/
          console
          ts_osfpal
          vmlinux
     disks/
          linux-bigswap2.img
          linux-latest.img
</code></pre></div> </div> </li> <li> <p>下载PARSEC Benchmark相关文件，并替换掉system文件夹中的相应文件</p> <p>下载PARSEC对应的linux kernel文件，并替换掉 ‘system/binaries/vmlinux’</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> cd ./system/binaries/
 wget http://www.cs.utexas.edu/~parsec_m5/vmlinux_2.6.27-gcc_4.3.4
 rm vmlinux
 mv vmlinux_2.6.27-gcc_4.3.4 vmlinux
</code></pre></div> </div> <p>下载PARSEC对应的PAL code文件， 并替换掉 ‘system/binaries/ts_osfpal’</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> wget http://www.cs.utexas.edu/~parsec_m5/tsb_osfpal
 rm ts_osfpal
 mv tsb_osfpal ts_osfpal
</code></pre></div> </div> <p>下载PARSEC-2.1 Disk Image并解压</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> cd ../disks/
 wget http://www.cs.utexas.edu/~parsec_m5/linux-parsec-2-1-m5-with-test-inputs.img.bz2
 bzip2 -b linux-parsec-2-1-m5-with-test-inputs.img.bz2
</code></pre></div> </div> </li> <li> <p>进入gem5文件夹，修改两个文件（SysPaths.py 和 Benckmarks.py）配置parsec的路径和文件名</p> <p>打开SysPaths.py配置parsec disk image的完整路径：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> vim ./configs/common/SysPaths.py
</code></pre></div> </div> <p>修改前：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> path = [ ’/dist/m5/system’, ’/n/poolfs/z/dist/m5/system’ ]
</code></pre></div> </div> <p>修改后：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> path = [ ’/dist/m5/system’, ’/home/full_system_images/system’ ]
</code></pre></div> </div> <p>打开Benchmarks.py，修改image文件名：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> vim ./configs/common/Benchmarks.py
</code></pre></div> </div> <p>修改前：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> elif buildEnv['TARGET_ISA'] == 'alpha':
     return env.get('LINUX_IMAGE', disk('linux-latest.img'))
</code></pre></div> </div> <p>修改后：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> elif buildEnv['TARGET_ISA'] == 'alpha':
     return env.get('LINUX_IMAGE', disk('linux-parsec-2-1-m5-with-test-inputs.img'))
</code></pre></div> </div> </li> <li> <p>生成benchmark的script文件，用于运行benchmark</p> <p>下载PARSEC script生成包，并解压：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> wget http://www.cs.utexas.edu/~parsec_m5/TR-09-32-parsec-2.1-alpha-files.tar.gz
 tar zxvf TR-09-32-parsec-2.1-alpha-files.tar.gz
</code></pre></div> </div> <p>生成script命令：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ./writescripts.pl &lt;benchmark&gt; &lt;nthreads&gt;
</code></pre></div> </div> <p>有以下13种benchmark：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> blackscholes
 bodytrack
 canneal
 dedup
 facesim
 ferret
 fluidanimate
 freqmine
 streamcluster
 swaptions
 vips
 x264
 rtview
</code></pre></div> </div> </li> <li> <p>根据生成的script文件运行gem5：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ./build/ALPHA/gem5.opt ./configs/example/fs.py -n &lt;number&gt; --script=./path/to/runScript.rcS --caches --l2cache -F 5000000000
</code></pre></div> </div> </li> <li> <p>新开一个窗口，使用telnet与gem5模拟系统进行交互</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> telnet localhost 3456  
</code></pre></div> </div> </li> </ol>]]></content><author><name></name></author><category term="GEM5"/><category term="Benchmark"/><category term="Simulator"/><summary type="html"><![CDATA[上一篇讲了怎样在linux系统里单独运行PARSEC Benchmark，本篇介绍如何在GEM5模拟器中配置和运行PARSEC Benchmark （以ARPHA架构方式为例）。PARSEC Benchmark需要在GEM5中的全系统（full system）模式下运行，其配置方法和上上篇中2.4节比较相似。相关教程可参考：http://www.m5sim.org/PARSEC_benchmarks .]]></summary></entry></feed>