Search Program
Organizations
Contributors
Presentations
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionPrior work shows that remote power attacks on Intel processors are possible through two Model Specific Registers (MSRs): MSR_PKG_Energy_Status and MSR_PP0_Energy_Status. In response, Intel introduced a defence: a bit in MSR IA32_MISC_PACKAGE_CTLS allows users to enable/disable "filtering'' mechanism that adds additional noise to energy measurements, making remote power attacks infeasible.
In this work, we demonstrate that "filtering'' does not cover all possible avenues of measuring power. On Intel server-grade platforms, components like out-of-band management interface (OOB) exist which also expose telemetric information like in-band energy consumption. For this, we first reverse engineer the protocol structure over which OOB communicates with in-band components. We then show how OOB allows read-only access to the Package Configuration Space (PCS) and note that energy readings through PCS are outside the scope of filtering.
Using this, we re-enable remote power side-channels on Intel SGX and TDX operational on Intel Sapphire Rapids. We first construct a synchronization mechanism to align in-band execution with out-of-band measurements by leveraging deliberately disabled MSRs. We then use energy readings through OOB PCS to recover 2048-bit RSA keys from MbedTLS operational within in-band Intel SGX and TDX (with generic single-stepping assumption). Finally, we also leak AESNI keys from within in-band Intel SGX and TDX (without any single-step assumption).
Prior to our work, the literature on side-channels has been focused on attacks leveraging in-band interfaces. Our work establishes the importance of evaluating confidential computing architectures against attack vectors that combine abilities of both in-band and out-of-band interfaces to achieve adversarial objectives (that both in-band and out-of-band interfaces cannot achieve independently).
In this work, we demonstrate that "filtering'' does not cover all possible avenues of measuring power. On Intel server-grade platforms, components like out-of-band management interface (OOB) exist which also expose telemetric information like in-band energy consumption. For this, we first reverse engineer the protocol structure over which OOB communicates with in-band components. We then show how OOB allows read-only access to the Package Configuration Space (PCS) and note that energy readings through PCS are outside the scope of filtering.
Using this, we re-enable remote power side-channels on Intel SGX and TDX operational on Intel Sapphire Rapids. We first construct a synchronization mechanism to align in-band execution with out-of-band measurements by leveraging deliberately disabled MSRs. We then use energy readings through OOB PCS to recover 2048-bit RSA keys from MbedTLS operational within in-band Intel SGX and TDX (with generic single-stepping assumption). Finally, we also leak AESNI keys from within in-band Intel SGX and TDX (without any single-step assumption).
Prior to our work, the literature on side-channels has been focused on attacks leveraging in-band interfaces. Our work establishes the importance of evaluating confidential computing architectures against attack vectors that combine abilities of both in-band and out-of-band interfaces to achieve adversarial objectives (that both in-band and out-of-band interfaces cannot achieve independently).
Engineering Poster
Networking


DescriptionThis work discusses optimized placement methodology to achieve faster timing convergence and enhanced synchronization in PCIe sub system. The latency of data transmission plays a crucial role, and data transfer rates have significantly improved with standards like PCI Express. The physical implementation of PCI Express presents significant challenges, particularly due to the aspect ratio, limited routing resources, and the number of clocks in the design. We present an innovative
approach that simplifies the implementation of multilane PCIe systems. The challenges at each stage are identified, and corresponding solutions are proposed here. Initially, we employ a source synchronous approach to achieve synchronization between the data and clock nets, followed by structural based placement methodology that enhances the implementation process. The proposed methodology systematically eliminates crosstalk issues, reduces skew by 33%, decreases the depth of data and clock paths by 30%, achieves better timing convergence and reduced the turnaround time by 41%, thereby optimizing overall design performance.
approach that simplifies the implementation of multilane PCIe systems. The challenges at each stage are identified, and corresponding solutions are proposed here. Initially, we employ a source synchronous approach to achieve synchronization between the data and clock nets, followed by structural based placement methodology that enhances the implementation process. The proposed methodology systematically eliminates crosstalk issues, reduces skew by 33%, decreases the depth of data and clock paths by 30%, achieves better timing convergence and reduced the turnaround time by 41%, thereby optimizing overall design performance.
Engineering Poster


DescriptionThe quest for miniaturization in chip design for Consumer and Industrial SOC applications like low end MCU, is fraught with challenges that can inflate die size and compromise PPA target and efficiency aligned to product specification.On entry level general purpose MCU SoC any die size reduction has huge potential to save dollars on each device and earn profit.
Improvement in Padring design, IO channel estimates, CR overheads and Power Delivery Network (PDN) design which resulted in better Floorplan efficiency.Clock tree development addressed cell density hot spots and skew reduction. Addressed early routability of design in reduced die area including lesser net detours.Addressed early IR/EM, Custom Route and DRC-Max density signoff checks. Addressed various techniques in STA to reduce Timing ECO cycles to achieve schedule
Technical challenges faced during Floorplan and Clock tree optimizations which led to timely Tapeout with better-than-expected quality of results.Some of the methods and Techniques were re used in projects of similar complexities and size.
Improvement in Padring design, IO channel estimates, CR overheads and Power Delivery Network (PDN) design which resulted in better Floorplan efficiency.Clock tree development addressed cell density hot spots and skew reduction. Addressed early routability of design in reduced die area including lesser net detours.Addressed early IR/EM, Custom Route and DRC-Max density signoff checks. Addressed various techniques in STA to reduce Timing ECO cycles to achieve schedule
Technical challenges faced during Floorplan and Clock tree optimizations which led to timely Tapeout with better-than-expected quality of results.Some of the methods and Techniques were re used in projects of similar complexities and size.
Engineering Poster


DescriptionThe quest for miniaturization in chip design for automotive SOC applications like UWB smart keyfobs, is fraught with challenges that can inflate die size and compromise PPA target and efficiency aligned to product specification
Challenge-1 was Synthesis methodology gaps in reducing logic as timing optimization was not physical aware , was done in only functional mode (not in Test modes) and disablement of boundary optimization
Challenge-2 was to reduce Floorplan Overheads with respect to custom routing and DRV channel estimation and in multi power domain & mixed signal SoC having 35 Analog IPs in which 100 custom routes need to be implemented
Challenge-3 was to implement optimal power domain to avoid long routed signal nets through secondary domain and also mitigated crosstalk issues by allowing load splitting
Challenge-4 was to do custom placement of clock module logic coupled with refined modular placement which helped in robust Clock Tree development
Challenge-5 observed higher insertion delay due to higher logic depth at clock sources resulting in high cell density due to over bufferization.
Challenge-6 is to meet stringent Metal Tapeout(MTO/BEOL) timelines
Challenge-1 was Synthesis methodology gaps in reducing logic as timing optimization was not physical aware , was done in only functional mode (not in Test modes) and disablement of boundary optimization
Challenge-2 was to reduce Floorplan Overheads with respect to custom routing and DRV channel estimation and in multi power domain & mixed signal SoC having 35 Analog IPs in which 100 custom routes need to be implemented
Challenge-3 was to implement optimal power domain to avoid long routed signal nets through secondary domain and also mitigated crosstalk issues by allowing load splitting
Challenge-4 was to do custom placement of clock module logic coupled with refined modular placement which helped in robust Clock Tree development
Challenge-5 observed higher insertion delay due to higher logic depth at clock sources resulting in high cell density due to over bufferization.
Challenge-6 is to meet stringent Metal Tapeout(MTO/BEOL) timelines
Research Manuscript


Design
DES5: Emerging Device and Interconnect Technologies
DescriptionThe memory wall is a major bottleneck for continuing to improve the energy efficiency of computing systems. To overcome this challenge, various nanomaterials, devices, circuits, architectures, and three-dimensional (3D) integration techniques are under development for future memory solutions. However, major trade-offs must be made when designing memories to achieve high on-chip memory capacity, high retention time, high endurance, low access times, low access energy, and low static leakage power. We present an energy- and area-efficient embedded DRAM memory architecture (quantified by EADP: the product of total energy consumption, circuit area footprint and application execution time) that leverages monolithic 3D integration of three types of transistors: (i) Indium Gallium Zinc Oxide (IGZO) FETs for ultra-low off-state leakage currents enabling high retention time DRAM; (ii) Carbon Nanotube FETs (CNFETs) for high on-state drive currents leading to fast access times; and (iii) Silicon CMOS for its combined energy efficiency and low off-state leakage current (for memory peripheral circuits implemented on the bottom physical circuit layer). Our resulting 333-eDRAM achieves each of the following simultaneously, which we quantify and describe how to co-optimize in this paper: high density, high retention time, high endurance, low access times, low access energy, and low static leakage power. We show full physical layout designs detailing how to implement 333-eDRAM and quantify EDAP for an ARM Cortex-M0 processor + on-chip 333-eDRAM implemented at a 7 nm technology node, running applications from the Embench benchmark suite. Using cycle-accurate simulations, SPICE circuit simulations, compact models calibrated to experimental data, and detailed full physical layout designs of 333-eDRAM memories, we show that on average (across 16 Embench benchmarks), ARM Cortex-M0 + IGZO/CNT/Si 333-eDRAM offers 1.7× better EDP and 4.47× better EDAP than ARM Cortex-M0 + Silicon eDRAM.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe design space for edge AI hardware supporting large language model (LLM) inference and continual learning is underexplored. We present 3D-CIMlet, a thermal-aware modeling and co-design framework for 2.5D/3D edge-LLM engines exploiting heterogeneous computing-in-memory (CIM) chiplets, adaptable for both inference and continual learning. We develop memory-reliability-aware chiplet mapping strategies for a case study of edge LLM system integrating RRAM, capacitor-less eDRAM, and hybrid chiplets in mixed technology nodes. Compared to 2D baselines, 2.5D/3D designs improve energy efficiency by up to 9.3x and 12x, with up to 90.2% and 92.5% energy-delay product (EDP) reduction respectively, on edge LLM continual learning. The framework is open-sourced anonymously.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionThe standard-cell placement legalization is a critical step in the physical design.
The emerging 3D ICs have brought challenges to traditional legalizers on efficiency and effectiveness.
In this work, we present a fast flow-based legalization algorithm, 3D-Flow, that minimizes cell displacement in a 3D solution space.
Our legalizer resolves overflowed bins by finding the shortest augmenting path on a 3D grid graph, utilizing an effective branch-and-bound algorithm.
Moreover, a post-optimization with a cycle-canceling algorithm is proposed to minimize the maximum displacement.
Our approach leverages the global perspective inherent in network flow methods, considering multiple dies in 3D ICs to minimize cell displacement.
Experimental results on ICCAD 2022 and 2023 contest benchmarks demonstrate our proposed algorithm achieves up to 13% and 43% less average and maximum cell displacement compared to state-of-the-art legalizers in a similar runtime.
The emerging 3D ICs have brought challenges to traditional legalizers on efficiency and effectiveness.
In this work, we present a fast flow-based legalization algorithm, 3D-Flow, that minimizes cell displacement in a 3D solution space.
Our legalizer resolves overflowed bins by finding the shortest augmenting path on a 3D grid graph, utilizing an effective branch-and-bound algorithm.
Moreover, a post-optimization with a cycle-canceling algorithm is proposed to minimize the maximum displacement.
Our approach leverages the global perspective inherent in network flow methods, considering multiple dies in 3D ICs to minimize cell displacement.
Experimental results on ICCAD 2022 and 2023 contest benchmarks demonstrate our proposed algorithm achieves up to 13% and 43% less average and maximum cell displacement compared to state-of-the-art legalizers in a similar runtime.
Engineering Poster
Networking


DescriptionAnalog and Photonic ICs are implemented in Virtuoso, where designers co-design stacks, perform EM analysis and use Packageless technology with Technology independent abstracts for IC footprints. These stacks are then integrated into larger systems in Integrity 3DIC, which may involve additional subsystems like other IC stacks, interposers, and packaging. This requires seamless transfer of the analog subsystem between Virtuoso and Integrity 3DIC, ensuring that changes made in one are seamlessly reflected in the other. This includes tasks like die positioning and optimal bump placement within the system context, etc. However, challenges arise due to differing die footprint formats, stack handling methods, and the complexity of advanced stacking (e.g., multiple ICs with cavities). Additionally, connectivity updates and bi-directional ECOs must be supported.
To address these challenges, we propose a mechanism that enables complete interoperability between Virtuoso and Integrity 3DIC. This flow ensures the seamless transfer of IC footprint, stacking, and connectivity information without data loss. Designers can implement, stack, edit, and analyze ICs in Virtuoso, developing subsystems that can be integrated into the full system in Integrity 3DIC. As photonic and complex stacking systems evolve, the proposed solution provides a robust framework for planning, implementation, bump management, and final sign-off.
To address these challenges, we propose a mechanism that enables complete interoperability between Virtuoso and Integrity 3DIC. This flow ensures the seamless transfer of IC footprint, stacking, and connectivity information without data loss. Designers can implement, stack, edit, and analyze ICs in Virtuoso, developing subsystems that can be integrated into the full system in Integrity 3DIC. As photonic and complex stacking systems evolve, the proposed solution provides a robust framework for planning, implementation, bump management, and final sign-off.
Networking
Work-in-Progress Poster


Description3D-IC stacking and floorplan play crucial roles in determining the performance of 3D-IC designs. The performance difference between optimal and non-optimal thermal results can reach more than 10% . Traditional design methodology considers macro placement for either performance or thermal. This paper is the first work to present a 3D-IC stacking and floorplan design methodology to co-optimize performance and thermal by simultaneously considering the relationships among I/O ports, macros, and standard cells. The proposed approach is to firstly decompose thermal effects into power aggregation and 3D-IC stacking characteristics and convert them into floorplan power design constraints. With the specific constraints, our method secondly builds pre-trained models through Machine-Learning (ML) algorithm and finally applies pre-trained models into the practical two-phase macro placement framework for analyzing 3D-IC stacking and floorplan efficiently and rapidly to co-optimize performance and thermal. Experimental results show that proposed power design constraints can hit critical hotspots over user-defined temperature threshold on the power map efficiently and the accuracy of ML-based pre-trained reaches up to 98% effectively. Through pre-trained model, our approach can accurately estimate a co-optimal thermal and performance macro placement within 0.1 sec.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionSubgraph Graph Neural Networks (GNNs) are emerging as a promising approach to enhance GNN expressiveness, but their more complex graph structures with numerous independent and irregular subgraphs pose significant hardware deployment challenges. In this work, we propose 3D-SubG, a 3D stacked hybrid processing-near/in-memory accelerator for subgraph GNNs. With hybrid bonding packaging technology, a logic die is 3D stacked with a DRAM die for highly parallel memory accesses. The logic die employs digital SRAM-based processing-in-memory (PIM) macros to boost computation density and minimize data transfer. We further propose a bit-level non-zero gathering method to exploit graph sparsity for PIM, a workload-balanced mapping strategy for subgraph allocation onto different logic-to-DRAM blocks, and a distributed global pooling approach to reduce inter-block data movements. Experimental results show that 3D-SubG achieves average improvements of 146.11× in performance, 934.18× in area efficiency, and 1171.80× in energy efficiency compared to RTX 3090Ti.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionThe LLM decoding process poses a significant challenge for memory bandwidth due to its autoregressive nature. Prior 2D memory solutions fail to overcome this memory bottleneck due to limited memory-to-logic bandwidth. In this work, we propose 3D-TokSIM, a cross-stack solution by stacking 3D memory on logic die with a specially designed token-stationary compute-in-memory (CIM) to efficiently accelerate speculative decoding. Our CIM is developed with novel token-stationary dataflow to reduce data movement on logic die to save power and balance computation and memory access. To further reduce the buffer requirements, we perform architecture exploration and allocate notable CIM resources for achieving higher decoding parallelism. Compared to RTX 3090 GPU, 3D-TokSIM achieves 15.1× throughput and 324× energy efficiency improvements on speculative Llama2-7B decoding.
Engineering Presentation


Back-End Design
Chiplet
Description3DIC technology nowadays unveil a new avenue for expanding the design capacity and integration density. It is beneficial for faster performance, lower power consumption, miniaturization and lower cost etc. Nevertheless, it comes with new challenges that need to be addressed to make the 3DIC solutions successful. Multiphysics effect is one of the critical challenges, especially for the thermal dissipation. Various technologies aim to resolve thermal issues at late design stage, but it could be very time-consuming to converge fatal thermal issues. However, most commercial EDA tools do not support thermal-aware design planning at early design stage. This paper presents a thermal-aware design optimization methodology to resolve thermal issues at early design stage. We proposed to create thermal constraints at early P&R stage, and the constraints guide module floorplanning of early design optimization. Verify and fix the thermal constraints violations iteratively at P&R stage without thermal simulation quickly converge the thermal impact and module floorplanning. The solution models the thermal impact for early design optimization and shorten the design cycle between implementation and thermal verification.
Workshop


AI
Security
Sunday Program
DescriptionSecurity vulnerabilities in hardware designs are catastrophic as once fabricated, they are nearly impossible to patch. Modern SoCs (Systems-on-Chip) face threats like side-channel leakage, information leakage, access control violations, and malicious functionality, jeopardizing the foundational integrity of SoCs. These vulnerabilities circumvent software-level defenses, creating urgent challenges for hardware security. Ensuring the security of hardware designs is challenging due to their huge complexity, aggressive time-to-markets, and the variety of attacks against hardware designs. Moreover, it is very costly for a design house to keep many security experts with in-depth design knowledge with diverse security implications. So, the semiconductor industry looks for a set of metrics, reusable security solutions, and automatic computer-aided-design (CAD) tools to aid analysis, identifying, root-causing, and mitigating SoC security problems. Artificial Intelligence (AI) is revolutionizing the landscape of CAD, providing unprecedented opportunities to tackle these challenges. AI-driven tools have the potential to analyze complex SoC designs at multiple abstraction levels, automatically detect vulnerabilities, and even predict potential attack vectors. By leveraging advanced AI models, including large language models (LLMs) and machine learning algorithms, we can now accelerate the identification of root causes, assess risks, and recommend security
countermeasures. The inclusion of AI in CAD/EDA for security addresses these issues in innovative ways, e.g., (1) Enhanced Vulnerability Detection, (2) Contextual Adaptability, and (3) Proactive Security.
Building on the resounding success of the 1st (inauguration) and 2nd CAD4Sec workshops, the 3rd iteration aims to embrace the transformative intersection of AI, CAD, and hardware security. Now rebranded as AICAD4Sec, this workshop aims to drive innovation at the nexus of AI-driven solutions and hardware design security. The ultimate vision of AICAD4Sec is to establish a cutting-edge platform that shows advancements and sets the roadmap for secure, AI-enabled hardware design, specifically, (i) Engaging experts from industry leaders like Google, Microsoft, Synopsys, and ARM, alongside academia and government agencies such as DARPA and AFRL; (ii) Showcasing the latest breakthroughs in AI-enhanced CADs for security; (iii) Facilitating practical demonstrations of AI-driven
solutions in hardware security by both industries/academia; and (iv) Hosting a dynamic panel discussion on the evolving role of AI, with a particular focus on large language models and their implications for secure SoC design.
Building on the foundation of its predecessors, the 3rd AICAD4Sec workshop will contain several technical talks on the scope of metrics and CAD as the following:
● CAD Tools for Side-Channel Vulnerability Assessment (Power, Timing, and Electromagnetic Leakage)
● Security-Oriented Equivalency Checking and Property Validation
● Fault Injection Analysis and Countermeasure Integration in CAD
● CAD for Secure Packaging and Heterogeneous Integration
● Assessment of Physical Probing and Reverse Engineering Risks
● AI-Powered Tools for Pre-Silicon Vulnerability Mitigation and Countermeasure Suggestions
● Large Language Models for Security-Aware Design Automation
● ML-Enhanced Threat Detection Across Design Abstractions
● AI-Augmented Detection of Malicious Functionality in Hardware Designs
● AI-Enabled Security Verification for Emerging SoC Architectures.
countermeasures. The inclusion of AI in CAD/EDA for security addresses these issues in innovative ways, e.g., (1) Enhanced Vulnerability Detection, (2) Contextual Adaptability, and (3) Proactive Security.
Building on the resounding success of the 1st (inauguration) and 2nd CAD4Sec workshops, the 3rd iteration aims to embrace the transformative intersection of AI, CAD, and hardware security. Now rebranded as AICAD4Sec, this workshop aims to drive innovation at the nexus of AI-driven solutions and hardware design security. The ultimate vision of AICAD4Sec is to establish a cutting-edge platform that shows advancements and sets the roadmap for secure, AI-enabled hardware design, specifically, (i) Engaging experts from industry leaders like Google, Microsoft, Synopsys, and ARM, alongside academia and government agencies such as DARPA and AFRL; (ii) Showcasing the latest breakthroughs in AI-enhanced CADs for security; (iii) Facilitating practical demonstrations of AI-driven
solutions in hardware security by both industries/academia; and (iv) Hosting a dynamic panel discussion on the evolving role of AI, with a particular focus on large language models and their implications for secure SoC design.
Building on the foundation of its predecessors, the 3rd AICAD4Sec workshop will contain several technical talks on the scope of metrics and CAD as the following:
● CAD Tools for Side-Channel Vulnerability Assessment (Power, Timing, and Electromagnetic Leakage)
● Security-Oriented Equivalency Checking and Property Validation
● Fault Injection Analysis and Countermeasure Integration in CAD
● CAD for Secure Packaging and Heterogeneous Integration
● Assessment of Physical Probing and Reverse Engineering Risks
● AI-Powered Tools for Pre-Silicon Vulnerability Mitigation and Countermeasure Suggestions
● Large Language Models for Security-Aware Design Automation
● ML-Enhanced Threat Detection Across Design Abstractions
● AI-Augmented Detection of Malicious Functionality in Hardware Designs
● AI-Enabled Security Verification for Emerging SoC Architectures.
Research Manuscript


EDA
EDA3: Timing Analysis and Optimization
DescriptionClock skew scheduling (CSS) is a well-known technique that improves design timing slack by adjusting clock latency to flip-flops.
CSS requires obtaining timing path information between sequential elements (including flip-flops and I/O ports), known as sequential graph extraction, which is the most time-consuming part of advanced CSS.
In this paper, to quickly identify the potential of clock skew in slack optimization, we propose an iterative CSS algorithm that leverages timing propagation to facilitate sequential graph extraction.
Then, we provide a comprehensive skew calculation method that considers multiple clock latency constraints, obtaining the target latency of each flip-flop.
Finally, we present slack optimization techniques to achieve the target latencies.
Our algorithm achieves a 49.11x speedup compared to the advanced CSS algorithm based on partial graph extraction, reducing 90.05\% of the extracted edges.
Compared to a state-of-the-art CSS-based slack optimization methodology, our algorithm delivers a 27.01x speedup with superior slack improvement.
CSS requires obtaining timing path information between sequential elements (including flip-flops and I/O ports), known as sequential graph extraction, which is the most time-consuming part of advanced CSS.
In this paper, to quickly identify the potential of clock skew in slack optimization, we propose an iterative CSS algorithm that leverages timing propagation to facilitate sequential graph extraction.
Then, we provide a comprehensive skew calculation method that considers multiple clock latency constraints, obtaining the target latency of each flip-flop.
Finally, we present slack optimization techniques to achieve the target latencies.
Our algorithm achieves a 49.11x speedup compared to the advanced CSS algorithm based on partial graph extraction, reducing 90.05\% of the extracted edges.
Compared to a state-of-the-art CSS-based slack optimization methodology, our algorithm delivers a 27.01x speedup with superior slack improvement.
Engineering Presentation


AI
IP
Chiplet
DescriptionWhen multiple divided clocks are derived from the same oscillator, and distributed across the System on Chip(SOC) using different transistor families, it becomes very hard to balance the multiple clock trees.
All the divided clocks can then be considered as mesochronous, i.e. the ratio between the clocks is known and the phase is unknown but constant.
By treating the 2 clock domains as mesochronous instead of asynchronous, we can easily increase the throughput of the data crossing the mesochronous domains.
This work proposes a phase synchronization architecture for mesochronous clock domains, which has been developed from ground up and is completely digital.
The proposed architecture is very different from the present literature and caters to low power and low area SOCs and is independent of any gate level custom analog designs or technology node.
The proposed phase synchronization scheme can be adjusted for any clock ratio on the fly and is also suitable for sporadic data transfers and not just burst data, thus providing a versatile solution.
The principles of phase detect synchronization is described with respect to ARM's AHB-APB mesochronous bridge, but can be easily extended to sample any data signal crossing mesochronous domains.
All the divided clocks can then be considered as mesochronous, i.e. the ratio between the clocks is known and the phase is unknown but constant.
By treating the 2 clock domains as mesochronous instead of asynchronous, we can easily increase the throughput of the data crossing the mesochronous domains.
This work proposes a phase synchronization architecture for mesochronous clock domains, which has been developed from ground up and is completely digital.
The proposed architecture is very different from the present literature and caters to low power and low area SOCs and is independent of any gate level custom analog designs or technology node.
The proposed phase synchronization scheme can be adjusted for any clock ratio on the fly and is also suitable for sporadic data transfers and not just burst data, thus providing a versatile solution.
The principles of phase detect synchronization is described with respect to ARM's AHB-APB mesochronous bridge, but can be easily extended to sample any data signal crossing mesochronous domains.
Engineering Poster


DescriptionThis post reports an high effiency and precision flow of EM Analysis for die and package joint modeling simulation of RF chips design realizing on simulation of one outout maching balun merging with according package. In this fllow we combine the transformer and ground/power line mental on chip with package model. By using this method, we built the comprehensive EM model to get the scatter parameters of the model, in which we can correct the ignorance of the coupled capacitors and the wrong electric field around the bonding surface and we avoid extra ports at the inter terminals between die and packge or between main circuit and ground or power nets and their additional parasitic Inductors. By using proposed new method flow, we can reduce the half of the ports and get a more brief EM model. The simulation result shows an repairation of the error of the input port resistance and Inductance exceeds 15% around the mesh frequency of 5GHz and correction of the abnormal spur in insert loss at 29.3GHz.
Exhibitor Forum


DescriptionIn today's industry, component providers and semiconductor companies supply components to customers along with specification datasheets in PDF format. Product design and manufacturing companies (ODMs and OEMs) need to create their own component libraries to meet specific design and manufacturing requirements. In Recognizing this need, many component suppliers have started offering generic ECAD libraries to facilitate faster design integration.
However, creating ready-to-use ECAD libraries for everyone remains a challenge. Different ODMs and OEMs have varying design requirements, and manufacturers' capabilities and limitations also differ, which affects Design for Manufacturing (DFM) standards. As a result, engineers must customize the libraries to accommodate DFM adjustments to ensure that designs are both manufacturable and reliable.
Moreover, even within a single company, ideal component libraries must be easily updated whenever design rules or contract manufacturing conditions change. Companies often employ dedicated engineering team to handle such challenges.
Last year, when we presented at the DAC Exhibitor Forum, we discussed key technologies for building an accurate and reusable ECAD library platform. Since then, we have made exciting advancements.
As the industry starting to adopt these technologies, a major challenge getting engineers to adopt automated library design tools is ensuring that these tools feel as intuitive and adaptable as human designers. The ability to accept specific requirements and generate customized libraries accordingly is crucial. Yet, how to design an automated library platform with efficient yet flexible human-machine interaction that meets all design needs is a topic that has been largely overlooked in the industry.
Over the past year, we have made significant progress in this area and want to share our insights with the industry. We believe this is one of the most relevant topics for engineers today, as using AI and automation tools in their work will become a norm. We've developed an integrated ECAD library system that, at its core, digitizes the entire design-for-manufacturing knowledge base and library creation process. This system opens APIs that allow users to flexibly embed any chosen DFM aspects into their libraries. It enables users to set their own DFM rule parameters and instantly receive library files that adhere to their specific design requirements in their preferred EDA format.
The goal is to share our experience, shed light on the real challenges the industry faces in adopting AI and automation tools, and help alleviate the burden of the cumbersome library-building process. By doing so, we aim to transform the industry, enabling companies to allocate engineering resources to the more creative aspects of design; Additionally, we aim to encourage component suppliers to provide digital datasheets or ready-to-use digital libraries, which will not only benefit the industry but also align with the broader trend toward digitization.
However, creating ready-to-use ECAD libraries for everyone remains a challenge. Different ODMs and OEMs have varying design requirements, and manufacturers' capabilities and limitations also differ, which affects Design for Manufacturing (DFM) standards. As a result, engineers must customize the libraries to accommodate DFM adjustments to ensure that designs are both manufacturable and reliable.
Moreover, even within a single company, ideal component libraries must be easily updated whenever design rules or contract manufacturing conditions change. Companies often employ dedicated engineering team to handle such challenges.
Last year, when we presented at the DAC Exhibitor Forum, we discussed key technologies for building an accurate and reusable ECAD library platform. Since then, we have made exciting advancements.
As the industry starting to adopt these technologies, a major challenge getting engineers to adopt automated library design tools is ensuring that these tools feel as intuitive and adaptable as human designers. The ability to accept specific requirements and generate customized libraries accordingly is crucial. Yet, how to design an automated library platform with efficient yet flexible human-machine interaction that meets all design needs is a topic that has been largely overlooked in the industry.
Over the past year, we have made significant progress in this area and want to share our insights with the industry. We believe this is one of the most relevant topics for engineers today, as using AI and automation tools in their work will become a norm. We've developed an integrated ECAD library system that, at its core, digitizes the entire design-for-manufacturing knowledge base and library creation process. This system opens APIs that allow users to flexibly embed any chosen DFM aspects into their libraries. It enables users to set their own DFM rule parameters and instantly receive library files that adhere to their specific design requirements in their preferred EDA format.
The goal is to share our experience, shed light on the real challenges the industry faces in adopting AI and automation tools, and help alleviate the burden of the cumbersome library-building process. By doing so, we aim to transform the industry, enabling companies to allocate engineering resources to the more creative aspects of design; Additionally, we aim to encourage component suppliers to provide digital datasheets or ready-to-use digital libraries, which will not only benefit the industry but also align with the broader trend toward digitization.
Engineering Poster
Networking


DescriptionWith myriad complex SOCs, challenges in verification have increased manifold. One of the challenging aspects of any SOC/IP verification is to identify the un-initialized flops in a design and analyze them across all corner case scenarios which, if not taken care, can lead to catastrophic silicon issues. Most of these when caught on silicon leads to re-spin of the device.
Hence it becomes extremely important to identify such cases in the design, either in RTL phase or gate-level simulations and analyze their impact on the design with due-diligence. There have been multiple cases where potential issues got masked due to random deposits in gate-level simulations, hence escaping to silicon.
With SOC integration levels approaching a billion transistors per chip, tremendous pressure to shrink the verification cycle, and power minimization, it becomes important to identify and resolve such potential bugs due to un-initialized/non-resettable flops in RTL verification stage (early enough in design cycle).
Following paper proposes a complete & practical methodology along with case studies over various SOC silicon findings for early identification and left shift of potential silicon bugs that could easily escape due to un-initialized flops in any SOC.
Here we leverage different tool support from Cadence and come up with a methodology to which helps to identify potential bugs.
With case-studies done on various Silicon bugs, this novel methodology has proved to effectively catch those un-initialized flops at RTL design stage itself.
Hence it becomes extremely important to identify such cases in the design, either in RTL phase or gate-level simulations and analyze their impact on the design with due-diligence. There have been multiple cases where potential issues got masked due to random deposits in gate-level simulations, hence escaping to silicon.
With SOC integration levels approaching a billion transistors per chip, tremendous pressure to shrink the verification cycle, and power minimization, it becomes important to identify and resolve such potential bugs due to un-initialized/non-resettable flops in RTL verification stage (early enough in design cycle).
Following paper proposes a complete & practical methodology along with case studies over various SOC silicon findings for early identification and left shift of potential silicon bugs that could easily escape due to un-initialized flops in any SOC.
Here we leverage different tool support from Cadence and come up with a methodology to which helps to identify potential bugs.
With case-studies done on various Silicon bugs, this novel methodology has proved to effectively catch those un-initialized flops at RTL design stage itself.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionModern deep learning models, such as Relation Graph Convolutional Networks (RGCNs), Sparse Convolutional Networks (SpConv), and Mixture of Experts Networks (MoE), are significantly dependent on the (gather − matrix multiplication − scatter)s (abbreviated as (g −mm−s)s) paradigm as their fundamental computational pattern. While existing works have made optimization attempts, several critical challenges remain unsolved: (1) Suboptimal operator fusion due to incomplete dataflow analysis. Current approaches lack systematic analysis of fusion strategies within the (g −mm−s)s, resulting in up to 35% performance degradation due to suboptimal operator fusion and inefficient computation patterns. (2) Time-consuming exploration within large configuration space. Finding optimal configurations in the complete solution space (often exceeding 10,000 configurations) can take up to 2000 seconds, while experience-based configurations often lead to suboptimal performance. (3) Inefficient static dataflow with dynamic inputs. Dataflow performance is significantly affected by input dynamism, fixed dataflow patterns can cause up to 1.7× performance degradation when input characteristics vary significantly. To address this challenge, we introduce Efficient-GMS, a com prehensive framework that enhances (g − mm−s)s paradigm across diverse input scenarios. Our framework introduces: (1) Complete dataflow analysis enabling efficient operator fusion strategies. We systematically analyze the efficiency of operator fusion and propose four optimal dataflow patterns with segment-gemm optimization, specifically designed for unbalanced inputs. (2) Performance model-guided configuration space reduction. We develop a performance model to predict the relative execution efficiency across configurations, thereby reducing the search space and minimizing search time while ensuring optimal configuration selection. (3) Adaptive dataflow selection mechanism. We implement a lightweight heuristic model that dynamically selects optimal dataflow pattern based on characteristics of the input and the hardware. Experimental results demonstrate that Efficient GMS achieves significant performance improvements across var ious applications: up to 3× speedup in RGCN computations, 1.23 1.59× acceleration in sparse convolution operations, and 1.17x improvement in MoE computations compared to state-of-the-art implementations.
Research Manuscript


EDA
EDA4: Power Analysis and Optimization
DescriptionAs transistor scaling approaches sub-5nm technologies, power distribution networks (PDNs) in integrated circuits have grown increasingly complex, with billions to trillions of nodes. Simultaneously, reduced noise margins and increased power density necessitate more accurate and efficient power grid analysis. Traditional methods for solving large-scale PDNs, especially those requiring the solution of sparse linear systems, face significant challenges due to high computational costs. Although domain decomposition methods (DDM) allow for efficient parallel computation, the size of the dense global Schur complement grows excessively large as the number of partitions increases, limiting scalability and imposing substantial computational burdens. This paper introduces an efficient parallel nested domain decomposition solver that incorporates a parallel Schur complement computation strategy and intermediate Schur complement to address these challenges. Experimental results demonstrate that by introducing an intermediate Schur complement, the size of the global Schur complement is reduced, achieving an average 1.70× speedup in computation. Furthermore, our approach outperforms existing solvers, with average speedups of 1.30× over the conventional DDM parallel solver, 8.44× over the CKTSO parallel solver, and 8.36× over the CHOLMOD-based serial direct solver with 32 threads.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionThe Adaptive Radix Tree (ART) is a widely used tree index structure prevalent in various domains such as databases and key-value stores. Despite many solutions have been proposed to improve the performance of ART, they still suffer from significant redundant tree traversals and serious synchronization costs when concurrently performing the operations (e.g., read/write) over ART. In this work, we observe that most operations of real-world workloads tend to target a small subset of ART nodes frequently, exhibiting strong temporal and spatial similarities among the operations. Based on this observation, we propose a data-centric hardware accelerator, called DCART, to support the operations over ART efficiently. Specifically, DCART proposes a novel data-centric processing model into the accelerator design to coalesce the operations associated with the same ART nodes and adaptively cache the frequently traversed ART nodes and their search results, thereby fully exploiting the similarities among the operations for lower tree traversal and synchronization overhead. We implemented DCART on the Xilinx Alveo U280 FPGA card and compared it with the cutting-edge solutions, DCART achieves 21.1×-44.2× speedups and 71.1×-148.9× energy savings.
Engineering Poster
Networking


DescriptionIn the face of the challenge of efficient tests for large-scale integrated circuit chips, this paper proposes a DFT parallel test technology, which can effectively save test time, reduce the idle time of ATE machines, improve test efficiency, and save test costs.The general DFT technologies include SCAN, MBIST, and IPTEST. This paper proposes two specific parallel test methods: (1) parallel test between IP test patterns; (2) parallel test between IP test and DFT SCAN test.If the IP test parallel test technology is used, the overall ATE test time for IPs is optimized to ~40% of the original time.If the parallel test with SCAN is further implemented based on the IP test parallel test technology, the overall ATE test time will be optimized to ~60% of the original time.
Networking
Work-in-Progress Poster


DescriptionAs power densities and operating frequencies continue to rise, thermal throttling—commonly known as dynamic power thermal management—has become increasingly critical. Elevated temperatures can lead to performance degradation and system failures, making effective thermal management essential. To address this challenge, we have developed a rapid and precise thermal solver for analyzing transient thermal responses with complex control mechanisms. This solver enables detailed chip benchmarking and temperature sensor optimization, and it also allows for the application of a PID controller during the simulation phase. Furthermore, it can be used alongside our reduced thermal circuit model, which incorporates chip design details and thermal cooling system effects. This thermal solver empowers users to efficiently assess and understand thermal throttling behavior, particularly when transient simulations require extended runtimes and intricate control mechanisms for large-scale applications.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionIn-memory computing (IMC) has established itself as an attractive alternative to hardware accelerators in addressing the memory wall problem for artificial intelligence (AI) workloads. However, designing programmable IMC-based computing platforms for today's large generative AI models, such as large language models (LLMs) and diffusion transformers (DiTs), is hindered by the absence of a simulator that is able to address the associated scalability challenges while simultaneously incorporating device and circuit-level behaviors intrinsic to IMCs. To address this challenge, we present IMCsim, a versatile full-system IMC simulation framework. IMCsim integrates software run-time libraries for AI models, introduces a new set of ISA extensions to express common tensor operators, and provides flexibility in mapping these operators to various IMC architectures. As such, IMCsim enables designers to explore trade-offs between performance, energy, area, and computational accuracy for various IMC design choices. To demonstrate the functionality, efficiency, and versatility of IMCsim, we model three types of IMCs: (1) embedded non-volatile memory (eNVM)-based, (2) SRAM-based, and (3) digital IMCs. We validate IMCsim using measured data from two laboratory-tested IMC prototype ICs—a 22 nm MRAM-based IMC and a 28 nm SRAM-based IMC—and a digital IMC design in 28 nm. Next, we demonstrate the utility of IMCsim by exploring the architectural design space to obtain insights for maximizing utilization of IMC-based processors for diverse workloads—ResNet-18, Llama, and a DiT—using the three IMC types. Finally, we employ IMCsim as a design tool to obtain an efficient chip architecture and layout in 28 nm for a lightweight DiT.
Networking
Work-in-Progress Poster


DescriptionThis paper addresses the challenges associated with standard cell synthesis for Nanosheet FET technology, particularly the constraints on M2 layer usage and the need to consider M0 and M1 layers in block-level routing. We propose a flexible synthesis flow that can dynamically switch between single-row and multi-row cell structures. To improve pin accessibility, we introduce a method for dynamic pin allocation on M0 and M1 layers, along with techniques to limit M0 pin length and mitigate vertical pin access conflicts. Experimental results demonstrate that our method outperforms both hand-crafted and existing automated cell libraries in terms of area (3.2%), DRV count (77.1% to 97.6%), and wirelength (16.5%) in block-level physical synthesis.
Research Manuscript


Design
DES4: Digital and Analog Circuits
DescriptionThe introduction of multiple transform types into the Versatile Video Coding (VVC) standard has yielded notable encoding gains but also imposed considerable computational burdens. Existing transform circuits of different types are typically implemented separately due to their independence, leading to substantial hardware overhead. To address this, we explore the relationship between Discrete Cosine Transform Type-2 (DCT2) and Discrete Sine Transform Type-7 (DST7) matrices and reveal a prominent diagonal aggregation phenomenon in the elements of the transfer matrix. Based on this insight, the least-squares method is applied to optimize the transfer matrix sparsity, achieving a high-precision, low-cost approximate conversion from DCT2 to DST7. Furthermore, we optimize DCT2 computation by proposing an elaborate matrix decomposition approach that allows a lightweight shift-adder unit to efficiently generate all required product terms across varying sizes. Leveraging these algorithmic optimizations, we implement a highly reusable and area-efficient approximate transform accelerator that supports sizes from 4 to 32 points and accommodates three types in VVC. Experimental results demonstrate that the proposed accelerator achieves over 44% reduction in circuit resource consumption with minimal BD-BR performance loss of just 0.57%, maintaining processing capabilities up to 8K@57 fps.
Networking
Work-in-Progress Poster


DescriptionTransformer-based large language models impose significant bandwidth and computation challenges when deployed on edge devices. SRAM-based compute-in-memory (CIM) accelerators offer a promising solution to reduce data movement but are still limited by the model size. This work develops a ternary weight splitting (TWS) binarization to obtain BF16×1-b based transformers that exhibit competitive accuracy while significantly reducing model size compared to full precision counterparts. Then, a full digital SRAM-based CIM accelerator is designed incorporating a bit-parallel SRAM macro within a highly efficient group vector systolic architecture, which can store one column of BERT-Tiny model with stationary systolic data reuse. The design in a 28nm technology only requires 2-KB SRAM with an area of 2 mm². It achieves a throughput of 6.55 TOPS and consumes a total power of 419.74 mW, resulting in the highest area efficiency of 3.3 TOPS/mm² and normalized energy efficiency of 20.98 TOPS/W for BERT-Tiny model.
Engineering Presentation
A Hybrid Simulation Technique for High-Speed and Accurate System-Level Side-Channel Leakage Analysis
3:30pm - 3:45pm PDT Tuesday, June 24 2012, Level 2

Back-End Design
Chiplet
DescriptionEvaluating the tolerance of cryptographic modules in application-specific ICs (ASICs) against side-channel (SC) attacks is typically conducted after silicon manufacturing. However, this post-silicon approach faces two major challenges: the high cost and time required for ASIC production, and the inability to pinpoint the sources of unexpected leakage. Simulation-based SC leakage assessments address these issues by enabling evaluations before manufacturing, allowing for immediate design adjustments if required SC leakage tolerance is not met.
This paper presents a hybrid simulation method that integrates logic-based and transistor-level simulations to overcome the limitations of traditional approaches. The proposed method offers high accuracy in assessing SC leakage at the cryptographic core level while also estimating the signal-to-noise ratio (SNR) across the entire chip. Furthermore, it achieves significantly improved efficiency, generating 1,000 waveforms in 300 hours, which is 282 times higher efficiency compared to conventional chip-level transistor simulations. This hybrid approach enables rapid and precise SC leakage evaluation, facilitating the development of secure cryptographic ASICs with reduced design iteration times and costs.
This paper presents a hybrid simulation method that integrates logic-based and transistor-level simulations to overcome the limitations of traditional approaches. The proposed method offers high accuracy in assessing SC leakage at the cryptographic core level while also estimating the signal-to-noise ratio (SNR) across the entire chip. Furthermore, it achieves significantly improved efficiency, generating 1,000 waveforms in 300 hours, which is 282 times higher efficiency compared to conventional chip-level transistor simulations. This hybrid approach enables rapid and precise SC leakage evaluation, facilitating the development of secure cryptographic ASICs with reduced design iteration times and costs.
Engineering Presentation


AI
Front-End Design
Chiplet
DescriptionGenerative AI has transformed various fields, including design verification, particularly formal verification (FV). FV is challenging due to its steep learning curve, especially in Formal Property Verification (FPV), which involves generating properties like assertions and assumptions in System Verilog Assertions (SVA). This process requires a deep understanding of the design and SVA syntax.
Generative AI can simplify this by creating SVA from prompts and RTL. It uses Large Language Models (LLMs) and Vector Databases (RAG) to streamline property generation. Ideally, an LLM fine-tuned with sufficient data can translate English to SVA in a design context, significantly easing the process. Even without fine-tuning, a robust RAG system with ample local data can provide relevant examples, working effectively with a generic LLM.
However, LLM responses can be error-prone, necessitating guardrails like compiler checks, linters, user feedback, and testbenches. The challenge lies in managing the numerous components: selecting the best LLM, the most supportive RAG system, effective prompts, and generating and updating the testbench. An incubator must evaluate and choose the right components for this experimentation.
Generative AI can simplify this by creating SVA from prompts and RTL. It uses Large Language Models (LLMs) and Vector Databases (RAG) to streamline property generation. Ideally, an LLM fine-tuned with sufficient data can translate English to SVA in a design context, significantly easing the process. Even without fine-tuning, a robust RAG system with ample local data can provide relevant examples, working effectively with a generic LLM.
However, LLM responses can be error-prone, necessitating guardrails like compiler checks, linters, user feedback, and testbenches. The challenge lies in managing the numerous components: selecting the best LLM, the most supportive RAG system, effective prompts, and generating and updating the testbench. An incubator must evaluate and choose the right components for this experimentation.
Engineering Special Session


Front-End Design
DescriptionThe verification landscape for integrated circuits continues to evolve rapidly as design complexity grows exponentially. This session brings together industry experts to explore emerging trends and technologies that are reshaping functional verification. Four distinguished speakers will present their insights on key developments including Portable Stimulus Standard (PSS), multi-language verification environments, formal verification methodologies, and cloud-based verification solutions.
Through these presentations, attendees will gain valuable insights into how these complementary approaches are converging to address the verification challenges of next-generation IC designs. Join us for a forward-looking discussion on how the verification landscape will evolve to meet the demands of tomorrow's semiconductor industry.
Through these presentations, attendees will gain valuable insights into how these complementary approaches are converging to address the verification challenges of next-generation IC designs. Join us for a forward-looking discussion on how the verification landscape will evolve to meet the demands of tomorrow's semiconductor industry.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionAttention-based LLMs excel in text generation but face redundant computations in autoregressive token generation. While KV cache mitigates this, it introduces increased memory access overhead as sequences grow. We propose Sella, a hardware-software co-design using cluster-based associative arrays to predict Q-K correlations, enabling selective KV cache access and reducing memory access without retraining. Sella includes a specialized accelerator featuring a prediction engine to improve performance and energy efficiency. Experiments show Sella achieves 2.1x, 93.8x, 31.4x, and 53.5x speedup over SpAtten, Sanger, TITAN RTX GPU, and Xeon CPU, respectively, reducing off-chip memory access by up to 66% with negligible accuracy loss.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionThe rapid surge in data generated by Internet of Things (IoT), artificial intelligence (AI), and machine learning (ML) applications demands ultra-fast, scalable, and energy-efficient hardware, as traditional von Neumann architectures face significant latency and power challenges due to data transfer bottlenecks between memory and processing units. Furthermore, conventional electrical memory technologies are increasingly constrained by rising bitline and wordline capacitance, as well as the resistance of compact and long interconnects, as technology scales. In contrast, photonics-based in-memory computing systems offer substantial speed and energy improvements over traditional transistor-based systems, owing to their ultra-fast operating frequencies, low crosstalk, and high data bandwidth. Hence, we present a novel differential photonic SRAM (pSRAM) bitcell-augmented scalable mixed-signal multi-bit photonic tensor core, enabling high-speed, energy-efficient matrix multiplication operations using fabrication-friendly integrated photonic components. Additionally, we propose a novel 1-hot encoding electro-optic analog-to-digital converter (eoADC) architecture to convert the multiplication outputs into digital bitstreams, supporting processing in the electrical domain. Our designed photonic tensor core, utilizing GlobalFoundries' monolithic 45SPCLO technology node, achieves computation speeds of 4.10 tera-operations per second (TOPS) and a power efficiency of 3.02 TOPS/W.
Engineering Poster
Networking


DescriptionWithin Arm a new unit has been built to provide SoC solutions to their customers. One of the biggest problems we faced when designing SoCs was need to rearrange the hierarchy to match physical implementation needs. As the targeted applications are highly complex, changing the hierarchy manually or using traditional methodologies was not affordable.
We have developed a new flow based on Defacto's SoC Compiler to be able to generate multiple configurations within one day. The presented flow enables to generate a new SoC hierarchy of one of extremely large Arm-based SoC design in just one hour rather than more than 24hours with the original way.
This flow enables to drastically reduce the overall TAT to generate new RTL which leads to have much more RTL configurations to explore and compare towards better PPA.
We have developed a new flow based on Defacto's SoC Compiler to be able to generate multiple configurations within one day. The presented flow enables to generate a new SoC hierarchy of one of extremely large Arm-based SoC design in just one hour rather than more than 24hours with the original way.
This flow enables to drastically reduce the overall TAT to generate new RTL which leads to have much more RTL configurations to explore and compare towards better PPA.
Engineering Poster
Networking


DescriptionA new virtual platform model is needed to integrate various types of models or VPs. This model consists of a common model and a wrapper to manage the read/write access of each virtual prototype (VP). This new modeling methodology reduced the turnaround time for IP model development in our company by three months by eliminating unnecessary redundant tasks. By working on one platform with IPC communications, the simulation reproducibility is maintained while preventing simulation performance degradation. This methodology allows IP models to be reused in various VPs.
Engineering Poster
Networking


DescriptionIn high-speed design we need to do manual eco to meet last miles timing paths. Pipeline Retiming is an optimization technique involves splitting and repositioning of combinational logic across the sequential without changing its logical functionality.
Current EDA Logical Equivalence Check (LEC) Tools has limitation to verify equivalence on sequential retiming because required information does not get saved in json file.
To overcome the above limitation of EDA vendor tool we are proposing a method to verify logical equivalence check after pipeline retiming, which helped to improve the frequency limiting paths by 2% of cycle time
Current EDA Logical Equivalence Check (LEC) Tools has limitation to verify equivalence on sequential retiming because required information does not get saved in json file.
To overcome the above limitation of EDA vendor tool we are proposing a method to verify logical equivalence check after pipeline retiming, which helped to improve the frequency limiting paths by 2% of cycle time
Engineering Presentation


Front-End Design
Chiplet
DescriptionModern GPUs often include hardware features that remain underutilized by key workloads, leading to unnecessary power overhead. To mitigate the power cost of one such unused feature, we implemented a series of targeted optimizations. Redundant flops were removed, and enable conditions were rewritten to consolidate flops under fewer integrated clock gates (ICGs), reducing dynamic and leakage power. Additional qualifying signals were introduced to gate larger registers, minimizing clock and sequential power consumption. Finally, critical data path computations were shifted to a less-utilized bypass path, reducing primary data path mux widths and lowering combinational logic power, as some of the calculation logic was relocated to the appropriate feature hierarchy.
These optimizations yielded significant improvements, including a 61.5% reduction in ICG count and a 19.1% reduction in flop bits within the feature hierarchy. At the top level, total power consumption decreased by 1.87%, and standard cell area was reduced by 0.35%. This approach highlights the importance of feature-specific power optimizations in improving the overall performance per watt of GPU hardware.
These optimizations yielded significant improvements, including a 61.5% reduction in ICG count and a 19.1% reduction in flop bits within the feature hierarchy. At the top level, total power consumption decreased by 1.87%, and standard cell area was reduced by 0.35%. This approach highlights the importance of feature-specific power optimizations in improving the overall performance per watt of GPU hardware.
Engineering Presentation


AI
IP
Chiplet
DescriptionVerifying a System on Chip (SoC) with embedded non-volatile memory (NVM) requires numerous tests to cover a wide range of scenarios. Traditional methods involve manually setting NVM content or using custom scripts, which are labor-intensive and inflexible.
We developed an automated flow that generates random or constrained NVM content to verify all possible values, and allows easy modification based on specific tests.
Our approach uses a spreadsheet to define NVM structure, read by an open-source language to generate a UVM Verification IP (VIP) with the expected memory structure and covergroups. This VIP automatically constructs NVM content sector by sector, eliminating manual bit-by-bit modifications. Additional files handle data constraints and randomization, used by the simulator in specific UVM phases to generate the NVM data.
This method provides control over NVM content within the UVM framework, independent of simulation tools and programming languages. It achieves comprehensive coverage and drastically reduces the time required for generating NVM content. The innovative approach enhances silicon quality and prevents bugs, making it a versatile solution for any project.
We developed an automated flow that generates random or constrained NVM content to verify all possible values, and allows easy modification based on specific tests.
Our approach uses a spreadsheet to define NVM structure, read by an open-source language to generate a UVM Verification IP (VIP) with the expected memory structure and covergroups. This VIP automatically constructs NVM content sector by sector, eliminating manual bit-by-bit modifications. Additional files handle data constraints and randomization, used by the simulator in specific UVM phases to generate the NVM data.
This method provides control over NVM content within the UVM framework, independent of simulation tools and programming languages. It achieves comprehensive coverage and drastically reduces the time required for generating NVM content. The innovative approach enhances silicon quality and prevents bugs, making it a versatile solution for any project.
Research Manuscript


Security
SEC4: Embedded and Cross-Layer Security
DescriptionThis paper presents a novel covert timing channel (CTC) that enables a malicious entity to exfiltrate data from a benign cloud FPGA user without requiring dedicated outgoing messages from the cloud FPGA, minimizing the detection risk by both the victim and the cloud service provider. The proposed CTC exploits the handshake signals of the Advanced eXtensible Interface (AXI) protocol and inter-packet delay of the Internet to establish the CTC from a cloud Field-Programmable Gate Array (FPGA) to an off-cloud computer. This paper analyzes the bit-error rate (BER) of the AXI-based CTC under varying conditions and demonstrates its effectiveness in truly enabling remote power analysis attacks on cloud services, such as Amazon Web Services Elastic Compute Cloud (AWS EC2). The proposed CTC achieves a BER as low as 0.01988%.
Engineering Presentation


AI
Back-End Design
DescriptionThe reduction of die size has been a primary goal in systems-on-chip (SoC) designs. While previous approaches focused on reducing congestion of the chip as a whole, there has been little effort to specifically alleviate congestion in the top channel. In channel-based hierarchical non-IO limited SoC designs, inserting feedthroughs can be a crucial step during the floorplan stage to enhance routing resource utilization, alleviate congestion in the top channels, and ultimately save die area. Traditional feedthrough insertion approaches are usually carried out manually or managed in a flattened design. However, it is challenging to handle the large number of connections at the top, the high dependency on RTL, the integration of new feedthrough ports at the subchip level, and the verification of logical equivalence. We propose a method to address these challenges by employing a novel feedthrough insertion approach
Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionIR drop analysis is crucial for ensuring the reliability and performance of integrated circuits (ICs) but poses computational challenges as the IC designs grow larger, especially for ultra deep-submicron VLSI designs. Deep learnings (DL) as the efficiency-promising solutions, mainly employ various CNN-based networks to achieve image-to-image IR drop predictions. However, they neglect and lose the power delivery network (PDN) global spatial features and cell instance topological information. This paper proposes a novel image-graph heterogeneous fusion framework (IGHF), which integrates the effectiveness and complementarity of dual branches (CNN and GNN) for higher prediction performance. In the CNN-based Power ScaleFusion Unet branch, the proposed long-range and local-detail encoder (LLE) integrates seamlessly with the hierarchical and adjacent compensation group (HACG) module. This design facilitates effective multi-scale global-to-local spatial power feature extraction within the PDN and enables adaptive high-to-low-level feature fusion and compensation in the decoder.
Moreover, a cell voltage aware (CVA) module in the GNN branch is designed to adaptively aggregate PDN topological features of heterogeneous neighbors of different orders. Comparative experiments demonstrate that the proposed IGHF achieves significant accuracy improvements, outperforming the state-of-the-art MAUnet and widely-used IREDGe methods by considerable margins of 24.6\% and 55.0\% reduction in prediction error, while the prediction maps possess higher structural fidelity. Transfer experiments indicate that IGHF with transfer learning can improve the accuracy in real circuits with the few-shot real circuit test cases.
Moreover, a cell voltage aware (CVA) module in the GNN branch is designed to adaptively aggregate PDN topological features of heterogeneous neighbors of different orders. Comparative experiments demonstrate that the proposed IGHF achieves significant accuracy improvements, outperforming the state-of-the-art MAUnet and widely-used IREDGe methods by considerable margins of 24.6\% and 55.0\% reduction in prediction error, while the prediction maps possess higher structural fidelity. Transfer experiments indicate that IGHF with transfer learning can improve the accuracy in real circuits with the few-shot real circuit test cases.
Networking
Work-in-Progress Poster


DescriptionThe shrinking of transistor feature sizes and supply voltages for the nanoscale CMOS technology makes integrated circuits increasingly susceptible to soft errors in radiation environments. To protect against soft errors, researchers have proposed various single-node-upset (SNU) and/or multi-node-upset (MNU) mitigated latches by utilizing the radiation hardening by design (RHBD) approach, whereas the node-upset recoverability verification of the existing latches is highly dependent on Electronic Design Automation (EDA) tools for taking into account the complex combinations of error injections. In this paper, an automatic verification method for MNU-recovery of latches is firstly proposed, which abstracts the directed graph that can express the controlled relationship among all nodes from the latch structure. The MNU-recovery of latches can then be verified based on the constructed directed graph, significantly simplifying the traditional verification process. Moreover, a generalized latch model is proposed and it is proven, using the proposed verification method, that any latch satisfying this model can recover from n-node-upset.
Networking
Work-in-Progress Poster


DescriptionThermal effects degrade performance by 10%-20% in 3D-IC designs due to dense stacking, making thermal superposition effect crucial for all dies. Without early thermal considerations in cell placement, later modifications are limited. Traditional thermal-aware cell placement is time-consuming and inaccurate, as it iteratively performs thermal simulations to fix hotspots without considering 3D-IC Stacking thermal boundary and timing performance. This paper presents a novel approach to break the thermal superposition loop by redistributing power distribution and inserting appropriately sized white spaces with considering 3D-IC stacking thermal boundary conditions while minimizing timing impact. Run time is significantly improved by simply analyzing power distribution instead of running thermal simulations. Experimental results demonstrate a minimum reduction in hotspot temperatures by 6%, a minimum runtime improvement of 100x, and a minimized timing impact of less than 0.5%.
Engineering Poster


DescriptionLVS (Layout vs. Schematic) and PEX (Parasitic Extraction) verification has to ensure accurate circuit layout alignment and extraction of parasitic elements. Advanced LVS/PEX techniques, including parasitic elements, improve Model-to-Hardware Correlation (MHC) for accurate design closure and manufacturing control. However, At GAA(Gate-All-Around) technology node the increase in the number of derived FEOL (Front End of Line) and MOL (Middle End of Line) layers poses challenges in Parasitic Resistance and ensuring no PEX Quality, thus affecting Process Design Kit (PDK) release schedules. In response to this, We propose an innovative method to verify the parasitic resistance-related issues in layouts utilizing the pin-to-pin functionality of PERC (Programmable Electrical Rule check). This approach safeguards Parasitic Resistance benefits, reinforces manufacturing control, and adeptly addresses challenges associated with parasitic resistance-related issues in advanced semiconductor designs.
Networking
Work-in-Progress Poster


DescriptionWe propose a novel standard cell structure and physical design methodology to enhance circuit performance and area efficiency in the face of increasing routing complexity. The proposed approach consists of two key techniques: (1) a standard cell structure that improves routing flexibility by increasing the degrees of freedom for pin routing, and (2) a physical design methodology that enables commercial tools to route the newly designed standard cells effectively. Using the ASAP7 PDK, our method demonstrates superior results in terms of via count and wirelength compared to conventional standard cell structures and physical design methodologies.
Engineering Presentation


AI
Back-End Design
DescriptionThe increasing area and complexity for system-on-a-chip (SOC) designs, leading to highly demand and efficiency for chip testing. In order to achieve better controllability and observability, IEEE 1687 standard (IJTAG) is widely integrated in designs recently. Most processor chips are large-scale, and there are many duplicated cores. The configurations are same for most cases during testing. The test time will reduce significantly if the IJTAG network can be broadcast controlled. However, the TDR can not be broadcast controlled for IJTAG network, this means that the same data needs to be shifted multiple times and more shift cycles to every TDR even if the configuration of each TDR is the same. This disadvantage seriously affected the efficiency of configuration.
In this paper, we propose a broadcastable IJTAG network.
Two new broadcast signals BroadCast_TDI and BroadCast_Select are used to control SIBs. The BroadCast_TDI signal connects with TDI, it acts as data input when the IJTAG is in broadcast mode. The BroadCast_Select signal is controlled by a dedicate instruction. The IJTAG network will switch to broadcast mode if the BroadCast_Select signal asserted. In the broadcast mode, TDR can be configurated broadcastly.
The configuration cycles of TDR reduced a lot using broadcast IJTAG network and the efficiency of configuration is improved significantly.
In this paper, we propose a broadcastable IJTAG network.
Two new broadcast signals BroadCast_TDI and BroadCast_Select are used to control SIBs. The BroadCast_TDI signal connects with TDI, it acts as data input when the IJTAG is in broadcast mode. The BroadCast_Select signal is controlled by a dedicate instruction. The IJTAG network will switch to broadcast mode if the BroadCast_Select signal asserted. In the broadcast mode, TDR can be configurated broadcastly.
The configuration cycles of TDR reduced a lot using broadcast IJTAG network and the efficiency of configuration is improved significantly.
Engineering Poster
Networking


DescriptionDevice test cost and DPPM are of paramount importance to gain market share by offering quality devices at optimal price point. ATPG is the proven efficient structural technique for testing and diagnosing silicon failures. In large gate count devices (5M+ design flip-flops), meeting scan coverage at entitlement has been challenging, resulting in higher scan pattern count and increased test time. Random resistant fault analysis (RRFA) based test point insertion (TPI) can improve scan coverage with lower scan pattern count - a known methodology to improve device quality. However, because of sheer design size, even 2% of design flops as test-points (TPs) leads to 100k+ more flip-flops added in design, which is huge DFT area overhead. This paper focuses on novel physical and timing aware TPI methodology, strategically re-purposing pre-existing functional/DFT flops in design as TPs. Further we propose observe-only TPs to gain scan coverage in DFT infrastructure blocks. This study evaluates the methodology across range of sub-chips, utilizing benefits of TPI and demonstrating substantial improvements in test quality and test time while overcoming the shortcomings and concerns of area overhead, routing congestion and power with TPs.
Networking
Work-in-Progress Poster


DescriptionLarge-scale integrated circuit simulations need statistical transistor compact models to reflect the influence of statistical variability when predicting circuit behavior. This paper proposes a novel compact model parameter generation approach based on Principal Component Analysis (PCA) and Kernel Density Estimation (KDE). This approach can generate infinite BSIM-CMG compact model parameters on a fly while following the original parameter distribution and maintaining correlations between different parameters. We introduce the methodology and use 1000 nanosheet devices from TCAD simulation under the influence of RDF and WFV for the final compact model parameter generation. Firstly, 1000 BSIM-CMG compact models are extracted against the TCAD data using our automatic parameter extraction platform. Then 10000 compact model parameters are generated using the generation method. The results show a good retention of the TCAD simulated electrical characteristics and give promising prospects for large-scale integrated circuit simulations.
Research Manuscript


Systems
SYS4: Embedded System Design Tools and Methodologies
DescriptionHigh-Level Synthesis (HLS) offers various optimization directives that enable designers to flexibly adjust hardware microarchitecture. However, existing HLS performance prediction methods typically rely on control data flow graphs generated(CDFG) from original HLS C/C++, which struggle to capture the complex interactions between directives and the resulting hardware resource reuse issues. To address these issues, this paper proposes a post-implementation performance prediction method tailored for directive-optimized circuits design, which utilizes a Graph Builder to integrate directive optimization and resource reuse information into the graph representation. In addition, the performance prediction model, which integrates a TransformerConv-based graph neural network (GNN) and an aggregation pool module, effectively captures key features related to post-implementation performance. Experimental results show that our method can reduce the prediction error of critical path delay(CP), power, and resource utilization to 3.87%∼8.08%, significantly outperforming existing state-of-the-art methods. It also demonstrates excellent generalization on unseen kernels, providing a more effective and accurate performance prediction tool for HLS.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionTime-Domain In-Memory Computing (TD-IMC) has emerged as a promising analog computing architecture for edge AI applications. However, the lack of developed hardware operators, especially general nonlinear operators, necessitates frequent cross-domain data transmission in practical TD-IMC systems, significantly reducing energy efficiency.
In this work, we propose a PulseWidth-IN-PulseWidth-OUT Universal Nonlinear Processing Element (PIPO-UNPE) to address the challenges of nonlinear processing in analog computing. By implementing an RRAM-based two-layer ReLU network, the PIPO-UNPE performs universal nonlinear operations entirely in the time domain. Algorithmically, we introduce Dynamic Loss-Responsive Subset Enhancement (DLRSE) to boost the performance of this low-cost network in function approximation tasks. From a hardware perspective, we design an RRAM-based pulse-driven programmable current source and a low-latency dispersion comparator-based voltage-to-time converter (VTC) to enhance both the energy efficiency and precision of the PIPO-UNPE. Hybrid simulations reveal that the PIPO-UNPE consumes 912 uW of power while delivering a throughput of 10M NOPS (Nonlinear Operations Per Second). Incorporating the PIPO-UNPE into the TD-IMC accelerator can increase energy efficiency by a factor of 9.5 to 25, keeping the accuracy loss below 0.1%.
In this work, we propose a PulseWidth-IN-PulseWidth-OUT Universal Nonlinear Processing Element (PIPO-UNPE) to address the challenges of nonlinear processing in analog computing. By implementing an RRAM-based two-layer ReLU network, the PIPO-UNPE performs universal nonlinear operations entirely in the time domain. Algorithmically, we introduce Dynamic Loss-Responsive Subset Enhancement (DLRSE) to boost the performance of this low-cost network in function approximation tasks. From a hardware perspective, we design an RRAM-based pulse-driven programmable current source and a low-latency dispersion comparator-based voltage-to-time converter (VTC) to enhance both the energy efficiency and precision of the PIPO-UNPE. Hybrid simulations reveal that the PIPO-UNPE consumes 912 uW of power while delivering a throughput of 10M NOPS (Nonlinear Operations Per Second). Incorporating the PIPO-UNPE into the TD-IMC accelerator can increase energy efficiency by a factor of 9.5 to 25, keeping the accuracy loss below 0.1%.
Networking
Work-in-Progress Poster


DescriptionEnergy-efficient, real-time motion prediction (MP) enables autonomous agents to swiftly track objects and adapt to unexpected trajectory changes, essential for tasks like escape, attack, and tracking. We present a retina-inspired neuromorphic framework capable of real-time, energy-efficient MP within image sensor pixels. Our hardware-algorithm co-design utilizes a biphasic filter, spike adder, non-linear circuit, and a 2D array for multi-directional MP, implemented on GlobalFoundries 22nm FDSOI technology. A 3D Cu-Cu hybrid bonding approach enables design compactness, reducing area and routing complexity. Validated on Berkeley DeepDrive dataset, the model provides efficient, low-latency MP for decision-making scenarios that depend on predictive visual computation.
Engineering Poster


DescriptionIn today's era of advanced application, SoC comprises of vast number of AMS IP due to advantages as 1.) Enhanced performance : High Speed Data Converters, Improved Signal Quality. 2). Complex & Diverse applications : IoT, Automotive and Industrial etc. 3). Innovation and Differentiation : New features, customization and flexibility. 4).Power efficiency (Low power, PMU), 5). Sensors and more…
Integration of such complex AMS IP in SoC Flow is challenging and cumbersome task. Here Library CAD views are pivotal in the complete SoC flow. Hardware Description Language (HDL) models are integral components of these CAD views IP libraries, which play important role in entire SoC (System on Chip) flow, starting from RTL to Emulation, Prototyping & Silicon debugging.
Given their importance, the verification of HDL models becomes a critical step in the overall CAD library development process. The diverse applications of HDL models in ASIC design flow necessitate robust and faster verification of all HDL views, including simulation models, test models, equivalence models, and timing models. Verifying these different views on separate platforms is a complex and time-consuming process. This work presents a verification suite for Analog Mixed Signal (AMS) IP's HDL models. It addresses the challenges associated with verifying all HDL views and introduces a robust and efficient approach. Through detailed test cases and comprehensive comparisons, the proposed methodology demonstrates significant improvements in the quality and efficiency of the verification process.
Integration of such complex AMS IP in SoC Flow is challenging and cumbersome task. Here Library CAD views are pivotal in the complete SoC flow. Hardware Description Language (HDL) models are integral components of these CAD views IP libraries, which play important role in entire SoC (System on Chip) flow, starting from RTL to Emulation, Prototyping & Silicon debugging.
Given their importance, the verification of HDL models becomes a critical step in the overall CAD library development process. The diverse applications of HDL models in ASIC design flow necessitate robust and faster verification of all HDL views, including simulation models, test models, equivalence models, and timing models. Verifying these different views on separate platforms is a complex and time-consuming process. This work presents a verification suite for Analog Mixed Signal (AMS) IP's HDL models. It addresses the challenges associated with verifying all HDL views and introduces a robust and efficient approach. Through detailed test cases and comprehensive comparisons, the proposed methodology demonstrates significant improvements in the quality and efficiency of the verification process.
Engineering Poster
Networking


DescriptionASIL requirements for automotive chips have resulted in implementation of functional safety features in several IPs. Dual Core Lock Step (DCLS) is one of the popular methods to achieve functional safety in medium to large IPs which require fine grain control. With many IPs incorporating DCLS, the verification effort has increased exponentially over the years. This paper proposes a methodology to speed up the verification process by automating testbench generation and identifying potential issues early in the design cycle. The generated testbench components viz checker, error injector, scoreboard and driver, are SV/UVM compatible allowing easy integration into existing legacy testbenches. The python based automation script also provides valuable insights into the lockstep design by generating several reports for review with the designer and functional safety experts. A mathematical model is developed for evaluating the effectiveness of DCLS in detecting drift between the primary and redundant cores. The methodology is evaluated for numerous styles of DCLS implementations, and enhancement knobs are introduced to reduce automation runtimes. The proposed methodology reduces the verification effort and time by several folds.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionQuantum graph states are critical resources for various quantum algorithms, and also determine essential interconnections in distributed quantum computing.There are two schemes for generating graph states -- probabilistic scheme and deterministic scheme. While the all-photonic probabilistic scheme has garnered significant attention, the emitter-photonic deterministic scheme has been proved to be more scalable and feasible across several hardware platforms.
This paper studies the GraphState-to-Circuit compilation problem in the context of the deterministic scheme. Previous research has primarily focused on optimizing individual circuit parameters, often neglecting the characteristics of quantum hardware, which results in impractical implementations. Additionally, existing algorithms lack scalability for larger graph sizes.To bridge these gaps, we propose a novel compilation framework that partitions the target graph state into subgraphs, compiles them individually, and subsequently combines and schedules the circuits to maximize emitter resource utilization. Furthermore, we incorporate local complementation to transform graph states and minimize entanglement overhead. Evaluation of our framework on various graph types demonstrates significant reductions in CNOT gates and circuit duration, up to 52% and 56%. Moreover, it enhances the suppression of photon loss, achieving improvements of up to x1.9.
This paper studies the GraphState-to-Circuit compilation problem in the context of the deterministic scheme. Previous research has primarily focused on optimizing individual circuit parameters, often neglecting the characteristics of quantum hardware, which results in impractical implementations. Additionally, existing algorithms lack scalability for larger graph sizes.To bridge these gaps, we propose a novel compilation framework that partitions the target graph state into subgraphs, compiles them individually, and subsequently combines and schedules the circuits to maximize emitter resource utilization. Furthermore, we incorporate local complementation to transform graph states and minimize entanglement overhead. Evaluation of our framework on various graph types demonstrates significant reductions in CNOT gates and circuit duration, up to 52% and 56%. Moreover, it enhances the suppression of photon loss, achieving improvements of up to x1.9.
Networking
Work-in-Progress Poster


DescriptionAs energy demands in cloud computing surge, efficient task scheduling and carbon-free energy use in mini data centers is essential. Solving scheduling problems with mixed-integer linear programming optimization methods is computationally intensive and less scalable. This paper proposes a two-step approach combining a greedy algorithm with linear programming to optimize virtual machines scheduling and energy management. We show via simulations that our method preserves solution quality, while accelerating the calculation process by 99.6%. While the literature method requires 6 hours to solve a reference system design, our solution handles 18 times larger designs in 20 minutes only, demonstrating its scalability.
Engineering Presentation


Back-End Design
Chiplet
DescriptionThis study investigates the potential risks of thermal side-channel leakage in cryptographic circuits. Unlike current-based side-channel attacks, thermal energy presents unique challenges due to its inherent difficulty to confine. Cryptographic operations generate thermal energy as a byproduct, which can leak sensitive information. Using a novel simulation technique, we combined circuit simulation and thermal analysis to evaluate temperature variations during AES operations. Our findings reveal that plaintext-dependent temperature differences, with T-values exceeding 4.5 in some cases, indicate a potential for thermal leakage.
To mitigate this, heat dissipation mechanisms were evaluated. While these mechanisms reduced temperature variations on the circuit side, new leakage pathways emerged in other areas, demonstrating a trade-off between thermal management and security. This highlights the need for balanced design strategies that address both thermal behavior and cryptographic robustness.
Overall, this stu
To mitigate this, heat dissipation mechanisms were evaluated. While these mechanisms reduced temperature variations on the circuit side, new leakage pathways emerged in other areas, demonstrating a trade-off between thermal management and security. This highlights the need for balanced design strategies that address both thermal behavior and cryptographic robustness.
Overall, this stu
Engineering Poster
Networking


DescriptionWith the continuous expansion of the chip scale and the pursuit of enhanced performance, augmenting the memory repair time during the power-up process has become increasingly arduous. This paper presents a novel structure that is designed to curtail the runtime of the memory hard repair process and bolster the efficiency of the full-chip repair dispatcher, thereby facilitating high-speed full-chip repair.
This structure encompasses two principal components: the request scheduler and the Index repair mechanism of the STAR memory system provided by Synopsys. In the case jointly developed by Sanechips and Synopsys, within a repair clock frequency of 100 MHz, the largest subsystem can repair any of three memories within a mere 2.5 microseconds. Moreover, the full-chip repair, with out-of-order decoupling requests, can be accomplished within 5 microseconds.
This structure encompasses two principal components: the request scheduler and the Index repair mechanism of the STAR memory system provided by Synopsys. In the case jointly developed by Sanechips and Synopsys, within a repair clock frequency of 100 MHz, the largest subsystem can repair any of three memories within a mere 2.5 microseconds. Moreover, the full-chip repair, with out-of-order decoupling requests, can be accomplished within 5 microseconds.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionAs the scaling of semiconductor devices nears its limits, utilizing the back-side space of silicon has emerged as a new trend for future integrated circuits. With intense interest, several works have hacked existing backend tools to explore the potential of synthesizing double-side clock trees via nano Through-Silicon-Vias (nTSVs). However, these works lack a systematic perspective on design resource allocation and multi-objective optimization. We propose a systematic methodology to design clock trees with double-side metal layers, including hierarchical clock routing, concurrent buffers and nTSVs insertion, and skew repairing. Compared with the state-of-the-art method, the widely-used open-source tool, our algorithm outperforms them in latency, skew, wirelength, and the number of buffers and nTSVs.
Networking
Work-in-Progress Poster


DescriptionLarge language models (LLMs) are both storage-intensive and computation-intensive, posing significant challenges when deployed on resource-constrained hardware. As linear layers in LLMs are mainly resource consuming parts, this paper develops a tensor-train decomposition (TTD) for LLMs with a further hardware implementation on FPGA. TTD compression is applied to the linear layers in ChatGLM3-6B and LLaMA2-7B models with compression ratios (CRs) for the whole network 1.94× and 1.60×, respectively. The compressed LLMs are further implemented on FPGA hardware within a highly efficient group vector systolic array (GVSA) architecture, which has DSP-shared parallel vector PEs for TTD inference, as well as optimized data communication in the accelerator. Experimental results show that the corresponding TTD based LLM accelerator implemented on FPGA achieves 1.45× and 1.57× reduction in first token delay for ChatGLM3-6B and LLaMA2-7B models, respectively.
Engineering Poster
Networking


DescriptionTraditional binning methods, especially the observation of test results, make binning have to face the difficulties of analysis. Many existing test solutions, such as Streaming Scan Network, use the shared bus technology. These solutions cannot quickly judge the Pass or Fail of any target circuit from the test result only. Therefore, marking a large number of labels or constructing special patterns become some methods of tracing the detailed test results. However, this is still complicated and not intuitive.
This work proposes a chip fast binning technology. A test solution including MISR, XORs and dr_config is designed to easily obtain the signatures of test results in all target circuits. These signatures can represent the test results. Therefore, the difficulty in obtaining the results of any target circuit can be reduced.
Because of the X-clean requirement of MISR, a X-mask test solution is designed to prevent any X from entering the misr and avoid the additional design load caused by the circuit requirements for the X-tolerant. The X-mask test solution can ensure that the MISR can calculate more reliable test results, reduce the design requirements for the circuits and finally apply to more circuits.
This work proposes a chip fast binning technology. A test solution including MISR, XORs and dr_config is designed to easily obtain the signatures of test results in all target circuits. These signatures can represent the test results. Therefore, the difficulty in obtaining the results of any target circuit can be reduced.
Because of the X-clean requirement of MISR, a X-mask test solution is designed to prevent any X from entering the misr and avoid the additional design load caused by the circuit requirements for the X-tolerant. The X-mask test solution can ensure that the MISR can calculate more reliable test results, reduce the design requirements for the circuits and finally apply to more circuits.
Engineering Presentation


Front-End Design
DescriptionWe introduce a scalable and efficient framework for enhancing the verification and tracking of analog IPs focusing on comprehensive coverage across power modes. We present a first of its kind, automated analog property centric verification quantification flow in a highly digital dominated framework.
We present a robust framework, ensuring a systematic creation of range, frequency, ramp, glitch checks for mixed signal designs tailored to their characteristics. For stimulus evaluation, auto addition of basic checkers, helps identify gaps in stimulus, ensuring that all critical scenarios are exercised.
The proposed flow is flexible, scalable, allowing user customization for fine tuning verification parameters. Presented approach facilitates comprehensive validation of IP/design integration at higher abstraction levels and automates module verification within the System On Chip (SoC) context. Proposed flow efficiently supports mixed-signal verification environments encompassing both SPICE circuits and analog behavioral models enabling faster interception in Mixed Signal Design.
We present a robust framework, ensuring a systematic creation of range, frequency, ramp, glitch checks for mixed signal designs tailored to their characteristics. For stimulus evaluation, auto addition of basic checkers, helps identify gaps in stimulus, ensuring that all critical scenarios are exercised.
The proposed flow is flexible, scalable, allowing user customization for fine tuning verification parameters. Presented approach facilitates comprehensive validation of IP/design integration at higher abstraction levels and automates module verification within the System On Chip (SoC) context. Proposed flow efficiently supports mixed-signal verification environments encompassing both SPICE circuits and analog behavioral models enabling faster interception in Mixed Signal Design.
Research Manuscript


Systems
SYS6: Time-Critical and Fault-Tolerant System Design
DescriptionServerless computing is increasingly adopted for its ability to manage complex, event-driven workloads without the need for infrastructure provisioning. However, traditional resource allocation in serverless platforms couples CPU and memory, which may not be optimal for all functions. Existing decoupling approaches, while offering some flexibility, are not designed to handle the vast configuration space and complexity of serverless workflows. In this paper, we propose AARC, an innovative, automated framework that decouples CPU and memory resources to provide more flexible and efficient provisioning for serverless workloads. AARC is composed of two key components: Graph-Centric Scheduler, which identifies critical paths in workflows, and Priority Configurator, which applies priority scheduling techniques to optimize resource allocation. Our experimental evaluation demonstrates that AARC achieves substantial improvements over state-of-the-art methods, with total search time reductions of 85.8% and 89.6%, and cost savings of 49.6% and 61.7%, respectively, while maintaining SLO compliance.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionMultimodal Large Language Models (MLLMs) have achieved notable success in visual instruction tuning, yet their inference is time-consuming due to the auto-regressive decoding of Large Language Model (LLM) backbone. Traditional methods for accelerating inference, including model compression and adaptations from language model acceleration, often compromise output quality or face challenges in effectively integrating multimodal features.
To address these issues, we propose AASD, a novel framework for Accelerating inference with refined Key-Value (KV) Cache and Aligning speculative decoding in MLLMs. Our approach leverages the target model's cached KV pairs to extract vital information for generating draft tokens, enabling efficient speculative decoding.
To reduce the computational burden associated with long multimodal token sequences, we introduce a KV Projector to compress the KV Cache while maintaining representational fidelity. Additionally, we design a Target-Draft Attention mechanism that optimizes the alignment between the draft and target models, achieving the benefits of real inference scenarios with minimal computational overhead.
Extensive experiments on mainstream MLLMs demonstrate that our method achieves up to a 2x inference speedup without sacrificing accuracy. This study not only provides an effective and lightweight solution for accelerating MLLM inference but also introduces a novel alignment strategy for speculative decoding in multimodal contexts, laying a strong foundation for future research in efficient MLLMs.
Code is availiable at https://anonymous.4open.science/r/ASD-F571.
To address these issues, we propose AASD, a novel framework for Accelerating inference with refined Key-Value (KV) Cache and Aligning speculative decoding in MLLMs. Our approach leverages the target model's cached KV pairs to extract vital information for generating draft tokens, enabling efficient speculative decoding.
To reduce the computational burden associated with long multimodal token sequences, we introduce a KV Projector to compress the KV Cache while maintaining representational fidelity. Additionally, we design a Target-Draft Attention mechanism that optimizes the alignment between the draft and target models, achieving the benefits of real inference scenarios with minimal computational overhead.
Extensive experiments on mainstream MLLMs demonstrate that our method achieves up to a 2x inference speedup without sacrificing accuracy. This study not only provides an effective and lightweight solution for accelerating MLLM inference but also introduces a novel alignment strategy for speculative decoding in multimodal contexts, laying a strong foundation for future research in efficient MLLMs.
Code is availiable at https://anonymous.4open.science/r/ASD-F571.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionAs the demand for privacy-preserving computation continues to grow, fully homomorphic encryption (FHE)—which enables continuous computation on encrypted data—has become a critical solution. However, its adoption is hindered by significant computational overhead, requiring 10000-fold more computation compared to plaintext processing. Recent advancements in FHE accelerators have successfully improved server-side performance, but client-side computations remain a bottleneck, particularly under bootstrappable parameter configurations, which involve combinations of encoding, encrypt, decoding, and decrypt for large-sized parameters. To address this challenge, we propose ABC-FHE, an area- and power-efficient FHE accelerator that supports bootstrappable parameters on the client side. ABC-FHE employs a streaming architecture to maximize performance density, minimize area usage, and reduce off-chip memory access. Key innovations include a reconfigurable Fourier engine capable of switching between NTT and FFT modes. Additionally, an on-chip pseudo-random number generator and a unified on-the-fly twiddle factor generator significantly reduce memory demands, while optimized task scheduling enhances the CKKS client-side processing, achieving reduced latency. Overall, ABC-FHE occupies a die area of 28.638 mm^2 and consumes 5.654 W of power in 28 nm technology. It delivers significant performance improvements, achieving a 1112x speed-up in encoding and encryption execution time compared to a CPU, and 214x over the state-of-the-art client-side accelerator. For decoding and decryption, it achieves a 963x speed-up over the CPU and 82x over the state-of-the-art accelerator.
Engineering Poster


DescriptionEarly detection of potential ESD issues is crucial for enhancing overall design robustness and ensuring a high-quality sign-off. Achieving this with full-chip level analysis using native Pathfinder tool is hindered by several factors such as runtime and memory usage demands,complexity of parsing multiple inputs, debuggability through GUI etc..
In an attempt to mitigate these issues, the following approaches were adopted, each with its own caveats:
-Block level ESD closure -> doesn't guarantee ESD protection at full-chip level.
-GDS based simulators for full-chip level sign-off:
1. Requires a clean LVS database, which is available late in the project cycle
2. Flattening design to the transistor level negatively impacts runtime.
This presentation unveils a solution to the challenges of current ESD sign-off processes using Pathfinder-SC technology and helped mitigate ESD problems related to grid robustness,connectivity to bumps and placement coverage early in the project cycle.The resistance computation was also found to be more accurate than traditional extraction tools used. By leveraging this,we were able to extend the resistance checks to encompass not only ESD cells but also macros(Analog IPs, IOs, PLL etc.).This resulted in broader coverage across the full-chip, leading to greater confidence in the sign-off process.
In an attempt to mitigate these issues, the following approaches were adopted, each with its own caveats:
-Block level ESD closure -> doesn't guarantee ESD protection at full-chip level.
-GDS based simulators for full-chip level sign-off:
1. Requires a clean LVS database, which is available late in the project cycle
2. Flattening design to the transistor level negatively impacts runtime.
This presentation unveils a solution to the challenges of current ESD sign-off processes using Pathfinder-SC technology and helped mitigate ESD problems related to grid robustness,connectivity to bumps and placement coverage early in the project cycle.The resistance computation was also found to be more accurate than traditional extraction tools used. By leveraging this,we were able to extend the resistance checks to encompass not only ESD cells but also macros(Analog IPs, IOs, PLL etc.).This resulted in broader coverage across the full-chip, leading to greater confidence in the sign-off process.
Research Special Session


Design
DescriptionUseful quantum computing of the future will require the seamless integration of quantum and classical resources. While significant progress has been made in scaling quantum hardware, the role of classical computing remains essential for tasks such as error correction, pre/post-processing, and calibration—key workflows that enable practical quantum applications. NVIDIA's CUDA-Q platform addresses this need with an open-source, qubit-agnostic approach, allowing users to build high-performance, scalable hybrid quantum applications, while leveraging accelerated computing and AI. This talk will provide an overview of CUDA-Q and NVIDIA's broader quantum strategy.
Engineering Presentation


AI
Systems and Software
DescriptionSoC level Android home screen bring-up, system Software development and its validation are usually done at post-silicon stage, which can affect time to market a product, and cost of silicon re-spin. It is also difficult to debug hardware/software design bugs at post-silicon level because of less RTL design visibility. So early software development and its validation is very crucial. Therefore, as a Platform Team, It is essential to have a high speed model of the SoC available months before silicon readiness to initiate Android kernel and system
software development effectively.
In this paper we discuss the available early pre-silicon platform, their limitation, and how the hybrid emulation methodology effectively accelerate home screen and system software development. It highlights techniques and customizations developed to accelerate Android Home Screen boot ,like leveraging high speed virtual memories and boot devices , migrating AFM to QEMU based virtual CPU SS, smart partitioning of SoC design components between hardware(RTL) and Virtual side, advanced clocking techniques, Hardware-Software profiling, and peripheral devices enablement close to silicon.
This paper addresses the challenges encountered in porting silicon level Android boot codes during pre-silicon phase, including silicon like PHY characterization. It also presents the accelerated results achieved by incorporating various techniques and Hardware-Software Optimizations, Highlighting the performance improvements at each stage.
software development effectively.
In this paper we discuss the available early pre-silicon platform, their limitation, and how the hybrid emulation methodology effectively accelerate home screen and system software development. It highlights techniques and customizations developed to accelerate Android Home Screen boot ,like leveraging high speed virtual memories and boot devices , migrating AFM to QEMU based virtual CPU SS, smart partitioning of SoC design components between hardware(RTL) and Virtual side, advanced clocking techniques, Hardware-Software profiling, and peripheral devices enablement close to silicon.
This paper addresses the challenges encountered in porting silicon level Android boot codes during pre-silicon phase, including silicon like PHY characterization. It also presents the accelerated results achieved by incorporating various techniques and Hardware-Software Optimizations, Highlighting the performance improvements at each stage.
Engineering Poster
Networking


DescriptionEnsuring reliable connectivity between digital logic and analog IPs in ASIC design is a critical challenge in verification flows.
This paper introduces a novel approach leveraging formal verification with Jasper Connectivity to address the limitations of traditional mixed simulation methods. By automating the generation of connectivity assertions and utilizing a formal verification engine, we achieve comprehensive and exhaustive verification, significantly reducing setup and runtime.
Our methodology involves extracting netlists from analog schematic views, generating SystemVerilog assertions using Python scripts and proving the assertions with formal verification. This automated process not only accelerates verification but also enhances bug detection and ensures complete coverage. A comparative analysis shows that formal verification with Jasper is up to 1400 times faster in setup and 650 times faster in runtime compared to mixed simulations. Moreover, formal verification can identify bugs that mixed simulation could miss, highlighting its superior effectiveness.
The adoption of formal verification and Jasper Connectivity in our verification flow demonstrates substantial improvements in efficiency and quality assurance.
This approach ensures a bug-free connectivity between the analog and the digital world, enhancing silicon quality. The paper discusses the implementation, benefits, and impact of this innovative verification methodology on the overall design process.
This paper introduces a novel approach leveraging formal verification with Jasper Connectivity to address the limitations of traditional mixed simulation methods. By automating the generation of connectivity assertions and utilizing a formal verification engine, we achieve comprehensive and exhaustive verification, significantly reducing setup and runtime.
Our methodology involves extracting netlists from analog schematic views, generating SystemVerilog assertions using Python scripts and proving the assertions with formal verification. This automated process not only accelerates verification but also enhances bug detection and ensures complete coverage. A comparative analysis shows that formal verification with Jasper is up to 1400 times faster in setup and 650 times faster in runtime compared to mixed simulations. Moreover, formal verification can identify bugs that mixed simulation could miss, highlighting its superior effectiveness.
The adoption of formal verification and Jasper Connectivity in our verification flow demonstrates substantial improvements in efficiency and quality assurance.
This approach ensures a bug-free connectivity between the analog and the digital world, enhancing silicon quality. The paper discusses the implementation, benefits, and impact of this innovative verification methodology on the overall design process.
Engineering Presentation


AI
IP
DescriptionThis case study presents a detailed methodology for the verification of a bandgap reference circuit in an advanced FinFET process. The verification process leverages Additive AI technology to address the complexities inherent in analog circuits, particularly those involving a large number of elements. The verification analysis was conducted at a high-sigma target, and the circuit encompassed 17,000 devices. The application of this AI technology significantly reduced the simulation count by 40.4 times and the wall clock time by 19.4 times, converting a typical 6-hour verification job into an 18-minute task with the same accuracy. This AI-powered approach operates seamlessly, requiring no additional designer effort, and ensures accuracy by self-verifying and running full jobs when necessary. It is particularly effective for incremental and iterative workflows, such as PVT changes, sizing and layout revisions, toolchain updates, and PDK updates. Our analysis included several VDD changes, ranging from nominal to extreme variations, with a target sigma of 5 sigma. The results demonstrate the efficacy of the technology in significantly reducing verification time and effort, enabling multiple iterations per day, and ensuring robust verification of complex analog circuits in advanced technology nodes.
Engineering Poster
Networking


DescriptionAs technology advances to deep sub-micron nodes, the complexity of chip design increases, particularly for high-sigma verification of bandgap references and RC oscillators. Traditional verification workflows are time-consuming and require multiple iterations, significantly delaying project timelines. This paper presents an AI-powered Additive Learning methodology that accelerates verification by retaining and reusing AI models from previous simulation jobs. The proposed methodology, applied to Microchip's iterative workflows, demonstrates up to 20x simulation speedup and 22x wall-clock time reduction without compromising accuracy. By automating the iterative process, this approach enhances both efficiency and accuracy, providing a fast, reliable, and scalable solution for verification in advanced technology nodes.
Networking
Work-in-Progress Poster


DescriptionThis paper addresses the challenges in accelerating clustering algorithms for large-scale datasets, emphasizing memory capacity and bandwidth limitations. By introducing a GPU and CXL-Memory collaborative architecture, we enhance the efficiency of Approximate Nearest Neighbor (ANN) searches. Leveraging the recent CXL protocol advancements, this architecture splits compute-intensive tasks for the GPU and memory-intensive processes for the CXL memory expander. The design incorporates near-data processing (NDP) units within each CXL memory rank to handle local computations, reducing memory bottlenecks and improving throughput without the need for multi-GPU configurations. Experimental results demonstrate significant performance gains in ANN tasks, achieving cost-effective and scalable solutions for billion-scale datasets.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionThis paper proposes a new design-technology co-optimization framework that expedites circuit optimization by utilizing the neural compact modeling (NCM) and a data-driven SPICE simulation. An efficient retargeting strategy of NCM and its improved design capability through a direct data driven SPICE simulation were leveraged at the industry level in response to increasingly challenging current development situations. To facilitate rapid feedback for extensive trial and error in technology optimization, the NCM swiftly fine-tune itself using pre-trained model. Then, the data interpolation and derating techniques are utilized to provide the same design environment as before such as instance binning, process variations, and layout dependent effects. Demonstrating the robustness of our framework, we achieved a 95% reduction in PDK release time while maintaining model consistency and performance at a mid-scale design of 15k transistors, with no SPICE run time and accuracy loss. This solution allows for rapid incorporation of process changes into the design, supporting quick path-finding during a design-technology co-development.
Networking
Work-in-Progress Poster


DescriptionThis paper presents an Electronic Design Automation (EDA) methodology for synthesizing mixed-signal Binarized Neural networks (BNNs) at CMOS device level. Despite the advances in Convolutional Neural Networks (CNN) high level frameworks with quantization capabilities, and contrary to digital design, there are not tools to translate such networks into mixed-signal electronic circuits. Such circuits could however offer enhanced performance regarding power dissipation or computation resources trade-offs compared to digital ones. The proposed methodology translates high level CNNs descriptions to device level designs, within the mixed-signal design cycle, enabling efficient iteration across architectural variants. The methodology is validated through SKILL language implementation, demonstrating synthesis of a compact MNIST neural network.
Networking
Work-in-Progress Poster


DescriptionRecent advances in data-driven approaches, such as Neural Operator (NO), have shown substantial efficacy in reducing the solution time for integrated circuit (IC) thermal simulations. However, a limitation of these approaches is requiring a large amount of high-fidelity training data, such as chip parameters and temperature distributions, thereby incurring significant computational costs. To address this challenge, we propose a novel algorithm for the generation of IC thermal simulation data, named block Krylov and operator action (BlocKOA), which simultaneously accelerates the data generation process and enhances the precision of generated data. BlocKOA is specifically designed for IC applications. Initially, we use the block Krylov algorithm based on the structure of the heat equation to quickly obtain a few basic solutions. Then we combine them to get numerous temperature distributions.
Finally, we apply heat operators on these functions to determine the heat source distributions, efficiently generating precise data points. Theoretical analysis shows that the time complexity of BlocKOA method is one order lower than direct solution methods. Experimental results further validate its efficiency, showing that BlocKOA accelerates the generation of datasets with 5000 instances by a factor of 420.
Finally, we apply heat operators on these functions to determine the heat source distributions, efficiently generating precise data points. Theoretical analysis shows that the time complexity of BlocKOA method is one order lower than direct solution methods. Experimental results further validate its efficiency, showing that BlocKOA accelerates the generation of datasets with 5000 instances by a factor of 420.
Engineering Poster
Networking


DescriptionIn the work of IR analysis for digital designs, setting up simulations and performing quality checks (QC) are critical yet time-intensive tasks, often requiring proficiency in Unix commands and scripting.
This challenge becomes more pronounced when team members are new and lack prior experience in automation workflows.
Metrics report dumped by the EDA view tools only captures the following run specific details
IR run logfile path
Runtime and memory usage
Metrics report captures only few aspects of the IR analysis
Power/IV reports
IV hotspot gif
Worst IR drop,
Avg effective voltage,
number of violations above specified threshold
The information provided may not be sufficient to thoroughly analyse the IR hotspot and root-cause study.
To address these challenges, we developed a GUI-based automation tool that simplifies the simulation setup process, reducing reliance on manual command-line operations.
Additionally, an interactive HTML dashboard was designed to provide an intuitive and efficient means of analyzing IR drop results, tracking simulation quality, and logging issues.
By abstracting complex Unix commands into a user-friendly interface, this solution significantly enhanced the team's productivity, reducing setup times and minimizing human errors.
The interactive dashboard enabled rapid identification of issues and streamlined the QC process, even for team members with limited technical expertise.
This paper outlines the development, implementation, and impact of these tools, demonstrating their effectiveness in increasing efficiency, reducing onboarding times for new team members, and ensuring robust design implementations.
This challenge becomes more pronounced when team members are new and lack prior experience in automation workflows.
Metrics report dumped by the EDA view tools only captures the following run specific details
IR run logfile path
Runtime and memory usage
Metrics report captures only few aspects of the IR analysis
Power/IV reports
IV hotspot gif
Worst IR drop,
Avg effective voltage,
number of violations above specified threshold
The information provided may not be sufficient to thoroughly analyse the IR hotspot and root-cause study.
To address these challenges, we developed a GUI-based automation tool that simplifies the simulation setup process, reducing reliance on manual command-line operations.
Additionally, an interactive HTML dashboard was designed to provide an intuitive and efficient means of analyzing IR drop results, tracking simulation quality, and logging issues.
By abstracting complex Unix commands into a user-friendly interface, this solution significantly enhanced the team's productivity, reducing setup times and minimizing human errors.
The interactive dashboard enabled rapid identification of issues and streamlined the QC process, even for team members with limited technical expertise.
This paper outlines the development, implementation, and impact of these tools, demonstrating their effectiveness in increasing efficiency, reducing onboarding times for new team members, and ensuring robust design implementations.
Engineering Poster


DescriptionThis research explores CPU-GPU hybrid computing strategies for semiconductor test data processing, with a focus on leveraging the strengths of each processor type for distinct data conditions. Experimental results reveal that CPUs achieve optimal performance at 10K chunk size for 5G datasets, making them well-suited for specific data patterns with proper chunk size optimization. In contrast, GPUs maintain consistent performance across all chunk sizes, excelling in environments with larger datasets without requiring chunk size tuning. To capitalize on these insights, we propose a workload distribution framework that dynamically allocates tasks based on data characteristics and size. Our framework automatically determines the optimal processor allocation by analyzing data patterns and volume in real-time, ensuring maximum resource efficiency. Testing in production environments demonstrated a 75% reduction in processing time for 30G datasets, with processing time reduced from 600s to 150s when compared to CPU-only processing. The hybrid framework also showed exceptional scalability, maintaining consistent performance improvements across varying data sizes from 1G to 30G. These findings provide actionable guidance for semiconductor manufacturers in selecting and optimizing processing strategies tailored to diverse data characteristics, while offering a foundation for future extensions into multi-GPU environments and machine learning-based optimization systems.
Engineering Presentation


AI
IP
DescriptionAs mobile SOC designs increasingly incorporate SRAM, achieving high yield qualification becomes crucial for ensuring performance and reliability. However, the iterative nature of SRAM design cycles, driven by failure corrections, design updates, and technology revisions, often results in substantial computational overhead. This paper introduces an AI-based methodology that accelerates SRAM design verification by leveraging Additive Learning technology. By reusing AI models and results from previous simulations, the methodology significantly reduces the number of simulations required for iterative design changes, achieving simulation speedups of 20X to 67X while maintaining accuracy. The proposed methodology addresses critical challenges in SRAM yield qualification, providing a more efficient, automated, and accurate approach to the verification process, ensuring faster time-to-market without compromising accuracy and quality.
Engineering Poster
Networking


DescriptionFirmware development and validation is a critical aspect of the product development cycle. Growing complexity of firmware along with its hardware counterpart has put tremendous pressure on the engineering teams to explore effectively and efficient ways to validate the firmware. Firmware validation is often gated by the hardware availability and slow simulator speed. Time-To-Market being the critical factor, waiting for the actual silicon to arrive to begin testing the firmware is simply not an option . Traditional FPGAs offer real-time speeds but the bring up duration eats up months for complex projects and debug remains a nightmare and resource intensive. too .We wanted a strategic trade-off solution addressing both HW and SW Engineer's needs in terms of bringup time and debug. The "Dynamic-Duo"PalladiumZ2 and ProtiumX2 "Congruency-flow"solution from cadence addresses these challenges in a much efficient way and we will demonstrate how we achieved the best of both of these platforms in ADI complex products
Exhibitor Forum


DescriptionIn this session, we will present how we optimized SPICE simulation for a D flip-flop, reducing the transient analysis Monte Carlo simulation time for 5000 samples to under 1 hour from the typical 8-hour run, while simultaneously reducing EDA software license costs by 75% on the cloud. We then leveraged the same technique to scale to larger designs with more complex requirements, achieving similar results and accelerating speedup 2X to 52X across multiple flows such as timing and power signoff, formal verification, and DRC extraction. We will share specific real-world design scenarios and examples of successful tapeouts completed months ahead of schedule by scaling per-minute licensing usage across the entire chip design project.
Analyst Presentation


AI
DescriptionThis session will discuss the design of accelerator packages and systems tailored for the AI era, addressing the need for improved performance, scalability, and energy efficiency to support complex AI workloads.
Ancillary Meeting


DescriptionJoin us for an insightful luncheon panel sponsored by Accellera, where industry leaders will discuss the transformative role of AI in semiconductor design and verification. As AI rapidly evolves, its potential to reduce costs, shorten time-to-market and address impending talent shortages is becoming increasingly evident—but what are the real-world opportunities and challenges?
This panel will bring together industry experts to share their vision and experiences, examining:
• The impact of AI on design and verification flows
• Envisioned benefits of applying AI, including cost reduction
• Challenges in training and deploying Large Language Models (LLMs), including IP ownership, ethics, and security
• The role of industry standards to shape AI-driven methodologies
The Accellera-sponsored luncheon is free to DAC attendees, but registration is required. Please visit https://www.accellera.org/news/events to register and find out more.
This panel will bring together industry experts to share their vision and experiences, examining:
• The impact of AI on design and verification flows
• Envisioned benefits of applying AI, including cost reduction
• Challenges in training and deploying Large Language Models (LLMs), including IP ownership, ethics, and security
• The role of industry standards to shape AI-driven methodologies
The Accellera-sponsored luncheon is free to DAC attendees, but registration is required. Please visit https://www.accellera.org/news/events to register and find out more.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionMerkle Tree is a fundamental cryptographic primi
tive in Zero-Knowledge Proof (ZKP) protocols, sharing significant
computational workloads with the Number Theoretic Transform
(NTT) in zk-STARK schemes. Merkle Tree is a tree structure
where nodes are primarily generated through hash computations.
Among them, Poseidon Hash, as a ZK-friendly hash function,
has emerged as one of the most widely adopted choices. There
fore, hardware acceleration of building Merkle Tree based on
Poseidon Hash can significantly enhance the performance of
ZKP protocols. We propose AcclMT, a highly resource-efficient
and flexible Poseidon Hash-based Merkle Tree architecture. Our
design employs hardware-software co-design and optimizes the
hashing data flow, resulting in an area-efficient Poseidon Hash
engine that improves modular multiplication resource utilization.
Furthermore, AcclMT uses these engines alongside hierarchical
on-chip cache and optimized task scheduling for building large
Merkle Trees. It also supports flexible parameter configurations
for various requirements. Experimental results show that our
proposed Poseidon Hash engine achieves a 14.3 × speedup
compared to the latest FPGA-based work. By improving resource
utilization, it also reduces area usage by 14.8% compared to
unoptimized design. AcclMT achieves up to 1665 × speedup over
software implementations in building Merkle tree, with average
utilization of 95.9% and 99.2% for the two hash engines.
tive in Zero-Knowledge Proof (ZKP) protocols, sharing significant
computational workloads with the Number Theoretic Transform
(NTT) in zk-STARK schemes. Merkle Tree is a tree structure
where nodes are primarily generated through hash computations.
Among them, Poseidon Hash, as a ZK-friendly hash function,
has emerged as one of the most widely adopted choices. There
fore, hardware acceleration of building Merkle Tree based on
Poseidon Hash can significantly enhance the performance of
ZKP protocols. We propose AcclMT, a highly resource-efficient
and flexible Poseidon Hash-based Merkle Tree architecture. Our
design employs hardware-software co-design and optimizes the
hashing data flow, resulting in an area-efficient Poseidon Hash
engine that improves modular multiplication resource utilization.
Furthermore, AcclMT uses these engines alongside hierarchical
on-chip cache and optimized task scheduling for building large
Merkle Trees. It also supports flexible parameter configurations
for various requirements. Experimental results show that our
proposed Poseidon Hash engine achieves a 14.3 × speedup
compared to the latest FPGA-based work. By improving resource
utilization, it also reduces area usage by 14.8% compared to
unoptimized design. AcclMT achieves up to 1665 × speedup over
software implementations in building Merkle tree, with average
utilization of 95.9% and 99.2% for the two hash engines.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionIntegrated circuit yield optimization plays a vital role in ensuring reliable semiconductor manufacturing. Current approaches allocate equal computational resources across design candidates causing low efficiency. To this end, we introduce a novel precision-aware yield optimization framework that intelligently adapts computational resource allocation based on the design candidate's predicted performance. Our approach moves beyond simple simulation counting by incorporating a Figure of Merit as a continuous quality metric. By combining a continuous autoregression model to characterize the relationship between yield and precision levels with a sophisticated multi-fidelity acquisition strategy, our framework achieves optimal resource distribution, reducing simulation costs by over 10x.
Engineering Poster


DescriptionTechnology advancements in lower tech nodes have increased the complexity of Memory IR closure due to increasing power density, reduced Vmin headroom causing high risk of failure. This presentation explores the challenges and solutions for Memory IR sign-off, focusing on accurate modeling, realistic use-case scenarios, and efficient resource utilization.
The Memory IR sign-off process requires four pillars: IR Modeling, IR Simulation Scenario, IR Reporting, and IR Fixing.
Key aspects of our approach include:
1. Utilizing both abstract and detailed models (based on the memory risk factor) for accurate IR modeling.
2. Selecting realistic test cases for IR simulation by considering memory modes, load capacitance, and the number of toggling bits to ensure realistic IR analysis.
3. Refining default IR drop reported by removing tool pessimism. Introduces a Bbox-based approach to identify the worst-case power and ground drop, considering current charge-discharge locations
4. Implementing effective IR fixing strategies including PG-based fixing methods and timing-based fixes.
By employing detailed models for critical memories, efficiently utilizing PG resources for dual-rail memories, and removing simulation/reporting pessimism, we achieve a seamless Memory IR sign-off process. This approach leads to greater confidence in the sign-off process and ensures robust chip performance at lower technology nodes.
The Memory IR sign-off process requires four pillars: IR Modeling, IR Simulation Scenario, IR Reporting, and IR Fixing.
Key aspects of our approach include:
1. Utilizing both abstract and detailed models (based on the memory risk factor) for accurate IR modeling.
2. Selecting realistic test cases for IR simulation by considering memory modes, load capacitance, and the number of toggling bits to ensure realistic IR analysis.
3. Refining default IR drop reported by removing tool pessimism. Introduces a Bbox-based approach to identify the worst-case power and ground drop, considering current charge-discharge locations
4. Implementing effective IR fixing strategies including PG-based fixing methods and timing-based fixes.
By employing detailed models for critical memories, efficiently utilizing PG resources for dual-rail memories, and removing simulation/reporting pessimism, we achieve a seamless Memory IR sign-off process. This approach leads to greater confidence in the sign-off process and ensures robust chip performance at lower technology nodes.
Engineering Presentation


AI
Systems and Software
Chiplet
DescriptionCo-packaged optics (CPO) is a promising candidate for implementing low-power, high-bandwidth optical links in datacenters and in high-performance computing (HPC) systems. By bringing the optical transceiver into the 3DIC package, CPO can reduce the complexity and power loss of the overall system. These benefits come at the cost of increased thermal power density in the 3DIC from the tighter integration. The close proximity of the electrical integrated circuit (EIC) and the photonic integrated circuit (PIC) also results in thermal cross-talk through vertical die-to-die coupling. Accurate thermal simulation of the 3DIC package is needed to ensure thermal integrity of the package and to reduce performance degradation from thermal crosstalk. In this work, we demonstrate a highly efficient and accurate chip-centric thermal simulation workflow for CPO. By generating chip thermal models (CTMs) for the EIC and PIC, we enable efficient thermal simulation of the 3DIC package to profile the high-fidelity temperature distribution of all dies. We extend the traditional CTM for modelling the PIC which increases the overall accuracy of our simulation. The temperature map is imported into a photonic circuit simulator to capture the effects of thermal crosstalk from the package on the performance of the PIC.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionQuasi-cyclic moderate density parity-check McEliece (QMM) cryptosystem is designed to mitigate the security threat posed by quantum computers, and is considered to be a promising candidate for post-quantum cryptography (PQC). However, the growing requirement of data encryption bring severe challenge for QMM implementation in terms of latency and hardware overhead. In this work, we firstly propose ACIM-QMM, an analog computing-in-memory (CIM) accelerator design for QMM cryptosystem. The empowerment of analog circuits and CIM make the design efficiently generating key and ciphertext encrypting while breaking the performance bottleneck of PQC constrained by digital computing paradigm. In the experiment, ACIM-QMM can work in low relative error, and it can achieve 31.4×~288.1× speedup compared with SOTA hardware of QMM cryptosystem. Furthermore, the results also demonstrate that ACIM-QMM can achieve maximum 3.12× area and 20.32× energy efficiency improvement comparing with other PQC hardware for 256-bit security.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionModern GPUs typically segment Streaming Multiprocessors (SMs) into sub-cores (e.g. 4 sub-cores) to reduce power consumption and chip area. However, this partitioned design prevents potential task distributions across sub-cores, impairing overall execution efficiency. In this paper, we explore the performance benefit of sharing hardware resources among sub-cores and identify functional units (FUs) as critical components for compute-intensive applications. Moreover, our observations reveal that instructions residing in operand collectors can be obstructed by back-end FUs, but there is a high probability that unoccupied FUs are available in adjacent sub-cores during such blockages.
In response, we introduce the adjacent computation resource sharing (ACRS) framework to efficiently utilize these unoccupied units among sub-cores. ACRS has two key modules: Shared FU Issue (SF ISSUE) and Shared FU Write Back (SF WriteBack). SF ISSUE monitors the status of operand collectors and functional units, and offloads instructions from blocked sub-cores to unoccupied resources. Meanwhile, SF WriteBack routes results back to the original sub-core.To minimize wiring overhead, each
sub-core is assigned a fixed target core for sharing. We design a series of matching policies and finally filter out the most effective sequential method. Evaluation results show that ACRS improves performance by up to 46.4%, with an average of 14.1% over the traditional partitioned architecture, while reducing energy consumption by 8.3%. Besides, ACRS achieves an additional 12.3% performance improvement compared with the SOTA method.
In response, we introduce the adjacent computation resource sharing (ACRS) framework to efficiently utilize these unoccupied units among sub-cores. ACRS has two key modules: Shared FU Issue (SF ISSUE) and Shared FU Write Back (SF WriteBack). SF ISSUE monitors the status of operand collectors and functional units, and offloads instructions from blocked sub-cores to unoccupied resources. Meanwhile, SF WriteBack routes results back to the original sub-core.To minimize wiring overhead, each
sub-core is assigned a fixed target core for sharing. We design a series of matching policies and finally filter out the most effective sequential method. Evaluation results show that ACRS improves performance by up to 46.4%, with an average of 14.1% over the traditional partitioned architecture, while reducing energy consumption by 8.3%. Besides, ACRS achieves an additional 12.3% performance improvement compared with the SOTA method.
Engineering Poster
Networking


DescriptionAs the demand for high performance and energy efficiency becomes increasingly urgent, integrated circuits continue to evolve towards advanced nodes to meet application requirements. Typically, to avoid active region process defects and ensure product yield, factories provide relatively conservative design rules and broad model files. This leads to designs that cannot be based on the real margin, resulting in performance and power wastage.
To address these issues, we conduct research on testkey in active region. By designing testkey with specific sizes and structures and conducting simulation consistency analysis, we evaluate the DC performance shifts of active region devices under normal conditions, thereby enabling process monitoring. In addition, we analyze the performance of different types of devices under parameters such as high voltage and isolation, providing corresponding design insights for specific design scenarios, and meeting special customization requirements such as high performance computing and low power consumption. This research can effectively assess the consistency of Spice2Silicon and establish a process defect database, accurately reflecting the actual performance level and fluctuation range of the process platform, providing effective guidance for enhancing product competitiveness. Furthermore, it helps explore the performance limits of devices, laying a theoretical foundation for subsequent special designs and applications.
To address these issues, we conduct research on testkey in active region. By designing testkey with specific sizes and structures and conducting simulation consistency analysis, we evaluate the DC performance shifts of active region devices under normal conditions, thereby enabling process monitoring. In addition, we analyze the performance of different types of devices under parameters such as high voltage and isolation, providing corresponding design insights for specific design scenarios, and meeting special customization requirements such as high performance computing and low power consumption. This research can effectively assess the consistency of Spice2Silicon and establish a process defect database, accurately reflecting the actual performance level and fluctuation range of the process platform, providing effective guidance for enhancing product competitiveness. Furthermore, it helps explore the performance limits of devices, laying a theoretical foundation for subsequent special designs and applications.
Engineering Poster
Networking


DescriptionThe need for high performance and increased feature addition in smaller areas is pushing the limits of physics in terms of process node optimization across the metrics of power/performance/area. While a higher number of transistors per unit area enables increased feature addition, it can create extreme power density hotspots due to higher switching activity. High performance processor core is one of the major IP which is affected by this trend of high-power density as the IP tries to achieve desired PPA goals of advanced process node. Since turbo frequency is a crucial factor to meet the performance benchmark, core IP voltage limit is closer to the process Vmax limit defined for reliability. These factors aggravate the issue of high-power density which leads to increased temperature ramp (Oc/ms) in the die and creation of localized hotspots. These kinds of localized hotspots may lead to permanent faults and reliability challenges. Advanced process nodes are required to meet power density targets to avoid thermal runaway scenarios. In this paper, we discuss power density mitigating techniques like module padding/instance padding and Activity-Based Power Density optimization on high switching designs.
This technique enables designers to reduce power density with minimal timing impact throughout the design cycle. The above techniques were tested in complex design with extremely high switching activity partition for next generation cores to reduce the power density by 40% in a thermal critical region.
This technique enables designers to reduce power density with minimal timing impact throughout the design cycle. The above techniques were tested in complex design with extremely high switching activity partition for next generation cores to reduce the power density by 40% in a thermal critical region.
Networking
Work-in-Progress Poster


DescriptionWith the increasing size of Large Language Models (LLMs), low-rank decompositions are being widely used for model compression. Although these methods achieve high compression ratios, they suffer from poor hardware utilization due to the irregular and low-rank nature of matrices, especially on conventional regular AI accelerators. This paper presents AdaMAP, a hybrid algorithm-to-hardware mapping strategy that optimizes low-rank matrix multiplications using input-stationary and output-stationary mappings. To fully leverage low-rank decomposition, we also propose hardware optimizations for efficient data-loading and output-flushing. Applied to a 92.5% compressed BERT model, our approach achieves up to 75× average layer-wise speed-up over the uncompressed model and 31× over the compressed model using weight-stationary mapping. Post-layout simulations on a 65 nm process show 15.9× higher hardware utilization and 77%–96% energy savings compared to weight-stationary mapping.
Networking
Work-in-Progress Poster


DescriptionThe shift to advanced storage technologies like SMR HDDs and 3D QLC NAND has led to the adoption of zoned block devices, which partition storage into zones that enforce strict sequential write constraints. While log-structured file systems (LFS) are suited to sequential writes, existing designs like Sprite LFS and F2FS face compatibility issues with zoned storage requirements. Persimmon has attempted to address this by converting non-zone-compliant on-disk structures into logs. However, converting to a log does not inherently ensure full sequentiality, and managing multiple distinct log systems in one file system introduces complexity, particularly when reconstructing necessary in-memory data structures from the on-disk layout. We identify key issues in current zone-compliant file systems, particularly with the log conversion approach, and introduce the Collective Log-Structured File System (CLFS), which addresses zone compliance, confined invalidation, and dynamic anchoring using write pointers. CLFS's zone-compliant design enables efficient mounting, recovery, and metadata management on zoned block devices, overcoming limitations in existing LFS implementations.
Networking
Work-in-Progress Poster


DescriptionSparse matrix multiplication (SpMSpM) is a critical computational kernel in various scientific and machine learning applications. The diverse sparsity patterns of different SpMSpM workloads present a significant challenge for traditional fixed-dataflow accelerators. Recent approaches have attempted to leverage dynamic dataflow to capture varying memory access characteristics under different sparsity patterns; however, they still face challenges in maximizing data reuse, load balance, and efficient merging of patial sums. To address these issues, we propose ADEM, an adaptive dataflow-based SpMSpM accelerator that can dynamically adjust the matrix partitioning scheme at a fine-grained level. Additionally, ADEM incorporates a greedy-based scheduling algorithm to achieve load balancing. Finally, we employ heterogeneous merge units to handle two distinct types of merging tasks. Experimental results demonstrate that ADEM achieves average performance gains of 8.1x and 1.88x over the baseline.
Research Manuscript


EDA
EDA1: Design Methodologies for System-on-Chip and 3D/2.5D System-in Package
DescriptionTo fully harness emerging computing architectures, compilers must provide intuitive input handling alongside powerful code optimization to unlock maximum performance. Coarse-Grained Reconfigurable Arrays (CGRAs) — highly energy-efficient and ideal for nested-loop applications — have lacked a compiler capable of meeting these objectives. This paper introduces the Adora compiler, which effectively bridges user-friendly, lightweight coding inputs with high-performance acceleration on the CGRA SoC. Adora utilizes CGRA-target loop transforms to achieve efficient data-flow level execution while optimizing data communication and task pipelining at the task-flow level. Additionally, it incorporates a comprehensive automated algorithm with a thoughtfully designed optimization sequence. A series of comprehensive experiments highlights the exceptional efficiency and scalability of the Adora compiler, demonstrating its transformative impact in leveraging CGRA capabilities for acceleration in edge computing.
Networking
Work-in-Progress Poster
Adora: An Arithmetic and Dynamic Operation Reconfigurable Accelerator using In-Memory Look-Up Tables
6:00pm - 7:00pm PDT Sunday, June 22 Level 3 Lobby

DescriptionIn and near memory computing has been shown to offer significant performance benefits to data-centric applications due to it alleviating the memory-wall problem. Existing processing in memory (PIM) works aim to accelerate the transformers within Large Language Models (LLMs), which are made up of purely arithmetic sequences. However, the majority of in-memory accelerators lack the ability to support branching operations. Here an enhancement to a previous Look-Up Table (LUT) based PIM architecture to support branching instructions is demonstrated. The enhanced design allows for the implementation of more complex branching kernels. While "tokens-per-second" has become the defacto unit of measurement for the performance of language model accelerators, many accelerator designs are limited to implementing the arithmetic component of the network. Existing accelerators leave an inherent bottleneck by the general-purpose computing unit which completes the initial stage of translating the input data into its symbolic representation for use in the network body, and then the transfer of data to the accelerator. The presented design provides a solution to these restrictions, with performance comparisons for power and throughput with respect to conventional processors.
Research Manuscript


Design
DES4: Digital and Analog Circuits
DescriptionThis paper presents how we leverage AI-human collaboration to develop an end-to-end, automated design flow for digitally controlled oscillators (DCOs), a key radio-frequency (RF) integrated circuits (ICs) building block that dominates phase noise and jitter performance of RF systems. Specifically, we decompose the DCO design process into two steps and use AI to enhance productivity and optimize performance within each step. Additionally, we demonstrate how AI can assist RF IC designers in creating unconventional circuit components to tackle challenging design specifications. Overall, the proposed flow is capable of synthesizing the DCO design including the schematic and layout in 80 seconds after one-time training, and is frequency agile between 1 and 20 GHz. Moreover, it can select the most robust design under process variations when multiple design parameters meet target specifications under the nominal condition. The proposed automated DCO design flow is demonstrated using two silicon prototypes implemented in the GlobalFoundries 22-nm CMOS SOI process. In the measurements, they achieve >192.4-dBc/Hz figure-of-merit (FoM) and <1.5-kHz frequency resolution at 7.1 to 8.6 GHz and 3.8 to 4.6 GHz, outperforming existing manual designs at similar frequencies.
Engineering Poster
Networking


DescriptionThe paper "Innovative Approach for APL (Apache Power Library) Modeling for Complex IO Designs" presents a new methodology to improve IR drop analysis accuracy in complex IO designs. Traditional IR drop analysis in System-on-Chip (SoC) designs uses models like APL for standard cells, CMM/APL for macros, and NLPM .libs for IO cells. However, NLPM .libs are inadequate for complex IO cells, necessitating a novel APL modeling approach.
The challenges in APL modeling for IO cells stem from their complexity, making traditional methods like sim2iprof cumbersome. The proposed methodology uses a characterization engine to generate CCS-POWER .lib, simplifying the process and reducing generation time. The comprehensive APL views cover all Process, Voltage, and Temperature (PVT) conditions, arcs, and slope x load combinations, aligning with NLPM data.
Comparative analysis shows the APL-CCS method offers a simple setup, quick generation, and comprehensive data without additional simulations. Two analyses validate the method: one compares APL-CCS with sim2iprof and SPICE, and the other compares peak current values from NLPM and APL-CCS models against SPICE simulations. Results show APL-CCS and SPICE report peak currents 2X to 25X higher than NLPM.
The paper concludes that accurate SoC analysis for complex IO cells requires APL models for better accuracy than NLPM .libs. The APL-CCS approach simplifies data generation, making it accessible for industry adoption and enhancing IR drop analysis for complex IO designs.
The challenges in APL modeling for IO cells stem from their complexity, making traditional methods like sim2iprof cumbersome. The proposed methodology uses a characterization engine to generate CCS-POWER .lib, simplifying the process and reducing generation time. The comprehensive APL views cover all Process, Voltage, and Temperature (PVT) conditions, arcs, and slope x load combinations, aligning with NLPM data.
Comparative analysis shows the APL-CCS method offers a simple setup, quick generation, and comprehensive data without additional simulations. Two analyses validate the method: one compares APL-CCS with sim2iprof and SPICE, and the other compares peak current values from NLPM and APL-CCS models against SPICE simulations. Results show APL-CCS and SPICE report peak currents 2X to 25X higher than NLPM.
The paper concludes that accurate SoC analysis for complex IO cells requires APL models for better accuracy than NLPM .libs. The APL-CCS approach simplifies data generation, making it accessible for industry adoption and enhancing IR drop analysis for complex IO designs.
Networking
Work-in-Progress Poster


DescriptionHardware Trojans (HTs) threaten integrated circuit (IC) security by introducing malicious components. Most detection approaches focus on register transfer and gate netlist levels due to the complexity of layout-level data. Layout-level detection is hindered by resource-intensive reverse engineering and difficulty accessing design details in GDSII files. This paper presents a computationally-efficient framework for detecting HTs at the GDSII level, leveraging layout versus schematic (LVS) and design rule checking (DRC) to extract netlists. Using a knowledge graph embedding (KGE) model, the framework identifies HT-suspicious nets, achieving high detection precision and recall, and outperforming existing GDSII-level methods.
Research Special Session


Design
DescriptionHeterogeneous integration has moved from a novel concept just a few years ago to a mainstay in the computer industry. Advanced packaging is a key part of any chiplet architecture strategy. The concept of heterogeneous integration is inviting, with the ability to mix and match chiplets from different design sources as well as foundry nodes. This enables scaled architectures to address high-computing needs for memory and enhanced compute with accelerators. The approach also promises cost-optimization and faster time to market. All these benefits cannot be realized, however, without careful consideration of the advanced packaging design and the optimization of the package design within the system construct. This includes co-optimization all the way from silicon to the system, and packaging is in the middle of this and critical. This talk covers the key role that advanced packaging plays in enabling optimized chiplet architectures and the important design considerations. Key trade-off studies that are needed early in the concept phase will be discussed, highlighting areas like power, thermal, I/O, mechanical and cost. The talk will touch upon co-design, early architecture studies, standards, and the need for interoperable tools/flows/methods as well as innovations in design and optimization approaches to ensure optimum performance while keeping cost in mind.
Research Special Session


Design
DescriptionAs cost-effectiveness of transistor scaling according to Moore's law slows, increasingly, future electronic systems for computing, mobile communications, automotive, defense and biological applications will rely on advanced integration of separately manufactured chiplets into a 2.5D/3D System-in-Package (SiP). Advanced Packaging with its promise of decreased cost through higher manufacturing yield, chiplet node optimization, functional modularity, and chiplets reuse will be a requisite foundation of future system design. Recently, sophisticated examples of advanced packages containing nearly 50 dies, many fabricated by different vendors on different technological nodes, have been demonstrated. In general, 2.5D integration of many dies on a common substrate, or their complex 3D integration is challenged by many factors including custom designs and tools with limited availability, computationally expensive multiscale and multiphysics models that make design exploration inefficient, unavailability of high-density substrates,
inefficient power delivery and thermal management solutions, as well as assembly, testing and reliability challenges. In this talk, I will provide a broad overview of Advanced Packaging and the ongoing roadmapping efforts at identifying its needs.
inefficient power delivery and thermal management solutions, as well as assembly, testing and reliability challenges. In this talk, I will provide a broad overview of Advanced Packaging and the ongoing roadmapping efforts at identifying its needs.
Engineering Poster


DescriptionOne of the most challenging elements of Formal Property Verification (FPV) is the when the user encounters complexity: some subset of properties time out while neither proven nor disproven. While many advanced techniques are available to address complexity, this can be especially tricky because the FPV tool does not provide a debuggable waveform. However, the technique of State Space Tunneling (SST) allows a user to generate a waveform that gives insight into why the tool is encountering complexity issues, and the user can leverage it to create helper assertions that will enable FPV to overcome the complexity issues. While SST has been around in some form for over a decade, recent advances in the technology have made it much more usable for the typical validation engineer. Thus, for users who may have tried it unsuccessfully in the past, now is the perfect time to dive back into this powerful methodology. Benefits of recent improvements include better integration into core engines, integration of "soft constraints" to create better waveforms, and completion of proofs using "upper bounds". We will discuss the usage model and recent improvements to SST, and how they enabled excellent FPV results in a recent project at Tenstorrent.
Engineering Presentation


AI
IP
DescriptionTHine Electronics specializes in designing high-speed interfaces and timing controllers. Their newly developed IC features 8-bit x 8 phase interpolators, designed for high-speed IO interfaces and communication systems. To ensure the quality of this design, THine's design flow demands SPICE-accurate verification across a wide range of scenarios, including 64 process modes, multiple temperature and supply voltage values, and hundreds of phase interpolator settings. This results in tens of thousands of nominal corners, requiring a significant number of samples for achieving 4-sigma statistical accuracy. Traditional methods would need at least 10 million samples for verification.
To meet their verification goals within practical production timelines, THine Electronics utilized an AI-driven methodology incorporating Siemens' Solido Simulation Suite and Solido Design Environment. This approach allowed the design team to secure SPICE-accurate verification at 4-sigma with only thousands of samples, reducing the total verification cycle by 100x in terms of simulation numbers compared to conventional brute-force methods.
In this presentation, we will explore the complex verification challenges encountered, the methodology used to overcome them, and the achieved results, highlighting how the collaboration between THine Electronics and Siemens EDA leveraged an AI-powered design verification flow to achieve accurate results with remarkable efficiency gains.
To meet their verification goals within practical production timelines, THine Electronics utilized an AI-driven methodology incorporating Siemens' Solido Simulation Suite and Solido Design Environment. This approach allowed the design team to secure SPICE-accurate verification at 4-sigma with only thousands of samples, reducing the total verification cycle by 100x in terms of simulation numbers compared to conventional brute-force methods.
In this presentation, we will explore the complex verification challenges encountered, the methodology used to overcome them, and the achieved results, highlighting how the collaboration between THine Electronics and Siemens EDA leveraged an AI-powered design verification flow to achieve accurate results with remarkable efficiency gains.
Engineering Poster
Networking


DescriptionAchieving high yield in SRAM bitcells is a critical challenge in semiconductor manufacturing, where process variations and random defects can significantly impact performance and reliability. This paper presents a novel methodology for improving SRAM yield prediction by integrating rare, random defects into traditional simulation models. The approach specifically addresses the impact of defects in NFET devices on SRAM performance, which were previously unquantified. Using an AI-powered methodology incorporates exponentially distributed defect variables and adjusts for variations in threshold voltage (Vt), enabling more accurate predictions of SRAM yield and fail-counts. The proposed method reduced the error margin in Vmin prediction to less than 5% and improved alignment between simulated and observed fail-counts. With minimal additional computational overhead, this AI-powered approach enhances the accuracy and efficiency of yield prediction for SRAM bitcells, offering a powerful tool for managing yield loss in advanced process nodes.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionDeep learning models are now pervasive in the malware detection domain owing to their high accuracy and performance efficiency. However, it is critical to analyze the robustness of these models by introducing adversarial attacks that can expose their vulnerabilities. Nevertheless, adversarial
malware generation problem for Linux has not been well-investigated. In this work, we propose a novel reinforcement learning framework, ADVeRL-ELF to generate adversarial ELF malware by adding semantic NOPs within the executable region. Experimental results show that ADVeRL-ELF achieved an attack success rate of 59.5%. These adversarial malware can be leveraged to harden the Linux based malware detection systems.
malware generation problem for Linux has not been well-investigated. In this work, we propose a novel reinforcement learning framework, ADVeRL-ELF to generate adversarial ELF malware by adding semantic NOPs within the executable region. Experimental results show that ADVeRL-ELF achieved an attack success rate of 59.5%. These adversarial malware can be leveraged to harden the Linux based malware detection systems.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionApproximate computing is a relatively new computing paradigm that allows to trade-off area/power with the accuracy at the outputs. Another relatively new VLSI design trend is to raise the level of design abstraction from the Register-Transfer Level (RTL) to the behavioral level and use High-Level Synthesis (HLS) to synthesize these behavioral descriptions. HLS has one unique advantage over RT-level design. It completely decouples the functional description from the implementation. This allows to design and verify the behavioral description once,
but then generate a large number of hardware implementations of unique area vs. performance trade-offs. This is typically achieved through synthesis directives in the form of pragmas that the HLS user annotates at the source code to mainly control how to synthesize arrays (RAM, registers, expand), loops (unroll, pipeline) and functions (inline or not).
In this work we leverage this uniqueness and build an automated HLS design space explorer to find the hardware circuit most amenable to approximate computing, this is, has the highest potential for area/power savings. We have coined this explorer ADVISOR. The main problem with traditional exploration processes is that their long run time, which is even more accentuated in this case because every new implementation needs to be fully approximated to fully understand the trade-offs in terms of area/power vs. error of each design. Thus, in order
to accelerate this exploration process, we propose to evaluate each new designs based on an Approximation Friendliness Index (AFI) that can be computed statically, very fast, and only fully approximate the designs recommended by our flow that have high AFI values. Experimental results show that this approach leads to basically the same results as exhaustively approximating every new design, while being on average 68× faster.
but then generate a large number of hardware implementations of unique area vs. performance trade-offs. This is typically achieved through synthesis directives in the form of pragmas that the HLS user annotates at the source code to mainly control how to synthesize arrays (RAM, registers, expand), loops (unroll, pipeline) and functions (inline or not).
In this work we leverage this uniqueness and build an automated HLS design space explorer to find the hardware circuit most amenable to approximate computing, this is, has the highest potential for area/power savings. We have coined this explorer ADVISOR. The main problem with traditional exploration processes is that their long run time, which is even more accentuated in this case because every new implementation needs to be fully approximated to fully understand the trade-offs in terms of area/power vs. error of each design. Thus, in order
to accelerate this exploration process, we propose to evaluate each new designs based on an Approximation Friendliness Index (AFI) that can be computed statically, very fast, and only fully approximate the designs recommended by our flow that have high AFI values. Experimental results show that this approach leads to basically the same results as exhaustively approximating every new design, while being on average 68× faster.
Research Manuscript


Systems
SYS2: Design of Cyber-Physical Systems and IoT
DescriptionEnergy Harvesting (EH) technology has emerged to prolong the lifetime of Internet of Things (IoT) devices. However, in EH-IoTs, the reliance on external energy sources introduces challenges in maintaining up-to-date information. To quantify data freshness in such systems, researchers have introduced the Age-of-Information (AoI) metric, which measures the time elapsed since the generation of the most up-to-date information received by the user. Consequently, the problem of AoI minimization has been studied extensively in EH-IoTs to ensure timely data delivery. While data aggregation is a fundamental task for IoTs, existing works on AoI minimization in EH-IoTs have only considered scenarios where sensory data is updated by individual source nodes. The problem has not been investigated for data aggregation, in which the sensory data is aggregated from multiple source nodes. In this paper, we
study the problem of AoI minimization for Data Aggregation in EH-IoTs. To address this problem, we propose an energy-adaptive node scheduling algorithm consisting of both offline scheduling and online
adjustment. Extensive simulations and testbed experiments verify the high performance of our algorithm in terms of AoI minimization and energy efficiency.
study the problem of AoI minimization for Data Aggregation in EH-IoTs. To address this problem, we propose an energy-adaptive node scheduling algorithm consisting of both offline scheduling and online
adjustment. Extensive simulations and testbed experiments verify the high performance of our algorithm in terms of AoI minimization and energy efficiency.
Engineering Poster
Networking


DescriptionVerifying server-grade Hardware Description Languages (HDLs) is a complex task that requires sophisticated front-end Electronic Design Automation (EDA) tools. To optimize performance, these tools abstract the HDLs and utilize intermediate representations. During runtime, EDA tools may also append additional information to facilitate verification. However, this abstraction and annotation process can create a significant disconnect between the user's input and the tool's output, necessitating extensive manual intervention from a subject matter expert to debug tool's error.
Artificial Intelligence (AI) can help bridge this gap and accelerate error interpretation and debugging. Nevertheless, foundational AI models lack proprietary domain/design-specific information. To address this limitation, AI agents can be employed to leverage existing EDA tools and gather domain/design-specific information, thereby facilitating more accurate error interpretation and debugging.
This presentation will showcase the application of AI agents in IBM's static structural checking tool, which is intensively used to ensure HDL compliance with IBM's design methodology. The tool operates on abstracted HDLs represented as a graph of BOXES and NETS, utilizing ANTLR grammar to traverse the graph. During traversal, the tool appends additional labels and tags to facilitate checking. However, the resulting error traces are often cryptic and require expert interpretation, slowing down the debugging process. By integrating AI agents, we aim to bridge the gap between tool errors and user understanding of the HDL, thereby accelerating debugging efforts by at least 30%. This innovative approach has the potential to significantly improve the efficiency and productivity of HDL verification and debugging processes.
Artificial Intelligence (AI) can help bridge this gap and accelerate error interpretation and debugging. Nevertheless, foundational AI models lack proprietary domain/design-specific information. To address this limitation, AI agents can be employed to leverage existing EDA tools and gather domain/design-specific information, thereby facilitating more accurate error interpretation and debugging.
This presentation will showcase the application of AI agents in IBM's static structural checking tool, which is intensively used to ensure HDL compliance with IBM's design methodology. The tool operates on abstracted HDLs represented as a graph of BOXES and NETS, utilizing ANTLR grammar to traverse the graph. During traversal, the tool appends additional labels and tags to facilitate checking. However, the resulting error traces are often cryptic and require expert interpretation, slowing down the debugging process. By integrating AI agents, we aim to bridge the gap between tool errors and user understanding of the HDL, thereby accelerating debugging efforts by at least 30%. This innovative approach has the potential to significantly improve the efficiency and productivity of HDL verification and debugging processes.
DAC Pavilion Panel


DescriptionThe recent AI revolution has dramatically reshaped computing, with VLSI technology playing a crucial role. This panel aims to unite industry and academic leaders to explore the symbiosis between AI and VLSI, highlighting how each fuels advancements in the other. We will discuss AI's role in driving innovations in VLSI and computer hardware, from specialized AI chips to advanced architectures, faster I/O, and increased memory capacities. Conversely, we will explore how VLSI advancements, such as GPUs and custom AI accelerators, power AI's rapid growth. As AI permeates EDA tools, we will examine its potential to revolutionize chip design, optimizing performance, power, and time-to-market. We will also debate whether hardware's slower development pace compared to software has hindered or paradoxically benefited AI's progress. The panel will address AI's sustainability challenges, questioning if next-gen hardware can meet AI's massive computational demands sustainably or exacerbate energy consumption and environmental impacts. Through this panel, we would like to inspire future VLSI pioneers, especially women and young engineers, by showcasing cutting-edge AI-VLSI synergies and igniting creativity to redefine technological frontiers. Join us to explore how these interconnected fields collaboratively shape computing's future.
SKYTalk


DescriptionArtificial Intelligence (AI) is transforming Research and Development (R&D) by accelerating discovery, optimizing resources, and enhancing decision-making. Edge AI, which processes data closer to its source, reduces latency and energy consumption, promoting sustainability. Chiplet technology, involving smaller integrated circuits combined into larger systems, offers improved performance, scalability, and cost-efficiency. STMicroelectronics is exploring chiplets to advance semiconductor solutions. Amidst climate change, STMicroelectronics is committed to sustainable technologies and responsible products that support decarbonization and digitalization. Integrating AI into R&D, deploying Edge AI, and utilizing chiplets, while addressing sustainability, forms a comprehensive strategy for future advancements, ensuring technological progress and environmental responsibility.
SKYTalk


DescriptionThe rapid evolution of artificial intelligence is fundamentally reshaping the semiconductor industry. AI workloads demand unprecedented computational power. At the same time, the explosive growth of AI inference is driving energy demands to new heights, forcing the industry to rethink power efficiency at every level – from silicon design to data-center scale optimization. This talk will explore how the demands of AI are reshaping semiconductor innovation, the balance between performance and efficiency, and the advancements that are needed in silicon engineering to power this new era of computing.
Engineering Presentation


AI
IP
Chiplet
DescriptionThis work describes the development and implementation of a verification flow that is aided by Artificial Intelligence (AI), supported by Synopsys VSO.ai. A comparative analysis was made between the conventional and the developed flow, on the verification of a multiprotocol SerDes PHY.
Coverage closure is fundamental in verification flows, despite being a resource-heavy task. Synopsys VSO.ai is a developing verification technology, based on AI, made to accelerate constraint-random and coverage-driven testbenches.
This is of special importance in automotive projects where the required high quality of coverage metrics demands a complete analysis of both code and functional coverage. Adding AI in the verification flow contributes to reducing Time To Results (TTR), to an easier identification of corner cases, to identify redundancies and missing coverage definition within the testbench and to the achievement of full coverage closure in a shorter time.
With the introduction of the AI-aided flow, a productivity boost was observed in comparison to the conventional regression flow, namely in reducing the required number of simulations and total regression time by a factor of 3 and 2, respectively. Such benefits provide the verification engineers increased flexibility in their allocation to debugging tasks.
Coverage closure is fundamental in verification flows, despite being a resource-heavy task. Synopsys VSO.ai is a developing verification technology, based on AI, made to accelerate constraint-random and coverage-driven testbenches.
This is of special importance in automotive projects where the required high quality of coverage metrics demands a complete analysis of both code and functional coverage. Adding AI in the verification flow contributes to reducing Time To Results (TTR), to an easier identification of corner cases, to identify redundancies and missing coverage definition within the testbench and to the achievement of full coverage closure in a shorter time.
With the introduction of the AI-aided flow, a productivity boost was observed in comparison to the conventional regression flow, namely in reducing the required number of simulations and total regression time by a factor of 3 and 2, respectively. Such benefits provide the verification engineers increased flexibility in their allocation to debugging tasks.
Engineering Poster


DescriptionIn this case study, we present an innovative AI-based trimming and optimization methodology developed for voltage regulators. Traditional methods of optimizing analog circuits, such as voltage references and regulators, are time-consuming and often result in suboptimal performance due to manual trimming processes. Our approach leverages AI-powered Solido Design Environment to automate the trimming process, significantly reducing the 3-sigma range of output voltages and achieving a remarkable 93% accuracy in wafer measurement data.
The methodology involves using automated trimming technique to find optimal trimming codes for target voltages, followed by circuit optimization to provide the best netlist. This process not only enhances the accuracy of the output voltage but also accelerates the optimization process by 140 times compared to manual methods. The results demonstrate a substantial improvement in the performance of voltage regulators, making this approach a valuable contribution to the field of analog circuit design.
Key findings include:
• Automated trimming code identification for voltage references and regulators.
• Significant reduction in the 3-sigma range of output voltages.
• Achieving 93% accuracy in wafer measurement data.
• 140x speed improvement in the optimization process.
This case study highlights the potential of AI-driven methodologies to revolutionize analog circuit design, offering both enhanced performance and efficiency.
The methodology involves using automated trimming technique to find optimal trimming codes for target voltages, followed by circuit optimization to provide the best netlist. This process not only enhances the accuracy of the output voltage but also accelerates the optimization process by 140 times compared to manual methods. The results demonstrate a substantial improvement in the performance of voltage regulators, making this approach a valuable contribution to the field of analog circuit design.
Key findings include:
• Automated trimming code identification for voltage references and regulators.
• Significant reduction in the 3-sigma range of output voltages.
• Achieving 93% accuracy in wafer measurement data.
• 140x speed improvement in the optimization process.
This case study highlights the potential of AI-driven methodologies to revolutionize analog circuit design, offering both enhanced performance and efficiency.
Engineering Presentation


AI
IP
DescriptionIn high-speed SerDes design, as data rates and line lengths increase, treating transmission lines merely as short interconnects in circuit simulation can lead to abnormal waveforms. To accurately address signal reflection, crosstalk, and attenuation, it becomes essential to adopt a precise transmission line model in high-speed channels simulation analysis. With process nodes shrinking and design complexity escalating, transmission line parameters, such as impedance, become increasingly sensitive to variations in width, spacing, shielding, etc. Additionally, the demand for space-sharing with other lines imposes stricter layout requirements. Constrained space makes routing more challenging, necessitating meticulous calculations and significantly increasing the workload for layout engineers.
In traditional design methods, layout engineers draw the transmission lines manually and flatly. As the design is not parameterized, any manual adjustment triggers a complete rework of the process, leading to low iteration efficiency. Furthermore, the relationship between transmission line design parameters and performance objectives is often unclear. When simulation results deviate from expectations, traditional design methods are difficult to pinpoint the root cause efficiently, making it difficult to identify effective optimization direction.
Here, we adopt AI-driven ML based MOP methods for high-speed transmission line design, enabling automatic multi-parameters, multi-objectives optimization. With this approach, we can efficiently obtain multiple designs that meet target values and dramatically reduce the layout iterations ranging from weeks to hours. Through this flow improvement, we make high-speed transmission line layout optimization easy and efficient, significantly improve design efficiency.
In traditional design methods, layout engineers draw the transmission lines manually and flatly. As the design is not parameterized, any manual adjustment triggers a complete rework of the process, leading to low iteration efficiency. Furthermore, the relationship between transmission line design parameters and performance objectives is often unclear. When simulation results deviate from expectations, traditional design methods are difficult to pinpoint the root cause efficiently, making it difficult to identify effective optimization direction.
Here, we adopt AI-driven ML based MOP methods for high-speed transmission line design, enabling automatic multi-parameters, multi-objectives optimization. With this approach, we can efficiently obtain multiple designs that meet target values and dramatically reduce the layout iterations ranging from weeks to hours. Through this flow improvement, we make high-speed transmission line layout optimization easy and efficient, significantly improve design efficiency.
Engineering Special Session


AI
Front-End Design
DescriptionArtificial Intelligence is revolutionizing Electronic Design Automation (EDA), accelerating innovation in chip design and verification. This panel brings together seven technical experts - each a co-founder of their company - who are pioneering AI-driven solutions to tackle today's most complex semiconductor challenges. From automating circuit optimization to enhancing verification efficiency, our panelists will showcase cutting-edge tools that push the boundaries of performance, accuracy, and scalability. Attendees will gain deep technical insights into real-world AI applications in EDA, uncovering how machine learning is reshaping the design flow and redefining engineering productivity. Whether you're a design engineer, verification specialist, or EDA technology strategist, this discussion will provide valuable perspectives on the future of AI in semiconductor design.
Networking
Work-in-Progress Poster


DescriptionThe need for modern workloads in the latest innovations has brought 2.5D/3D stacking and advanced packaging technologies to the forefront. The requirements for integrating multiple chips, components, and materials to create an advanced IC package are becoming increasingly complicated and introduce new challenges to existing extraction and analysis methods. This paper presents a computational framework which allows for comprehensive extraction of advanced IC package designs. It is based on a hybrid computational framework, combining different electromagnetic (EM) solvers, and leveraging AI models and on-the-fly libraries based on 3D full-wave simulation, to extract different netlist models efficiently and accurately for system-level analysis and optimization. It has the capability to do entire extraction of the most intricate stacked die system for a variety of packaging styles and provides co-design automation flows with signoff extraction, static timing analysis (STA) and signoff with signal and power integrity (SI/PI). With exceptional performance and reliability, it enables users to meet tight schedule efficiently.
Engineering Poster


DescriptionA bandgap reference provides a stable voltage output that is relatively insensitive to temperature changes. This stability is crucial for maintaining consistent performance in radar systems. An increase in temperature drift of the bandgap output to 100 ppm/°C can significantly impact the performance of radar transmitters and receivers. It can lead to frequency instability, reduced power amplifier efficiency, signal processing errors, increased noise, and higher calibration requirements.
Ensuring minimal temperature drift in the bandgap reference is essential for maintaining the accuracy, reliability, and efficiency of radar systems. For this high sigma(4.2 sigma) montecarlo analysis is need to study the statistical variation. The standard monte-carlo flow involves lots of unnecessary simulations around the mean to get to the worst tail information of the Gaussian distribution.
In this paper, we are proposing an AI/ML enables Statistical solution flow (Spectre-FMC) to accurately detect the worst-case tail samples with fewer simulations and with below advantages.
• Turn-around time is reduced by 36X for 4.2 sigma accuracy
• Significant reduction in required number of samples from 1million to just 2600 runs to achieve
4.2 sigma accuracy makes this solution a highly sustainable solution
• Seamless integration in Virtuoso-ADE flow and easy setup reduces the learning curve for
designers
Ensuring minimal temperature drift in the bandgap reference is essential for maintaining the accuracy, reliability, and efficiency of radar systems. For this high sigma(4.2 sigma) montecarlo analysis is need to study the statistical variation. The standard monte-carlo flow involves lots of unnecessary simulations around the mean to get to the worst tail information of the Gaussian distribution.
In this paper, we are proposing an AI/ML enables Statistical solution flow (Spectre-FMC) to accurately detect the worst-case tail samples with fewer simulations and with below advantages.
• Turn-around time is reduced by 36X for 4.2 sigma accuracy
• Significant reduction in required number of samples from 1million to just 2600 runs to achieve
4.2 sigma accuracy makes this solution a highly sustainable solution
• Seamless integration in Virtuoso-ADE flow and easy setup reduces the learning curve for
designers
Engineering Poster
Networking


DescriptionLSF Resource Optimization in Large-Scale Designs: Generating ATPG for complex designs (e.g., 5M+ flip-flops) is time-consuming, creating bottlenecks in meeting tight project timelines. AI/ML solutions can dynamically allocate resources during Distributed Automatic Test Pattern Generation (D-ATPG) to optimize compute farm usage, significantly reducing turnaround time (TAT).
Design Rule(DRC) Check and Simulation dashboard Automation: In large gate count designs, it becomes challenging to disposition scan ATPG Design Rule Checks (DRCs) and simulation results across corners & multiple handoffs (15+ releases). Once the DRCs are analyzed, manual dispositioning of these DRCs repeatedly across multiple handoffs is time consuming and wastage of resources. Identifying new violations is difficult with manual checks across thousands of violations in multiple ATPG testmode DRC log files.
DFT Simulations Quality check involves analysis of huge amount of data such as pass/fail result, failure pattern type, runtime, mem usage and so on from thousands of simulation runs. This takes significant man hour effort and error prone.
Low Power ATPG Efficiency: ATPG processes often lead to excessive switching activity, increasing IR drop and potentially causing pattern failures. AI/ML can predict and adapt test patterns for low-power scenarios, reducing power consumption while maintaining test quality and coverage.
Scan Coverage and Pattern Count Reduction: Test points implementation has been used in the past for improving controllability and observability in the design, thus improving scan coverage and optimize pattern count. Traditional manual methods of identifying test points are resource-intensive and might not be the most optimal. AI/ML algorithms can intelligently identify the optimal test points, improving scan coverage while minimizing pattern count, thereby cutting down test time and cost.
Accelerated Decision-Making for Optimal Test Strategies: AI/ML models can analyze massive amounts of data from previous ATPG runs to derive insights, enabling faster and more accurate decisions for optimal test pattern generation and resource utilization, enhancing overall ATPG efficiency and reliability.
Design Rule(DRC) Check and Simulation dashboard Automation: In large gate count designs, it becomes challenging to disposition scan ATPG Design Rule Checks (DRCs) and simulation results across corners & multiple handoffs (15+ releases). Once the DRCs are analyzed, manual dispositioning of these DRCs repeatedly across multiple handoffs is time consuming and wastage of resources. Identifying new violations is difficult with manual checks across thousands of violations in multiple ATPG testmode DRC log files.
DFT Simulations Quality check involves analysis of huge amount of data such as pass/fail result, failure pattern type, runtime, mem usage and so on from thousands of simulation runs. This takes significant man hour effort and error prone.
Low Power ATPG Efficiency: ATPG processes often lead to excessive switching activity, increasing IR drop and potentially causing pattern failures. AI/ML can predict and adapt test patterns for low-power scenarios, reducing power consumption while maintaining test quality and coverage.
Scan Coverage and Pattern Count Reduction: Test points implementation has been used in the past for improving controllability and observability in the design, thus improving scan coverage and optimize pattern count. Traditional manual methods of identifying test points are resource-intensive and might not be the most optimal. AI/ML algorithms can intelligently identify the optimal test points, improving scan coverage while minimizing pattern count, thereby cutting down test time and cost.
Accelerated Decision-Making for Optimal Test Strategies: AI/ML models can analyze massive amounts of data from previous ATPG runs to derive insights, enabling faster and more accurate decisions for optimal test pattern generation and resource utilization, enhancing overall ATPG efficiency and reliability.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionAs embedded and edge devices demand efficient, low-power computing, traditional methods struggle with complex tasks. Stochastic computing (SC) offers a fault-tolerant alternative, approximating operations like addition and multiplication using simple bitwise logic on stochastic bit-streams (SBS). However, CMOS-based SBS generators dominate power and area consumption, and existing solutions overlook the cost of data movement. This work exploits ReRAM devices to implement SC entirely in-memory: generating low-cost true random numbers, performing SC operations, and converting SBS back to binary. Despite ReRAM variability, SC's robustness ensures reliability, achieving 1.39×-2.16× higher throughput and 1.15×-2.8× lower energy use with a 5% average quality drop.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionModular arithmetic, particularly modular reduction, is widely used in cryptographic applications such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). High-bit-width operations are crucial for enhancing security; however, they are computationally intensive due to the large number of modular operations required. The lookup-table-based (LUT-based) approach, a ``space-for-time'' technique, reduces computational load by segmenting the input number into smaller bit groups, pre-computing modular reduction results for each segment, and storing these results in LUTs. While effective, this method incurs significant hardware overhead due to extensive LUT usage.
In this paper, we introduce ALLMod, a novel approach that improves the area efficiency of LUT-based large-number modular reduction by employing hybrid workloads. Inspired by the iterative method, ALLMod splits the bit groups into two distinct workloads, achieving lower area costs without compromising throughput. We first develop a template to facilitate workload splitting and ensure balanced distribution. Then, we conduct design space exploration to evaluate the optimal timing for fusing workload results, enabling us to identify the most efficient design under specific constraints. Extensive evaluations show that ALLMod achieves up to $1.65\times$ and $3\times$ improvements in area efficiency over conventional LUT-based methods for bit-widths of $128$ and $8,192$, respectively.
In this paper, we introduce ALLMod, a novel approach that improves the area efficiency of LUT-based large-number modular reduction by employing hybrid workloads. Inspired by the iterative method, ALLMod splits the bit groups into two distinct workloads, achieving lower area costs without compromising throughput. We first develop a template to facilitate workload splitting and ensure balanced distribution. Then, we conduct design space exploration to evaluate the optimal timing for fusing workload results, enabling us to identify the most efficient design under specific constraints. Extensive evaluations show that ALLMod achieves up to $1.65\times$ and $3\times$ improvements in area efficiency over conventional LUT-based methods for bit-widths of $128$ and $8,192$, respectively.
Networking
Work-in-Progress Poster


DescriptionAs Large Language Model(LLM) models continue to expand in scale and complexity, numerous high-fidelity pruning methods have been proposed to counteract the exponential growth in parameters. However, translating these potential computational savings into practical performance gains faces significant hurdles, including limited GPU support for unstructured sparsity, as well as the mismatch between existing sparse kernels and the sparsity requirements of LLMs.
We introduce AlphaSparseTensor, a novel search methodology for efficient algorithms designed for arbitrary sparse matrix multiplication. Inspired by AlphaTensor's approach to matrix multiplication, AlphaSparseTensor reduces computational complexity by minimizing block multiplications. We propose a workflow that automates the generation of multiplication-addition orders for actual sparse matrix multiplication tasks. This workflow accommodates various sparsity and size loads, enabling the extraction of zero blocks and reducing block multiplications through dynamic planning. Furthermore, we optimize the GPU implementation of our matrix multiplication paradigm, enhancing performance through computation-storage overlap and optimized memory layouts.
In our experimental evaluation, AlphaSparseTensor demonstrates significant performance improvements. On the open-source Sparse Transformer dataset, it achieves up to 1.59x and 1.91x faster performance compared to cuBLAS and cuSPARSE. For the matrix tests of the 70% pruned LLaMA(7B, 13B, 65B), AlphaSparseTensor achieved an average acceleration of 4.05x, 3.77x, 3.37x, and 2.39x compared to CuBLAS, CuSPARSE, PyTorch, and Sputnik. In the end-to-end inference of LLaMA (7B, 13B), AlphaSparseTensor achieved an acceleration of 11.5x, 2.5x, 1.3x, and 1.1x compared to CuBLAS, CuSPARSE, PyTorch, and Sputnik.
We introduce AlphaSparseTensor, a novel search methodology for efficient algorithms designed for arbitrary sparse matrix multiplication. Inspired by AlphaTensor's approach to matrix multiplication, AlphaSparseTensor reduces computational complexity by minimizing block multiplications. We propose a workflow that automates the generation of multiplication-addition orders for actual sparse matrix multiplication tasks. This workflow accommodates various sparsity and size loads, enabling the extraction of zero blocks and reducing block multiplications through dynamic planning. Furthermore, we optimize the GPU implementation of our matrix multiplication paradigm, enhancing performance through computation-storage overlap and optimized memory layouts.
In our experimental evaluation, AlphaSparseTensor demonstrates significant performance improvements. On the open-source Sparse Transformer dataset, it achieves up to 1.59x and 1.91x faster performance compared to cuBLAS and cuSPARSE. For the matrix tests of the 70% pruned LLaMA(7B, 13B, 65B), AlphaSparseTensor achieved an average acceleration of 4.05x, 3.77x, 3.37x, and 2.39x compared to CuBLAS, CuSPARSE, PyTorch, and Sputnik. In the end-to-end inference of LLaMA (7B, 13B), AlphaSparseTensor achieved an acceleration of 11.5x, 2.5x, 1.3x, and 1.1x compared to CuBLAS, CuSPARSE, PyTorch, and Sputnik.
Research Special Session


Design
DescriptionCompute requirements for machine learning (ML) models for AI training and inference applications are driving the AI supercycle. Heterogeneous integration via chiplet architectures is key to enabling economically feasible growth of power-efficient computing, particularly given the slowdown in Moore's law. AMD Instinct™ MI300 Series accelerators leverage multiple advanced packaging technologies to provide a heterogeneous integration solution for emerging AI/ML and high-performance computing (HPC) workloads.
In this presentation, we will highlight innovative advanced packaging technologies that directly enable the heterogeneous integration of multiple chiplets for AMD Instinct™ MI300. These technologies include microbump 3D memory stacks, 2.5D silicon interposers, and 3D hybrid bonding. The combination of these advanced packaging techniques facilitates architectural innovations and delivers generational performance uplifts that traditional technologies and Moore's Law scaling alone cannot achieve.
In this presentation, we will highlight innovative advanced packaging technologies that directly enable the heterogeneous integration of multiple chiplets for AMD Instinct™ MI300. These technologies include microbump 3D memory stacks, 2.5D silicon interposers, and 3D hybrid bonding. The combination of these advanced packaging techniques facilitates architectural innovations and delivers generational performance uplifts that traditional technologies and Moore's Law scaling alone cannot achieve.
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionFPGAs offer superior energy efficiency and performance in parallel computing but are vulnerable to remote power side-channel attacks. Existing attacks rely on assumptions of co-resident crafted circuits and shared power delivery networks, limiting their practicality in real-world scenarios. In this paper, we present AmpereBleed, a novel current-based side-channel attack that exploits widely available INA226 sensors in ARM-FPGA SoCs, bypassing the aforementioned two assumptions. AmpereBleed achieves 261x greater variations to victim activities compared to the popular ring oscillator (RO) circuit, fingerprints DNN models on the Xilinx Deep Learning Processor Unit (DPU) with 99.7% accuracy, and distinguishes Hamming weights of RSA-1024 keys.
Engineering Poster
Networking


DescriptionAs technology nodes shrink, the demand for energy-efficient designs increases. Improved power gain ensures that SoCs consume less energy, extending battery life in portable devices and reducing operational costs in high-performance systems. This is essential for applications like AI, IoT, and edge computing. The existing method of defining switching activity often use worst-case or average-case scenarios, which can lead to overestimating or underestimating of power.
Vector based power analysis captures switching activity accurately at granular level, leading to more precise dynamic and leakage power calculations compared to static, generalized assumptions in traditional power analysis. Multiple vectors are generated by simulating RTL, these are properly analyzed based on their active and idle timeframes and merged, to predict accurate activity; to achieve more power compared to using single vector approach. In addition to this, the superset of vectors is replayed and analyzed with a simulator to identifies the areas without accurate activity and generate the fully annotated design for accurate gate-level power. This paper presents a methodology that utilizes switching activity information from multiple vectors, merging power patterns and re-simulating the vectors to accurately annotate the power patter during synthesis and Place & Route to optimize power performance in CPU designs.
Vector based power analysis captures switching activity accurately at granular level, leading to more precise dynamic and leakage power calculations compared to static, generalized assumptions in traditional power analysis. Multiple vectors are generated by simulating RTL, these are properly analyzed based on their active and idle timeframes and merged, to predict accurate activity; to achieve more power compared to using single vector approach. In addition to this, the superset of vectors is replayed and analyzed with a simulator to identifies the areas without accurate activity and generate the fully annotated design for accurate gate-level power. This paper presents a methodology that utilizes switching activity information from multiple vectors, merging power patterns and re-simulating the vectors to accurately annotate the power patter during synthesis and Place & Route to optimize power performance in CPU designs.
Networking
Work-in-Progress Poster


DescriptionEnsuring data consistency in multi-core embedded systems with concurrent data read-write operations presents significant challenges, especially under real-time constraints. Instead of relying on traditional lock-based synchronization which often introduces extra blocking time, this paper adopts wait-free protocols to enhance efficiency. We extend both the Dynamic Buffering Protocol (DBP) and the Temporal Concurrency Control Protocol (TCCP) to multi-core settings under partitioned scheduling. We further propose a novel wait-free protocol, Partitioned Combined-DBP-TCCP Protocol (PCDT), for data communication through shared memory. This method enables different data consumers to selectively adopt either DBP or TCCP based on specific timing requirements to optimize resource allocation. Extensive numerical experiments demonstrate that PCDT achieves more than 36% reduction in memory footprint compared to PDBP or PTCCP while ensuring wait-free, non-blocking data access and timely updates for each data reader. A case study in an automotive advanced driver assistant system further underscores its practicality, achieving approximately 13.5% memory savings.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionThe narrow-bit-width data format is crucial for reducing the computation and storage costs of modern deep learning applications, particularly in large language models (LLMs) based applications. Microscaling (MX) format has been proven as a drop-in replacement for the baseline FP32 in existing inference frameworks, with low user friction. However, deploying such a new format into existing hardware systems is still challenging, and the dominant solution for LLM inference at low precision is still low-bit quantization. This particularly limits the strategic applications of such LLMs in real deployment on a large scale. In this work, we propose an algorithm-hardware co-design that adopts a two-level Revised MX Format Quantization (RMFQ) and a Revised MX Format Accelerator (RMFA) architecture design. RMFQ proposes the revised MX (RMX) format and provides a novel quantization framework with innovative group direction. Also, RMFA provides an RMX adaptive hardware architecture and an RMX encoding scheme. As a result, RMFQ pushes the limit of 4-bit and 6-bit quantization to a new state-of-the-art, and RMFA surpasses the existing outlier-aware accelerator such as OliVe, achieving a 1.28× speedup and a 1.31× energy reduction.
Engineering Presentation


AI
Front-End Design
Chiplet
DescriptionMulti-die SoCs integrate chiplets with distinct roles, requiring seamless coordination as a single system, which significantly increases verification complexity. The ARM SMMU is critical for memory management in SoCs, demanding architecture-specific configurations for TBU, TCU, and DTI interconnects. These requirements complicate the detection of performance-related issues and inter-die interactions in multi-die systems. Traditional verification methods, such as partial page table checks or waveform-based ACE/AXI attribute analysis, often fail to ensure comprehensive coverage, overlooking critical issues like data integrity and TLB hit/miss accuracy. To address these challenges, we propose an automated and scalable Monitor-and-Checker solution for ARM SMMU verification. The Monitor tracks real-time operations, including page table walks and TLB activity, while the Checker validates ACE/AXI transformations, detects exceptions, and ensures DTI data integrity. Applied to the development of a multi-die SoC with four chiplets, the solution identified 16 critical bugs, including a performance issue previously detectable only in real-use scenarios. Despite the increased complexity of multi-die verification, the solution reduced debugging efforts and ensured successful project completion within the planned schedule.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionThe reconfigurable platform offers possibilities for identifying the bit-level unstructured redundancy during inference with different DNN models. Researchers noticed significant progress in value-aware accelerators on ASICs, yet we are concerned about the few studies on FPGAs. This paper observed the limitations of implementing bit-level sparsity optimizations using FPGA and proposed a software/architecture co-design solution. Specifically, by introducing LUT-friendly encoding with adaptable granularity and hardware structure supporting multiplication time uncertainty, we achieved a better trade-off between potential redundancy and accuracy with compatibility and scalability. Experiments show that under accurate calculation, PEs are up to 2.2× smaller than bit-parallel ones, and our design boosts performance by 1.04× to 1.74× and 1.40× to 2.79× over bit-parallel and Booth-based designs, respectively.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionPoint-based point cloud neural networks (PNNs) have shown remarkable accuracy in various applications. However, the performance of PNNs is usually limited on resource-constrained edge devices. This paper presents Point-CIM, an efficient Compute-in-Memory (CIM) based accelerator for PNNs, with software and hardware co-optimization. A reconfigurable CIM unit, which exploits the inherent zero-bit-skipping capability of CIM array, is designed to efficiently process Multiply-Accumulate (MAC) operations on decomposed feature data with high bit sparsity. A Voxel-Morton-sorted Partitioning (VMP) method combined with Channel-wise Minimum (CM) base point selection is proposed to further improve the decomposition bit sparsity and reduce the hardware implementation overhead. Additionally, a detailed Bit-level Truncation Quantization (BTQ) method is proposed to directly compress the bit-width of the offset without incurring any additional hardware overhead. Based on this decomposition dataflow, a Pre-decomposition (PD) data movement strategy is employed to reduce data transfers of intermediate feature results between the on-chip buffer and CIM array. Extensive evaluation experiments on multiple datasets show that Point-CIM achieves an average speedup of 1.69× to 9.63×, and 3.11× to 17.32× improvement in energy efficiency, compared to state-of-the-art accelerators and general-purpose processors.
Engineering Poster


DescriptionThe increasing functionality and complexity of large automotive SoCs at 5nm FinFET has led to growing power grid complexity, creating challenges for efficient chip-top EMIR signoff. These automotive SoCs have billions of transistors, multiple power domains, several PVT corners, and operation modes, requiring comprehensive full-chip signoff within tight tape-out timeframes. Current methodologies for full-chip flat EMIR analysis have extremely high runtimes typically taking 1-3 days, 100's of cores and 50-100 GB peak memory per core for a single iteration. Divide and Conquer techniques do not apply here because of systematic inaccuracies, false EMIR violations and lengthy results consolidation. Full-chip level run failures result in higher cost per run both in terms of TAT and compute resources.
We present here a hierarchical methodology utilizing reduced order modelling, and the results using this methodology to achieve significant improvements in runtime and memory. We also compared the results from this hierarchical methodology with the full-chip flat analysis to verify that this hierarchical methodology has little to no compromise on accuracy of chip-top results. Our results from a large automotive SoC show a 2.4X performance improvement, 1.55X complexity reduction, and 2X memory savings, with accuracy deviations within 5% of flat analysis.
We present here a hierarchical methodology utilizing reduced order modelling, and the results using this methodology to achieve significant improvements in runtime and memory. We also compared the results from this hierarchical methodology with the full-chip flat analysis to verify that this hierarchical methodology has little to no compromise on accuracy of chip-top results. Our results from a large automotive SoC show a 2.4X performance improvement, 1.55X complexity reduction, and 2X memory savings, with accuracy deviations within 5% of flat analysis.
Engineering Poster
Networking


DescriptionThe existing power integrity workflow, which uses Ansys's tool RedHawk_SC, enables users to launch many targets, such as IR-drop and Electromigration analysis, Resistance Checks, SigmaDVD, and reporting of results. With this flow NVIDIA has closed many projects, but as the sizes of designs grow, there are some issues with the flow, such as high license, runtime, diskspace and machine usage, runtime stalls, and run crashes due to high peak memory. To address these run and flow issues, New Flow (Multi-PVT/Multi Target Flow) was developed, with optimized code based on Ansys tool features, such as Seascape User View, Seascape Delayed Objects, and Map Reduce architecture. New Flow enables target runs to be done in parallel re-using the views generated by earlier targets, avoids duplicate collateral setup, and saves the minimum views needed for the analysis across all scenarios. These benefits improve user experience and reduce time to close projects (since more runs can be done in parallel due to improved runtime and efficient resource usage).
Engineering Poster
Networking


Description* Lumped model-based estimation can be done which is faster, but it has accuracy issues.
Another way is we can do ramp-up analysis on complete design which is having multi-million node
which is more accurate. In this case simulation will run from days to weeks time as it measure
voltage for each node.
* Running simulation for number of weeks to completely turn-on power net, may result in huge
impact on paid LSF services. Therefore, some voltage threshold-based solution is required
* In early phase of a design, power gates optimization plays an important and time-consuming role
Focus would be to identify accurate number of power gates within a time frame
* How much time voltage waveform is taking to reach 80/90 % of ideal voltage
* Can we do adaptive resolutions-based runs, or can we cut down the simulation when switched
nets are above used defined threshold.
* In early stage of design implementation cycle, Design Scenarios like Isolated islands, power gates
triggering issues, irregular turn-on time of power gates etc. came in picture.
* otherwise doing early ramp-up analysis, debugging of steeps in current waveforms, identification of
different chains turning-on patterns, etc. these all scenarios we have to take care.
* Some cluster-based debug reports would be helpful here
Another way is we can do ramp-up analysis on complete design which is having multi-million node
which is more accurate. In this case simulation will run from days to weeks time as it measure
voltage for each node.
* Running simulation for number of weeks to completely turn-on power net, may result in huge
impact on paid LSF services. Therefore, some voltage threshold-based solution is required
* In early phase of a design, power gates optimization plays an important and time-consuming role
Focus would be to identify accurate number of power gates within a time frame
* How much time voltage waveform is taking to reach 80/90 % of ideal voltage
* Can we do adaptive resolutions-based runs, or can we cut down the simulation when switched
nets are above used defined threshold.
* In early stage of design implementation cycle, Design Scenarios like Isolated islands, power gates
triggering issues, irregular turn-on time of power gates etc. came in picture.
* otherwise doing early ramp-up analysis, debugging of steeps in current waveforms, identification of
different chains turning-on patterns, etc. these all scenarios we have to take care.
* Some cluster-based debug reports would be helpful here
Networking
Work-in-Progress Poster


DescriptionExisting persistent memory file systems improve the lifespan of persistent memories (PMs) by designing wear-leveling-aware allocators. However, these allocators focus on achieving higher-balanced writes to PMs while neglecting the overhead, which can lead to serious performance degradation of persistent memory file systems, especially in the parallel requests by multiple threads in modern multiprocessor systems. This paper proposes an efficient wear-leveling-aware parallel allocator, called WPAlloc, to achieve accurate wear-leveling and high parallel performance. WPAlloc adopts bucket sort to manage the wear range of unused blocks with low overhead and provides parallel allocation and deallocation to avoid the request conflicts by multiple threads. We implement WPAlloc in the Linux kernel based on PMFS. Experimental results show that WPAlloc can achieve 168.12%, 150.73% maximum write reduction and 16.09%, 16.02% performance improvements over PMFS and WASA, respectively, and reach similar wear-leveling while 184.09% performance improvements in multiple threads tests over DWARM.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionDesigning energy-efficient, high-speed accelerators for attention mechanisms, which dominate energy and latency in Transformers, is increasingly significant. We observe two limitations in previous work: First, existing systolic arrays (SAs) fail to balance data reuse, register saving, and utilization. Second, layer-by-layer operation ordering incurs high SRAM access overhead for intermediate results. To address the first, we propose the "Balanced Systolic Array," improving energy efficiency (EE) by 40% with a 99.5% utilization rate. To address the second, we propose "Multi-Row Interleaved" operation ordering, reducing SRAM energy by 31.7%. Combining these techniques, the proposed accelerator improves EE by 39% over previous works.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionKalman Filter (KF) is the most prominent algorithm to predict motion from measurements of brain activity. However, little effort has been made to specialize KF hardware for the unique requirements of embedded brain-computer interfaces (BCIs). For this reason, we present the first configurable KF hardware architecture that enables fine-grained tuning of latency and accuracy, thereby facilitating specialization for neural data processing in BCI applications and supporting design-space exploration. Based on our architecture, we design KF hardware accelerators and integrate them into a heterogeneous system-on-chip (SoC). Through FPGA-based experiments, we demonstrate an energy-efficiency improvement of 15.3x and 10^3x better accuracy compared to state-of-the-art implementations.
Engineering Presentation


IP
DescriptionOn logic embedding in semi-dynamic flip-flops, this paper identifies two main issues: slow evaluation time in an dynamic stage and unwanted glitch at an output node. To address these issues, this paper proposes novel scannable semi-dynamic flip-flops featuring a two-stack NMOS in the dynamic stage, whereas conventional semi-dynamic flip-flops typically use three or more stacks to incorporate a scan MUX. Moreover, the paper introduces a new scannable logic-embedded flip-flop that adopts a modified clock-delayed domino, which is specifically designed to suppress the unwanted glitch while also enhancing speed. The proposed scannable two-stack semi-dynamic flip-flop improves data-to-output latency by 56%, as demonstrated by 14nm FinFET silicon measurement results, compared to a conventional master-slave flip-flop with the scan MUX. Even when incorporating complex logic, the proposed scannable logic-embedded flip-flop achieves nearly the same D-to-Q latency as a conventional static pulse-based flip-flop, which only includes the scan MUX. The measurement results also show that the glitch width at the output node of the proposed scannable logic-embedded flip-flop is reduced by 3 to 5 times, attributable to the two-stack NMOS and modified clock-delayed domino structures.
Research Manuscript


Security
SEC4: Embedded and Cross-Layer Security
DescriptionGeneral Matrix Multiplication (GEMM) is the most prevalent operation in deep learning (DL). However, performing GEMM within Fully Homomorphic Encryption (FHE) is inefficient due to high computational demands and significant data migration constrained by limited bandwidth. Additionally, the inherent limitations of FHE schemes restrict the widespread application of DL, as standard activation functions are incompatible. This incompatibility necessitates alternative nonlinear functions, which lead to notable accuracy reductions. To address these challenges, we propose a polynomial encoding method for GEMM under the Brakerski/Fan-Vercauteren (BFV) scheme and extend the method to inference with packing inputs and weights. Furthermore, we design specialized hardware to accelerate the inference process through optimized scheduling between the hardware and the host system. In experiments, we implemented our hardware on an FPGA U250 platform. Compared to existing solutions, our method achieves superior performance, achieving the highest 4.72 x and 3.78 x speedups on MNIST and CIFAR-10, respectively.
Networking
Work-in-Progress Poster


DescriptionThe limited field-of-image and lesion-position-losing of the endoscope are often the most problematic issues faced by junior surgeons. A battery-operation wireless four-lens panoramic endoscopy is proposed. This design provides a panoramic vision and in-time lesion-guiding information during a surgical operation. This work consists of a combination of several image-processing techniques of image-stitching, viewing-synthesis, and lesion-guiding to a view. Lesion-guiding provides global positioning information and tracks the predefined lesion position during surgery. On the low-power hardware design system, the image-processing encoder and decoder chips were designed using the clustered voltage-domain technique. The technique of the designed chip is separated by two/four voltage domains, and the voltage-scaling technique is used for each domain. The high-voltage domain maintains the chip performance, and the low-voltage domain reduces the power consumption. This effective technique decreases the power consumption without reducing the performance of the chip. The entire system is integrated by a personal computer, an embedded system, and image encoder+decoder chips. By applying the Multi-Vdd technique, the multiple-Vdd encoder and decoder chips can be quickly redesigned based on the power, delay-time, and gate-count optimization requirements. The power consumption of the encoder and decoder chip can be effectively reduced to 50% and 24%, respectively. The performance loss can be maintained within 5% of both designs. A wireless panoramic endoscopic system is successfully validated and demonstrated by integrated encoder and decoder chips. The whole system has been successfully validated by animal in-vivo experiments. Experimental results show that the proposed system can enhance the side-by-side image size to 155%. Based on our search, there is no similar work that can be used as a comparison.
Networking
Work-in-Progress Poster


DescriptionMemory usually needs to incorporate ECC circuits to prevent soft errors from occurring within it.This paper presents a novel memory design with internal ECC functionality based on in - memory computing. Firstly, we propose an in - memory XNOR macro, which consists of 7T SRAM cells and an XNOR gate. Then, 32 such macros share the subsequent error - correction and detection logic to complete Hamming code encoding and decoding. This design eliminates external encoding and decoding circuits, reducing the area overhead. Simulation results show that the additional area for the ECC in this design is only one - third of that of the traditional method with external ECC logic. Moreover, the encoding, decoding, and error - correction power consumption are reduced by 10.3%, 73.4%, and 52.1% respectively. Finally, in this design, the ECC is independently performed through internal ports and will not interfere with the normal read and write operations of the memory.
Research Manuscript


Systems
SYS3: Embedded Software
DescriptionSparsity is widely prevalent in real-world applications, yet existing compiler optimizations and code generation techniques for sparse computations remain underdeveloped. Sparse matrix-matrix multiplication (SpMM) is a representative operator in sparse computations, whose performance is often limited by the design of sparse formats and the extent of hardware architecture optimization. Most existing solutions achieve high-performance SpMM through two approaches: (1) meticulously designed kernels and specialized sparse formats, which require extensive manual effort, or (2) tensor compilers that support code generation, though these typically offer limited support for sparse patterns, making it challenging to adapt to complex sparsity patterns in practical applications. This paper presents SpMMTC, an input-aware sparse tensor compiler. Given a sparse matrix as input, SpMMTC analyzes its non-zero distribution and generates a vectorized kernel optimized for SpMM on the specific matrix. We evaluated SpMMTC on various workloads. It achieves speedups of 1.21x to 2.97x over state-of-the-art methods such as TACO, TVM, and ASpT on different multi-core processors. It also provides an speedup of up to 1.52x for sparse MobileNetV1 inference on the edge device.
Engineering Presentation


AI
Systems and Software
Chiplet
DescriptionElectrostatic Discharge (ESD) poses new challenges in multi-die (2.5D and 3D) systems. Traditional ESD analysis focuses on single-die systems, but multi-die systems introduce new discharge paths from die-to-die, necessitating a comprehensive assessment through simulation. This study applies well-established ESD simulation methodologies to the entire multi-die system, including package and interposer, using novel simulation tools. This approach involves Current Density (CD) and Resistance (R) checks to identify potential violations that single-die analysis might miss. The findings highlight the importance of assessing ESD vulnerability in 3DIC systems to ensure their safety and reliability, and may be critical to signing off on designs before production going forward.
Networking
Work-in-Progress Poster


DescriptionThis article presents a methodology to automatically and optimally design and size analog circuits through an analytically derived objective function and constraints based on given specifications, implement their layout, and produce the corresponding Graphic Data System (GDS) files. To implement this flow, a python implementation of the C/ID methodology is used for technology characterization and lookup tables (LUTs) generation. An open source layout generation platform is used for producing circuit layout from an optimized design in an automatized flow. To adequately implement the flow, a method to analyze complexity of circuit networks, and the concept of ``reducibility of problem dimension" is provided. An analytical approach is proposed to formulate a general and technology-agnostic optimization process that is expandable to high degrees of dimensionality for a wide variety of various circuit topologies, accounting for process and temperature variations while maintaining low computational complexity and high accuracy with no simulation iterations required. The methodology and approach affords the ability for clear visualization of design space and and optimums. From this analysis, technology-agnostic topology factors, compute complexity degree, and design scripts will be derived. A case study on a Current Mirror Operational Transconductance Amplifier (CM OTA) illustrates the approach, showing optimal design points for power minimization. The process includes SPICE simulation, layout generation via the ALIGN analog layout generator, and an example using open source Electronic Design Automation (EDA) tools and Skywater's 130~nm design kit, with results compared across analysis, schematic, and post layout netlist simulations. All results, designs, and scripts are publicly available for transparency and easy reproducability.
Networking
Work-in-Progress Poster


DescriptionThe difference in coefficients of thermal expansion between materials leads to uneven deformation within a package, known as warpage. This warpage effect has become a significant reliability concern in advanced packaging, and effective methods are still desirable. We present the first analytical multi-die floorplanning framework considering the warpage effect in advanced package designs. We apply a physical warpage model that can be embedded into a gradient-based floorplanner to optimize the warpage effect. We then present a warpage, wirelength, and area co-optimizer considering the overlapped region and outline barrier constraints. Experimental results show that our floorplanner outperforms TCG-based ones in warpage, wirelength, and area.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionSubstantial data movement caused by irregular graph topologies hinders the efficient processing of graph neural networks (GNNs). Although the emerging near-bank processing-in-memory (PIM) architecture offers a promising solution to reduce data transfer between memory and computing units, cross-bank communication remains a critical challenge, limiting the benefits of PIM architectures. Our findings indicate that only 35.6% of the data can stay stationary within PIM units on average, with the rest requiring movement due to graph dependencies. This situation worsens as the number of PIM units increases, reducing the ratio to 18.7%.
In this paper, we argue that to fully leverage PIM architectures, systems must maximize stationary data and minimize the movement of non-stationary data. Following this principle, we propose Anchor, a scalable PIM architecture that exploits stationary data for GNNs through a hardware-software co-design approach. To maximize stationary data, we introduce the graph partitioning algorithm Mastav, which carefully allocates vertices and edges to preserve data locality. To minimize the movement of non-stationary data, we employ a two-step strategy. First, a customized dataflow ensures that non-stationary data is accessed and distributed exactly once. Second, an optimized communication mechanism reduces redundant data transfers through critical paths. Our extensive experiments demonstrate that Anchor significantly reduces processing latency and data movement compared to representative schemes.
In this paper, we argue that to fully leverage PIM architectures, systems must maximize stationary data and minimize the movement of non-stationary data. Following this principle, we propose Anchor, a scalable PIM architecture that exploits stationary data for GNNs through a hardware-software co-design approach. To maximize stationary data, we introduce the graph partitioning algorithm Mastav, which carefully allocates vertices and edges to preserve data locality. To minimize the movement of non-stationary data, we employ a two-step strategy. First, a customized dataflow ensures that non-stationary data is accessed and distributed exactly once. Second, an optimized communication mechanism reduces redundant data transfers through critical paths. Our extensive experiments demonstrate that Anchor significantly reduces processing latency and data movement compared to representative schemes.
Research Manuscript


Design
DES3: Emerging Models of ComputatioN
DescriptionDesign space exploration (DSE) through system-level simulation is essential for designing energy-efficient asynchronous neuromorphic hardware, which is increasingly promising in edge AI applications. However, there are significant mismatches between system-level predictions and gate-level simulations, resulting in low precision when predicting performance during the DSE process for asynchronous neuromorphic hardware. To address this issue, we put forward ANGraph, a graph neural network (GNN)-based performance prediction framework for asynchronous neuromorphic hardware. In the ANGraph framework, we transform the intermediate representation of system-level simulations into graphs, collect gate-level circuit simulation results to build benchmarks with over one million samples, and train a GNN model to predict hardware latency for asynchronous neuromorphic hardware. Additionally, we use a residual network (ResNet)-based method to predict the power consumption of asynchronous neuromorphic hardware. We evaluate these two models on additional datasets without extra training across different scales, process nodes, and traffic patterns of input data. Compared to the latency predictions from the state-of-the-art simulator, we improve the R-square score by 0.69 and reduce root mean square error (RMSE) by 76% on average across all datasets. We also achieve an R-square score of 0.98 and a mean absolute percentage error (MAPE) of 0.88% for the power consumption prediction task. The benchmarks and models will be publicly available at https://github.com/ if the paper is accepted.
Engineering Poster


DescriptionData path design with algorithmic logic have their own challenges in verification. DV sign off using simulation alone can have holes as it is nearly impossible to cover all scenarios in limited DV time frame. By formally verifying the equivalence check between RTL and its reference C++ model algorithm provides an edge over traditional simulation approach. The paper describes the advantages of using C2RTL equivalence for this verification.
Design Use Case-De-Compression IP:
The decompression IP uses the LZO RLE algorithm for decompressing the data up to 4KB Input compressed data stream, can have input size between 19 bytes to 4KB which makes it difficult to cover all input data values with functional simulation. This can lead to some uncovered scenarios in algorithm and hence bug escapes.
Approach Taken:
1) Instruction based formal verification
2) Reduced number of properties in FV setup
3) Unique Mapping Requirement of C variables with RTL cycle-based signal
Results:
Verified 3 random instructions sequence with Input vector size ranging from 19 – 64 bytes. This resulted in a failure case.
Compressed Input data 60 bytes insize: 240'h00000000d09002000a360088004010111
Decompressed Output Data:
RTL Output: 128'h0000 0009 0020 6060 6060 6060 6008 8004
C++ Output: 128'h0020 6009 0020 6060 6060 6060 6008 8004
After analyzing the LZO algorithm encoding of compressed instruction, C++ decompressed output data is correct, but RTL output data is incorrect. This could not be covered in simulation-based approach even when thousands of input vectors were simulated for more than a month to decompressor RTL.
Design Use Case-De-Compression IP:
The decompression IP uses the LZO RLE algorithm for decompressing the data up to 4KB Input compressed data stream, can have input size between 19 bytes to 4KB which makes it difficult to cover all input data values with functional simulation. This can lead to some uncovered scenarios in algorithm and hence bug escapes.
Approach Taken:
1) Instruction based formal verification
2) Reduced number of properties in FV setup
3) Unique Mapping Requirement of C variables with RTL cycle-based signal
Results:
Verified 3 random instructions sequence with Input vector size ranging from 19 – 64 bytes. This resulted in a failure case.
Compressed Input data 60 bytes insize: 240'h00000000d09002000a360088004010111
Decompressed Output Data:
RTL Output: 128'h0000 0009 0020 6060 6060 6060 6008 8004
C++ Output: 128'h0020 6009 0020 6060 6060 6060 6008 8004
After analyzing the LZO algorithm encoding of compressed instruction, C++ decompressed output data is correct, but RTL output data is incorrect. This could not be covered in simulation-based approach even when thousands of input vectors were simulated for more than a month to decompressor RTL.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionSatisfiability Modulo Theory (SMT) solvers have advanced automated reasoning, solving complex formulas across discrete and continuous domains. Recent progress in propositional model counting motivates extending SMT capabilities toward model counting, especially for hybrid SMT formulas. Existing approaches, like bit-blasting, are limited to discrete variables, highlighting the challenge of counting solutions projected onto the discrete domain in hybrid formulas.
We introduce pact, an SMT model counter for hybrid formulas that use hashing-based approximate model counting to estimate
solutions with theoretical guarantees. pact makes a logarithmic number of SMT solver calls relative to the projection variables,
leveraging optimized hash functions. pact achieves significant performance improvements over baselines on a large suite
of benchmarks. In particular, out of 14,202 instances, pact successfully finished on 603 instances, while Baseline could only
finish on 13 instances.
We introduce pact, an SMT model counter for hybrid formulas that use hashing-based approximate model counting to estimate
solutions with theoretical guarantees. pact makes a logarithmic number of SMT solver calls relative to the projection variables,
leveraging optimized hash functions. pact achieves significant performance improvements over baselines on a large suite
of benchmarks. In particular, out of 14,202 instances, pact successfully finished on 603 instances, while Baseline could only
finish on 13 instances.
Networking
Work-in-Progress Poster


DescriptionWith the increasing size of images and the widespread use of image filtering like photo enhancement, noise reduction and edge detection, rapid handling extensive amounts of pixel data becomes required for image processing tasks. However, the limited computing resources and memory bandwidth pose constraints on the rapid image filtering. Processing-in-Memory (PIM) architectures can solve the issue by integrating parallel processing elements (PEs) within memory and minimizing data movement, but existing PIM-based image filtering approaches require data sharing between the PEs causing huge communication overheads due to the lack of direct inter-PE communication support in commodity PIM systems.
To address this limitation, we propose a new approximation-based inter-PE communication-free image filtering scheme for commodity PIM systems called ApplePIM. By approximating partitioned image boundaries in the filtering process, ApplePIM efficiently removes inter-PE communication across multiple PEs in a commodity PIM. Furthermore, ApplePIM analyzes error sensitivity of the image filter, and dynamically determines its approximation method depending on the sensitivity. If the image filter is too sensitive on all the approximation methods, ApplePIM conservatively duplicates the boundaries of image partitions, thus avoiding inter-PE communication at a little cost of additional memory usage.
To address this limitation, we propose a new approximation-based inter-PE communication-free image filtering scheme for commodity PIM systems called ApplePIM. By approximating partitioned image boundaries in the filtering process, ApplePIM efficiently removes inter-PE communication across multiple PEs in a commodity PIM. Furthermore, ApplePIM analyzes error sensitivity of the image filter, and dynamically determines its approximation method depending on the sensitivity. If the image filter is too sensitive on all the approximation methods, ApplePIM conservatively duplicates the boundaries of image partitions, thus avoiding inter-PE communication at a little cost of additional memory usage.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionNeural Networks (NNs) excel in vision and language tasks but face high computational costs, bottlenecked by floating-point operations. Approximation methods like Mitchell's logarithm enhance hardware efficiency but struggle with precision, integration, and accuracy-resource trade-offs. Building on a hardware-efficient down-sampling-based compensation method addressing precision loss and a flexible bias mechanism for diverse NN data distributions, this paper designs configurable systolic arrays optimized for NN accelerators. Supported by April, a co-design framework balancing accuracy-resource efficiency, FPGA evaluations show April-generated systolic arrays reduce RMSE by 96% and achieve 34%-52% area reduction compared to INT8 implementations, while maintaining or improving accuracy.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionDNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, which may account for 69% of power consumption. This study introduces a novel Additive Partial Sum Quantization (APSQ) method, seamlessly integrating PSUM accumulation into the quantization framework. A grouping strategy that combines APSQ with PSUM quantization enhanced by a reconfigurable architecture is further proposed. The APSQ performs nearly lossless on NLP and CV tasks across BERT, Segformer, and EfficientViT models while compressing PSUMs to INT8. This leads to a notable reduction in energy costs by 28-87%. Extended experiments on LLaMA2-7B demonstrate the potential of APSQ for large language models. The codes will be open-source upon publication.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionIn the current NISQ era, the performance of QNN models is strictly hindered by the limited qubit number and inevitable noise. A natural idea to improve the robustness of QNN is to involve multiple quantum devices. Nevertheless, due to the heterogeneity and instability of quantum devices (e.g., noise, frequent online/offline), training and inference on distributed quantum devices may even destroy the accuracy.
In this paper, we propose ArbiterQ, a comprehensive QNN framework designed for efficient and high-accuracy training and inference on heterogeneous QPUs. The main innovation of ArbiterQ is it applies personalized models for each QPU via two uniform QNN representations: model vector and behavioral vector. The model vector specifies the logical-level parameters in the QNN model, while the behavioral vector captures the hardware-level features when implementing the QNN circuit. In this manner, by sharing the gradient among QPUs with similar behavioral vectors, we can effectively leverage parallelism while considering heterogeneity. We also propose shot-oriented inference scheduling, which is a much more fine-grained scheduling that can improve accuracy and balance the workload. The experiments show that ArbiterQ accelerates the training process by 4.03X with 7.87% loss reduction, compared with the previous distributed QNN framework EQC.
In this paper, we propose ArbiterQ, a comprehensive QNN framework designed for efficient and high-accuracy training and inference on heterogeneous QPUs. The main innovation of ArbiterQ is it applies personalized models for each QPU via two uniform QNN representations: model vector and behavioral vector. The model vector specifies the logical-level parameters in the QNN model, while the behavioral vector captures the hardware-level features when implementing the QNN circuit. In this manner, by sharing the gradient among QPUs with similar behavioral vectors, we can effectively leverage parallelism while considering heterogeneity. We also propose shot-oriented inference scheduling, which is a much more fine-grained scheduling that can improve accuracy and balance the workload. The experiments show that ArbiterQ accelerates the training process by 4.03X with 7.87% loss reduction, compared with the previous distributed QNN framework EQC.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionModern data-driven applications expose limitations of von Neumann architectures—extensive data movement, low throughput, and poor energy efficiency. Accelerators improve performance but lack flexibility and require data transfers. Existing compute in- and near-memory solutions mitigate these issues but face usability challenges due to data placement constraints. We propose a novel cache architecture that doubles as a tightly coupled compute-near-memory coprocessor. Our RISC-V cache controller executes custom instructions from the host CPU using vector operations dispatched to near-memory vector processing units within the cache memory subsystem. This architecture abstracts memory synchronization and data mapping from application software while offering software-based Instruction Set Architecture (ISA) extensibility. Our implementation shows 30 × to 84 × performance improvement when operating on 8-bit data over the same system with a traditional cache when executing a worst-case 32-bit CNN workload, with only 41.3 % area overhead.
Research Special Session


AI
DescriptionGenerative AI models such as LLMs have emerged as the state-of-the-art in various machine learning applications, including vision, speech recognition, code generation, and machine translation. These large transformer-based models surpass traditional machine learning methods, albeit at the expense of hundreds of ExaOps in computation. Hardware specialization and acceleration play a crucial role in enhancing the operational efficiency of these models, in turn necessitating synergistic cross-layer design across algorithms, hardware, and software.
In this talk, I will focus on the challenges and opportunities introduced by these models (e.g., trade-offs between compute vs memory BW). Recent advances in AI algorithms and reduced precision/quantization techniques have led to improvements in compute efficiency while maintaining the same level of
accuracy. System architectures need to be tailor made to mimic the communication patterns of LLMs which the software can then leverage to feed data to the compute engines, achieving high sustained compute utilization. The talk will present this holistic approach adopted in the design of the recently announced IBM Spyre accelerator.
In this talk, I will focus on the challenges and opportunities introduced by these models (e.g., trade-offs between compute vs memory BW). Recent advances in AI algorithms and reduced precision/quantization techniques have led to improvements in compute efficiency while maintaining the same level of
accuracy. System architectures need to be tailor made to mimic the communication patterns of LLMs which the software can then leverage to feed data to the compute engines, achieving high sustained compute utilization. The talk will present this holistic approach adopted in the design of the recently announced IBM Spyre accelerator.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionNowadays, the importance of data privacy protection has grown significantly. Privacy Set Intersection (PSI) based on Fully Homomorphic Encryption (FHE) is widely applied in various privacy protection scenarios, such as federated learning and password verification. Nevertheless, the substantial computational demands of FHE and the vast scale of databases in PSI result in inefficient processing, thereby necessitating specialized accelerator architectures to enhance usability. Current general-purpose FHE accelerators do not adequately address the unique requirements of PSI applications, leading to suboptimal data handling and underutilization of hardware, which impedes their effective deployment for PSI acceleration. This paper introduces Ares, a practical hardware-software co-designed FHE-based PSI FPGA accelerator. We propose Lazy Relinearization to optimize redundant calculations in PSI and reduce computational complexity without changing the PSI protocol. At the same time, through the analysis and decoupling of the PSI computing pattern, we design an efficient hardware acceleration architecture that fully utilizes the bandwidth and computing resources of the hardware to achieve excellent acceleration performance. We highlight the following result: (1) a 47.99xspeedup relative to CPU; (2) performance improvements of 1.79x and 1.93x over the state-of-the-art FPGA FHE accelerators, Poseidon and FAB, respectively; (3) achieves 7.96x and 10.95x energy efficiency improvement compared to Poseidon and FAB, respectively.
Research Manuscript


EDA
EDA4: Power Analysis and Optimization
DescriptionThe physics-informed neural network (PINN) is a promising technology to accelerate thermal analysis of integrated circuits (ICs). However, the application of PINN to threedimensional (3D) scenarios is faced with numerous challenges, including the requirement of a vast number of sampling points
as well as the high-level fitting capability of the network. This paper introduces an Adaptive Sub-regional Random Resamplingbased PINN (ASRR-PINN) to accelerate the thermal analysis of 3D-ICs. The adaptive sub-regional random resampling (ASRR) algorithm is proposed for efficient sampling in PINN, which divides the solution domain into a series of sub-regions and further adaptively decomposes these sub-regions during the training process. Besides, the sampling points are randomly resampled within each sub-region. By this means, the solution accuracy is enhanced significantly, especially in cases of a limited number of sampling points. To accelerate the convergence of the network, the thermal boundary conditions are incorporated into the ASRRPINN, ensuring that the network outputs automatically satisfy
these constraints. Furthermore, the capability of parameterization of ASRR-PINN is achieved by training an additional reduced network. Numerical results show that the ASRR algorithm reduces the maximum absolute error (AE) by more than 56% compared with other non-uniform sampling methods while the runtime is shortened by at least 28%. Moreover, using the parameterized ASRR-PINN to explore the design space of 3D-ICs achieves speeds over 200 times faster than the original ASRR-PINN.
as well as the high-level fitting capability of the network. This paper introduces an Adaptive Sub-regional Random Resamplingbased PINN (ASRR-PINN) to accelerate the thermal analysis of 3D-ICs. The adaptive sub-regional random resampling (ASRR) algorithm is proposed for efficient sampling in PINN, which divides the solution domain into a series of sub-regions and further adaptively decomposes these sub-regions during the training process. Besides, the sampling points are randomly resampled within each sub-region. By this means, the solution accuracy is enhanced significantly, especially in cases of a limited number of sampling points. To accelerate the convergence of the network, the thermal boundary conditions are incorporated into the ASRRPINN, ensuring that the network outputs automatically satisfy
these constraints. Furthermore, the capability of parameterization of ASRR-PINN is achieved by training an additional reduced network. Numerical results show that the ASRR algorithm reduces the maximum absolute error (AE) by more than 56% compared with other non-uniform sampling methods while the runtime is shortened by at least 28%. Moreover, using the parameterized ASRR-PINN to explore the design space of 3D-ICs achieves speeds over 200 times faster than the original ASRR-PINN.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionQuantum layout synthesis (QLS) is a critical step in quantum program compilation for superconducting quantum computers, involving the insertion of SWAP gates to satisfy hardware connectivity constraints. While previous works have introduced SWAP-free benchmarks with known-optimal depths for evaluating QLS tools, these benchmarks overlook SWAP count—a key performance metric. Real-world applications often require SWAP gates, making SWAP-free benchmarks insufficient for fully assessing QLS tool performance. To address this limitation, we introduce QUBIKOS, a benchmark set with provable-optimal SWAP counts and non-trivial circuit structures. For the first time, we are able to quantify the optimality gaps of SWAP gate usages of the leading QLS algorithms, which are surprisingly large: LightSabre from IBM delivers the best performance with an optimality gap of 63x, followed by ML-QLS with an optimality gap of 117x. Similarly, QMAP and t|ket⟩ exhibit significantly larger gaps of 250x and 330x, respectively. This highlights the need for further advancements in QLS methodologies. Beyond evaluation, QUBIKOS offers valuable insights for guiding the development of future QLS tools, as demonstrated through an analysis of a suboptimal case in LightSABRE. This underscores QUBIKOS's utility as both an evaluation framework and a tool for advancing QLS research.
Research Manuscript


EDA
EDA9: Design for Test and Silicon Lifecycle Management
DescriptionTo avoid corruption of user data, predictive testing methods have been proposed to identify SRAMs likely to fail in the near future due to aging. These methods use aggressive operating conditions, e.g., adjustments to wordline voltages or the power supply voltage, that are calibrated to provide high coverage of SRAMs likely to fail in the near future, but end up with some over-testing, i.e., spuriously identifying some chips as likely to fail.
We first present our study which discovered that a large fraction of over-tested chips fail due to read faults triggered during read-1 operations. Our analysis identifies asymmetric aging in SRAM cells, which are more likely to store zeros, as the root cause for this. We build on this discovery to propose an asymmetric predictive testing method which performs writes using normal voltages, read-0 at aggressive voltages, and read-1 at less aggressive voltages. We demonstrate that this method significantly reduces over-testing by over 3x to 5x, for low limits on under-testing. We also propose and use a new statistical sampling and simulation method to enable fast convergence and accurate evaluation of asymmetric predictive testing.
We first present our study which discovered that a large fraction of over-tested chips fail due to read faults triggered during read-1 operations. Our analysis identifies asymmetric aging in SRAM cells, which are more likely to store zeros, as the root cause for this. We build on this discovery to propose an asymmetric predictive testing method which performs writes using normal voltages, read-0 at aggressive voltages, and read-1 at less aggressive voltages. We demonstrate that this method significantly reduces over-testing by over 3x to 5x, for low limits on under-testing. We also propose and use a new statistical sampling and simulation method to enable fast convergence and accurate evaluation of asymmetric predictive testing.
Research Manuscript


EDA
EDA4: Power Analysis and Optimization
DescriptionAccurate power prediction in VLSI design is crucial for effective power optimization, especially as designs get transformed from gate-level netlist to layout stages. However, traditional accurate power simulation requires time-consuming back-end processing and simulation steps, which significantly impede design optimization. To address this, we propose ATLAS, which can predict the ultimate time-based layout power for any new design in the gate-level netlist. To the best of our knowledge, ATLAS is the first work that supports both time-based power simulation and general cross-design power modeling. It achieves such general time-based power modeling by proposing a new pre-training and fine-tuning paradigm customized for circuit power. Targeting golden per-cycle layout power from commercial tools, our ATLAS achieves the mean absolute percentage error (MAPE) of only 0.58%, 0.45%, and 5.12% for the clock tree, register, and combinational power groups, respectively, without any layout information. Overall, the MAPE for the total power of the entire design is <1%, and the inference speed of a workload is significantly faster than the standard flow of commercial tools.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionLarge Language Models (LLMs) have demonstrated unprecedented generative performance across a wide range of applications. While recent heterogeneous architectures attempt to address the memory-bound bottleneck from attention computations by processing-in-memory (PIM) offloading, they overlook two critical characteristics of attention GEMVs that distinguish them from traditional PIM scenarios: (1) dynamic matrix dimensions that scale with token length, and (2) distinct GEMV patterns between score computation (Q×Kt) and context computation (S×V). Existing PIM designs, employing either uniform or transposed computing modes, suffer from inefficiencies in newly generated element preparation or distinct GEMV execution. To address these limitations, we propose AttenPIM, a software-hardware co-design for efficient PIM-based attention acceleration. For bank-level execution, we propose dual-mode computing modes tailored for score and context computations with PIM-oriented data layouts and execution flows for KV storage, supported by a low-cost configurable per-bank PIM unit (PU). For system-level execution, we leverage token-level and head-level concurrency to ensure workload balance and maximize bank PU parallelism. Furthermore, dynamic allocation and kernel fusion method are proposed to further minimize memory overhead. Experimental results demonstrate that AttenPIM achieves 1.13×-5.26× speedup and reduces energy consumption by 17%-49% compared to two state-of-the-art PIM baselines.
Engineering Poster


DescriptionAs design complexity grows with shrinking technology nodes and larger design sizes, maintaining quality while ensuring execution speed is vital to meet time-to-market goals. Late detection of quality issues can disrupt project timelines. We introduce Audit, an intelligent tool that continuously monitors quality at every stage of design execution. By analyzing logs and reports, Audit identifies issues impacting QoR or later stages. It leverages predefined quality checks, provides a grading system for quick insights, and enables targeted issue resolution. Built with SQL, React, and Node.js, Audit is modular, scalable, and capable of handling high request volumes seamlessly.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionHigh-level synthesis (HLS) tools streamline FPGA design by enabling engineers to implement hardware using C/C++ languages. However, while clock management serves as a critical stage in the FPGA EDA flow that affects system-level performance, area, and especially power consumption, existing commercial HLS tools lack comprehensive solutions for clock management. Specifically, the diversity of clock resources creates a vast design space for finding the optimal configuration, and the insufficient analysis of multiple clock domain scenarios hinders effective clock-oriented optimizations in HLS. This work introduces AutoClock, an open-source integrated clock management framework that complements AMD Vitis HLS. AutoClock allocates resources for clock generation, assigns modules to appropriate clock domains, addresses metastability and time division multiplexing (TDM) malfunctioning introduced by multi-clock domain architectures, and hierarchically gates the clock of modules in a design. Experimental results demonstrate that AutoClock can fully utilize clock resources on FPGAs and help reduce dynamic power consumption by up to 74.38%.
Engineering Presentation


Front-End Design
DescriptionSystem-on-Chip (SoC) Design Verification (DV) traditionally relies on C-based verification methodologies, where test cases are executed using default linker command file. These linker command file typically map code execution to predefined memory regions. While effective for general testing, this approach often leaves certain memory regions untested, potentially leading to missed corner cases and latent silicon bugs.
This work explores the creation of linker-agnostic test cases and dynamically targeting all executable memory regions applicable to the respective CPU architecture to address this gap. The proposed methodology ensures comprehensive coverage that all test cases are designed to run across diverse memory segments, reducing dependency on fixed linker configurations.
The implementation involves dynamically randomizing linker configurations by extracting executable regions from the memory map into a structured "executable sheet" that specifies size and bounds for each region. Using this information, scripts parse the executable specifications to generate dynamic linkers which are used in test execution of test cases, which are executed in regression, ensuring comprehensive coverage and robustness in the verification process.
The paper emphasizes the significance of this methodology in achieving exhaustive memory coverage, minimizing validation escapes, and improving the overall quality of SoC verification. This approach aims to enhance first-pass success rates by uncovering potential silicon bugs earlier in the verification cycle, ultimately leading to more robust and reliable SoC designs.
This work explores the creation of linker-agnostic test cases and dynamically targeting all executable memory regions applicable to the respective CPU architecture to address this gap. The proposed methodology ensures comprehensive coverage that all test cases are designed to run across diverse memory segments, reducing dependency on fixed linker configurations.
The implementation involves dynamically randomizing linker configurations by extracting executable regions from the memory map into a structured "executable sheet" that specifies size and bounds for each region. Using this information, scripts parse the executable specifications to generate dynamic linkers which are used in test execution of test cases, which are executed in regression, ensuring comprehensive coverage and robustness in the verification process.
The paper emphasizes the significance of this methodology in achieving exhaustive memory coverage, minimizing validation escapes, and improving the overall quality of SoC verification. This approach aims to enhance first-pass success rates by uncovering potential silicon bugs earlier in the verification cycle, ultimately leading to more robust and reliable SoC designs.
Engineering Poster
Networking


DescriptionMotivation:
Initial milestone of SOC verification includes the verification of the integrity check and register access checks of all IP's. To reduce the closure time of verification with improved quality and reduced resource effort, automation is deployed. This enables backend team to start on floor planning and subsequent activities ahead of time in the project life cycle which results in early closure of project.
Issue with current approach:
Release to release test coding - Specification changes due to addition/removal of IP's as a result of architecture changes
Register addition/removal
Change of security policies for any given IP
Manual coding of tests are error prone and time consuming
Dealing with local changes made [check-in/check-out issue]- prone to data loss
Ineffective Disk space usage
Attribute sheet to be updated manually and track for the same
Process and Pay-OFF of automation:
Achieving efficient and effective closure of initial phase of verification by adopting the Maximum automation, centralized Execution and distributed Debug involves
Automated test generation by considering the specification for register access and integration check
Centralized regression launch for all blocks on every label release
Failures dispatched to dedicated owners for timely debug
All generated code is maintained in repository by auto check-in
Initial milestone of SOC verification includes the verification of the integrity check and register access checks of all IP's. To reduce the closure time of verification with improved quality and reduced resource effort, automation is deployed. This enables backend team to start on floor planning and subsequent activities ahead of time in the project life cycle which results in early closure of project.
Issue with current approach:
Release to release test coding - Specification changes due to addition/removal of IP's as a result of architecture changes
Register addition/removal
Change of security policies for any given IP
Manual coding of tests are error prone and time consuming
Dealing with local changes made [check-in/check-out issue]- prone to data loss
Ineffective Disk space usage
Attribute sheet to be updated manually and track for the same
Process and Pay-OFF of automation:
Achieving efficient and effective closure of initial phase of verification by adopting the Maximum automation, centralized Execution and distributed Debug involves
Automated test generation by considering the specification for register access and integration check
Centralized regression launch for all blocks on every label release
Failures dispatched to dedicated owners for timely debug
All generated code is maintained in repository by auto check-in
Engineering Poster
Networking


DescriptionThe paper proposes an automated flow, which combines AI/ML-based constraint
generation and SV assertion-based constraint validation, to provide users with accurate
CDC/RDC constraints without manual intervention. These constraints are formally
validated. The approach targets assumptions in the design that are challenging to be
validated structurally, prompting users to provide these constraints.
Clock Domain Crossing (CDC) and Reset Domain Crossing (RDC) problems are
common issues in modern digital design, especially in designs employing multiple clock
domains. Modern SoC design often utilizes numerous underlying IPs and subsystems that
operate asynchronously with clocks, resets, and interrupts. This gives rise to multiple CDC
and RDC signals crossing clock and reset domain boundaries.
Advanced CDC/RDC tools provide smart structural analysis of RTL designs, which
can be further validated formally for protocol verification. This analysis significantly depends
on tool-specific design constraints like clock-reset definition, gray encoding, stable signals,
and constant signals, etc.
In the context of ASIC design flow, static validation tools identify specific structures
that may cause metastability or data coherence problems at runtime. With advances in
artificial intelligence and machine learning, leading industry tools offer AI/ML-based utilities
to generate or locate missing CDC/RDC constraints. However, the complexity of validating
these generated constraints remains a challenge.
The proposed solution suggests integrating the available technologies of AI/ML-
based constraint generation and SV assertion-based constraint validation into an
automated "push-button" flow. This reduces the need for human intervention in generating
and validating CDC/RDC constraints. The paper focuses on targeting assumptions that are
difficult to validate structurally and expects users to provide these constraints.
When we discuss constraints, their sources can vary. Examples include:
1. Boundary assumptions that are beyond the designer's control and cannot be
generated by the tool as SV assumptions. For example, input clock port frequency
and input port domain clock association.
2. Boundary assumptions provided as design guidelines that can be converted into SV
assumptions by available tools, such as constant values and static with respect to
the clock.
3. Behavior on internal nets and registers based on design behavior that can be analyzed
in structural analysis, such as clock domain association.
4. Assumptions on internal nets and registers based on design functionality, which will
be forced during runtime depending on the design modes and are challenging to
validate structurally, such as gray-encoded address buses, internal functional
constants, and reset orders.
In this paper, the primary focus is on Category 4 above. The constraints generated will be
supported by converting SV assertions mentioned in Category 2 for complete closure.
We plan to analyze available solutions for AI-ML based constraint generation and the
types of constraints generated by them. With machine-generated constraints, we will first
explore the design-specific constraints that can be easily validated using SV assertions.
Next, we will run static analysis with the automatically generated constraints and explore
available tools to generate SV assertions for specific validation of these constraints. The
process will be automated using TCL scripts for push-button functionality:
1. Run the optimized first cut CDC/RDC structural analysis.
2. Automatically execute the AI-ML based tool to generate constraints.
3. Filter out constraints satisfying Category 4 above.
4. Append these constraints to the available constraints for the first cut run and rerun
the CDC/RDC analysis.
5. Automatically generate SV assertions for added constraints and run formal analysis.
6. Filter out failing constraints and rerun the CDC/RDC analysis with passing
constraints only.
Vacuous Formal properties can be explored for additional support of category
2 assumptions above.
7. Automate this entire flow with TCL scripts for a push-button methodology.
8. Solution will also explore the support of hierarchal design methodology to assist large
SoC designs.
The results of this paper include:
• A set of constraints that are true by design and do not require manual application in
CDC/RDC analysis, helping automate a significant part of manual effort for
CDC/RDC engineers.
• Integration of two available technologies to achieve faster closure time for projects
already on critical paths.
• Opening up the possibility of automating more constraints in the future based on
advancements in AI-ML tools, making this a step in the right direction.
• Formal validation of constraints that eliminate the risk of bad CDC/RDC constraints
and redirect user effort towards addressing actual problems.
Future work will cover all available possibilities for tool-based constraint generation and
highlight best-case scenarios readily amenable to formal validation.
generation and SV assertion-based constraint validation, to provide users with accurate
CDC/RDC constraints without manual intervention. These constraints are formally
validated. The approach targets assumptions in the design that are challenging to be
validated structurally, prompting users to provide these constraints.
Clock Domain Crossing (CDC) and Reset Domain Crossing (RDC) problems are
common issues in modern digital design, especially in designs employing multiple clock
domains. Modern SoC design often utilizes numerous underlying IPs and subsystems that
operate asynchronously with clocks, resets, and interrupts. This gives rise to multiple CDC
and RDC signals crossing clock and reset domain boundaries.
Advanced CDC/RDC tools provide smart structural analysis of RTL designs, which
can be further validated formally for protocol verification. This analysis significantly depends
on tool-specific design constraints like clock-reset definition, gray encoding, stable signals,
and constant signals, etc.
In the context of ASIC design flow, static validation tools identify specific structures
that may cause metastability or data coherence problems at runtime. With advances in
artificial intelligence and machine learning, leading industry tools offer AI/ML-based utilities
to generate or locate missing CDC/RDC constraints. However, the complexity of validating
these generated constraints remains a challenge.
The proposed solution suggests integrating the available technologies of AI/ML-
based constraint generation and SV assertion-based constraint validation into an
automated "push-button" flow. This reduces the need for human intervention in generating
and validating CDC/RDC constraints. The paper focuses on targeting assumptions that are
difficult to validate structurally and expects users to provide these constraints.
When we discuss constraints, their sources can vary. Examples include:
1. Boundary assumptions that are beyond the designer's control and cannot be
generated by the tool as SV assumptions. For example, input clock port frequency
and input port domain clock association.
2. Boundary assumptions provided as design guidelines that can be converted into SV
assumptions by available tools, such as constant values and static with respect to
the clock.
3. Behavior on internal nets and registers based on design behavior that can be analyzed
in structural analysis, such as clock domain association.
4. Assumptions on internal nets and registers based on design functionality, which will
be forced during runtime depending on the design modes and are challenging to
validate structurally, such as gray-encoded address buses, internal functional
constants, and reset orders.
In this paper, the primary focus is on Category 4 above. The constraints generated will be
supported by converting SV assertions mentioned in Category 2 for complete closure.
We plan to analyze available solutions for AI-ML based constraint generation and the
types of constraints generated by them. With machine-generated constraints, we will first
explore the design-specific constraints that can be easily validated using SV assertions.
Next, we will run static analysis with the automatically generated constraints and explore
available tools to generate SV assertions for specific validation of these constraints. The
process will be automated using TCL scripts for push-button functionality:
1. Run the optimized first cut CDC/RDC structural analysis.
2. Automatically execute the AI-ML based tool to generate constraints.
3. Filter out constraints satisfying Category 4 above.
4. Append these constraints to the available constraints for the first cut run and rerun
the CDC/RDC analysis.
5. Automatically generate SV assertions for added constraints and run formal analysis.
6. Filter out failing constraints and rerun the CDC/RDC analysis with passing
constraints only.
Vacuous Formal properties can be explored for additional support of category
2 assumptions above.
7. Automate this entire flow with TCL scripts for a push-button methodology.
8. Solution will also explore the support of hierarchal design methodology to assist large
SoC designs.
The results of this paper include:
• A set of constraints that are true by design and do not require manual application in
CDC/RDC analysis, helping automate a significant part of manual effort for
CDC/RDC engineers.
• Integration of two available technologies to achieve faster closure time for projects
already on critical paths.
• Opening up the possibility of automating more constraints in the future based on
advancements in AI-ML tools, making this a step in the right direction.
• Formal validation of constraints that eliminate the risk of bad CDC/RDC constraints
and redirect user effort towards addressing actual problems.
Future work will cover all available possibilities for tool-based constraint generation and
highlight best-case scenarios readily amenable to formal validation.
Research Manuscript


EDA
EDA1: Design Methodologies for System-on-Chip and 3D/2.5D System-in Package
DescriptionInstruction decoders are indispensable components of the System-on-Chip design flow and major constituents of instruction set simulators and processor toolchains. The complex and lengthy process of manual decoder design can be greatly alleviated by automated decoder generation tools based on high level instruction definitions. Unfortunately, automatic generation is challenged by the rising complexity of instruction sets as well as irregularities such as non-uniform opcodes, logic propositions on bit fields and multiple or nested specializations. The few available state-of-the-art decoder generation tools either cannot handle irregularities altogether or produce inadequate results, either functionally or w.r.t. performance. Moreover, they are largely ad hoc and do not bear on any of the well-established work on decision tree generation.
This paper presents a sophisticated decision-tree algorithm for the problem of generating decoders for irregular instruction sets. Our algorithm has produced fully automated, functionally correct and cost-aware decoders for the SPARC, MIPS32 and ARMv7 instruction sets. Our results prove the application of information-theoretic concepts to decoder generation a most promising approach.
This paper presents a sophisticated decision-tree algorithm for the problem of generating decoders for irregular instruction sets. Our algorithm has produced fully automated, functionally correct and cost-aware decoders for the SPARC, MIPS32 and ARMv7 instruction sets. Our results prove the application of information-theoretic concepts to decoder generation a most promising approach.
Engineering Poster
Networking


DescriptionThe ISO 21434 standard provides a structured approach to automotive cybersecurity to address the increasing risk of cyber attacks. Each cybersecurity goal is assigned a Cybersecurity Assurance Rating (CAL) based on the potential impact of an attack. These goals are achieved through system requirements, which are associated with specific design or software modules. The CAL rating of a component is determined by the highest-rated goal that is mapped to it. A higher CAL rating requires more rigorous testing. However, manually generating the cybersecurity verification work products required by the standard can be challenging. Specifically, the challenges include: 1. Creating a mapping from components to CAL ratings, while ensuring that the highest-rated goal is selected, can be difficult when there are hundreds of goals mapped to a smaller number of requirements, which then map to 20-30 components. 2. It can be difficult to determine at a glance whether testing of a component is complete with respect to its CAL rating. 3. Reporting component-specific test details, along with their owners and mapped requirements, without missing any details and ensuring all requirements are covered, can be a complex task when test cases run into hundreds. These challenges highlight the need for a more efficient and automated approach to generating cybersecurity verification work products that meet the requirements of the ISO 21434 standard.
Networking
Work-in-Progress Poster


DescriptionDesigning deep neural network (DNN) accelerator configurations is challenging due to the vast design space encompassing hardware resources and mapping strategies. We present CORE, a novel hardware and mapping CO-design methodology using single-step REinforcement learning to optimize spatial DNN accelerators for simulation-based metrics. CORE employs a policy neural network to generate near-optimal joint distributions for sampling design choices. A scaling graph-based decoding method captures dependencies between design choices and maps them into accelerator configurations. Guided by configuration simulation, the policy NN is updated with an adaptive reward mechanism to penalize invalid designs and accelerate convergence. Experimental results show that CORE improves latency and latency-area-sum by over 15× compared to baseline methods while reducing the number of sampled designs.
Engineering Poster
Networking


DescriptionIn this work, a workflow of automated IR violation fixing based on Synopsys PrimeClosure IR-ECO is presented, which tries to address the "last-mile" IR convergence issue while minimizing manual effort from design engineers.
Redhawk-SC IR signoff tool analyzes the aggressor cells in layout and calculates IR-drop violations and information, which is then fed to PrimeClosure.
PrimeClosure IR-ECO can automatically identify the correct fixing by cell swapping, sizing or moving, while being timing and DRC aware.
The cell change list from PrimeClosure can be passed to PNR construction tool to apply design changes, while can be also annotated back to Redhawk-SC to get updated IR-drop results for the fixed cells.
Experiment data from test blocks in advanced process node indicates that the proposed automated IR-ECO flow can achieve satisfactory IR violation reduction while being timing and DRC friendly.
Along with the "shift-left" early in-design IR analysis, as well as timing aware IR signoff analysis, the three components complete an end-to-end IR convergence flow which can boost the execution efficiency for physical design teams.
Redhawk-SC IR signoff tool analyzes the aggressor cells in layout and calculates IR-drop violations and information, which is then fed to PrimeClosure.
PrimeClosure IR-ECO can automatically identify the correct fixing by cell swapping, sizing or moving, while being timing and DRC aware.
The cell change list from PrimeClosure can be passed to PNR construction tool to apply design changes, while can be also annotated back to Redhawk-SC to get updated IR-drop results for the fixed cells.
Experiment data from test blocks in advanced process node indicates that the proposed automated IR-ECO flow can achieve satisfactory IR violation reduction while being timing and DRC friendly.
Along with the "shift-left" early in-design IR analysis, as well as timing aware IR signoff analysis, the three components complete an end-to-end IR convergence flow which can boost the execution efficiency for physical design teams.
Engineering Poster
Networking


DescriptionAddressing Power Delivery Network (PDN) violations across multiple scenarios within tight tape-out timelines has been a significant challenge due to the extensive manual effort required. Identifying root causes, implementing engineering change orders (ECOs), and validating their impact on both timing and violations demand considerable time and resources, making it impractical to complete within deadlines. To overcome this, a new multi-scenario IR-ECO flow was developed, offering an automated solution to reduce PDN violations efficiently.
This flow employs voltage impact technology from IR/EM tool (RHSC) to automatically identify root cause aggressors for violations across various scenarios. An ECO tool (Tweaker) communicates with the IR/EM tools to perform quick what-if evaluations, enabling the seamless generation of ECOs which are validated for timing and effectiveness across all scenarios. The flow prioritizes aggressors common to multiple violations, ensuring an optimal fix rate while preventing negative impacts on timing.
When tested on a large design, the IR-ECO flow achieved a 40-50% reduction in violations across four scenarios with overnight run time. By automating previously manual and time-consuming processes, this solution eliminates the need for waivers and excessive routing resources, freeing up weeks of engineering effort and accelerating the tape-out process while ensuring timing integrity.
This flow employs voltage impact technology from IR/EM tool (RHSC) to automatically identify root cause aggressors for violations across various scenarios. An ECO tool (Tweaker) communicates with the IR/EM tools to perform quick what-if evaluations, enabling the seamless generation of ECOs which are validated for timing and effectiveness across all scenarios. The flow prioritizes aggressors common to multiple violations, ensuring an optimal fix rate while preventing negative impacts on timing.
When tested on a large design, the IR-ECO flow achieved a 40-50% reduction in violations across four scenarios with overnight run time. By automating previously manual and time-consuming processes, this solution eliminates the need for waivers and excessive routing resources, freeing up weeks of engineering effort and accelerating the tape-out process while ensuring timing integrity.
Engineering Poster
Networking


DescriptionA typical IO controller design (which is one of the most critical components of an SoC) is used route peripheral information to the pad and vice versa. Typically, to offer end users the flexibility, IO controllers are configurable to map any peripheral on the chip to any pad on the device. As the number of IOs in the SoC increase, the permutations of peripheral to pad connectivity exponentially increases. . For "m" peripheral options and "n" IOs, we have mPn ways of mapping them, which explodes exponentially with m and n. For ex, for a SoC with 64 peripherals and 64 IOs, muxing possibilities that need to be checked are 64! (~= 10^89). These peripheral to pad mappings can be implemented either in the form of fixed networks or shuffle networks, both based on configurations. This becomes a hurdle for Design Verification to verify all such combinations. The motivation is to look for an efficient and scalable solution that does not incur too much run-time and effort, and can be generalized for any IO Controller design concept.
Exhibitor Forum


DescriptionThe requirement for extremely high bandwidth chiplet interconnectivity in 2.5D and 3D packaging technologies has surged, driven by exponential growth in computing performance required for Artificial Intelligence and Machine Learning (AI/ML) applications. These new chiplet architectures have created an urgent need for high-performance, customizable die-to-die (D2D) interconnect IP solutions. This presentation explores the application of an automated analog design platform, specifically the Blue Cheetah Adaptive Interconnect Platform (AIP), to accelerate the development and optimization of D2D interconnect IP for extremely high bandwidth chiplet ecosystems. We will discuss how AIP's parameterizable generator scripts can automate the creation of layout, schematic, and simulation environments for D2D interconnect circuits. The framework's ability to encode design methodologies and leverage computer resources for iterative design and verification processes will be highlighted. We'll demonstrate how this approach can significantly reduce turnaround time from circuit design to tape-out in advanced FinFET nodes. The presentation will cover key challenges in high-bandwidth D2D interconnect design, including bandwidth density, energy efficiency, and signal integrity. Case studies will illustrate the application of AIP to create UCIe compatible D2D PHYs, showcasing quick customization for various data rates, I/O configurations, and packaging types. By demonstrating the power of automated analog design tools in creating flexible, high-performance D2D interconnect IP, this presentation aims to highlight a path forward for accelerating chiplet ecosystem development and enabling more efficient heterogeneous integration solutions for extremely high bandwidth applications.
Engineering Poster


DescriptionLiberty file characterization is an arduous and time-consuming process. Liberty files contain millions of data points related to cell timing, power, and statistical variation dependent on complex settings within a Characterizer. With each new PDK, .lib files must be meticulously reviewed for errors that will result in STA failure and eventual production delays.
The proposed solution in this paper provides designers with a comprehensive characterization QA tool with differentiated debugging, automation and visualization capabilities.
Specifically, we describe an automated and custom QA workflow used by Impinj to find issues early in the design process which result in high quality products and on time deliveries. Approximately 150+ structural and rule-based checks are run with industry leading AI outlier detection on .libs, catching bugs early in the design process and yielding high quality libraries at every PDK revision. Custom API scripts are used to perform design specific checks and organize library information in user specified formats.
The proposed solution in this paper provides designers with a comprehensive characterization QA tool with differentiated debugging, automation and visualization capabilities.
Specifically, we describe an automated and custom QA workflow used by Impinj to find issues early in the design process which result in high quality products and on time deliveries. Approximately 150+ structural and rule-based checks are run with industry leading AI outlier detection on .libs, catching bugs early in the design process and yielding high quality libraries at every PDK revision. Custom API scripts are used to perform design specific checks and organize library information in user specified formats.
Engineering Poster


DescriptionPin accessibility is crucial in semiconductor layout development. As process nodes shrink, achieving a compact layout with high yield is essential, involving tightly packed routing and device placement within the standard cell. Ensuring pins have adequate clearance for connections becomes challenging, especially with fewer metal layers. Issues with pin accessibility during placement and routing can significantly prolong the development cycle.
This paper introduces a novel technique to validate standard cells by placing them in topologies that mimic the SoC congestion environment, using the same router engine and rules as during actual Placement and Routing. The simple, repetitive topologies result in comparable and probabilistic routing results, indicating the quality of standard cells in various placement scenarios. These topologies could involve the same cell repeating in different orientations or a central DuT with varying surrounding cells. The flow sweeps across selected cells to generate all such topologies using fillers for uniform placement of DuT. DRC and the number of vias used are reported for comprehensive analysis. This method safeguards the detection of pin accessibility issues during the standard cell layout development cycle, ensuring standard cells are free from pin accessibility problems upon construction.
This paper introduces a novel technique to validate standard cells by placing them in topologies that mimic the SoC congestion environment, using the same router engine and rules as during actual Placement and Routing. The simple, repetitive topologies result in comparable and probabilistic routing results, indicating the quality of standard cells in various placement scenarios. These topologies could involve the same cell repeating in different orientations or a central DuT with varying surrounding cells. The flow sweeps across selected cells to generate all such topologies using fillers for uniform placement of DuT. DRC and the number of vias used are reported for comprehensive analysis. This method safeguards the detection of pin accessibility issues during the standard cell layout development cycle, ensuring standard cells are free from pin accessibility problems upon construction.
Engineering Poster


DescriptionA lot of automation is done for placement from block to top level. For routing, the implementation requirements can be much more complex and specific to design. Depending on whether routing is done for single net, bus or stacked bus routing, a lot of customization is required to meet the specifications. Additionally, there are multiple constraints to be considered, for example, DRCs, parasitic, area, current density and so on. At TestChip level, order of 100s of bits need to be routed in DRC correct manner. Critical nets such as CLK Tree, routing, need to be completed before routing other nets. Bus nets and symmetrical nets should be routed in similar topology.
An amalgamation of automatic and interactive routing is proposed in this paper to tackle TestChip routing of 100-200 nets efficiently. In this paper, we introduce a new integrated methodology of doing bus routing in Virtuoso Studio Layout Design - Cadence.
We have developed automated and interactive solutions- with flight lines, to show the connectivity between the block, to achieve first time correct DRC routing of 100-150 of nets, with minimum double via connect.
An amalgamation of automatic and interactive routing is proposed in this paper to tackle TestChip routing of 100-200 nets efficiently. In this paper, we introduce a new integrated methodology of doing bus routing in Virtuoso Studio Layout Design - Cadence.
We have developed automated and interactive solutions- with flight lines, to show the connectivity between the block, to achieve first time correct DRC routing of 100-150 of nets, with minimum double via connect.
Networking
Work-in-Progress Poster


DescriptionWe explore the application of agentic Large Language Models (LLMs) for HW design. Generation and optimization of Register-Transfer Level (RTL) code is a challenging task and requires significant human effort. Agentic flow consisting of multiple agents, which are a combination of specialized LLMs and hardware simulation tools, can work together to complete the complex task of hardware design. This flow may incorporate human feedback in order to improve the efficiency of these tools. The proposed agentic flow, built on the open-source AutoGen framework, leverages iterative error feedback to significantly improve efficiency. This approach improves test pass rates and ensures successful compilation in RTL code generation, all while keeping computational overhead minimal. Additionally, it streamlines high-level design specifications, guaranteeing syntactical accuracy, compilation reliability, and functional integrity of the generated RTL. A key feature of this flow is its self-correcting mechanism, where outputs from each stage are refined through iterative feedback loops and validated against test benches treated as black boxes. The study also investigates the trade-offs between open-source and closed-source LLMs as primary code generators in a zero-shot setting. To validate this adaptive approach to code generation, benchmarking is performed using two open-source natural language-to-RTL datasets. To the best of our knowledge, this is the first work to explore the feasibility of agentic flows for RTL generation.
Engineering Poster
Networking


DescriptionLow power microcontrollers are becoming increasingly complex, making it crucial to test corner cases to avoid unexpected issues in the field. Testing low power and security scenarios can be high-risk, as failures can lead to device power up issues and security vulnerabilities. To mitigate this, emulation platforms like Cadence's Palladium can be used to analyze waveforms and ensure comprehensive coverage of all possible time windows. However, manual testing and analysis can be time-consuming and prone to errors. The BDI (Break the Device Intent) methodology involves identifying a scenario of interest, sweeping events to cover a specific time window, and using a shmoo engine to generate events at desired delays. To perform BDI testing on the Palladium emulator, a shmoo engine needs to be hooked to the DUT (Device Under Test) via a testbench, and a test code needs to be loaded on the DUT. The shmoo engine must be configured to generate events at desired times, and this information is communicated to the Palladium via qel files. The manual creation of qel files and the process of capturing and saving waveforms can be time-consuming and repetitive, requiring significant effort. Additionally, failed scenarios can cause the DUT to get stuck, making it necessary for the tester to actively monitor the process.
Networking
Work-in-Progress Poster


DescriptionSDV centralizes control algorithms in high-performance ECUs. This results in increased end-to-end latency on paths between sensors, central compute, and actuators where latency is caused by software and network components. Our analysis shows that software communication protocol stack execution significantly contributes to latency and MCU utilization. We introduce ARDMA inspired by techniques used in Data Centers to minimize the latency of control traffic. ARDMA accomplishes reduced latency by eliminating the software communication stack by directly reading data from / writing data to a remote ECU's memory. We implemented ARDMA both in software and hardware (FPGA). Our measurements show significant reduction in latency compared to UDP.
Engineering Poster
Networking


DescriptionAs the complexity of Application-Specific Integrated Circuit (ASIC) design continues to increase, traditional Electronic Design Automation (EDA) tools face significant challenges in efficiently managing the physical design process. Our work introduces a machine learning-driven framework aimed at accelerating the Placement and Routing (PnR) stages of the RTL-to-GDSII flow. By incorporating reinforcement learning and predictive models, the framework automates input recommendations, QoR predictions, and resource allocation, providing a more efficient and scalable approach to ASIC design. The use of machine learning enables the framework to optimize design parameters, reduce design time, and improve overall design quality. This research demonstrates the potential of machine learning to enhance the design process, addressing the growing demands of modern semiconductor development and enabling the creation of complex, high-performance ASICs.
Research Manuscript


Systems
SYS4: Embedded System Design Tools and Methodologies
DescriptionPower efficiency is a critical design objective in modern CPU design. Architects need a fast yet accurate architecture-level power evaluation tool to perform early-stage power estimation. However, traditional analytical architecture-level power models are inaccurate. The recently proposed machine learning (ML)-based architecture-level power model requires sufficient data from known configurations for training, making it unrealistic.
In this work, we propose AutoPower targeting fully automated architecture-level power modeling with limited known design configurations. We have two key observations: (1) The clock and SRAM dominate the power consumption of the processor, and (2) The clock and SRAM power correlate with structural information available at the architecture level. Based on these two observations, we propose the power group decoupling in AutoPower. First, AutoPower decouples across power groups to build individual power models for each group. Second, AutoPower designs power models by further decoupling the model into multiple sub-models within each power group.
In our experiments, AutoPower can achieve a low mean absolute percentage error (MAPE) of 4.36% and a high R^2 of 0.96 even with only two known configurations for training. This is 5% lower in MAPE and 0.09 higher in R^2 compared with McPAT-Calib, the representative ML-based power model.
In this work, we propose AutoPower targeting fully automated architecture-level power modeling with limited known design configurations. We have two key observations: (1) The clock and SRAM dominate the power consumption of the processor, and (2) The clock and SRAM power correlate with structural information available at the architecture level. Based on these two observations, we propose the power group decoupling in AutoPower. First, AutoPower decouples across power groups to build individual power models for each group. Second, AutoPower designs power models by further decoupling the model into multiple sub-models within each power group.
In our experiments, AutoPower can achieve a low mean absolute percentage error (MAPE) of 4.36% and a high R^2 of 0.96 even with only two known configurations for training. This is 5% lower in MAPE and 0.09 higher in R^2 compared with McPAT-Calib, the representative ML-based power model.
Research Manuscript


Design
DES3: Emerging Models of ComputatioN
DescriptionAs an emerging platform for biochemical experiments, flow-based microfluidic biochips are currently suffering from malfunctions caused by manufacturing defects, thereby having low yield. While many studies have been conducted and reliability quantification models have been published, layout optimization methods are yet lacking. In this paper, we propose AutoRE, the first automatic tool to enhance reliability by optimizing layouts. AutoRE varies the layout within a certain range without changing its topology, and adopts Bayesian optimization (BO) to identify more reliable variants. Experimental results demonstrate that AutoRE can efficiently and effectively improve the reliability across all testcases by around 40% on average.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionAs domain-specific accelerators for deep neural network (DNN) inference gain popularity due to their performance and flexibility advantages over general-purpose systems, the security of accelerator data in memory has emerged as a significant concern. However, the overhead associated with standard memory security measures, such as encryption and integrity authentication, presents a major challenge for accelerators, particularly given the high throughput demands of typical DNN applications.
In this work, we present AutoSkewBMT, a security framework that autonomously generates optimized integrity system configurations to enhance the Bonsai Merkle Tree (BMT)-based integrity authentication workflow for DNN accelerators. The framework leverages a novel and efficient design space generation algorithm to optimally skew the BMT for specific workloads. Configurations generated by AutoSkewBMT outperform recent state-of-the-art solutions by up to 32% on general DNN workloads.
In this work, we present AutoSkewBMT, a security framework that autonomously generates optimized integrity system configurations to enhance the Bonsai Merkle Tree (BMT)-based integrity authentication workflow for DNN accelerators. The framework leverages a novel and efficient design space generation algorithm to optimally skew the BMT for specific workloads. Configurations generated by AutoSkewBMT outperform recent state-of-the-art solutions by up to 32% on general DNN workloads.
Engineering Presentation


IP
DescriptionBinary polynomial multipliers significantly influence the performance and cost efficiency of elliptic curve cryptography (ECC) systems. ECC hardware commonly uses multiplication algorithms with sub quadratic complexity to minimize area usage and enhance speed. This research shows a new type of scalar point multiplication (SPM) processor for elliptic curves that uses a special group of overlap-free multipliers that work best for Internet of Things (IoT) uses. We design these multipliers to reduce partial products and employ overlap-free reconstruction methods, resulting in improved computational recurrence and enhanced efficiency.
The designed ECC-Multiplier can be vulnerable to power side channel leakage analysis if the RTL code has not been validated by thorough security analysis. Through an automated power side-channel leakage verification and root-causing flow, we demonstrated how to find the time and RTL gate with side-channel leakage of a unprotected ECC design. This flow can help ECC designers assess the most secure implementation and fix any leakage gate at early-stage RTL design phase.
The designed ECC-Multiplier can be vulnerable to power side channel leakage analysis if the RTL code has not been validated by thorough security analysis. Through an automated power side-channel leakage verification and root-causing flow, we demonstrated how to find the time and RTL gate with side-channel leakage of a unprotected ECC design. This flow can help ECC designers assess the most secure implementation and fix any leakage gate at early-stage RTL design phase.
Networking
Work-in-Progress Poster


DescriptionThe rising popularity of Large Language Models (LLMs) has intensified the demand for efficient inference acceleration. While GPUs and NPUs are adept at handling General Matrix-Matrix (GEMM) operations, the memory-intensive tasks inherent in LLMs are better suited to Processing-In-Memory (PIM) architectures. However, integrating PIM into heterogeneous systems presents challenges, particularly in enabling con- current PIM and standard memory operations, which can lead to significant bottlenecks and underutilization of PIM resources.
In this paper, we propose a novel PIM architecture that addresses these issues through two key innovations: 1) Bank-Split Architecture: Segregates memory banks and assigns independent I/O buffers to each, enabling the simultaneous execution of PIM and normal memory operations by decoupling their access patterns. 2) Partial Batch Offloading: Duplicates weight data to alternate I/O buffers during GEMM operations on Processing Units (e.g., GPUs or NPUs), enabling independent partial batch processing and significantly enhancing PIM utilization.
Experimental results demonstrate that our architecture achieves up to a 7.31× speedup compared to the NPU baseline, an average 20.3% performance improvement over the latest heterogeneous system PIM, NeuPIMs, and an overall 21% increase in PIM utilization.
In this paper, we propose a novel PIM architecture that addresses these issues through two key innovations: 1) Bank-Split Architecture: Segregates memory banks and assigns independent I/O buffers to each, enabling the simultaneous execution of PIM and normal memory operations by decoupling their access patterns. 2) Partial Batch Offloading: Duplicates weight data to alternate I/O buffers during GEMM operations on Processing Units (e.g., GPUs or NPUs), enabling independent partial batch processing and significantly enhancing PIM utilization.
Experimental results demonstrate that our architecture achieves up to a 7.31× speedup compared to the NPU baseline, an average 20.3% performance improvement over the latest heterogeneous system PIM, NeuPIMs, and an overall 21% increase in PIM utilization.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionLarge language models (LLMs), with their billions of pa-
rameters, pose substantial challenges for deployment on edge devices,
straining both memory capacity and computational resources. Block
floating-point (BFP) quantisation reduces memory and computational
overhead by converting high-overhead floating-point operations into low-
bit fixed-point operations. However, BFP requires aligning all data to the
maximum exponent, which causes loss of small and moderate values,
resulting in quantisation error and degradation in the accuracy of
LLMs. To address this issue, we propose a Bidirectional Block Floating-
Point (BBFP) data format, which reduces the probability of selecting
the maximum as shared exponent, thereby reducing quantisation error.
By utilizing the features in BBFP, we present a full-stack Bidirectional
Block Floating Point-Based Quantisation Accelerator for LLMs (BBAL),
primarily comprising a PE array based on BBFP, paired with
our proposed cost-effective nonlinear computation unit. Experimental
results show BBAL achieves a 22% improvement in accuracy compared
to an outlier-aware accelerator at similar efficiency, and a 40% efficiency
improvement over a vanilla BFP-based accelerator at similar accuracy.
rameters, pose substantial challenges for deployment on edge devices,
straining both memory capacity and computational resources. Block
floating-point (BFP) quantisation reduces memory and computational
overhead by converting high-overhead floating-point operations into low-
bit fixed-point operations. However, BFP requires aligning all data to the
maximum exponent, which causes loss of small and moderate values,
resulting in quantisation error and degradation in the accuracy of
LLMs. To address this issue, we propose a Bidirectional Block Floating-
Point (BBFP) data format, which reduces the probability of selecting
the maximum as shared exponent, thereby reducing quantisation error.
By utilizing the features in BBFP, we present a full-stack Bidirectional
Block Floating Point-Based Quantisation Accelerator for LLMs (BBAL),
primarily comprising a PE array based on BBFP, paired with
our proposed cost-effective nonlinear computation unit. Experimental
results show BBAL achieves a 22% improvement in accuracy compared
to an outlier-aware accelerator at similar efficiency, and a 40% efficiency
improvement over a vanilla BFP-based accelerator at similar accuracy.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionA bird's-eye-view (BEV) semantic segmentation accelerator (BEVSA) is proposed for real-time 3D space perception in multi-camera system (MCS). The view transformation from multi-camera-view to BEV obstructs the real-time operation on edge devices through 68.3 ms of time consumption during BEV pooling, which requires sorting and irregular memory access over wide searching space. Moreover, the 69.1% high average input activation sparsity of the segmentation process for the transformed BEV features results in excessive meaningless computations. For the real-time implementation of BEV semantic segmentation on edge platforms, this paper proposes two key features: 1) Block-decomposed hierarchical BEV pooling cluster that partitions the searching space in an MCS-suitable way and supports parallel pooling, achieving 43.2× speedup for BEV pooling over the edge computing platform; 2) Coarse-to-fine-grained zero skipping convolution cluster, which conducts coarse-grained zero skipping for all-zero channels and tile-wise fine-grained zero skipping, improving the convolution throughput by 1.61×. Implemented with 28nm technology, and evaluated on two representative MCS datasets, BEVSA finally achieves 23.1 frames-per-second of real-time BEV segmentation throughput with 167.4× higher energy-per-frame over the edge computing platform.
TechTalk


DescriptionSemiconductor innovation is at a critical juncture, demanding next-generational methods to overcome rising complexities, shorter design cycles, and intense competitive pressures. Traditional EDA tools are constrained by manual processes and limited intelligence, but what if we could transcend these limitations?
Enter AI Agents—the AI solution leveraging large language models and advanced algorithms to continue to improve themselves. In this talk, Prof. William Wang, Founder & CEO of ChipAgents, will introduce how AI agents go beyond traditional EDA automation, embedding agentic intelligence capable of independently handling hardware modeling, constraint-solving, automated debugging, testbench generation, and even proactive design optimization. Highlights include Use Cases, Scalability & Reliability: Case studies illustrating substantial productivity improvements, enhanced design quality, and accelerated time-to-market achieved by leading semiconductor enterprises deploying AI Agents. AI Agents in Action: Real-world scenarios demonstrating how AI agents autonomously identify critical bugs, optimize RTL designs, and significantly shorten verification cycles.
Enter AI Agents—the AI solution leveraging large language models and advanced algorithms to continue to improve themselves. In this talk, Prof. William Wang, Founder & CEO of ChipAgents, will introduce how AI agents go beyond traditional EDA automation, embedding agentic intelligence capable of independently handling hardware modeling, constraint-solving, automated debugging, testbench generation, and even proactive design optimization. Highlights include Use Cases, Scalability & Reliability: Case studies illustrating substantial productivity improvements, enhanced design quality, and accelerated time-to-market achieved by leading semiconductor enterprises deploying AI Agents. AI Agents in Action: Real-world scenarios demonstrating how AI agents autonomously identify critical bugs, optimize RTL designs, and significantly shorten verification cycles.
Exhibitor Forum


DescriptionVerification is increasingly becoming the defining constraint in semiconductor design, as complexity surges across software-defined architectures, massive 3D IC and chiplet-based designs, and exploding security and safety-critical requirements. Traditional approaches—relying on large regression suites, manual coverage analysis, and isolated debug—are struggling to keep up. This session explores how scalable, intelligent verification strategies are addressing these challenges through connected workflows, AI-enhanced automation, and data-driven insights. We’ll discuss how to shift from reactive debugging to proactive verification planning, and how to improve engineering throughput without scaling teams or compute linearly. Real-world examples will illustrate how teams are reducing debug effort, accelerating coverage closure, and unlocking new levels of productivity. Attendees will leave with practical ideas for building smarter verification flows that are engineered for modern complexity—not just more speed, but better focus resulting in improved productivity.
Research Panel


Systems
DescriptionAs computing systems push toward ever-increasing performance, thermal management becomes a fundamental bottleneck. Cooling is no longer just a support function—it is a necessity to ensure that modern and future computing architectures achieve their full potential. The question remains: what is the best approach to keep computing efficient while overcoming thermal constraints? Should we rely on conventional package-level cooling, embed advanced and exotic cooling solutions within the 3D stack itself, or shift our focus toward developing thermally resilient devices that can operate at much higher temperatures? This panel will explore these three primary thermal management strategies. We will bring together experts from semiconductor manufacturing, system architecture, and thermal engineering to debate the merits, feasibility, and trade-offs of each approach. While cryogenic cooling remains an interesting avenue, the discussion will focus on the dominant and most practical cooling methods that can be broadly applied to computing systems today.
Networking
Work-in-Progress Poster


DescriptionLarge Language Models (LLMs) are transforming the programming language
landscape by facilitating learning for beginners, enabling code generation, and
optimizing documentation workflows. Hardware Description Languages (HDLs),
with their smaller user community, stand to benefit significantly from the
application of LLMs as tools for learning new HDLs. This paper investigates the
challenges and solutions of enabling LLMs for HDLs, particularly for
HDLs that LLMs have not been previously trained on.
This work introduces HDLAgent, an AI agent optimized for LLMs with limited
knowledge of various HDLs. It significantly enhances off-the-shelf LLMs. For
example, PyRTL's success rate improves from zero to 35\% with Mixtral 8x7B, and
Chisel's success rate increases from zero to 59\% with GPT-3.5-turbo-0125.
HDLAgent offers an LLM-neutral framework to accelerate the adoption and growth
of HDL user bases in the era of LLMs.
landscape by facilitating learning for beginners, enabling code generation, and
optimizing documentation workflows. Hardware Description Languages (HDLs),
with their smaller user community, stand to benefit significantly from the
application of LLMs as tools for learning new HDLs. This paper investigates the
challenges and solutions of enabling LLMs for HDLs, particularly for
HDLs that LLMs have not been previously trained on.
This work introduces HDLAgent, an AI agent optimized for LLMs with limited
knowledge of various HDLs. It significantly enhances off-the-shelf LLMs. For
example, PyRTL's success rate improves from zero to 35\% with Mixtral 8x7B, and
Chisel's success rate increases from zero to 59\% with GPT-3.5-turbo-0125.
HDLAgent offers an LLM-neutral framework to accelerate the adoption and growth
of HDL user bases in the era of LLMs.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionReRAM is a promising non-volatile memory for neuromorphic accelerators, but challenges like high sensing power and accuracy loss persist. BiNeuroRAM, a novel SNN accelerator with ReRAM processing-in-memory (PIM), makes three key contributions: (1) It is the first to support higher-accuracy spike-tracing bipolar-integrate-and-fire (ST-BIF) neurons, achieving 80.9% accuracy on ImageNet, 8.4% higher than prior state-of-the-art. (2) A low-power voltage sense amplifier (LPVSA) reduces ReRAM read power by 14.7 - 58.2×, addressing energy efficiency. (3) The asynchronous micro-architecture in BiNeuroRAM fully leverages the event-driven nature of SNNs. Our experiments demonstrate that BiNeuroRAM improves throughput density and energy efficiency by 2.1× on ImageNet with ResNet-18, compared to traditional integrate-and-fire (IF) neuron-based SNN accelerators.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionIn this paper, we propose BirdMoE, a load-aware communication compression technique with Bi-random quantization for MoE training. Specifically, BirdMoE employs a lightweight Random Quantization with expectation invariance property to efficiently map the floating-point intermediate computing results into integers while maintaining the MoE training quality. Additionally, BirdMoE utilizes a Mixed-Precision strategy to balance the communication loads among expert nodes, significantly improving all-to-all communication efficiency for MoE training system.
Experiments on typical MoE training tasks demonstrate that BirdMoE achieves higher 3.98x-10.44x total communication compression ratios and 1.18x-5.27x training speedup compared with the state-of-the-art compression techniques while maintaining the MoE training quality.
Experiments on typical MoE training tasks demonstrate that BirdMoE achieves higher 3.98x-10.44x total communication compression ratios and 1.18x-5.27x training speedup compared with the state-of-the-art compression techniques while maintaining the MoE training quality.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionBit-serial computation shows promise for accelerating deep neural networks (DNNs) by exploiting inherent bit sparsity. However, the original unstructured bit sparsity poses two major challenges for existing bit-serial accelerators (BSA): (1) workload imbalance from irregular bit distribution, and (2) inefficient memory access due to unpredictable non-zero bit locations.
To address these issues, this paper proposes BitPattern, an algorithm/hardware co-design to efficiently accelerate bit-serial computation through bit-pattern pruning.
At the algorithm level, we employ bit-pattern pruning to identify optimal combinations of predefined patterns and apply compression encoding to minimize weight storage. We further devise a pattern-similarity-based merging method to balance the bit-serial workload. At the hardware level, we co-design a bit-serial accelerator with a dedicated bit-pattern decoder and PE to leverage the potential of structured bit-pattern sparsity. The evaluation on several deep learning benchmarks shows that BitPattern can achieve $1.72\times$ memory reduction with negligible accuracy loss, and up to $2.11 \times$ speedup and $1.86 \times$ energy saving compared to state-of-the-art bit-serial accelerators.
To address these issues, this paper proposes BitPattern, an algorithm/hardware co-design to efficiently accelerate bit-serial computation through bit-pattern pruning.
At the algorithm level, we employ bit-pattern pruning to identify optimal combinations of predefined patterns and apply compression encoding to minimize weight storage. We further devise a pattern-similarity-based merging method to balance the bit-serial workload. At the hardware level, we co-design a bit-serial accelerator with a dedicated bit-pattern decoder and PE to leverage the potential of structured bit-pattern sparsity. The evaluation on several deep learning benchmarks shows that BitPattern can achieve $1.72\times$ memory reduction with negligible accuracy loss, and up to $2.11 \times$ speedup and $1.86 \times$ energy saving compared to state-of-the-art bit-serial accelerators.
Networking
Work-in-Progress Poster


DescriptionDatacenters and enterprise servers demand high-performance, customized SSDs optimized for specific I/O patterns to meet stringent Quality of Service requirements. We present an automated tuning framework leveraging Bayesian Optimization to effectively optimize firmware parameters for such customized SSDs. Our method was validated across multiple mass-produced SSD products, satisfying 94% of the targeted performance metrics and achieving an average latency reduction of 30.43 times compared to manual tuning. Furthermore, our distributed optimization system reduces tuning time from 19 days to 1.3 days by performing parallel evaluations, significantly enhancing development efficiency. This study is the first to apply Bayesian Optimization to SSD firmware optimization.
Networking
Work-in-Progress Poster


DescriptionState-of-the-art quantum circuit optimization (QCO) algorithms for T-count reduction often cause a significant increase in two-qubit gate count (2Q-count), a problem that current 2Q-count optimization techniques struggle to mitigate. We present Blaqsmith, a novel two-stage QCO flow that effectively counteracts the 2Q-gate surges in T-count-optimized Clifford+T circuits, achieving a 38.4% reduction in ancilla-free scenarios and 25.3% with ancillae. Furthermore, Blaqsmith scales to larger circuits by incorporating a perturbation-based heuristic that reduces runtime and memory usage. These advancements improve both the quality and scalability of Clifford+T circuit compilation.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionBalanced hypergraph partitioning is a fundamental problem in applications like VLSI design, high-performance computing, etc. Nowadays, large-scale hypergraphs become more common due to the increasing complexity of modern systems. Thus, fast and high-quality deterministic partitioning algorithms are largely in demand. Regarding the quality of partitioning, balance is a critical concern when the number of partitions increases. In this paper, we propose BlasPart, a deterministic parallel algorithm for balanced large-scale hypergraph partitioning. BlasPart leverages a recursive multilevel bisection framework to achieve high-quality partitions while ensuring deterministic outcomes. A level-dependent balance constraint is also proposed to further improve the efficiency and effectiveness of the proposed partitioner. Extensive experiments, with comparisons to the state-of-the-art partitioners (hMETIS, BiPart, and Mt-KaHyPar-SDet), demonstrate that BlasPart achieves better balance and scalability while maintaining competitive partitioning quality and efficiency. BlasPart runs 3.33× faster than Mt-KaHyPar-SDet on average for a 4096-way partitioning task on six benchmarks.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionThe attention mechanism is a key component in neural networks, essential for retrieving relevant information in Natural Language Processing (NLP). However, the high computational complexity and substantial power consumption limits the deployment of attention-based models. To overcome these issues, we introduce Blaze, an efficient attention architecture that utilizes both value and bit-level sparsity with workload orchestration optimization. Our Approximate-Computing-Based (ACB) mechanism addresses workload imbalance in bit-sparse architectures, while the Leading-Booth mechanism further enhances the performance of attention computations. We also design a reconfigurable computing engine to support these innovations, improving performance in attention inference tasks.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionProcessing-In-Memory (PIM) architectures alleviate the memory bottleneck in the decode phase of large language model (LLM) inference by performing operations like GEMV and Softmax in memory. However, the fragmented data layout in current PIM architectures limits end-to-end acceleration for long-context LLMs. In this paper, we propose BlockPIM, a cross-channel block memory layout strategy that maximizes memory utilization and eliminates the context length constraint. Additionally, we introduce a cross-channel attention computation scheme that is compatible with the current architecture to support distributed attention operations on BlockPIM. Experimental results demonstrate that our approach achieves a 62\% average throughput increase compared to existing state-of-the-art PIM solutions, enabling efficient and scalable deployment of large language models on PIM architectures.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionDeep neural networks (DNNs) have revolutionized numerous AI applications, but their vast model sizes and limited hardware resources present significant deployment challenges. Model quantization offers a promising solution to bridge the gap between DNN size and hardware capacity. While INT8 quantization has been widely used, recent research has pushed for even lower precision, such as INT4. However, the presence of outliers—values with unusually large magnitudes—limits the effectiveness of current quantization techniques. Previous compression-based acceleration methods that incorporate outlier-aware encoding introduce complex logic. A critical issue we have identified is that serialization and deserialization dominate the encoding/decoding time in these compression workflows, leading to substantial performance penalties during workflow execution. To address this challenge, we introduce a novel computing approach and a compatible architecture design named ``BLOOM''. BLOOM leverages the strengths of the ``bit-slicing'' method, effectively combining structured mixed-precision and bit-level sparsity with adaptive dataflow techniques. The key insight of BLOOM is that outliers require higher precision, while normal values can be processed at lower precision. By interleaving 4-bit values, we efficiently exploit the inherent sparsity in the high-precision components. As a result, the BLOOM-based accelerator outperforms the existing outlier-aware accelerators by an average $1.2 \sim 4.0\times$ speedup and $24.6\% \sim 71.3\%$ energy reduction, respectively, without model accuracy loss.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionBoolean symbolic reasoning for gate-level netlists is a critical step in fields such as verification, logic and datapath synthesis, and hardware security. Specifically, reasoning datapath and carry-chain in bit-blasted Boolean networks is particularly crucial for verification and synthesis, and challenging. Conventional approaches either failed to accurately (exactly) identify the carry-chain of the designs in gate-level netlist with structural hashing, or failed due to runtime complexity with functional or formal methods. This paper introduces BoolE, an exact symbolic reasoning framework for Boolean netlists using equality saturation (e-graph). BoolE optimizes scalability and performance by domain-specific ruleset pruning in e-graph rewriting, and incorporates a novel extraction algorithm to efficiently identify and capture multi-input multi-output high-level structures (e.g., full adder) in the constructed e-graph, enhancing its structural insight and computational efficiency.
Our experiments show that BoolE surpasses state-of-the-art symbolic reasoning baselines, including the structural/functional approach (ABC) and machine learning-based method (Gamora). Specifically, we evaluated its performance on various multiplier architecture with different configurations. Our results show that BoolE identifies 3.53x and 3.01x more exact full adders than ABC in carry-save array (CSA) and Booth-encoded multipliers, respectively. Additionally, we integrated BoolE into multiplier formal verification tasks, where it significantly accelerates the performance of traditional formal verification tools using computer algebra.
Our experiments show that BoolE surpasses state-of-the-art symbolic reasoning baselines, including the structural/functional approach (ABC) and machine learning-based method (Gamora). Specifically, we evaluated its performance on various multiplier architecture with different configurations. Our results show that BoolE identifies 3.53x and 3.01x more exact full adders than ABC in carry-save array (CSA) and Booth-encoded multipliers, respectively. Additionally, we integrated BoolE into multiplier formal verification tasks, where it significantly accelerates the performance of traditional formal verification tools using computer algebra.
Networking
Work-in-Progress Poster


DescriptionUngrouping is a key step in design implementation. This step is tasked to dissolve selected hierarchical boundaries in order to unlock more optimization opportunities and logic sharing.
Determining the right amount of ungrouping is an important and challenging problem which requires new and innovative technologies in logic synthesis.
On the one hand, ungrouping too much may impact the verifiability of the implemented netlist, reduce the scalability of synthesis, and degrade the Quality of Results (QoR).
On the other hand, not ungrouping enough may fail to catch many optimization opportunities and degrade QoR even further.
Existing solutions are mainly guided by comparing the size of the children and parent hierarchies to be ungrouped.
In this paper, we propose a novel ungrouping flow based on Boolean reasoning.
We guide ungrouping with formal reasoning engines, such as Boolean SATisfiability (SAT), considering logic sharing, connectivity, and Boolean optimization opportunities identified across the design.
We show how this flow unlocks more QoR, keeps verification complexity contained, and preserves the most advantageous hierarchical boundaries for synthesis.
We integrate our novel ungrouping flow within an industrial synthesis tool, showing significant QoR improvement over the state-of-the-art solutions.
Determining the right amount of ungrouping is an important and challenging problem which requires new and innovative technologies in logic synthesis.
On the one hand, ungrouping too much may impact the verifiability of the implemented netlist, reduce the scalability of synthesis, and degrade the Quality of Results (QoR).
On the other hand, not ungrouping enough may fail to catch many optimization opportunities and degrade QoR even further.
Existing solutions are mainly guided by comparing the size of the children and parent hierarchies to be ungrouped.
In this paper, we propose a novel ungrouping flow based on Boolean reasoning.
We guide ungrouping with formal reasoning engines, such as Boolean SATisfiability (SAT), considering logic sharing, connectivity, and Boolean optimization opportunities identified across the design.
We show how this flow unlocks more QoR, keeps verification complexity contained, and preserves the most advantageous hierarchical boundaries for synthesis.
We integrate our novel ungrouping flow within an industrial synthesis tool, showing significant QoR improvement over the state-of-the-art solutions.
Engineering Poster
Networking


DescriptionThe increasing complexity of System on Chip (SoC) designs over the years has necessitated advanced verification techniques to detect hidden functional failures. This challenge becomes even tougher when low power elements such as isolation cells, level shifters are present in the design.
The low-power integrated circuit market is experiencing significant growth, driven by the portable and battery-powered chip demand. In such applications, the low-power aspect is an essential component of the design cycle. Consequently, a great care must be paid during the verification phase to guarantee the correct functionality of the low-power architecture. For example, a bug in the isolation protocol could potentially cause a complete system failure.
This paper proposes a methodology employing formal approach for verifying potential criticalities such as supply network issues, isolation and retention in low-power chips. The proposed strategy was applied to an ultra low-power SoC (250k equivalent gates) based on ARM Cortex M0+ and featuring a compound power structure with five switchable domains, designed by ST. The verification time is dramatically reduced (~50%) compared to dynamic approach, also improving verification quality. During this work, a risky bug that was extremely difficult to reproduce in a simulative environment was identified early in the flow.
The low-power integrated circuit market is experiencing significant growth, driven by the portable and battery-powered chip demand. In such applications, the low-power aspect is an essential component of the design cycle. Consequently, a great care must be paid during the verification phase to guarantee the correct functionality of the low-power architecture. For example, a bug in the isolation protocol could potentially cause a complete system failure.
This paper proposes a methodology employing formal approach for verifying potential criticalities such as supply network issues, isolation and retention in low-power chips. The proposed strategy was applied to an ultra low-power SoC (250k equivalent gates) based on ARM Cortex M0+ and featuring a compound power structure with five switchable domains, designed by ST. The verification time is dramatically reduced (~50%) compared to dynamic approach, also improving verification quality. During this work, a risky bug that was extremely difficult to reproduce in a simulative environment was identified early in the flow.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionThis paper presents BPUFuzzer, a fuzz testing tool for detecting branching transient execution vulnerabilities in CPU RTL design.
BPUFuzzer addresses two key challenges: generating testcases that capture complex control flows, and extracting essential data from vast hardware states to guide testcase selection.
Utilizing a control flow graph-based testcase generation strategy with anomaly detection and employing fitness and coverage metrics,
BPUFuzzer works on testcases that cover broader program flows and deliberately selects testcases to discover transient execution vulnerabilities effectively. When applied on RISC-V Boom v3, BPUFuzzer uncovered more Spectre types than the state-of-the-arts, including a previously unidentified variant, named Spectre-LOOP.
BPUFuzzer addresses two key challenges: generating testcases that capture complex control flows, and extracting essential data from vast hardware states to guide testcase selection.
Utilizing a control flow graph-based testcase generation strategy with anomaly detection and employing fitness and coverage metrics,
BPUFuzzer works on testcases that cover broader program flows and deliberately selects testcases to discover transient execution vulnerabilities effectively. When applied on RISC-V Boom v3, BPUFuzzer uncovered more Spectre types than the state-of-the-arts, including a previously unidentified variant, named Spectre-LOOP.
DAC Pavilion Panel


DescriptionSilicon design is complex and expensive with high penalties for iterations. From start to finish, chip implementation has teams of 100s to 1000s of Engineers, geographically distributed across multiple time zones, and spans multiple years of development time. A typical design process goes through 100s of tools in the front end ( from concept to tape-out ) and an equally excruciating process in the back end (from silicon to product release). Silicon development costs have skyrocketed with the shrinking geometries - from sub-hundred million to above $500million.
The complexity and efficiency of Silicon design process has of course been constantly studied. Take for example this survey conducted by Siemens & Wilson Research, which presents a study on Functional Verification. Functional verification has been a key focus for the industry - EDA leaders such as Synopsys, Cadence and Siemens have produced state of the art verification tools.
More recently, Nvidia has demonstrated that use of GenAI can bring significant efficiencies to the Silicon design process.
While there is heavy focus on productivity improvements in specific areas of Silicon Design, the question that remains is what can be done to make the entire design process more efficient. Needless to say, improving efficiency has a significant impact on the industry as a whole - it can reduce the time to market and enable the industry to produce more silicon, faster and cheaper… and this can be a winner for the whole tech industry as Silicon is at the foundation of all technology driving the AI revolution!
Silicon & Hardware Systems designs have significant similarities in process and efficiency challenges. Most people find the HW Systems design to be an extension of the process used in Silicon design making solutions for efficiency mutually beneficial.
The complexity and efficiency of Silicon design process has of course been constantly studied. Take for example this survey conducted by Siemens & Wilson Research, which presents a study on Functional Verification. Functional verification has been a key focus for the industry - EDA leaders such as Synopsys, Cadence and Siemens have produced state of the art verification tools.
More recently, Nvidia has demonstrated that use of GenAI can bring significant efficiencies to the Silicon design process.
While there is heavy focus on productivity improvements in specific areas of Silicon Design, the question that remains is what can be done to make the entire design process more efficient. Needless to say, improving efficiency has a significant impact on the industry as a whole - it can reduce the time to market and enable the industry to produce more silicon, faster and cheaper… and this can be a winner for the whole tech industry as Silicon is at the foundation of all technology driving the AI revolution!
Silicon & Hardware Systems designs have significant similarities in process and efficiency challenges. Most people find the HW Systems design to be an extension of the process used in Silicon design making solutions for efficiency mutually beneficial.
Research Manuscript


Design
DES4: Digital and Analog Circuits
DescriptionThe increasing demand to maximize PPA gains with backside metal layers has driven their function beyond power delivery. This study introduces a novel BS-PDN-last flow, crucial for leveraging multifunctional backside metal layers, by deferring power routing to the post-signal routing stage. This approach addresses the IR-drop and performance trade-offs inherent in conventional PDN-first flows. Experimental results show that the BS-PDN-last flow achieves a 90% reduction in Total Negative Slack and a 12% performance gain with BS-CDN. Additionally, our work presents the first comprehensive comparison of FS-PDN, BS-PDN, and multifunctional backside designs, evaluated on both physical design and workload metrics leveraging accurate vector-based analysis.
Engineering Poster
Networking


DescriptionThe ASTC VLAB version 2 simulator/virtual platform tool was designed in the late 2000s. Based on the requirements of the time, it is based on a SystemC simulator kernel together with fast instruction-set simulators. As the complexity and processor core count of the hardware that is being modeled in VLAB increased over the years, performance was becoming an issue. To unlock performance for the future, it was necessary to move away from SystemC as the kernel.
A new simulator kernel, Hipersim, was created with the explicit goal of removing performance bottlenecks. It improves serial simulation performance and open up for parallel simulation (by removing the serial semantics imposed by SystemC). Along the way, the rest of the product was also redesigned to support performance – and clean out some old features. Still, the product had to remain compatible with existing ecosystem models and integrations.
This talk presents our experience implementing the new kernel and how it handles SystemC models and new models. We share some performance measurements and observations from the implementation process and porting of old models to the new kernel.
Submission category: SAS.04 Simulation and modeling environments
A new simulator kernel, Hipersim, was created with the explicit goal of removing performance bottlenecks. It improves serial simulation performance and open up for parallel simulation (by removing the serial semantics imposed by SystemC). Along the way, the rest of the product was also redesigned to support performance – and clean out some old features. Still, the product had to remain compatible with existing ecosystem models and integrations.
This talk presents our experience implementing the new kernel and how it handles SystemC models and new models. We share some performance measurements and observations from the implementation process and porting of old models to the new kernel.
Submission category: SAS.04 Simulation and modeling environments
DAC Pavilion Panel


DescriptionThe importance of proactively securing semiconductor chips during the design phase has grown significantly over the past few years, driven by the rapidly increasing number of discovered chip security vulnerabilities, emerging industry standards, and new regulations and laws, among other factors. As a result, most would agree on the importance of a robust hardware security program and that security signoff should become another key checkbox before tape-out. Yet, translating this objective into reality is challenging due to tight tape-out schedules, lack of broad security knowledge, limited engineering resources, and the need for new cross-organizational coordination.
This panel of leading hardware security practitioners will discuss the various challenges of securing chips and share how to overcome them in practice, all while staying on schedule, within budget, and boosting competitiveness.
This panel of leading hardware security practitioners will discuss the various challenges of securing chips and share how to overcome them in practice, all while staying on schedule, within budget, and boosting competitiveness.
Exhibitor Forum


DescriptionThe global semiconductor sector is witnessing a surge in demand for proficient individuals with expertise in VLSI design and fabrication. Skill universities significantly contribute to bridging the divide between academia and industry by integrating advanced EDA tools into their curricula. This forum will showcase best practices, success narratives, and collaborative opportunities for EDA vendors and technical universities. Participants will discover how these tools can foster innovation, enhance student employability, and facilitate academic research that aligns with industry trends.
Exhibitor Forum


DescriptionThe semiconductor industry is on the cusp of an AI revolution, yet significant barriers persist, particularly around data provenance and liability. While Generative AI (GenAI) technologies have transformed software development, their adoption in semiconductor design remains hindered by the industry's unique challenges—such as highly protected intellectual property (IP) and the extraordinary costs associated with errors. Unlike the software domain, where open-source practices are common, the hardware space demands a more secure and traceable approach.
This presentation will explore the emerging role of GenAI in semiconductor and embedded systems design, focusing on the critical issues of data versioning, provenance, and traceability in training internal AI models. These factors are essential to ensure model reproducibility, reliability, and accountability, as well as to mitigate risk and foster trust in AI adoption. We will explore concerns around "data contamination" when using external or purchased IP as well as liability concerns over whether such data can lawfully and ethically be used to train AI.
Additionally, we will introduce IP Lifecycle Management (IPLM) as a robust framework for managing and tracking the IPs used in AI training. By leveraging an IPLM platform, organizations can establish a secure, compliant, and controlled approach to training AI models, paving the way for innovative applications in semiconductor design.
This presentation will explore the emerging role of GenAI in semiconductor and embedded systems design, focusing on the critical issues of data versioning, provenance, and traceability in training internal AI models. These factors are essential to ensure model reproducibility, reliability, and accountability, as well as to mitigate risk and foster trust in AI adoption. We will explore concerns around "data contamination" when using external or purchased IP as well as liability concerns over whether such data can lawfully and ethically be used to train AI.
Additionally, we will introduce IP Lifecycle Management (IPLM) as a robust framework for managing and tracking the IPs used in AI training. By leveraging an IPLM platform, organizations can establish a secure, compliant, and controlled approach to training AI models, paving the way for innovative applications in semiconductor design.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionData-Free Knowledge Distillation (DFKD) enables the knowledge transfer from the given pre-trained teacher network to the target student model without access to the real training data.
Existing DFKD methods primarily focus on improving image recognition performance on associated datasets, often neglecting the crucial aspect of the transferability of learned representations.
In this paper, we propose Category-Aware Embedding Data-Free Knowledge Distillation (CAE-DFKD), which addresses at the embedding level the limitations of previous rely on image-level methods to improve model generalization but fail when directly applied to DFKD. The superiority and flexibility of CAE-DFKD are extensively evaluated, including:
1. Significant efficiency advantages resulting from altering the generator training paradigm;
2. Competitive performance with existing DFKD state-of-the-art methods on image recognition tasks;
3. Remarkable transferability of data-free learned representations demonstrated in downstream tasks.
Existing DFKD methods primarily focus on improving image recognition performance on associated datasets, often neglecting the crucial aspect of the transferability of learned representations.
In this paper, we propose Category-Aware Embedding Data-Free Knowledge Distillation (CAE-DFKD), which addresses at the embedding level the limitations of previous rely on image-level methods to improve model generalization but fail when directly applied to DFKD. The superiority and flexibility of CAE-DFKD are extensively evaluated, including:
1. Significant efficiency advantages resulting from altering the generator training paradigm;
2. Competitive performance with existing DFKD state-of-the-art methods on image recognition tasks;
3. Remarkable transferability of data-free learned representations demonstrated in downstream tasks.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionWith the development of DNN applications, multi-tenant execution on a single SoC is becoming a prevailing trend. Although methods are proposed to improve multi-tenant performance, the impact of shared cache is not well studied. This paper proposes CaMDN, an architecture-scheduling co-design to enhance cache efficiency. Specifically, a lightweight architecture is proposed to support NPU-controlled regions inside shared cache to eliminate unexpected cache contention. A cache scheduling method is proposed to improve shared cache utilization, including cache-aware mapping and dynamic cache allocation. CaMDN reduces memory access by 33.4% and achieves a model speedup of up to 2.56x (1.88x on average).
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionHyperdimensional computing (HDC) based GNNs are significantly advancing the brain-like cognition in terms of mathematical rigorousness and computational tractability. However, the researches in this field seem to have a "long vector consensus" that the length of HDC-hypervectors must be designed to mimic that of cerebellar cortex, i.e., ten thousands of bits, to express human's feature-rich memory. To system architects, this choice presents a formidable challenge that the combination of numerous nodes and ultra-long hypervectors could create a new memory bottleneck that undermines the operational brevity from HDC. To overcome above problem, in this work, we shift our focus to rebuilding a set of more GNN-friendly HDC-operations, by which, short hypervectors are sufficient to encode rich features via enjoining the strong error tolerance of neural cognition. To achieve that, three behavioral incompatibilities of HDC with general GNNs, i.e., feature distortion, structural bias, and central-node vacancy, are found and successfully resolved for more efficient feature-extraction in graphs. Taken as a whole, a memory-efficient HDC-based GNN framework, called CiliaGraph, is designed to drive one-shot graph classifying tasks with only hundreds of bits in hypervector aggregation, which offers 1 to 2 orders of memory savings. The results show that, compared to the SOTA GNNs, CiliaGraph reduces the memory access and training latency by an average of 292× (up to 2341×) and 103× (up to 313×), respectively, while maintaining the competitively accuracy.
Networking
Work-in-Progress Poster


DescriptionWith increasingly complex manufacturing processes, the carbon footprint of the semiconductor industry has become an important consideration in integrated circuit (IC) design. Therefore, it is essential to develop models for quantifying the total carbon footprint associated with an IC throughout its entire lifetime. Total carbon footprint accounts for both embodied carbon from manufacturing and operational carbon from day-to-day use. Although tools for architectural-level analysis of carbon footprint are actively being developed, carbon modeling tools are not yet well integrated with circuit-level Electronic Design Automation tools, which typically optimize many design parameters in an iterative fashion. In this work, we present CarbonEDA, an open-source tool enabling closed-loop Carbon-aware Electronic Design Automation for integrated circuits, in which designers can evaluate trade-offs in energy efficiency and total carbon footprint at circuit design stages. CarbonEDA interfaces directly with industry standard formats for representing layouts, wafer cross-sections, design rules, etc., and enables quantification of carbon footprint in approximately 10 seconds per million polygons in an IC layout. Leveraging CarbonEDA, existing IC design infrastructure can be enhanced to perform operational and embodied carbon footprint analysis for various process technologies – from today's silicon CMOS to emerging transistor technologies, new memories, and 3D integration techniques (e.g., chiplets, interposers, monolithic 3D integration). CarbonEDA enables predictive analysis of carbon footprint during early design stages, which is essential for guiding future directions in carbon-aware IC design.
Networking
Work-in-Progress Poster


DescriptionOver the years, the chip industry has consistently developed high-performance processors to address the increasing demands of general consumers and data centers across diverse applications. However, the rapid expansion of chip production has significantly increased carbon emissions, raising critical concerns about environmental sustainability. While researchers have previously modeled carbon footprint (CFP) at the system and processor levels, a holistic analysis of sustainability trends encompassing the entire chip lifecycle remains lacking. This paper presents CarbonSet, a comprehensive dataset that integrates sustainability and performance metrics for CPUs and GPUs over the past decade. CarbonSet serves as a benchmark to guide the design of next-generation processors. Leveraging this dataset, we conduct a detailed analysis of sustainability trends in flagship processors from the last ten years. Furthermore, CarbonSet provides insights into the lifecycle phases that contribute most to the CFP of these processors and examines the impact of the Artificial Intelligence (AI) boom on the computing industry's carbon footprint. This paper shows that modern processor architectures are not yet sustainably designed, with total carbon emissions increasing more than 50x in the past three years due to the surging demand driven by the AI boom. Power efficiency remains a significant architectural challenge, while advanced semiconductor processes place additional demands on designing for sustainable computing. Moreover, consumers need to maximize processor utilization to amortize the embodied carbon better.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionCustom accelerators enhance System-on-Chips' performance through hardware specialization. High-level synthesis (HLS) can automatically synthesize accelerators for given kernels but requires manual selection and extraction of kernels from applications. This paper proposes Cayman, the first end-to-end framework to synthesize high-performance custom accelerators with both control flow and data access optimization. Cayman automatically selects kernels for hardware acceleration based on a hierarchical program representation, which captures kernel candidates with general control flows. Besides, Cayman optimizes accelerators with specialized processor-accelerator interfaces for data access acceleration. Cayman further introduces a novel accelerator merging mechanism to synthesize reusable accelerators. Experiments on various benchmarks demonstrate that Cayman outperforms two state-of-the-art frameworks by 8.0× and 14.4×.
Networking
Work-in-Progress Poster


DescriptionEvery day, on average, 8 cybercrimes targeting IoT networks occur, leading to a cumulative loss of $10 million. The main reason for these attacks is the ability of unauthorized devices to gain access to IoT networks by replicating the hardware and software configurations of authorized devices. To tackle this pressing issue, cryptographic keys are used to authenticate devices in IoT networks. Given the extensive computational requirements of this process, authentication is a one-time process happening at the beginning. However, this makes devices susceptible to cyber-attacks like spoofing, Sybil attacks, DDoS, and Advanced Persistent Threats (APT). To address this, we propose a novel Continuous Device-to-Device Authentication (CD2A) framework based on two components: 1) Identity Establishment, and 2) Continuous Authentication. In the Identity Establishment phase, we use manufacturing imperfections to model unique dynamic device behaviours. A novel device fingerprint algorithm is proposed that uses crystal oscillator impurities in the central processing unit (CPU) and graphical processing unit (GPU) cores. In the Continuous Authentication phase, we implement a dynamic timeline to establish device identity at regular intervals. In this phase, a device is continuously authenticated by using machine learning techniques to dynamically establish identity at regular intervals. To protect CD2A from cyber-attacks like spoofing, Sybil attacks, DDoS, and APT, we track device legitimacy by calculating the Device Authentication Score (DAS) and the Device Risk Factor (DRF) in view of varying security risks. We evaluate the CD2A framework on an IoT system with 11 devices. The CD2A framework achieves an average authentication accuracy of 99.96% and 99.85% when used in tandem with CatBoost and XGBoost machine learning algorithms, respectively.
Engineering Poster
Networking


DescriptionCDC-RDC analysis has evolved as an inevitable stage in RTL quality signoff in the last two decades. Over this period, the designs have grown exponentially to SOC's having 2 trillion+ transistors and chiplet's having 7+ SOC's. Today CDC verification has become a multifaceted effort across the chips designed for clients, servers, mobile, automotives, memory, AI/ML, FPGA etc… with focus on cleaning up of thousands of clocks and constraints, integrating the SVA's for constraints in validation environment to check for correctness, looking for power domain and DFT logic induced crossings, finally signing off with netlist CDC to unearth any glitches and corrupted synchronizers during synthesis.
As the design sizes increased in every generation, the EDA tools could not handle running flatly and the only way of handling design complexity was through hierarchical CDC-RDC analysis consuming abstracts. Also, hierarchical analysis helps to enable the analysis in parallel with teams across the globe. Even with all these significant progress in capabilities of EDA tools the major bottleneck in CDC-RDC analysis of complex SOC's and Chiplets is consuming abstracts generated by different vendor tools. Different vendor tool abstracts are seen because of multiple IP vendors, even in house teams might deliver abstracts generated with different vendors tools.
The Accellera CDC Working-Group aims to define a standard CDC-RDC IP-XACT / TCL model to be portable and reusable regardless of the involved verification tool.
As moving from monolithic designs to IP/SOC with IPs sourced from a small/select providers to sourcing IPs globally (to create differentiated products), the quality must be maintained as driving faster time-to-market. In areas where the standards (SystemVerilog, OVM/UVM, LP/UPF) are present, the integration is able to meet the above (quality, speed). However, in areas where standards (in this case, CDC-RDC) are not available, most options trade-off either quality, or time-to-market, or both 🙁 Creating a standard for inter-operable collateral addresses this gap.
As the design sizes increased in every generation, the EDA tools could not handle running flatly and the only way of handling design complexity was through hierarchical CDC-RDC analysis consuming abstracts. Also, hierarchical analysis helps to enable the analysis in parallel with teams across the globe. Even with all these significant progress in capabilities of EDA tools the major bottleneck in CDC-RDC analysis of complex SOC's and Chiplets is consuming abstracts generated by different vendor tools. Different vendor tool abstracts are seen because of multiple IP vendors, even in house teams might deliver abstracts generated with different vendors tools.
The Accellera CDC Working-Group aims to define a standard CDC-RDC IP-XACT / TCL model to be portable and reusable regardless of the involved verification tool.
As moving from monolithic designs to IP/SOC with IPs sourced from a small/select providers to sourcing IPs globally (to create differentiated products), the quality must be maintained as driving faster time-to-market. In areas where the standards (SystemVerilog, OVM/UVM, LP/UPF) are present, the integration is able to meet the above (quality, speed). However, in areas where standards (in this case, CDC-RDC) are not available, most options trade-off either quality, or time-to-market, or both 🙁 Creating a standard for inter-operable collateral addresses this gap.
Engineering Special Session


Back-End Design
DescriptionCDC-RDC analysis has evolved as an inevitable stage in RTL quality signoff in the last two decades. Over this period, the designs have grown exponentially to SOC's having 2 trillion+ transistors and chiplets having 7+ SOC's. Today CDC verification has become a multifaceted effort across the chips designed for clients, servers, mobile, automotives, memory, AI/ML, FPGA etc… with focus on cleaning up of thousands of clocks and constraints, integrating the SVA's for constraints in validation environment to check for correctness, looking for power domain and DFT logic induced crossings, finally signing off with netlist CDC to unearth any glitches and corrupted synchronizers during synthesis.
As the design sizes increased in every generation, the EDA tools could not handle running flatly and the only way of handling design complexity was through hierarchical CDC-RDC analysis consuming abstracts. Also, hierarchical analysis helps to enable the analysis in parallel with teams across the globe. Even with all these significant progress in capabilities of EDA tools the major bottleneck in CDC-RDC analysis of complex SOC's and Chiplets is consuming abstracts generated by different vendor tools. Different vendor tool abstracts are seen because of multiple IP vendors, even in house teams might deliver abstracts generated with different vendors tools.
The Accellera CDC Working-Group aims to define a standard CDC-RDC IP-XACT / TCL model to be portable and reusable regardless of the involved verification tool.
As moving from monolithic designs to IP/SOC with IPs sourced from a small/select providers to sourcing IPs globally (to create differentiated products), the quality must be maintained as driving faster time-to-market. In areas where the standards (SystemVerilog, OVM/UVM, LP/UPF) are present, the integration is able to meet the above (quality, speed). However, in areas where standards (in this case, CDC-RDC) are not available, most options trade-off either quality, or time-to-market, or both. Creating a standard for inter-operable collateral addresses this gap.
This special session aims to remind the definitions of CDC-RDC Basic Concepts and constraints, as well as the description of the reference verification flow, and addressing the goals, scope, structure & deliverables of the Accellera CDC Working Group in order to elaborate a specification of the standard abstract model.
As the design sizes increased in every generation, the EDA tools could not handle running flatly and the only way of handling design complexity was through hierarchical CDC-RDC analysis consuming abstracts. Also, hierarchical analysis helps to enable the analysis in parallel with teams across the globe. Even with all these significant progress in capabilities of EDA tools the major bottleneck in CDC-RDC analysis of complex SOC's and Chiplets is consuming abstracts generated by different vendor tools. Different vendor tool abstracts are seen because of multiple IP vendors, even in house teams might deliver abstracts generated with different vendors tools.
The Accellera CDC Working-Group aims to define a standard CDC-RDC IP-XACT / TCL model to be portable and reusable regardless of the involved verification tool.
As moving from monolithic designs to IP/SOC with IPs sourced from a small/select providers to sourcing IPs globally (to create differentiated products), the quality must be maintained as driving faster time-to-market. In areas where the standards (SystemVerilog, OVM/UVM, LP/UPF) are present, the integration is able to meet the above (quality, speed). However, in areas where standards (in this case, CDC-RDC) are not available, most options trade-off either quality, or time-to-market, or both. Creating a standard for inter-operable collateral addresses this gap.
This special session aims to remind the definitions of CDC-RDC Basic Concepts and constraints, as well as the description of the reference verification flow, and addressing the goals, scope, structure & deliverables of the Accellera CDC Working Group in order to elaborate a specification of the standard abstract model.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionWhile distributed, neural-network-based resource con-
trollers represent the state of the art for their ability to cope with the
ever-expanding decision space, such approaches suffer from several limi-
tations, like conflicting control decisions and partial observability. These
effects can significantly impair the controllers' learning capabilities and
the robustness of their control policies, causing substantial performance
losses. We are the first to solve this problem by employing a centralized
training and decentralized control regime to mitigate the aforementioned
limitations. Specifically, we design a centralized neural network (critic)
that evaluates the behavior of multiple decentralized, neural controllers
(actors) in a system-wide context. The objective of our proposed technique
is to maximize the performance under a temperature constraint through
dynamic voltage frequency scaling. The evaluation of our technique on
a real processor shows its superiority to the state of the art technique,
yielding average (peak) performance improvements of 20% (34%), which
we consider a breakthrough as the gains are measured on a real-world
platform.
trollers represent the state of the art for their ability to cope with the
ever-expanding decision space, such approaches suffer from several limi-
tations, like conflicting control decisions and partial observability. These
effects can significantly impair the controllers' learning capabilities and
the robustness of their control policies, causing substantial performance
losses. We are the first to solve this problem by employing a centralized
training and decentralized control regime to mitigate the aforementioned
limitations. Specifically, we design a centralized neural network (critic)
that evaluates the behavior of multiple decentralized, neural controllers
(actors) in a system-wide context. The objective of our proposed technique
is to maximize the performance under a temperature constraint through
dynamic voltage frequency scaling. The evaluation of our technique on
a real processor shows its superiority to the state of the art technique,
yielding average (peak) performance improvements of 20% (34%), which
we consider a breakthrough as the gains are measured on a real-world
platform.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionBoolean satisfiability (SAT), the first proven non-deterministic polynominal (NP)-complete problem, is crucial in data-intensive applications. Different applications have a wide spectrum of SAT problem sets (scale, complexity), also the various solution requirements (algorithm completeness, speed). The current SAT solvers are insufficient to provide an optimal solution under different scenarios.
In this work, we designed the Chameleon-SAT, an adaptive SAT accelerator using mixed-signal in-memory computing which support local search, DPLL and CDCL algorithms, and yet leverage the efficient mixed-signal computing architecture, achieving orders of magnitude improvement in speed compared to the prior SAT solvers. By judiciously selecting the reconfiguration mode, Chameleon-SAT are able to solve a wide range of the SAT problems achieving small-scale, high-complexity cases ($ \geq 90 \times$ for 20 variables/ 86 clauses, satisfiable problems), medium-scale, structured ($ \geq 19 \times$ for 50 variables/ 215 clauses, unsatisfiable problems) and large-scale, high-complexity cases ($ \geq 7 \times$ for 100 variables/ 430 clauses, satisfiable problems).
In this work, we designed the Chameleon-SAT, an adaptive SAT accelerator using mixed-signal in-memory computing which support local search, DPLL and CDCL algorithms, and yet leverage the efficient mixed-signal computing architecture, achieving orders of magnitude improvement in speed compared to the prior SAT solvers. By judiciously selecting the reconfiguration mode, Chameleon-SAT are able to solve a wide range of the SAT problems achieving small-scale, high-complexity cases ($ \geq 90 \times$ for 20 variables/ 86 clauses, satisfiable problems), medium-scale, structured ($ \geq 19 \times$ for 50 variables/ 215 clauses, unsatisfiable problems) and large-scale, high-complexity cases ($ \geq 7 \times$ for 100 variables/ 430 clauses, satisfiable problems).
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionLarge Language Models (LLMs) have demonstrated significant potential in automating the Electronic Design Automation (EDA) process through effective integration with EDA tools. This paper targets the customization of logic synthesis scripts, which is crucial for accommodating the unique characteristics of each design in the EDA workflow. The proposed framework, called ChatLS, integrates multimodal retrieval-augmented generation (RAG) and chain-of-thought (CoT) reasoning, enabling LLMs to collaboratively analyze design features and precisely customize synthesis scripts. Experimental results demonstrate that ChatLS has achieved superior performance in customizing synthesis scripts with commercial logic synthesis tool.
Engineering Poster
Networking
Chip package level thermal integrity analysis of high-power data center chips for hot spot detection
5:00pm - 6:00pm PDT Monday, June 23 Engineering Posters, Level 2 Exhibit Hall

DescriptionIn advanced node stacked chips, increasing power density within the chip results in reaching thermal limits of operation and a need for identifying thermal hotspots on the chip and designing appropriate cooling solutions. In this presentation, we demonstrate chip package level thermal analysis performed for a large Machine learning accelerator chip to model thermal gradient across the design for multiple distinct vectors.
The design consists of compute dies and high bandwidth memory chips stacked on an organic interposer. The tile-based power map distinct to each vector generated includes leakage power as a function of temperature along with switching and internal power. This led to an increase in accuracy (~3.5°C). Thermal material properties of on-die routing layers along with package and system level cooling details were incorporated.
Nearly 96% of the power was estimated to be dissipated though the heat sink system, compared to just 4% through the package. Heat map and thermal gradient at each individual on-die routing layer was extracted. We identified the hot spots specific to each vector and accurate thermal sensor placement at these locations enabled us to capture on die temperature, avoid overheating and minimize thermal failures by providing appropriate feedback to trigger mitigation techniques.
The design consists of compute dies and high bandwidth memory chips stacked on an organic interposer. The tile-based power map distinct to each vector generated includes leakage power as a function of temperature along with switching and internal power. This led to an increase in accuracy (~3.5°C). Thermal material properties of on-die routing layers along with package and system level cooling details were incorporated.
Nearly 96% of the power was estimated to be dissipated though the heat sink system, compared to just 4% through the package. Heat map and thermal gradient at each individual on-die routing layer was extracted. We identified the hot spots specific to each vector and accurate thermal sensor placement at these locations enabled us to capture on die temperature, avoid overheating and minimize thermal failures by providing appropriate feedback to trigger mitigation techniques.
Engineering Poster


DescriptionIt is widely understood in the semiconductor design and manufacturing industry that antenna-related failure results from negatively charged particles occurring during metal interconnect etching discharging through the point of least resistance on the path. When the antenna discharge issue was discovered, the solution to avoid such failures was editing the layout so that the ratio of metal area to combined gates and diodes area was greater than the acceptable antenna ratio value determined by the foundry.
In deep sub-micron manufacturing, antenna analysis has become much more complicated. As such, a computation-based method to estimate the potential damages is used. Accurate antenna analysis on a very large design requires computing resources. The antenna model is often simplified using some design assumptions, compromising accuracy for engineering time and cost of computing resources. To be fair, the foundry that creates the antenna models and rule checks can not anticipate all possible design structures that cause antenna failures. However, the designers may be able to improve the reliability of their chip by adding the proposed concept of customized antenna checking.
In deep sub-micron manufacturing, antenna analysis has become much more complicated. As such, a computation-based method to estimate the potential damages is used. Accurate antenna analysis on a very large design requires computing resources. The antenna model is often simplified using some design assumptions, compromising accuracy for engineering time and cost of computing resources. To be fair, the foundry that creates the antenna models and rule checks can not anticipate all possible design structures that cause antenna failures. However, the designers may be able to improve the reliability of their chip by adding the proposed concept of customized antenna checking.
Research Manuscript
ChipAlign: Instruction Alignment in Large Language Models for Chip Design via Geodesic Interpolation
2:45pm - 3:00pm PDT Monday, June 23 3000, Level 3

AI
AI1: AI/ML Algorithms
DescriptionRecent advancements in large language models (LLMs) have expanded their application across various domains, including chip design, where domain-adapted chip models like ChipNeMo have emerged. However, these models often struggle with instruction alignment, a crucial capability for LLMs that involves following explicit human directives. This limitation impedes the practical application of chip LLMs, including serving as assistant chatbots for hardware design engineers. In this work, we introduce ChipAlign, a novel approach that utilizes a training-free model merging strategy, combining the strengths of a general instruction-aligned LLM with a chip-specific LLM. By considering the underlying manifold in the weight space, ChipAlign employs geodesic interpolation to effectively fuse the weights of input LLMs, producing a merged model that inherits strong instruction alignment and chip expertise from the respective instruction and chip LLMs. Our results demonstrate that ChipAlign significantly enhances instruction-following capabilities of existing chip LLMs, achieving up to a 26.6% improvement on the IFEval benchmark, while maintaining comparable expertise in the chip domain. This improvement in instruction alignment also translates to notable gains in instruction-involved QA tasks, delivering performance enhancements of 3.9% on the OpenROAD QA benchmark and 8.25% on production-level chip QA benchmarks, surpassing state-of-the-art baselines.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionElectromigration (EM) has become one of the major challenges for
2.5D and 3D chiplet integration systems. However, most of the re-
search for EM focus on 2D power delivery network and cannot take
the vertical power supply structures and non-uniformly thermal
distribution condition between dies into consideration. To mitigate
this problem, in this article a novel EM simulation tool ChipletEM
for 2.5D and 3D chiplet integration systems is proposed. A finite vol-
ume method (FVM) based electrical-thermal co-simulation model
is employed to get initial temperature and current density inside
TSV. A finite difference time domain (FDTD) solver is employed for
hydrostatic stress simulation for both nucleation and post-voiding
phases. Thermal migration effect is also considered in that solver.
A compact TSV thermal solver is employed for temperature distri-
bution simulation and thermal dependent current simulation. The
FDTD EM solver and TSV thermal solver are coupled together at
each time step so that the interaction among EM stress, thermal
stress, void growth, resistance change, IR drop and joule heating
effects can be simulated in a single simulation framework. Accuracy
of the proposed tool is validated with commercial finite element
(FEM) tool and published experiment data. Comparison results
show that the proposed method has high accuracy and fast sim-
ulation speed. EMChiplet get 10 times speed up compared with
commercial FEM tool and only have 2% of accuracy trade off.
Furthermore, comparednd with experiment result the average error
is less then 5%
2.5D and 3D chiplet integration systems. However, most of the re-
search for EM focus on 2D power delivery network and cannot take
the vertical power supply structures and non-uniformly thermal
distribution condition between dies into consideration. To mitigate
this problem, in this article a novel EM simulation tool ChipletEM
for 2.5D and 3D chiplet integration systems is proposed. A finite vol-
ume method (FVM) based electrical-thermal co-simulation model
is employed to get initial temperature and current density inside
TSV. A finite difference time domain (FDTD) solver is employed for
hydrostatic stress simulation for both nucleation and post-voiding
phases. Thermal migration effect is also considered in that solver.
A compact TSV thermal solver is employed for temperature distri-
bution simulation and thermal dependent current simulation. The
FDTD EM solver and TSV thermal solver are coupled together at
each time step so that the interaction among EM stress, thermal
stress, void growth, resistance change, IR drop and joule heating
effects can be simulated in a single simulation framework. Accuracy
of the proposed tool is validated with commercial finite element
(FEM) tool and published experiment data. Comparison results
show that the proposed method has high accuracy and fast sim-
ulation speed. EMChiplet get 10 times speed up compared with
commercial FEM tool and only have 2% of accuracy trade off.
Furthermore, comparednd with experiment result the average error
is less then 5%
Research Panel


AI
DescriptionThe rapid evolution of artificial intelligence (AI) marks a transformative leap in technology, reshaping industries and influencing everyday life. AI has emerged as a cornerstone of innovation, enhancing productivity and unlocking new possibilities across diverse domains. The integration of advanced deep learning (DL) models, vast datasets, and powerful hardware is revolutionizing the computing industry. Over the past decade, advancements such as convolutional neural networks and transformers have broadened AI's applications, from vision and language to sophisticated generative tasks. Today, large language models (LLMs), equipped with trillions of parameters and trained on terabytes of data, exemplify this progress.
Accompanying these developments are significant hardware breakthroughs. For instance, modern GPUs achieve an astonishing performance of 40 Peta Ops, representing exponential improvements over the last decade. Energy efficiency has also seen significant progress, with cutting-edge research prototypes delivering over 100 TOPS/W.
We stand at a pivotal moment, where reflection on past achievements enables us to celebrate milestones and identify key contributors. At the same time, this understanding helps shape a roadmap for the future, highlighting challenges and exploring innovative solutions. Critical questions driving these discussions include:
1. Dual-edged nature of AI
o While LLMs and AI advancements have transformed our lives, have they genuinely boosted productivity, or have they primarily fueled the spread of misinformation?
o Have we sacrificed security and overlooked ethical biases in the rush for rapid progress? What critical lessons can we learn from these past missteps?
2. Have we reached the limits of AI scaling?
o Can models and hardware continue to grow at their current pace, or is the era of exponential scaling nearing its end?
o Are smaller models the answer to high computational demands?
3. Is hardware innovation keeping up?
o Can hardware performance sustain the rapid advancements in DL models?
o What emerging hardware technologies could disrupt the future? Are technologies like In-memory computing, Neuromorphic computing promising or just a hype?
4. Specialization vs. Generalization
o Will the future belong to specialized models tailored to specific domains, or will general-purpose models dominate? How transformative are technologies such as Mixture of Experts (MoE)?
o Should future DL hardware prioritize bespoke solutions, or is flexibility key to serving diverse applications?
5. Economic Viability
o Can AI applications justify their soaring costs in both the short and long term?
o Are companies overinvesting in AI without clear paths to economic sustainability?
This panel discussion will convene experts from industry and academia with extensive experience in deep learning systems and product development. By reflecting on AI's priorities and lessons from the past decade, the panel will explore strategies to address pressing challenges in AI development. These insights aim to pave a roadmap for AI's future, fostering a balanced and innovative approach to technological advancement.
Accompanying these developments are significant hardware breakthroughs. For instance, modern GPUs achieve an astonishing performance of 40 Peta Ops, representing exponential improvements over the last decade. Energy efficiency has also seen significant progress, with cutting-edge research prototypes delivering over 100 TOPS/W.
We stand at a pivotal moment, where reflection on past achievements enables us to celebrate milestones and identify key contributors. At the same time, this understanding helps shape a roadmap for the future, highlighting challenges and exploring innovative solutions. Critical questions driving these discussions include:
1. Dual-edged nature of AI
o While LLMs and AI advancements have transformed our lives, have they genuinely boosted productivity, or have they primarily fueled the spread of misinformation?
o Have we sacrificed security and overlooked ethical biases in the rush for rapid progress? What critical lessons can we learn from these past missteps?
2. Have we reached the limits of AI scaling?
o Can models and hardware continue to grow at their current pace, or is the era of exponential scaling nearing its end?
o Are smaller models the answer to high computational demands?
3. Is hardware innovation keeping up?
o Can hardware performance sustain the rapid advancements in DL models?
o What emerging hardware technologies could disrupt the future? Are technologies like In-memory computing, Neuromorphic computing promising or just a hype?
4. Specialization vs. Generalization
o Will the future belong to specialized models tailored to specific domains, or will general-purpose models dominate? How transformative are technologies such as Mixture of Experts (MoE)?
o Should future DL hardware prioritize bespoke solutions, or is flexibility key to serving diverse applications?
5. Economic Viability
o Can AI applications justify their soaring costs in both the short and long term?
o Are companies overinvesting in AI without clear paths to economic sustainability?
This panel discussion will convene experts from industry and academia with extensive experience in deep learning systems and product development. By reflecting on AI's priorities and lessons from the past decade, the panel will explore strategies to address pressing challenges in AI development. These insights aim to pave a roadmap for AI's future, fostering a balanced and innovative approach to technological advancement.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThe Basic Linear Algebra Subprograms (BLAS) is a fundamental software library. Many operations in BLAS are data-intensive and are limited by the memory bandwidth of the CPU and GPU. The computing-in-memory (CIM) technology can effectively alleviate the memory wall bottleneck and is particularly suitable for accelerating BLAS. We propose the first CIM accelerator for BLAS, CIM-BLAS, based on non-volatile memory. CIM-BLAS includes a unified floating-point pipeline to support high-precision arithmetics. High efficiency of the accelerator is achieved by developing configurable data flows to support various BLAS functions. The evaluations demonstrate the significant potential of CIM for accelerating BLAS.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionDigital Compute-in-Memory (CIM) architectures have shown great promise in accelerating deep neural networks (DNNs) by addressing the memory wall bottleneck. However, their development and optimization are hindered by a lack of systematic tools encompassing both software and hardware design spaces. We present CIMFlow, an integrated framework that offers an out-of-the-box workflow for implementing and evaluating DNN workloads on digital CIM platforms. CIMFlow bridges compilation and simulation through a flexible ISA design and tackles digital CIM constraints with advanced optimization strategies in the compilation flow. Our evaluation demonstrates that CIMFlow enables systematic exploration and optimization across diverse configurations.
Engineering Poster
Networking


DescriptionI/O circuit design involving multiple standards often face varied constraints which need to be taken care appropriately at architecture level to avoid significant area penalty. Supporting a wide range of operating loads adds complexity to meet timing and drive specifications across all scenarios. Further achieving a tight spread of transitions times across PVT and minimal loop delays in worst PVT while ensuring design reliability is a key challenge. Traditional methods rely on repetitive manual iterations and post-design validations, leading to inefficiencies and longer development cycles. In this paper we illustrate a comprehensive design and optimization methodology aimed at addressing these challenges resulting in a area optimized, robust and reliable I/O Circuit supporting wide range of applications. While the findings are illustrated based on work done for development in STM 0.13u technology node for analog applications the outcomes are relevant to design development for I/O interfaces in any semiconductor technology node.
Networking
Work-in-Progress Poster


DescriptionAnalog circuit synthesis is crucial to Electronic Design Automation (EDA), automating the creation of circuit structures tailored to design requirements. Addressing challenges in the vast design space and constraint adherence, we propose CIRCUITSYNTH-RL, an RL-based framework in two phases: instruction tuning and RL refinement. Instruction tuning adapts LLMs to generate initial circuit topologies based on input constraints like component pool and efficiency. RL refinement uses reward models to align designs with constraints. Experiments show superior performance in generating compliant circuits and highlight the framework's ability to generalize to more complex configurations with limited training data.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionCircuit stability (sensitivity) analysis aims to estimate the overall performance impact resulting from variations in underlying design parameters, such as gate sizes and capacitance. This process is challenging because it often requires numerous time-consuming circuit simulations. In contrast, graph neural networks (GNNs) have shown remarkable effectiveness and efficiency in tackling several chip design automation issues, including circuit timing predictions, parasitics prediction, gate sizing, and device placement. This paper introduces a novel approach called CirSTAG, which utilizes GNNs to analyze the stability (robustness) of modern integrated circuits (ICs). CirSTAG is grounded in a spectral framework that examines the stability of GNNs by leveraging input/output graph-based manifolds. When two adjacent nodes on the input manifold are mapped (through a GNN model) to two remote nodes (data samples) on the output manifold, this indicates a significant mapping distortion (DMD) and consequently poor GNN stability. CirSTAG calculates a stability score equivalent to the local Lipschitz constant for each node and edge, taking into account both graph structure and node feature perturbations. This enables the identification of the most critical (sensitive) circuit elements that could significantly impact circuit performance. Our empirical evaluations across various timing prediction tasks with realistic circuit designs demonstrate that CirSTAG can accurately estimate the stability of each circuit element under diverse parameter variations. This offers a scalable method for assessing the stability of large integrated circuit designs.
Networking
Work-in-Progress Poster


DescriptionModern data storage systems heavily rely on SSDs
due to their high speed and efficiency. With the rise of data-
intensive applications, particularly in deep learning, the demand
for increased storage capacity is more pressing than ever. To
address this challenge, we propose Compression in SSDs (CiS),
an approach to enhance storage performance and reliability by
applying data compression techniques within SSDs. By compress-
ing user data and storing it in NAND flash, the number of bitlines
selected within the NAND flash during a read operation can
be effectively reduced. Reducing the number of selected bitlines
during read operations helps mitigate read-retry occurrences that
may arise due to read disturbance effects. This paper examines
various compression algorithms for storing user data in NAND
flash-based SSDs and evaluates their effects on compression ratio
and storage reliability. Our findings demonstrate that strategic
grouping of similar file types in data centers, coupled with
the application of appropriate compression algorithms, can lead
to significant improvements in storage read performance by
reducing read-retry counts. The results show that CiS reduced the
read-retry count by an average of 32%, 62%, and 88% compared
to the baseline when the compression ratios were 4/3, 2, and 4,
respectively.
due to their high speed and efficiency. With the rise of data-
intensive applications, particularly in deep learning, the demand
for increased storage capacity is more pressing than ever. To
address this challenge, we propose Compression in SSDs (CiS),
an approach to enhance storage performance and reliability by
applying data compression techniques within SSDs. By compress-
ing user data and storing it in NAND flash, the number of bitlines
selected within the NAND flash during a read operation can
be effectively reduced. Reducing the number of selected bitlines
during read operations helps mitigate read-retry occurrences that
may arise due to read disturbance effects. This paper examines
various compression algorithms for storing user data in NAND
flash-based SSDs and evaluates their effects on compression ratio
and storage reliability. Our findings demonstrate that strategic
grouping of similar file types in data centers, coupled with
the application of appropriate compression algorithms, can lead
to significant improvements in storage read performance by
reducing read-retry counts. The results show that CiS reduced the
read-retry count by an average of 32%, 62%, and 88% compared
to the baseline when the compression ratios were 4/3, 2, and 4,
respectively.
Networking
Work-in-Progress Poster


DescriptionThe ability to selectively forget learned information--capability crucial for privacy, security, and dynamic adaptation--is unexplored in Hyperdimensional computing (HDC) systems. In this paper, we show that unlearning in HDC is challenging due to its memorization nature, making it difficult to naturally forget specific information. We then present CLEAR-HD, a light-weight and effective framework for HDC unlearning. CLEAR-HD tracks the effect of encoded vectors in the model and offsets the impact of unlearned data by appropriate substitutes. CLEAR-HD also utilizes a selective retraining to minimize accuracy loss. CLEAR-HD outperforms the baselines in unlearning quality, accuracy, and performance.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionThe complexity of design rules and intense time-to-market demands have made auto-placement tools essential for advanced printed circuit board (PCB) designs. This paper presents a novel PCB placement framework to handle pad-to-pad clearance constraints and heterogeneous components to address these challenges. Unlike existing academic placers, our framework focuses on the following key features: a wire-area model to account for various routing resource needs between power and signal nets, a pad-to-pad clearance model to minimize spacing violations, and a two-sided, pad-type-aware density model to reduce component and pad overlap. We further develop a quadratic programming-based legalizer to resolve constraint violations among components of varying shapes. Experimental results show the effectiveness and efficiency of our framework, surpassing two state-of-the-art academic placers in post-routing quality on both academic and industrial benchmarks.
Engineering Poster
Networking


DescriptionAs node technology has been evolved, routing congestion and net delay issues are becoming more severe.
BSPDN and BS clock routing solution can resolve these challenging by segregating routing area with very low resistance metal.
We propose BS H-tree CTS, which is showing robust timing characteristics by fully enabling back-side metal resources.
The low-resistance of BS metal in BS CTS reduced the clock latency and net delay, and the symmetric structure of H-tree improved the clock skew.
Based on our methodology, the CPU design achieved 5.3% improvement in cell area and 9.7% in performance, and for the GPU design, a 9.6% in cell area and 10.7% in performance.
BSPDN and BS clock routing solution can resolve these challenging by segregating routing area with very low resistance metal.
We propose BS H-tree CTS, which is showing robust timing characteristics by fully enabling back-side metal resources.
The low-resistance of BS metal in BS CTS reduced the clock latency and net delay, and the symmetric structure of H-tree improved the clock skew.
Based on our methodology, the CPU design achieved 5.3% improvement in cell area and 9.7% in performance, and for the GPU design, a 9.6% in cell area and 10.7% in performance.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionLarge Language Models (LLMs) have been widely deployed in a variety of applications, and the context length is rapidly increasing to handle tasks such as long-document QA and complex logical reasoning. However, long context poses significant challenges for inference efficiency, including high memory costs of key-value (KV) cache and increased latency due to extensive memory accesses. Recent works have proposed compressing KV cache to approximate computation, but these methods either evict tokens permanently, never recalling them for later inference, or recall previous tokens at the granularity of pages divided by textual positions. Both approaches degrade the model accuracy and output quality. To achieve efficient and accurate recallable KV cache compression, we introduce ClusterKV, which recalls tokens at the granularity of semantic clusters. We design and implement efficient algorithms and systems for clustering, selection, indexing and caching. Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths, using only a 1k to 2k KV cache budget, and achieves up to a 2× speedup in latency and a 2.5× improvement in decoding throughput. Compared to SoTA recallable KV compression methods, ClusterKV demonstrates higher model accuracy and output quality, while maintaining or exceeding inference efficiency.
Research Manuscript


Security
SEC4: Embedded and Cross-Layer Security
DescriptionIoT protocols are essential for the communication among diverse devices.
In real-world scenarios, IoT protocols utilize flexible configurations to meet various use cases.
These configurations can significantly impact the protocols' execution paths, with many bugs emerging only under specific configurations. Fuzzing has become a prominent technique for uncovering vulnerabilities in IoT protocol implementations.
However, traditional fuzzing approaches are typically conducted using fixed or default configurations, overlooking potential issues that might arise in different settings.
This limitation can lead to missing critical bugs that appear only under alternative configurations.
In this paper, we propose CMFuzz, a parallel fuzzing framework designed to improve fuzzing effectiveness of IoT protocols through configuration identification and scheduling.
CMFuzz first constructs a generalized protocol configuration model by systematically extracting configuration items from protocol implementations.
Then, based on this model, CMFuzz defines the relations among configuration items and introduces a relation-aware allocation mechanism to distribute them across parallel fuzzing instances.
For evaluation, We implement CMFuzz on top of the widely-used protocol fuzzer Peach and conduct experiments on six popular IoT protocols.
Compared to the original parallel mode of Peach and state-of-the-art parallel protocol fuzzer SPFuzz, CMFuzz covers an average of 34.4% and 28.5% more branches within 24 hours.
Additionally, CMFuzz has detected 14 previously-unknown bugs in these real-world IoT protocols.
In real-world scenarios, IoT protocols utilize flexible configurations to meet various use cases.
These configurations can significantly impact the protocols' execution paths, with many bugs emerging only under specific configurations. Fuzzing has become a prominent technique for uncovering vulnerabilities in IoT protocol implementations.
However, traditional fuzzing approaches are typically conducted using fixed or default configurations, overlooking potential issues that might arise in different settings.
This limitation can lead to missing critical bugs that appear only under alternative configurations.
In this paper, we propose CMFuzz, a parallel fuzzing framework designed to improve fuzzing effectiveness of IoT protocols through configuration identification and scheduling.
CMFuzz first constructs a generalized protocol configuration model by systematically extracting configuration items from protocol implementations.
Then, based on this model, CMFuzz defines the relations among configuration items and introduces a relation-aware allocation mechanism to distribute them across parallel fuzzing instances.
For evaluation, We implement CMFuzz on top of the widely-used protocol fuzzer Peach and conduct experiments on six popular IoT protocols.
Compared to the original parallel mode of Peach and state-of-the-art parallel protocol fuzzer SPFuzz, CMFuzz covers an average of 34.4% and 28.5% more branches within 24 hours.
Additionally, CMFuzz has detected 14 previously-unknown bugs in these real-world IoT protocols.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionIntrusion detection systems (IDS) play a crucial role in IoT and network security by monitoring system data and alerting to suspicious activities. Machine learning (ML) has emerged as a promising solution for IDS, offering highly accurate intrusion detection. However, ML-IDS solutions often overlook two critical aspects needed to build reliable systems: continually changing data streams and a lack of attack labels. Streaming network traffic and associated cyber attacks are continually changing, which can degrade the performance of deployed ML models. Labeling attack data, such as zero-day attacks, in real-world intrusion scenarios may not be feasible, making the use of ML solutions that do not rely on attack labels
necessary. To address both these challenges, we propose CND-IDS, a continual novelty detection IDS framework which consists of (i) a learning-based feature extractor that continuously updates
new feature representations of the system data, and (ii) a novelty detector that identifies new cyber attacks by leveraging principal component analysis (PCA) reconstruction. Our results on realistic
intrusion datasets show that CND-IDS achieves up to 6.1× F-score improvement, and up to 6.5× improved forward transfer over the SOTA unsupervised continual learning algorithm
necessary. To address both these challenges, we propose CND-IDS, a continual novelty detection IDS framework which consists of (i) a learning-based feature extractor that continuously updates
new feature representations of the system data, and (ii) a novelty detector that identifies new cyber attacks by leveraging principal component analysis (PCA) reconstruction. Our results on realistic
intrusion datasets show that CND-IDS achieves up to 6.1× F-score improvement, and up to 6.5× improved forward transfer over the SOTA unsupervised continual learning algorithm
Research Special Session


Design
DescriptionWithin the noisy intermediate-scale quantum (NISQ) regime, quantum computers have the potential to
perform tasks that cannot be achieved on classical hardware. A logical question to ask is how the architecture and design of quantum information processors impacts their effectiveness to address problems of interest. In some architectures, such as those with atomic qubits, qubits are not fixed in space, but can instead by dynamically reconfigured. Using this architectural flexibility, we explore the co-design of trapped-ion quantum computers with NISQ applications. By incorporating quantum error correction strategies and conducting detailed performance and resource estimations, we show that architectural decisions are impactful for practical utility, even when component performance metrics such as gate fidelity and speed remain fixed.
perform tasks that cannot be achieved on classical hardware. A logical question to ask is how the architecture and design of quantum information processors impacts their effectiveness to address problems of interest. In some architectures, such as those with atomic qubits, qubits are not fixed in space, but can instead by dynamically reconfigured. Using this architectural flexibility, we explore the co-design of trapped-ion quantum computers with NISQ applications. By incorporating quantum error correction strategies and conducting detailed performance and resource estimations, we show that architectural decisions are impactful for practical utility, even when component performance metrics such as gate fidelity and speed remain fixed.
Research Manuscript


Systems
SYS1: Autonomous Systems (Automotive, Robotics, Drones)
DescriptionEfficient control of prosthetic limbs via non-invasive brain-computer interfaces (BCIs) requires advanced EEG processing capabilities—including pre-filtering, feature extraction, and action prediction—all performed in real-time on edge AI hardware. Achieving this level of real-time processing on resource-constrained edge devices presents significant challenges in balancing model complexity, computational efficiency, and latency. We present CognitiveArm, an EEG-driven, brain-controlled prosthetic system implemented on edge AI hardware, achieving real-time performance without compromising accuracy. The system integrates BrainFlow—an open-source library for EEG data acquisition and streaming—and optimized deep learning (DL) models for precise brain signal classification. By leveraging evolutionary search, we identify Pareto-optimal DL model configurations through hyper-parameter tuning, optimizer analysis, and window selection, analyzed individually and in ensemble configurations. We further apply model compression techniques like pruning and quantization to optimize these models for embedded deployment, balancing computational efficiency and accuracy. We collected EEG dataset and designed an annotation pipeline, enabling precise labeling of brain signals corresponding to specific intended actions, which forms the foundation for training our optimized deep learning (DL) models. Our CognitiveArm system also supports voice commands for seamless mode switching, enabling control of the prosthetic arm's 3 degrees of freedom (DoF). Running independently on embedded hardware, CognitiveArm ensures low latency and facilitates real-time interaction. We developed a full-scale prototype of the CognitiveArm, interfaced with the OpenBCI UltraCortex Mark IV EEG headset. Our evaluations demonstrate a significant improvement in accuracy, reaching up to 96%, for classifying three core actions (left, right, and stay idle). The integration of voice commands allows for multiplexed, variable movement, enabling multi-action control for various everyday tasks (e.g., handshake, cup picking). This enhances CognitiveArm's real-world performance for prosthetic control, demonstrating its potential as a practical solution for individuals requiring advanced prosthetic limb control.
Networking
Work-in-Progress Poster


DES5: Emerging Device and Interconnect Technologies
DescriptionNon-volatile memory (NVM) technologies offer ex-
citing opportunities for data-intensive computing, but the over-
whelming range of available implementations can limit the ability
to perform device-architecture co-design optimization. This work
addresses these challenges by introducing a machine learning
approach to model memristive devices, that merges physics-
informed and data-driven learning methodologies. We demon-
strate that mimicking the switching dynamics with appropriate
neural network architectures and incorporating physics modeling
equations as constraints during the training phase facilitates the
modeling task even with sparse experimental data. We validate
this training approach against traditional data-driven solutions
by comparing the respective modeling errors. Additionally, we
investigate the ability of the proposed machine learning model
to extrapolate high-level characteristics such as endurance and
switching dynamics.
citing opportunities for data-intensive computing, but the over-
whelming range of available implementations can limit the ability
to perform device-architecture co-design optimization. This work
addresses these challenges by introducing a machine learning
approach to model memristive devices, that merges physics-
informed and data-driven learning methodologies. We demon-
strate that mimicking the switching dynamics with appropriate
neural network architectures and incorporating physics modeling
equations as constraints during the training phase facilitates the
modeling task even with sparse experimental data. We validate
this training approach against traditional data-driven solutions
by comparing the respective modeling errors. Additionally, we
investigate the ability of the proposed machine learning model
to extrapolate high-level characteristics such as endurance and
switching dynamics.
Networking
Work-in-Progress Poster


DescriptionAs the semiconductor industry approaches the physical limits of Moore's Law, 3D integrated circuits (ICs)---which stack multiple dies vertically to improve performance and area efficiency---have emerged as a promising path forward.
However, this stacked architecture brings significant thermal challenges due to restricted heat dissipation.
Early-stage optimization during the placement stage is critical for alleviating these thermal issues.
Yet, existing 3D placement algorithms either fail to incorporate adequate thermal considerations or optimize each die in isolation, underscoring the necessity for a native-3D, thermal-aware, cross-die optimization solution.
In this paper, we propose T3DPlace, which incorporates multiple objectives---including wirelength, density, and thermal---into an analytical framework for cross-die optimization of 3D ICs.
To specifically tackle the unique thermal challenges in 3D ICs, we propose a compact thermal model (CTM)-based objective term, which is accurate and differentiable, seamlessly integrating hotspot temperature optimization into the analytical placement framework.
We implement the proposed framework with GPU acceleration.
Experiments demonstrate that T3DPlace achieves an average hotspot temperature reduction of $30.7\%$, while maintaining the quality of other critical design metrics.
The source code will be publicly accessible at https://github.com/OpenAfterReview.
However, this stacked architecture brings significant thermal challenges due to restricted heat dissipation.
Early-stage optimization during the placement stage is critical for alleviating these thermal issues.
Yet, existing 3D placement algorithms either fail to incorporate adequate thermal considerations or optimize each die in isolation, underscoring the necessity for a native-3D, thermal-aware, cross-die optimization solution.
In this paper, we propose T3DPlace, which incorporates multiple objectives---including wirelength, density, and thermal---into an analytical framework for cross-die optimization of 3D ICs.
To specifically tackle the unique thermal challenges in 3D ICs, we propose a compact thermal model (CTM)-based objective term, which is accurate and differentiable, seamlessly integrating hotspot temperature optimization into the analytical placement framework.
We implement the proposed framework with GPU acceleration.
Experiments demonstrate that T3DPlace achieves an average hotspot temperature reduction of $30.7\%$, while maintaining the quality of other critical design metrics.
The source code will be publicly accessible at https://github.com/OpenAfterReview.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionToday, unconventional hardware design techniques based on simple data representations are receiving more and more attention. Unary computing is one of these techniques that processes data in the form of uniform bit-streams. The simplicity of implementing complex arithmetic operations and high tolerance to noise are the crucial advantages of unary systems. However, converting data from weighed binary radix to unary representation with existing comparator-based unary number generators is expensive regarding footprint area and power consumption. The problem aggregates when the number of inputs and data precision increase. This work proposes a low-cost, comparison-free, unary number generation mechanism for efficient data conversion from binary radix to unary representation. We introduce a serial and two parallel (an exact and an approximate) unary number generators. Synthesis results show that the proposed method reduces the hardware area, power consumption, and area-delay product for both serial and parallel designs compared to the state-of-the-art converter. We evaluate the efficiency of the proposed converter in four use cases.
Engineering Presentation


Front-End Design
DescriptionInterconnects are among the most critical IPs in any system-on-chip (SoC) design, directly impacting overall performance and functionality. The increasing complexity of interconnect architectures, coupled with stringent performance requirements, poses significant challenges to traditional verification methodologies. In this work, we present a comprehensive verification flow tailored for interconnects, designed to improve both design and verification quality.
Our flow integrates formal verification methodologies to ensure data integrity and leverages Universal Verification Methodology (UVM) components to generate detailed performance and arbitration reports. These reports provide actionable insights into interconnect behavior and performance metrics. The proposed solution is scalable, automated, and adaptable to various interconnect designs, making it a one-stop solution for holistic interconnect verification. Through this approach, we enable designers and verification engineers to meet the demands of modern SoC architectures efficiently and reliably.
Our flow integrates formal verification methodologies to ensure data integrity and leverages Universal Verification Methodology (UVM) components to generate detailed performance and arbitration reports. These reports provide actionable insights into interconnect behavior and performance metrics. The proposed solution is scalable, automated, and adaptable to various interconnect designs, making it a one-stop solution for holistic interconnect verification. Through this approach, we enable designers and verification engineers to meet the demands of modern SoC architectures efficiently and reliably.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionThis paper develops a comprehensive placement and routing framework for synthesizing CFET cells to address scaling issues in advanced nodes. We first develop a BFS-based partitioning technique with a heuristic quality maintenance strategy to ensure scalability. Then, we propose a SMT-based placement method that incorporates partial routing for minimum-width placement and in-cell routability. Finally, a progressive metal routing method is proposed to address the challenges of routing resource scarcity. Compared with the state-of-the-art CFET cell generators, experimental results show that our algorithm achieves the optimal cell width for all cells, with 7 out of 30 cells exhibiting smaller widths.
Engineering Presentation


AI
Back-End Design
Chiplet
DescriptionUltra-large-scale 2.5DIC designs with multi silicon bridges embedded on organic interposer show excellent application prospects in various fields such as AI, graphics processing, etc., but their development and application also face a series of challenges, and power integrity is one of the key challenges, including:
·High-density interconnect: Integration of multiple SoC/chip modules and memories in a compact space results in high transient current fluctuations, placing extremely high demands on the power delivery network (PDN)
·Noise suppression: Due to high-frequency operation, special attention needs to be paid to the control of power supply noise, especially AC noise and synchronous switching noise
·Dynamic load response: Rapidly changing workloads require PDN to be able to respond quickly and maintain voltage stability
Therefore, we need to consider a number of factors and adopt a series of measures to ensure stable power supply and minimize noise. Silicon bridge not only help realize high-density interconnects but also provide significant advantages in power integrity.
By placing more Deep Trench Capacitor (DTC) on the silicon bridge in close proximity to the point of load, it helps to respond quickly to transient current changes. Silicon bridges can shorten PDN transfer path, they also allow the use of multiple power ground planes, and they extend the power and ground planes directly near each functional module, reducing voltage drop to enhance power supply stability.
In contrast to the traditional verification of the robustness of PDNs at signoff phase only, we use big data analytics EDA tools to perform early analysis of the PDN to establish whether the DTC configuration strategy on the silicon bridge is appropriate, whether the multilayer power and ground planes are effective, and whether the PDN network is at risk, to predict potential problems early in the design and to adjust the design plan in advance. In the mid-design and signoff phases, we can perform power integrity analysis of the entire super huge 2.5DIC system to ensure that the PDN design meets the target impedance requirements.
keywords : 2.5DIC, silicon bridge, power integrity
·High-density interconnect: Integration of multiple SoC/chip modules and memories in a compact space results in high transient current fluctuations, placing extremely high demands on the power delivery network (PDN)
·Noise suppression: Due to high-frequency operation, special attention needs to be paid to the control of power supply noise, especially AC noise and synchronous switching noise
·Dynamic load response: Rapidly changing workloads require PDN to be able to respond quickly and maintain voltage stability
Therefore, we need to consider a number of factors and adopt a series of measures to ensure stable power supply and minimize noise. Silicon bridge not only help realize high-density interconnects but also provide significant advantages in power integrity.
By placing more Deep Trench Capacitor (DTC) on the silicon bridge in close proximity to the point of load, it helps to respond quickly to transient current changes. Silicon bridges can shorten PDN transfer path, they also allow the use of multiple power ground planes, and they extend the power and ground planes directly near each functional module, reducing voltage drop to enhance power supply stability.
In contrast to the traditional verification of the robustness of PDNs at signoff phase only, we use big data analytics EDA tools to perform early analysis of the PDN to establish whether the DTC configuration strategy on the silicon bridge is appropriate, whether the multilayer power and ground planes are effective, and whether the PDN network is at risk, to predict potential problems early in the design and to adjust the design plan in advance. In the mid-design and signoff phases, we can perform power integrity analysis of the entire super huge 2.5DIC system to ensure that the PDN design meets the target impedance requirements.
keywords : 2.5DIC, silicon bridge, power integrity
Engineering Poster
Networking


DescriptionAt the forefront of digital design verification, Gate Level Simulation (GLS) is a critical technique in validating design accuracy at the most granular level.
However, the whole process comes with many challenges making GLS a difficult task to complete as detailed below
This paper presents below methodologies leading to optimize and accelerate the GLS closure. The
Challenges of GLS simulation includes the following..
1. Simulation performance and run time,
2. Debug complexity due to initialization and x-propagation,
3. Resource involvement including infrastructures and headcount at SOC level and
4. GLS closure within deadline as GLS activity starts at the end of project cycle.
The explored opportunities in this paper includes the following:
1. Drastic resource reduction in terms of headcount adopting GLS regression automation,
2. Dynamic timing check on/off during simulation under user control to accelerate the simulation,
3. Enhance MSIE(Multi Snapshot Incremental Elaboration tool flow) to reduce compilation time
and disk space consumption,
4. Common save and restore snapshot across SOC to expedite GLS simulation,
5. Minimizing hierarchical reference updates between RTL and GLS,
6. SDC(Standard Design Constraint) verification using TCV tool for GLS scope reduction and
7. Early testbench and netlist floting port issues by Zero Delay simulation
However, the whole process comes with many challenges making GLS a difficult task to complete as detailed below
This paper presents below methodologies leading to optimize and accelerate the GLS closure. The
Challenges of GLS simulation includes the following..
1. Simulation performance and run time,
2. Debug complexity due to initialization and x-propagation,
3. Resource involvement including infrastructures and headcount at SOC level and
4. GLS closure within deadline as GLS activity starts at the end of project cycle.
The explored opportunities in this paper includes the following:
1. Drastic resource reduction in terms of headcount adopting GLS regression automation,
2. Dynamic timing check on/off during simulation under user control to accelerate the simulation,
3. Enhance MSIE(Multi Snapshot Incremental Elaboration tool flow) to reduce compilation time
and disk space consumption,
4. Common save and restore snapshot across SOC to expedite GLS simulation,
5. Minimizing hierarchical reference updates between RTL and GLS,
6. SDC(Standard Design Constraint) verification using TCV tool for GLS scope reduction and
7. Early testbench and netlist floting port issues by Zero Delay simulation
Research Manuscript


Design
DES6: Quantum Computing
DescriptionHybrid Quantum Neural Networks (HQNNs), under the umbrella of Quantum Machine Learning (QML), have garnered significant attention due to their potential to enhance computational performance by integrating quantum layers within traditional neural network (NN) architectures.
Despite numerous state-of-the-art applications, a fundamental question remains: Does the inclusion of quantum layers offer any computational advantage over purely classical models? If yes/no, how and why?
In this paper, we analyze how classical and hybrid models adapt their architectural complexity in response to increasing problem complexity. To this end, we select a multiclass classification problem and perform comprehensive benchmarking of classical models for increasing problem complexity, identifying those that optimize both accuracy and computational efficiency to establish a robust baseline for comparison. These baseline models are then systematically compared with HQNNs by evaluating the rate of increase in floating-point operations (FLOPs) and number of parameters, providing insights into how architectural complexity scales with problem complexity in both classical and hybrid networks.
We utilize classical machines to simulate the quantum layers in HQNNs, a common practice in Noisy Intermediate-Scale Quantum (NISQ) era. Our analysis reveals that, as problem complexity increases, the architectural complexity of HQNNs, and consequently their FLOPs consumption, despite the simulation overhead associated with quantum layer's simulation on classical hardware, scales more efficiently (88.5%$increase in FLOPs from 10 features (low problem complexity) to 110 features (high problem complexity)), compared to classical networks (53.1%). Moreover, as the problem complexity increases, classical networks consistently exhibit a need for a larger number of parameters to accommodate the increasing problem complexity. Additionally, the rate of increase in number of parameters is also slower in HQNNs (81.4%) than classical NNs (88.5%). These findings suggest that HQNNs provide a more scalable and resource-efficient solution, positioning them as a promising alternative for tackling complex computational problems.
Despite numerous state-of-the-art applications, a fundamental question remains: Does the inclusion of quantum layers offer any computational advantage over purely classical models? If yes/no, how and why?
In this paper, we analyze how classical and hybrid models adapt their architectural complexity in response to increasing problem complexity. To this end, we select a multiclass classification problem and perform comprehensive benchmarking of classical models for increasing problem complexity, identifying those that optimize both accuracy and computational efficiency to establish a robust baseline for comparison. These baseline models are then systematically compared with HQNNs by evaluating the rate of increase in floating-point operations (FLOPs) and number of parameters, providing insights into how architectural complexity scales with problem complexity in both classical and hybrid networks.
We utilize classical machines to simulate the quantum layers in HQNNs, a common practice in Noisy Intermediate-Scale Quantum (NISQ) era. Our analysis reveals that, as problem complexity increases, the architectural complexity of HQNNs, and consequently their FLOPs consumption, despite the simulation overhead associated with quantum layer's simulation on classical hardware, scales more efficiently (88.5%$increase in FLOPs from 10 features (low problem complexity) to 110 features (high problem complexity)), compared to classical networks (53.1%). Moreover, as the problem complexity increases, classical networks consistently exhibit a need for a larger number of parameters to accommodate the increasing problem complexity. Additionally, the rate of increase in number of parameters is also slower in HQNNs (81.4%) than classical NNs (88.5%). These findings suggest that HQNNs provide a more scalable and resource-efficient solution, positioning them as a promising alternative for tackling complex computational problems.
Networking
Work-in-Progress Poster


DescriptionComputing-In-Memory (CIM) offers a potential solution to the memory wall issue and can achieve high energy efficiency by minimizing data movement, making it a promising architecture for edge AI devices. Lightweight models like MobileNet and EfficientNet, which utilize depthwise convolution for feature extraction, have been developed for these devices. However, CIM macros often face challenges in accelerating depthwise convolution, including underutilization of CIM memory and heavy buffer traffic. The latter, in particular, has been overlooked despite its significant impact on latency and energy consumption.
To address this, we introduce a novel CIM dataflow that significantly reduces buffer traffic by maximizing data reuse and improving memory utilization during depthwise convolution. The proposed dataflow is grounded in solid theoretical principles, fully demonstrated in this paper. When applied to MobileNet and EfficientNet models, our dataflow reduces buffer traffic by 77.4–87.0%, leading to a total reduction in data traffic energy and latency by 10.1–17.9% and 15.6–27.8%, respectively, compared to the baseline (conventional weight-stationary dataflow).
To address this, we introduce a novel CIM dataflow that significantly reduces buffer traffic by maximizing data reuse and improving memory utilization during depthwise convolution. The proposed dataflow is grounded in solid theoretical principles, fully demonstrated in this paper. When applied to MobileNet and EfficientNet models, our dataflow reduces buffer traffic by 77.4–87.0%, leading to a total reduction in data traffic energy and latency by 10.1–17.9% and 15.6–27.8%, respectively, compared to the baseline (conventional weight-stationary dataflow).
Engineering Poster
Networking


DescriptionConfidentiality is a crucial element of the security triad, which also includes Integrity and Availability. Spectre-like vulnerabilities arise from improper implementation and transient execution, while improper AES implementation can expose signals. Currently, there is no reliable single sign-off method in the industry, making verification complex and involving multiple stages such as Simulation, Formal, and post-Si validation. Signing off on confidentiality is particularly challenging without verification in unknown scenarios, and there is no single method for early detection at the RTL stage.
The proposed new technology addresses these challenges by capturing Secure Data Transaction Intent across User Defined Signals and performing static analysis to identify Illegal Data Flow causing Security Violations. This approach facilitates early detection with minimal constraints during the RTL design phase and is scalable to full SoC-sized designs, thereby saving time and effort.
The proposed new technology addresses these challenges by capturing Secure Data Transaction Intent across User Defined Signals and performing static analysis to identify Illegal Data Flow causing Security Violations. This approach facilitates early detection with minimal constraints during the RTL design phase and is scalable to full SoC-sized designs, thereby saving time and effort.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionContent-addressable memory (CAM) is a type of fast memory unique in its ability to perform parallel searches of stored data based on content rather than specific memory addresses. They have been used in many domains, such as networking, databases, and graph processing.
Field-programmable gate arrays (FPGAs) are an attractive means to implement CAMs because of their low latency, reconfigurable, and energy-efficient nature.
However, such implementations also face significant challenges, including high resource utilization, limited scalability, and suboptimal performance due to the extensive use of look-up tables (LUTs) and block RAMs (BRAMs). These issues stem from the inherent limitations of FPGA architectures when handling the parallel operations required by CAMs, often leading to inefficient designs that cannot meet the demands of high-speed, data-intensive applications. To address these challenges, we propose a novel configurable CAM architecture that leverages the digital signal processing (DSP) blocks available in modern FPGAs as the core resource. By utilizing DSP blocks' data storage and logic capabilities, our approach enables configurable CAM architecture with efficient multi-query support while significantly reducing search and update latency for data-intensive applications. The DSP-based CAM architecture offers enhanced scalability, higher operating frequencies, and improved performance compared to traditional LUT and BRAM-based designs. In addition, we demonstrate the effectiveness of our proposed CAM architecture with a triangle counting application on real graphs. This innovative use of DSP blocks also opens up new possibilities for high-performance, data-intensive applications on FPGAs.
Our proposed design is open-sourced at: https://anonymous.4open.science/r/CAM-B9D1/
Field-programmable gate arrays (FPGAs) are an attractive means to implement CAMs because of their low latency, reconfigurable, and energy-efficient nature.
However, such implementations also face significant challenges, including high resource utilization, limited scalability, and suboptimal performance due to the extensive use of look-up tables (LUTs) and block RAMs (BRAMs). These issues stem from the inherent limitations of FPGA architectures when handling the parallel operations required by CAMs, often leading to inefficient designs that cannot meet the demands of high-speed, data-intensive applications. To address these challenges, we propose a novel configurable CAM architecture that leverages the digital signal processing (DSP) blocks available in modern FPGAs as the core resource. By utilizing DSP blocks' data storage and logic capabilities, our approach enables configurable CAM architecture with efficient multi-query support while significantly reducing search and update latency for data-intensive applications. The DSP-based CAM architecture offers enhanced scalability, higher operating frequencies, and improved performance compared to traditional LUT and BRAM-based designs. In addition, we demonstrate the effectiveness of our proposed CAM architecture with a triangle counting application on real graphs. This innovative use of DSP blocks also opens up new possibilities for high-performance, data-intensive applications on FPGAs.
Our proposed design is open-sourced at: https://anonymous.4open.science/r/CAM-B9D1/
Engineering Poster
Networking


DescriptionAutomotive SoCs with body control applications require a current profile supported by the various power modes of the device. For body and gateway applications, power gating is a very commonly used technique to achieve the current profile.
Hence the MCU transitions to and from standby mode at regular intervals. Driven from ultra-low power consumption requirement of the MCU in the standby mode, not all the IOs remain alive in standby mode, apart from the ones which would be used in the standby application.
Hence there is a need to configure the pad state of the device in standby mode based on the low power application requirements.
The conventional signal multiplexing approaches suffer from congestion and routing issues since the IOs which have any always-on domain functionality reside in always on domain, irrespective of other functions multiplexed on that IO.
The proposed signal multiplexing scheme resolves this by controlling the signal multiplexing in a modular fashion, thereby providing a congestion free, power domain aware signal multiplexing which reduces the buffering and congestion in the signal routing. It also enables the software to govern the state of the pads in the standby mode and providing flexibility to achieve the lowest current profile driven by the application by controlling functionality at per-pad basis.
Hence the MCU transitions to and from standby mode at regular intervals. Driven from ultra-low power consumption requirement of the MCU in the standby mode, not all the IOs remain alive in standby mode, apart from the ones which would be used in the standby application.
Hence there is a need to configure the pad state of the device in standby mode based on the low power application requirements.
The conventional signal multiplexing approaches suffer from congestion and routing issues since the IOs which have any always-on domain functionality reside in always on domain, irrespective of other functions multiplexed on that IO.
The proposed signal multiplexing scheme resolves this by controlling the signal multiplexing in a modular fashion, thereby providing a congestion free, power domain aware signal multiplexing which reduces the buffering and congestion in the signal routing. It also enables the software to govern the state of the pads in the standby mode and providing flexibility to achieve the lowest current profile driven by the application by controlling functionality at per-pad basis.
Networking
Work-in-Progress Poster


DescriptionLow-latency machine learning (ML) on resource-limited FPGAs is critical for applications like high-energy physics and autonomous systems. However, balancing latency, accuracy, and hardware constraints remains a challenge. Existing methods often adopt a software-first approach, relying on proxy metrics and overlooking direct hardware considerations, which leads to suboptimal resource usage and costly deployment adjustments. We propose ConNAS4ML, a constraint-aware differentiable neural architecture search (DNAS) framework. By modeling resource usage as a continuous function with a novel quadratic exterior penalty loss, ConNAS4ML enables fast, single-stage, gradient-based optimization under hardware constraints. Experiments demonstrate that ConNAS4ML achieves an average of up to 59.15% FPGA resource reduction with minimal performance degradation, enabling practical deployment in resource-constrained environments.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionIn modern printed circuit board (PCB) designs, the increasing complexity poses more challenges for automatic placement. Existing PCB placement methods cannot handle complex constraints with heterogeneous, irregular-shaped, and any-oriented components for double-sided PCB designs well. This paper proposes the first constraint graph-based legalization approach for these constraints. We use a slicing technique to model a component more accurately with a set of rectangles instead of resorting to the naïve bounding box approximation. Unlike the commonly used linear programming method for macro placement in integrated circuit (IC) designs, we employ a mixed integer quadratic programming (MIQP) formulation to effectively expand the solution space for heterogeneous, irregular-shaped, and any-oriented components, particularly with high-density designs. Experimental results demonstrate the effectiveness and robustness of our work.
Research Manuscript


Systems
SYS6: Time-Critical and Fault-Tolerant System Design
DescriptionDirected Acyclic Graphs (DAGs) are widely deployed as task models in autonomous systems, including vehicles and drones, to capture functional dependency.
DAG scheduling has been extensively investigated by various communities
to shorten makespan, under the common assumption that the model itself is given a priori.
This work studies a rarely touched problem --- construction of DAG models --- and considers time-triggered blended task chains predominant in autonomous systems.
We report representation semantics and a topology optimization method.
Experiments show that the average end-to-end response time reduction is 4.8 times of the conventional Floyd algorithm.
Our time complexity is $\mathcal{O}(n^2)$, making it suitable for handling dynamic tasks as well.
DAG scheduling has been extensively investigated by various communities
to shorten makespan, under the common assumption that the model itself is given a priori.
This work studies a rarely touched problem --- construction of DAG models --- and considers time-triggered blended task chains predominant in autonomous systems.
We report representation semantics and a topology optimization method.
Experiments show that the average end-to-end response time reduction is 4.8 times of the conventional Floyd algorithm.
Our time complexity is $\mathcal{O}(n^2)$, making it suitable for handling dynamic tasks as well.
Engineering Poster
Networking


DescriptionToday, System on Chip (SoC) devices have stringent Power, Performance, Area and Schedule (PPAS) requirements. With the increase in design size and complexity, standard techniques and flow-options provided by state-of-the-art Placement and Routing (PnR) tools often prove to be ineffective. It is really important to have better design approach and optimization techniques while considering signoff constraint on timing closure and physical verification.
With each design revision there is a push to improve on the PPAS met by the previous generation. Although design related changes are the most effective for this, beyond a point to get the maximum out of a technology, we need to look at novel and automated strategies that can be used in physical design implementation.
In this paper, we are exploiting various methodologies and optimization techniques which can be used across different implementation stages of physical design cycle. We demonstrate end-to-end implementation of various optimization techniques for Turn-Around-Time (TAT) improvement, without any impact to Quality-of-Results (QOR). We also present a deep dive analysis of a novel solution to the primitive problem of multiple iterations required to close timing for Source Synchronous IO Interfaces.
In this paper, we will discuss about the correlation issues which are prevalent among implementation and sign-off techniques and the various methodologies that can help in reducing this miscorrelation which in turn can lead to PPAS improvements.
With each design revision there is a push to improve on the PPAS met by the previous generation. Although design related changes are the most effective for this, beyond a point to get the maximum out of a technology, we need to look at novel and automated strategies that can be used in physical design implementation.
In this paper, we are exploiting various methodologies and optimization techniques which can be used across different implementation stages of physical design cycle. We demonstrate end-to-end implementation of various optimization techniques for Turn-Around-Time (TAT) improvement, without any impact to Quality-of-Results (QOR). We also present a deep dive analysis of a novel solution to the primitive problem of multiple iterations required to close timing for Source Synchronous IO Interfaces.
In this paper, we will discuss about the correlation issues which are prevalent among implementation and sign-off techniques and the various methodologies that can help in reducing this miscorrelation which in turn can lead to PPAS improvements.
Research Manuscript


Systems
SYS4: Embedded System Design Tools and Methodologies
DescriptionWe present EffiCast, the first methodology for contentionaware energy efficiency forecasting in clustered heterogeneous processors using sequence-based models. Through extensive experimental analysis of
energy efficiency sensitivities across core types, voltage/frequency (V/f) levels, application phases, and resource contention scenarios, EffiCast uncovers key factors driving energy efficiency variability in modern heterogeneous processors. Leveraging structured data generation and advanced LSTM- and Transformer-based models, EffiCast achieves unprecedented accuracy while outperforming state-of-the-art predictive techniques. Deployed on a real heterogenous processor with Intel's oneDNN acceleration,
EffiCast delivers inference latencies as low as 1.82 ms per sequence, enabling seamless integration into proactive resource management frameworks. With the ability to forecast future system states under dynamic workloads, EffiCast sets a new standard for energy efficiency optimization in energy-constrained application domains.
energy efficiency sensitivities across core types, voltage/frequency (V/f) levels, application phases, and resource contention scenarios, EffiCast uncovers key factors driving energy efficiency variability in modern heterogeneous processors. Leveraging structured data generation and advanced LSTM- and Transformer-based models, EffiCast achieves unprecedented accuracy while outperforming state-of-the-art predictive techniques. Deployed on a real heterogenous processor with Intel's oneDNN acceleration,
EffiCast delivers inference latencies as low as 1.82 ms per sequence, enabling seamless integration into proactive resource management frameworks. With the ability to forecast future system states under dynamic workloads, EffiCast sets a new standard for energy efficiency optimization in energy-constrained application domains.
DAC Pavilion Panel


DescriptionCome watch the EDA troublemakers answer the edgy, user-submitted questions about this year's most controversial issues! It's an old-style open Q&A from the days before corporate marketing took over every aspect of EDA company images.
Engineering Presentation


AI
Back-End Design
Chiplet
DescriptionWith the increasing reliance on Artificial Intelligence (AI) and the growth of the automotive industry, the need for Power Delivery Network (PDN) analysis for subsystems that include GPUs, CPUs, DSPs and Modem becomes more pertinent than ever. The traditional PDN simulations are costly and require a lot of resources, this proves to be quite difficult for modern systems that have numerous multiple power domains and billions of devices. This paper introduces Reduced Order Model (ROM) for subsystems which provides same physical and electrical impact as that of detailed model while significantly reducing the node, resistor, and current sink counts. Flat level analysis becomes cheaper and quicker with the ROM, without compromising accuracy. CPU designs simulation suggest that the ROM-based analysis is far more resource efficient than full flat, and can achieve up to 2.5 times improvements in runtime, memory, and disk space usage. Comparing ROM results of Static, Dynamic Vectorless, Package analysis and Grid Resistance checks with full-flat simulations further confirms the efficacy of ROM. Therefore, the methodology enhances the design team productivity, shortens time-to-market, and reduces compute resource costs. Further developments hope to broaden the scope of ROM applications to Electromigration (EM), Electrostatic Discharge (ESD), and 3DIC/2.5D simulations.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionThe cost-distance Steiner tree problem seeks a Steiner tree that minimizes the total congestion cost plus the weighted sum of source-sink delays.
This problem arises as a subroutine in timing-constrained global routing with a linear delay model, used before buffer insertion.
Here, the congestion cost and the delay of an edge are essentially uncorrelated unlike in most other algorithms for timing-driven Steiner trees.
We present a fast algorithm for the cost-distance Steiner tree problem.
Its running time is O(t(n log n + m)), where t, n, and m are the numbers of terminals, vertices, and edges in the global routing graph.
We also prove that our algorithm guarantees an approximation factor of O(log t).
This matches the best-known approximation factor for this problem, but with a much faster running time.
To account for increased capacitance and delays after buffering caused by bifurcations, we incorporate a delay penalty for each bifurcation without compromising the running time or approximation factor.
In our experimental results, we show that our algorithm outperforms previous methods that first compute a Steiner topology, e.g. based on shallow-light Steiner trees or the Prim-Dijkstra algorithm, and then embed this into the global routing graph.
This problem arises as a subroutine in timing-constrained global routing with a linear delay model, used before buffer insertion.
Here, the congestion cost and the delay of an edge are essentially uncorrelated unlike in most other algorithms for timing-driven Steiner trees.
We present a fast algorithm for the cost-distance Steiner tree problem.
Its running time is O(t(n log n + m)), where t, n, and m are the numbers of terminals, vertices, and edges in the global routing graph.
We also prove that our algorithm guarantees an approximation factor of O(log t).
This matches the best-known approximation factor for this problem, but with a much faster running time.
To account for increased capacitance and delays after buffering caused by bifurcations, we incorporate a delay penalty for each bifurcation without compromising the running time or approximation factor.
In our experimental results, we show that our algorithm outperforms previous methods that first compute a Steiner topology, e.g. based on shallow-light Steiner trees or the Prim-Dijkstra algorithm, and then embed this into the global routing graph.
Exhibitor Forum


DescriptionFunctional coverage closure remains one of the most persistent and resource-intensive challenges in RTL verification. Despite decades of EDA tool evolution, coverage gaps often require manual analysis, ad hoc scripting, and repeated testbench iterations. In this talk, we introduce CoverAgent, an agentic AI system purpose-built to identify, target, and close the last functional coverage gaps that conventional tools and workflows leave behind.
CoverAgent operates as an autonomous agent within your existing verification environment. It analyzes coverage reports, understands testbench structure, infers unreachable states, and autonomously proposes targeted stimuli and constraint adjustments — all without rewriting your entire environment. Built on a foundation of LLMs and agent-based reasoning, CoverAgent bridges the usability gap between design intent and simulation behavior.
We present real-world case studies demonstrating how CoverAgent accelerated closure by 80% in complex SoC environments, uncovered unreachable bins missed by traditional tools, and improved the productivity of design verification engineers without sacrificing control or interpretability.
Whether you're building CPUs, accelerators, or memory subsystems, CoverAgent fits seamlessly into your UVM or SystemVerilog flow. It complements existing commercial tools, providing a new dimension of intelligence to the verification loop.
Join us to see how agentic AI can supercharge your coverage strategy, reduce manual effort, and make coverage closure not just achievable — but efficient, scalable, and even enjoyable.
CoverAgent operates as an autonomous agent within your existing verification environment. It analyzes coverage reports, understands testbench structure, infers unreachable states, and autonomously proposes targeted stimuli and constraint adjustments — all without rewriting your entire environment. Built on a foundation of LLMs and agent-based reasoning, CoverAgent bridges the usability gap between design intent and simulation behavior.
We present real-world case studies demonstrating how CoverAgent accelerated closure by 80% in complex SoC environments, uncovered unreachable bins missed by traditional tools, and improved the productivity of design verification engineers without sacrificing control or interpretability.
Whether you're building CPUs, accelerators, or memory subsystems, CoverAgent fits seamlessly into your UVM or SystemVerilog flow. It complements existing commercial tools, providing a new dimension of intelligence to the verification loop.
Join us to see how agentic AI can supercharge your coverage strategy, reduce manual effort, and make coverage closure not just achievable — but efficient, scalable, and even enjoyable.
Networking
Work-in-Progress Poster


DescriptionAbstract—Inspired by software fuzz testing, current research applies fuzz testing technique to hardware vulnerability detection in increasingly complex processor designs. However, existing fuzzing methods often prioritize global coverage efficiency over vulnerability detection efficiency. This paper proposes CPCRFUZZ, an innovative two-phase fuzz testing method for hardware vulnerabilities, which utilizes critical path coverage for targeted vulnerability detection and control register coverage for comprehensive state exploration. This approach aims to enhance both vulnerability discovery efficiency and state coverage. Experimental results indicate that CPCRFUZZ outperforms existing fuzz testing methods in terms of vulnerability discovery efficiency and achieves higher coverage.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionWe propose CREST-CiM, an STT-MRAM-based Computing-in-Memory (CiM) technique, targeted for binary neural networks. To circumvent the low-distinguishability issue in standard MRAM-based CiM, CREST-CiM utilizes two magnetic tunnel junctions (MTJs) to store +1 and -1 weights in a bitcell and cross-couples the MTJs, achieving a high-to-low current ratio of up to 8100 for a bit-cell. Our analysis for 64x64 arrays shows up to 3.4x higher CiM sense-margin, 27.6% higher read-disturb-margin, and resilience to process variations, and other hardware non-idealities, albeit at the cost of just 7.9% overall-area overhead, and <1% energy and latency overhead compared to a 2T-2MTJ-CiM design. Our system-level analysis for ResNet-18 trained on CIFAR-10 shows near-sofware inference accuracy with CREST-CiM, with 10.7% improvement over 2T-2MTJ baseline.
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionPortability poses a significant challenge for Deep Learning (DL)-based profiling Side-Channel Analysis (SCA) on AES encryption, as attackers cannot always ensure that training and target samples use the same encryption mode. To address this, we propose an Unsupervised Domain Adaptation (UDA) DL-SCA framework for achieving effective and robust cross-encryption-mode attacks. By incorporating cross-attention and UDA techniques, our framework aligns high-dimensional input samples, reducing interference from encryption mode mismatches. Evaluation across five distinct AES modes demonstrates that our method achieves robust SCA performance without requiring prior knowledge or multiple labeled datasets for analysis.
Engineering Presentation


AI
IP
Chiplet
DescriptionAdoption of advanced packaging has accelerated multi-chip module (MCM) based heterogeneous SoC solutions. With standardization of Die to Die (D2D) interfaces such as UCIe, Industry is aggressively designing Chiplet dies involving multiple vendors, foundries, tech nodes and process corners. Due to shorter interconnect distance, these dies can communicate with each other using a source-synchronous clock-forwarding architecture PHY. Effectively making a launch & capture cross-die timing path originating at one die and ending at the other.
Due to lack of proper cross-die STA signoff methodology, we are forced to adopt pessimistic guard-banding approach, which won't scale correctly with higher data rates. A new methodology for design & signoff at package level is the need of the hour. Link budgeting based on channel ISI/cross-talk/jitter and eye-plot are still integral part of the design flow. However, the silicon correlation data did show that power-efficient source-synchronous short-reach PHY can open the possibility of STA margining methodology accounting for all the error components of link budget in future, leading to a new era of shift-left automation in custom work force.
The proposed 3DIC STA solution is foundry, tech-node and process agonistic – thereby enabling universal interoperability at Chiplet package level along with silicon proven accuracy.
Due to lack of proper cross-die STA signoff methodology, we are forced to adopt pessimistic guard-banding approach, which won't scale correctly with higher data rates. A new methodology for design & signoff at package level is the need of the hour. Link budgeting based on channel ISI/cross-talk/jitter and eye-plot are still integral part of the design flow. However, the silicon correlation data did show that power-efficient source-synchronous short-reach PHY can open the possibility of STA margining methodology accounting for all the error components of link budget in future, leading to a new era of shift-left automation in custom work force.
The proposed 3DIC STA solution is foundry, tech-node and process agonistic – thereby enabling universal interoperability at Chiplet package level along with silicon proven accuracy.
Networking
Work-in-Progress Poster


DescriptionBy leveraging wavelength division multiplexing (WDM) technology, optical neural networks (ONNs) built with micro-ring resonator (MRR) arrays have recently emerged as a powerful solution for accelerating the energy-intensive matrix-vector multiplication operations in artificial intelligence applications. Despite their promise, the scalability of MRR-based computing systems is severely limited by adjacent channel crosstalk. This challenge arises from the inherent imperfections in MRR filtering, which lead to signal leakage between neighboring channels. As the number of WDM channels increases, this accumulated adjacent channel interference intensifies, posing a substantial obstacle to large-scale ONNs. In this work, we propose the Crosstalk-Aware Mapping (CAM) strategy that mitigates this crosstalk by optimizing the weight mapping scheme for MRR arrays. This is achieved through array-wise weight reallocation and the optimization of phase-shift combinations on both row and column levels. Extensive experimental results on MRR-ONN systems substantiate the effectiveness of CAM, with average performance improvements of 61.92%, 48.70%, and 63.08% on the MNIST, Fashion-MNIST, and CIFAR-10 datasets.
Engineering Poster


DescriptionCTS (Clock Tree Synthesis) is important to optimize design. It is necessary of CTS as Design Methodology to get robust clock tree in terms of latency, skew, physical track, clock tree depth and clock power. There are many ways to synthesis and optimize by changing the type of clock cells or by locating the clock cells with fixed NDR rules for BEOL. (NDR is non default rule for clock net routing which defined routing width, spacing and shielding rule). But there were a few studies for NDR.
It is necessary to get optimal NDR with Machine learning in advance from CTS by apply differential NDR for each nets based on Machine Learning.
By using various delay depending on different clock width, CTS can be optimized more. However, Since it is hard to change or choose the different NDR on every clock nets, Machine learning algorithm is necessary to get more better optimal clock tree.
The result of gain is high without any side effects.
It is necessary to get optimal NDR with Machine learning in advance from CTS by apply differential NDR for each nets based on Machine Learning.
By using various delay depending on different clock width, CTS can be optimized more. However, Since it is hard to change or choose the different NDR on every clock nets, Machine learning algorithm is necessary to get more better optimal clock tree.
The result of gain is high without any side effects.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionThis paper presents a novel curvilinear optical proximity correction (OPC) framework.
The proposed approach involves representing mask patterns with control points, which are interconnected through cardinal splines.
Mask optimization is achieved by iteratively adjusting these control points, guided by lithography simulation.
To ensure compliance with mask rule checking (MRC) criteria, we develop comprehensive methods for checking width, space, area, and curvature.
Additionally, to match the performance of inverse lithography techniques (ILT), we design algorithms to fit ILT results and resolve MRC violations.
Extensive experiments demonstrate the effectiveness of our methodology, highlighting its potential as a viable OPC/ILT alternative.
The proposed approach involves representing mask patterns with control points, which are interconnected through cardinal splines.
Mask optimization is achieved by iteratively adjusting these control points, guided by lithography simulation.
To ensure compliance with mask rule checking (MRC) criteria, we develop comprehensive methods for checking width, space, area, and curvature.
Additionally, to match the performance of inverse lithography techniques (ILT), we design algorithms to fit ILT results and resolve MRC violations.
Extensive experiments demonstrate the effectiveness of our methodology, highlighting its potential as a viable OPC/ILT alternative.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionComplex-Valued Neural Networks (CVNNs) have demonstrated high performance in applications where complex numbers are essential, but suffer from higher computational and memory overheads.
Since their target applications often operate in resource-constrained environments, optimizing CVNNs for energy and area efficiency is important for their acceleration.
To resolve these challenges, we present CVMAX, a software-hardware co-design for energy and area-efficient CVNN acceleration.
CVMAX introduces a specialized quantization technique based on polar form representation and shift quantization.
The technique significant reduces the bit width of CVNNs and computational complexity compared to traditional quantization with rectangular form.
Moreover, shift quantization leverages the computational simplicity of multiplication in the polar form, reducing the complexity of complex number multiplication.
With the quantization technique, we designed a dedicated hardware accelerator that supports CVMAX data and its associated arithmetic operations.
In our evaluation, CVMAX achieves a reduction of 75% in energy consumption and achieve x4.44 speedup compared to conventional accelerators.
Since their target applications often operate in resource-constrained environments, optimizing CVNNs for energy and area efficiency is important for their acceleration.
To resolve these challenges, we present CVMAX, a software-hardware co-design for energy and area-efficient CVNN acceleration.
CVMAX introduces a specialized quantization technique based on polar form representation and shift quantization.
The technique significant reduces the bit width of CVNNs and computational complexity compared to traditional quantization with rectangular form.
Moreover, shift quantization leverages the computational simplicity of multiplication in the polar form, reducing the complexity of complex number multiplication.
With the quantization technique, we designed a dedicated hardware accelerator that supports CVMAX data and its associated arithmetic operations.
In our evaluation, CVMAX achieves a reduction of 75% in energy consumption and achieve x4.44 speedup compared to conventional accelerators.
Research Manuscript


Systems
SYS6: Time-Critical and Fault-Tolerant System Design
DescriptionCompute eXpress Link (CXL) offers an effective interface for connecting CPUs with external computing and memory devices. CXL Memory eXpander Controller (CXL-MXC) is gaining attention for its ability to boost memory capacity and bandwidth more efficiently than traditional DDR DIMMs. Despite extensive research on MXC performance and adaptation, DRAM reliability in CXL architecture remains under explored. Traditional fault tolerance mechanisms like replica or RAID-based systems would significantly increase bandwidth overhead in the CXL fabric, adversely affecting system performance. To address this, we propose the on-CXL-Memory-Expander-Controller ECC (CXL-ECC), by using Locally Recoverable Codes (LRC) as the Inter-Channel-ECC (IC-ECC) and offloading its process to the expander, we eliminate extra memory access requests in the CXL fabric.
Consequently, we conduct several experiments to demonstrate that our approach enhances DRAM reliability by more than $10^9$, compared to state-of-the-art ECC methods. Relative to RAID-enabled CXL switch, it reduces additional bandwidth overhead from 63.5\% to 3.4\% and improves system performance by 12\%.
Consequently, we conduct several experiments to demonstrate that our approach enhances DRAM reliability by more than $10^9$, compared to state-of-the-art ECC methods. Relative to RAID-enabled CXL switch, it reduces additional bandwidth overhead from 63.5\% to 3.4\% and improves system performance by 12\%.
Research Manuscript


Design
DES5: Emerging Device and Interconnect Technologies
DescriptionCompute Express Link (CXL) is a promising technology that addresses memory and storage challenges. Despite its advantages, CXL faces performance threats from external interference when co-existing with current memory and storage systems. This interference is under-explored in existing research. To address this, we develop CXL-Interplay, systematically characterizing and analyzing interference from memory and storage systems. To the best of our knowledge, we are the first to characterize CXL interference on real CXL hardware. We also provide reverse-reasoning analysis with performance counters and kernel functions. In the end, we propose and evaluate mitigating solutions.
Networking
Work-in-Progress Poster


DescriptionThe design of three-dimensional integrated circuits (3D-ICs) is at the forefront for cost-effective and high-performance computing systems. 3D systems enable the vertical stacking of multiple chiplets, reducing areas and interconnection lengths. On the other hand, the floorplanning of such systems requires more computational resources. To address this problem, we propose Daedalus, a framework supporting the design of 3D-ICs. Daedalus provides an environment where multiple design constraints can be enforced, enabling to find the optimal configuration by exploring the solution space. The experimental results show the effectiveness of our method in solving floorplanning with increasing complexity in a timely manner.
Research Manuscript


Design
DES5: Emerging Device and Interconnect Technologies
DescriptionAnalog machine learning hardware platforms, such as those using wave physics, present potential for edge artificial intelligence (AI) applications due to in-sensor computing architecture, offering superior energy efficiency compared to digital circuits. While the diffractive neural network has been implemented in optical systems, its deployment on integrated acoustic systems has not been achieved due to the challenges associated with hardware optimization. In this paper, we propose the Diffractive Acoustic Neural Network (DANN), a novel approach that applies diffractive neural network algorithms to surface acoustic wave (SAW) systems for in-sensor multi-biomarker diagnosis. To address the optimization challenges, we introduce a novel training methodology that combines Finite Element Analysis (FEA) with gradient descent. We validate our method on Major Depressive Disorder (MDD) and prostate cancer, achieving accuracies of 74.07% and 86.0%, respectively, which nearly reaches the accuracy levels of clinical diagnoses. By comparing the co-training method with the traditional gradient descent training method and direct training on the FEA model, the co-training method demonstrates its advantages in balancing training efficiency and accuracy. Furthermore, a comparison of power consumption between the traditional method and the in-sensor computing system is conducted, indicating 66% energy savings attributed to its high level of integration.
Networking
Work-in-Progress Poster


DescriptionHigh-Level Synthesis (HLS) tools have become increasingly popular for facilitating the design of domain-specific accelerators (DSAs). However, the quality of the final design produced by HLS tools highly depends on the optimization pass sequence during compilation. Finding an optimally tailored pass sequence for each design, commonly known as the pass ordering problem, is NP-hard. Traditional heuristic approaches such as Simulated Annealing and Greedy Search require substantial computational time and effort to yield satisfactory solutions for individual designs. Machine learning methods that employ reinforcement learning (RL) or graph neural networks (GNNs) often struggle with poor generalization due to inefficient representations of program features.
To address the aforementioned challenges, we first propose our pass-order-oriented graph---a heterogeneous graph that effectively captures both semantic and structural features, offering a more informative representation of the input design compared to traditional graphs. Next, we introduce a technique for generating representative program embeddings using contrastive learning, which enhances the GNN's ability to generalize across different designs by learning the distinctions between HLS programs. Since the middle-end of commercial tools are inaccessible to users, hindering pass ordering exploration, we enhance Light-HLS, a lightweight open-source HLS tool that provides accurate and rapid latency results and synthesizes the input HLS design into Verilog code. By integrating these methods within a reinforcement learning flow, we propose the DAPO framework for pass order optimization, which achieves an average performance improvement of 1.8x compared to the SOTA academic tool AutoPhase and a 1.2x improvement w.r.t. the -O3 optimization level in LLVM-18.1.0 while saving 4.36x compilation time. We perform cross-validation on 80 real-world HLS designs showcasing the generalizability of the DAPO's inference model.
To address the aforementioned challenges, we first propose our pass-order-oriented graph---a heterogeneous graph that effectively captures both semantic and structural features, offering a more informative representation of the input design compared to traditional graphs. Next, we introduce a technique for generating representative program embeddings using contrastive learning, which enhances the GNN's ability to generalize across different designs by learning the distinctions between HLS programs. Since the middle-end of commercial tools are inaccessible to users, hindering pass ordering exploration, we enhance Light-HLS, a lightweight open-source HLS tool that provides accurate and rapid latency results and synthesizes the input HLS design into Verilog code. By integrating these methods within a reinforcement learning flow, we propose the DAPO framework for pass order optimization, which achieves an average performance improvement of 1.8x compared to the SOTA academic tool AutoPhase and a 1.2x improvement w.r.t. the -O3 optimization level in LLVM-18.1.0 while saving 4.36x compilation time. We perform cross-validation on 80 real-world HLS designs showcasing the generalizability of the DAPO's inference model.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionThe widespread use of Deep Neural Networks (DNNs) is limited by high computational demands, especially in constrained environments. GPUs, though effective accelerators, often face underutilization and rely on coarse-grained scheduling. This paper introduces DARIS, a priority-based real-time DNN scheduler for GPUs, utilizing NVIDIA's MPS and CUDA streaming for spatial sharing, and a synchronization-based staging method for temporal partitioning. In particular, DARIS improves GPU utilization and uniquely analyzes GPU concurrency by oversubscribing computing resources. It also supports zero-delay DNN migration between GPU partitions. Experiments show DARIS improves throughput by 15% and 11.5% over batching and state-of-the-art schedulers, respectively, even without batching. All high-priority tasks meet deadlines, with low-priority tasks having under 2% deadline miss rate. High-priority response times are 33% better than those of low-priority tasks.
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionMitigating micro-architectural side channels remains a central challenge in hardware security. Despite substantial research efforts, current defenses are often narrowly tailored to specific vulnerabilities, leaving systems exposed to a broader spectrum of micro-architectural side-channel attacks. In this paper, we propose a generic approach to mitigate side-channel attacks with minimal architectural changes. Unlike traditional approaches that focus on mitigating specific side channels, we propose a dynamic strategy that alters the decoding of the instructions into secure (side-channel resilient) or performance versions of the instructions, based on the data it is processing. Specifically, to minimize performance overhead, decoding to a secure version is selectively applied only when sensitive data is being processed, ensuring optimal performance for instructions operating on non-sensitive data.
To evaluate our approach, we implement it on the RISC-V out-of-order BOOM processor. Our results demonstrate that the mechanism increases FPGA resource utilization by only 2% compared to the original design. Additionally, it imposes 0% performance overhead for unprotected applications, while maintaining overhead between up to 25% for security-critical workloads. This work represents a scalable and efficient solution for defending against micro-architectural side-channel attacks without compromising system performance.
To evaluate our approach, we implement it on the RISC-V out-of-order BOOM processor. Our results demonstrate that the mechanism increases FPGA resource utilization by only 2% compared to the original design. Additionally, it imposes 0% performance overhead for unprotected applications, while maintaining overhead between up to 25% for security-critical workloads. This work represents a scalable and efficient solution for defending against micro-architectural side-channel attacks without compromising system performance.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionDeep Neural Networks (DNNs) have achieved remarkable success across various intelligent tasks but encounter performance and energy challenges in inference execution due to data movement bottlenecks. We introduce DataMaestro, a versatile and efficient data streaming unit that brings the decoupled access/execute architecture to DNN dataflow accelerators to address this issue.
DataMaestro supports flexible and programmable access patterns to accommodate diverse workload types and dataflows, incorporates fine-grained prefetch and addressing mode switching to mitigate bank conflicts, and enables customizable on-the-fly data manipulation to reduce memory footprints and access counts. We integrate five DataMaestros with a Tensor Core-like GeMM accelerator and a Quantization accelerator into a RISC-V host system for evaluation. The FPGA prototype and VLSI synthesis results demonstrate that DataMaestro helps the GeMM core achieve nearly 100% utilization, which is 1.05-21.39× better than state-of-the-art solutions, while minimizing area and energy consumption to merely 6.43% and 15.06% of the total system.
DataMaestro supports flexible and programmable access patterns to accommodate diverse workload types and dataflows, incorporates fine-grained prefetch and addressing mode switching to mitigate bank conflicts, and enables customizable on-the-fly data manipulation to reduce memory footprints and access counts. We integrate five DataMaestros with a Tensor Core-like GeMM accelerator and a Quantization accelerator into a RISC-V host system for evaluation. The FPGA prototype and VLSI synthesis results demonstrate that DataMaestro helps the GeMM core achieve nearly 100% utilization, which is 1.05-21.39× better than state-of-the-art solutions, while minimizing area and energy consumption to merely 6.43% and 15.06% of the total system.
Research Manuscript


Systems
SYS4: Embedded System Design Tools and Methodologies
DescriptionAs a fundamental perception task, 3D point cloud detection has become essential for autonomous systems. Point-based detection methods offer high accuracy but are computationally expensive, primarily due to sequential point processing operations, such as set abstraction. To address these challenges, we propose DAWN, an acceleration framework for point cloud object detection that identifies partial similarities via well-designed partitioning and filters redundant points. We leverage spatio-temporal information from consecutive frames to accelerate point cloud detection. Frame partitioning enables partial similarity identification, but naive partitioning can lead to object fragmentation and detection errors. Our dynamic object-aware partitioning leverages previous detection results to determine boundaries and prevent fragmentation errors. Axis-sorted point selection refines the partitioning for better similarity identification and an efficient 3D similarity algorithm accurately filters out redundant points. Our experiments demonstrate that DAWN enables flexible latency-accuracy trade-offs and achieves up to 1.7x detection speedup by filtering more than 50% of points, with negligible impact on accuracy.
Research Manuscript


Design
DES5: Emerging Device and Interconnect Technologies
DescriptionDeep neural networks (DNNs) have demonstrated outstanding performance across a wide range of applications.
However, their substantial number of weights necessitates scalable and efficient storage solutions.
Emerging non-volatile memory technologies, such as phase change memory (PCM) with multi-level cell (MLC) operation, are promising candidates due to their high scalability and non-volatility compared to conventional charge-based storage devices.
Despite these advantages, MLC PCM suffers from reliability issues, particularly conductance drift, where the conductance of a PCM cell changes over time.
This drift can lead to significant accuracy degradation in DNNs, as their weights are stored in PCM cells.
In this paper, we propose Drift-aware Binary Code (DBC), a novel binary code designed to improve the tolerance of DNNs to conductance drift.
DBC maps smaller decimal values to less error-prone MLC PCM cell levels and ensures that values shift to smaller magnitudes when conductance drift occurs.
This approach helps maintain the accuracy of the DNN over an extended period compared to conventional binary code, as dominant DNN weights are stored at levels less prone to errors and DNNs exhibit better tolerance when weight values decrease rather than increase due to drift.
Additionally, DBC requires no additional hardware overhead for auxiliary bits and can be combined with other fault-tolerant approaches, such as error correction code (ECC).
Experimental results based on the real PCM device developed by IBM Research demonstrate that DBC improves the drift tolerance of DNNs by up to 55.18x compared to conventional binary code.
However, their substantial number of weights necessitates scalable and efficient storage solutions.
Emerging non-volatile memory technologies, such as phase change memory (PCM) with multi-level cell (MLC) operation, are promising candidates due to their high scalability and non-volatility compared to conventional charge-based storage devices.
Despite these advantages, MLC PCM suffers from reliability issues, particularly conductance drift, where the conductance of a PCM cell changes over time.
This drift can lead to significant accuracy degradation in DNNs, as their weights are stored in PCM cells.
In this paper, we propose Drift-aware Binary Code (DBC), a novel binary code designed to improve the tolerance of DNNs to conductance drift.
DBC maps smaller decimal values to less error-prone MLC PCM cell levels and ensures that values shift to smaller magnitudes when conductance drift occurs.
This approach helps maintain the accuracy of the DNN over an extended period compared to conventional binary code, as dominant DNN weights are stored at levels less prone to errors and DNNs exhibit better tolerance when weight values decrease rather than increase due to drift.
Additionally, DBC requires no additional hardware overhead for auxiliary bits and can be combined with other fault-tolerant approaches, such as error correction code (ECC).
Experimental results based on the real PCM device developed by IBM Research demonstrate that DBC improves the drift tolerance of DNNs by up to 55.18x compared to conventional binary code.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionJPEG is the most widely-used image compression method on low-cost cameras which cannot support learning-based compressors. One promising approach to enhance JPEG aims to drop DC coefficients at the cameras' ends (without extra computation) and reconstruct those DC coefficients after receiving them. They all face the challenge that their DC reconstruction relies on a statistical property, which will cause deviation-introduced errors and propagate. In this paper, we propose DCDiff, a novel end-to-end DC estimation method to tackle the above challenge. Instead of using statistical methods to recover DC coefficients and then fix errors, we directly leverage a generative model to estimate DC coefficients in an end-to-end manner. In the meantime, we generate masks to correct certain image locations that do not satisfy the statistical distribution to suppress error propagation. Extensive experiments show that DCDiff not only outperforms all baselines on compression performance but also introduces a tiny impact on downstream tasks and is fully compatible with 2 typical low-cost processors with JPEG support.
Workshop


AI
Sunday Program
DescriptionIn the rapidly evolving domain of computational technologies, the transformative impact of artificial intelligence (AI) continues to shape the future. DCgAA 2025 builds on the success of its inaugural edition by diving deeper into the frontier of deep learning (DL) and hardware co-design, with an amplified focus on real-world deployment challenges and next-generation innovations for generative AI applications. This second iteration
emphasizes expanding the scope beyond foundational discussions, addressing emerging paradigms in generative AI, including multimodal fusion, real-time adaptive processing, and decentralized edge applications. Acknowledging the growing role of foundation models, diffusion models, and large-scale generative systems, this workshop prioritizes optimizing these technologies for sustainable scalability—balancing performance, energy efficiency, and accessibility across diverse computing environments such as edge devices, AR/VR platforms, and ubiquitous IoT systems. Through a blended format of keynotes, paper presentations, interactive discussions, and novel program additions, DCgAA 2025 seeks to redefine the boundaries of DL-hardware integration. By engaging thought leaders, researchers, and practitioners across academia and industry, this workshop promises to
set new benchmarks for hardware-aware generative AI, driving innovation that is efficient, scalable, and impactful in the real world.
More information can be found at https://dcgaa.dk-lab.xyz/
emphasizes expanding the scope beyond foundational discussions, addressing emerging paradigms in generative AI, including multimodal fusion, real-time adaptive processing, and decentralized edge applications. Acknowledging the growing role of foundation models, diffusion models, and large-scale generative systems, this workshop prioritizes optimizing these technologies for sustainable scalability—balancing performance, energy efficiency, and accessibility across diverse computing environments such as edge devices, AR/VR platforms, and ubiquitous IoT systems. Through a blended format of keynotes, paper presentations, interactive discussions, and novel program additions, DCgAA 2025 seeks to redefine the boundaries of DL-hardware integration. By engaging thought leaders, researchers, and practitioners across academia and industry, this workshop promises to
set new benchmarks for hardware-aware generative AI, driving innovation that is efficient, scalable, and impactful in the real world.
More information can be found at https://dcgaa.dk-lab.xyz/
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionState-of-the-art 3D IC flows fail to consider 3D congestion during earlier stages, leading to excessive use of end-of-flow ECO resources for routability correction that severely degrades full-chip Power, Performance, and Area metrics. We present DCO-3D, a Machine Learning based routability-aware 3D PD flow that performs early post-route congestion prediction using Siamese Networks and resolves the predicted hotspots using a fully differentiable 3D cell spreading with Graph Neural Network. On 6 industrial designs in a commercial 3nm node, DCO-3D improves Pin-3D, the known best Pin-3D flow, by up to 47.2% in overflow, 86.2% in TNS and 5.1% in power at signoff.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionIn the Noisy Intermediate-Scale Quantum (NISQ) era, the topological constraints present in many of the currently available quantum devices pose a physical limit on the feasible interactions between qubits. To comply with such limitations, the compilation of quantum circuits requires solving the Qubit Routing Problem (QRP), by inserting SWAP operations among qubits. The State of the Art provides heuristic algorithms addressing this task, yet the depth of the output circuits is often incompatible with the current limits of quantum hardware. Therefore, we propose DDRoute, a novel heuristic algorithm to solve QRP, designed to reduce the depth overhead introduced by the routing process in the compiled circuits. Our experimental evaluation proves the efficiency of our approach, with a depth reduction of up to 70% with respect to the state-of-the-art routing procedures.
Networking
Work-in-Progress Poster


DescriptionHybrid algorithms combine the strengths of quantum and classical computing. However, quantum circuits are executed entirely, even when only subsets of measurement outcomes contribute to subsequent calculations. In this manuscript, we propose a novel circuit optimization technique that identifies and removes dead gates. We prove that removing dead gates has no influence on the probability distribution of the measurement outcomes that contribute to the subsequent calculation result. The evaluation of our optimization on instances of variational quantum eigensolver and quantum phase estimation, and random circuits, confirms its capability to remove a non-trivial number of dead gates in real-world algorithms.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionAnalog IC design is mainly manual and implemented at the device level.
A major reason is circuit behavior-extraction.
Unlike its digital counterpart, analog IC design is strongly coupled with technology nodes and is difficult to represent by an abstract behavioral model.
The lack of accurate and efficient analog modeling has become a bottleneck in analog design automation.
This paper proposes a behavior-centric optimization framework for analog circuits that represents circuit behavior using transistor electrical properties instead of sizes, improving model generalization and reducing optimization complexity.
To characterize the process, we propose a method for mapping transistor electrical properties to sizes.
Moreover, we developed a radial basis functions-based Kolmogorov-Arnold network (RBF-KAN) to accurately approximate circuit nonlinear behavior with limited simulations.
Compared to black-box modeling, our approach enables constructing surrogate models via KAN under a set specification with just a few hundred simulations.
Experiments on the testing suite showed our framework achieved a 1.76 times to 2.64 times improvement in large signal figure of merit (FOM) and 1.73 times to 2.48 times in small signal FOM over state-of-the-art methods, while also enabling 3.5 times to 6.2 times acceleration in design porting.
A major reason is circuit behavior-extraction.
Unlike its digital counterpart, analog IC design is strongly coupled with technology nodes and is difficult to represent by an abstract behavioral model.
The lack of accurate and efficient analog modeling has become a bottleneck in analog design automation.
This paper proposes a behavior-centric optimization framework for analog circuits that represents circuit behavior using transistor electrical properties instead of sizes, improving model generalization and reducing optimization complexity.
To characterize the process, we propose a method for mapping transistor electrical properties to sizes.
Moreover, we developed a radial basis functions-based Kolmogorov-Arnold network (RBF-KAN) to accurately approximate circuit nonlinear behavior with limited simulations.
Compared to black-box modeling, our approach enables constructing surrogate models via KAN under a set specification with just a few hundred simulations.
Experiments on the testing suite showed our framework achieved a 1.76 times to 2.64 times improvement in large signal figure of merit (FOM) and 1.73 times to 2.48 times in small signal FOM over state-of-the-art methods, while also enabling 3.5 times to 6.2 times acceleration in design porting.
Networking
Work-in-Progress Poster


DescriptionThis paper presents an energy-efficient coarse-grained reconfigurable array (CGRA) architecture tailored for perception applications at the edge, developed using a deep co-design methodology. With a focus on application, compiler, and hardware architecture co-optimization, the proposed CGRA architecture achieves significant energy efficiency per unit area, reaching 7.461TOPS/W/mm^(2) when implemented in a 22nm FD-SOI technology. Our methodology includes customizations at all levels of the CGRA subsystem, facilitating optimized data flow and multi-level parallelism for perception applications. Through strategic design adjustments, the CGRA reduces the computational energy consumption of kernels as low as 9.386nJ, making it ideal for edge-based perception tasks in energy-constrained environments. This work underscores the potential of deep co-design to push the boundaries of energy-efficient computing at the edge, enabling high-performance perception applications in compact, resource-limited systems.
Exhibitor Forum


DescriptionToday's PCB designers face significant challenges, including component shortages, rapidly evolving design constraints, and intense pressure to deliver results quickly. Traditional manual place-and-route processes frequently involve repetitive, time-consuming tasks that delay development.
In this talk, we introduce DeepPCB, an AI-based tool that learns PCB placement and routing through iterative trial and error, quickly generating DRC-compliant designs. We illustrate DeepPCB's capabilities through a practical use-case involving a complex multi-layer PCB design, demonstrating how this integrated approach significantly reduces design iteration cycles. This helps designers swiftly respond to demanding timelines and component shortages.
Participants will discover how AI-driven tools like DeepPCB enable engineers to spend less time on repetitive tasks and more on strategic, creative aspects of PCB development. The session emphasizes the collaborative potential between human designers and AI, highlighting tangible productivity gains and enhanced adaptability to industry challenges.
In this talk, we introduce DeepPCB, an AI-based tool that learns PCB placement and routing through iterative trial and error, quickly generating DRC-compliant designs. We illustrate DeepPCB's capabilities through a practical use-case involving a complex multi-layer PCB design, demonstrating how this integrated approach significantly reduces design iteration cycles. This helps designers swiftly respond to demanding timelines and component shortages.
Participants will discover how AI-driven tools like DeepPCB enable engineers to spend less time on repetitive tasks and more on strategic, creative aspects of PCB development. The session emphasizes the collaborative potential between human designers and AI, highlighting tangible productivity gains and enhanced adaptability to industry challenges.
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionPhysical Unclonable Function (PUF) poses a vulnerability that it could be mitigated by machine learning attacks and side channel attacks, which break its physical uniqueness and unpredictable characteristic. Hence, many works are concerned with enhancing PUF design by introducing more nonlinear modules inside to differentiate approximating PUF behavior from the attacker side. However, the safety of these PUFs are still an open area and need to be verified. In this paper, we propose DeepPUFSCA, which is a deep learning-based model that uniquely combines both challenge and side-channel information features during training to attack PUF. To gather the data, we conduct a design of an arbiter PUF on FPGA and measure its power consumption. Our intensive experiments on this dataset demonstrate that DeepPUFSCA outperforms other machine learning-based methods in terms of attacking accuracy, even the novel ensemble algorithms. Moreover, we also show that combined side channel information boosts the model performance compared to attacking with challenge-response only.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionRecently, machine learning-based techniques have been applied for layout hotspot detection. However, existing methods encounter challenges in capturing the decision boundary across the entire dataset and ignore the geometric properties and topology of the polygons. In this paper, we introduce CLI-HD, a novel contrastive learning framework on layout sequences and images for hotspot detection. Our framework improves the ability to distinguish between hotspots and non-hotspots by similarity computations instead of a single decision boundary. To effectively incorporate geometric information into the model training process, we propose Layout2Seq, which encodes polygon shapes as vectors within sequences that are subsequently fed into the CLI-HD. Furthermore, to better represent topology information, we develop an absolute position embedding, replacing the standard position encoders used in Transformer architectures. Extensive evaluations on various benchmarks demonstrate that CLI-HD outperforms current state-of-the-art methods, with an accuracy improvement ranging from 0.82% to 4.77% and a reduction in false alarm rates by 4.9% to 23.18%.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionIn this paper, we introduce DenSparSA, a balanced systolic array centralized architecture that can execute sparse matrix computations with minimal overhead to original dense matrix computations. DenSparSA supports both single-side and dual-side unstructured sparse matrix multiplications with high efficiency. Moreover, the additional hardware required for managing sparsity is compact and decoupled from the conventional systolic array, allowing for minimal power overhead when switched back to dense matrix operations via circuit gating. Compared with existing solutions, DenSparSA delivers competitive (0.82x-1.32x) efficiency in sparse scenarios and 1.17x-2.28x better efficiency for dense scenarios, indicating a better balance between both situations.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionWith the continued scaling of VLSI technology beyond 3nm, a consistent demand for layout reduction in standard cells has been made. CFET (Complementary-FET) has been accepted as a promising device technology, stacking N-FET on P-FET (or vice versa) to achieve this goal while providing metal interconnects on both the front and backsides of the wafer through BEOL (back-end-of-line) processing. However, the layout synthesis of CFET based standard cells and its use in physical design implementation are not fully compatible with the effective exploitation of backside interconnects. This is because of a considerable overhead on the allocation of special vias in FEOL (front-end-of-line) referred to as tap-cells, which are essential for the net routes using backside wire. To overcome this drawback of using CFET cells, a new technology called FFET (Flip-FET) has been proposed, which flips the lower FET in CFET to enable direct pin accessibility on both sides of BEOL with no tap cells. In this context, we propose an FFET cell based DTCO methodology to fully utilize backside wires with minimal tap
Engineering Poster
Networking


DescriptionThis work addresses the challenges of timing convergence and routing congestion in modern memory systems by introducing a localized approach to STAR (Self-Test and Repair) Memory System (SMS) processor placement. SMS processors are essential components that enable efficient testing, error detection, and repair in memory architectures. However, traditional centralized STAR architectures often face challenges in timing and design convergence due to high load and extensive interconnects, leading to routing complexity and performance bottlenecks.
To overcome these issues, we propose splitting and localizing STAR processors based on the design floorplan and memory arrangement, guided by feedback from structural designers. By strategically distributing processors closer to their associated memory blocks, our approach reduces interconnect lengths, alleviates routing congestion, and balances load across processors.
Quantitative analysis reveals a reduction in Worst Negative Slack (WNS) by ~85%, a 63% decrease in routing congestion density, and a 74% reduction in Design Rule Check (DRC) violations. Additionally, design turnaround time improved significantly due to fewer iterations and faster convergence.
This method provides a scalable solution for optimizing performance in complex memory architectures, such as SRAM and TCAM, and lays the foundation for future advancements in STAR processor design for emerging technologies.
To overcome these issues, we propose splitting and localizing STAR processors based on the design floorplan and memory arrangement, guided by feedback from structural designers. By strategically distributing processors closer to their associated memory blocks, our approach reduces interconnect lengths, alleviates routing congestion, and balances load across processors.
Quantitative analysis reveals a reduction in Worst Negative Slack (WNS) by ~85%, a 63% decrease in routing congestion density, and a 74% reduction in Design Rule Check (DRC) violations. Additionally, design turnaround time improved significantly due to fewer iterations and faster convergence.
This method provides a scalable solution for optimizing performance in complex memory architectures, such as SRAM and TCAM, and lays the foundation for future advancements in STAR processor design for emerging technologies.
Engineering Poster
Networking


DescriptionTime interleaving of data is needed across multiple domains including wired/wireless links, data converters, bus consolidators and direct digital frequency synthesis. Deployed in systems at the transmitting side to translate low data rate signals to high data rate interleaved outputs.
In absence of high-speed clocks, output data from the transmitter can be glitchy which can cause the receiver to be in an undesired state. It is highly desired to generate robust high-speed data at the transmitter using sub-sampling clocks, avoiding the need of a high-speed clock. A system using slow clocks for time interleaving high-speed data would ensure a significantly lower power design.
A new design for interleaving N data with an N-time data output data rate is proposed. The proposed design obviates the need for a final serializer multiplexer, its selection pins, and a high-speed clock.
Grey signaling at final output ensures glitch free, robust operation with relaxed timing. Pipelined, modular architecture realizations suited across applications. Proposed data interleaving saves power and simplifies design. The design is technology independent and resilient to PVT variations.
In absence of high-speed clocks, output data from the transmitter can be glitchy which can cause the receiver to be in an undesired state. It is highly desired to generate robust high-speed data at the transmitter using sub-sampling clocks, avoiding the need of a high-speed clock. A system using slow clocks for time interleaving high-speed data would ensure a significantly lower power design.
A new design for interleaving N data with an N-time data output data rate is proposed. The proposed design obviates the need for a final serializer multiplexer, its selection pins, and a high-speed clock.
Grey signaling at final output ensures glitch free, robust operation with relaxed timing. Pipelined, modular architecture realizations suited across applications. Proposed data interleaving saves power and simplifies design. The design is technology independent and resilient to PVT variations.
Networking
Work-in-Progress Poster


DescriptionThis paper presents design guidelines for scalable architectures to control fixed-frequency transmon qubits.
While cost-effective architectures using simplified RF pulses for quantum gate operations have been proposed, prior works remain at a conceptual level and lack design guidelines to specify required circuit performance and optimize gate fidelity within reasonable circuit costs.
This paper provides minimum necessary circuit configurations that achieve a two-qubit gate fidelity of 99.9% in the absence of decoherence.
This gate fidelity is comparable to that of conventional cost-consuming AWG-based architectures.
Additionally, this paper analyzes the impact of circuit noise on gate fidelity and clarify the performance required to suppress gate fidelity degradation due to noise to a negligible level.
While cost-effective architectures using simplified RF pulses for quantum gate operations have been proposed, prior works remain at a conceptual level and lack design guidelines to specify required circuit performance and optimize gate fidelity within reasonable circuit costs.
This paper provides minimum necessary circuit configurations that achieve a two-qubit gate fidelity of 99.9% in the absence of decoherence.
This gate fidelity is comparable to that of conventional cost-consuming AWG-based architectures.
Additionally, this paper analyzes the impact of circuit noise on gate fidelity and clarify the performance required to suppress gate fidelity degradation due to noise to a negligible level.
Engineering Poster
Networking


DescriptionUsing an Intelligent Chip Explorer, we leverage massive, cloud-enabled distributed computing and AI-driven optimization to enhance parameter tuning for EDA tools. As modern ASIC SoC designs grow more complex and integrate more third-party IPs, macro placement during chip designers becomes increasingly challenging. The thousands of parameters in EDA tools further complicate decision-making for chip designers. We first show steps to optimize power, performance, and area (PPA) using AI-driven cloud-compatible parameter tuning based on a fixed floorplan. Building on this, we demonstrate how AI-driven floorplanning optimization (FP-OPT), which adjusts the floorplan size and performs concurrent macro placement, can further improve the PPA. The optimized floorplan is then reintegrated into the initial optimization process, delivering even better results.
Engineering Poster


DescriptionThe digital designs predominantly include Memory Mapped Registers (MMRs) to configure the design as per the application needs. Digital designs also contain multiple Finite State Machines (FSMs) and boundary ports. To verify basic feature-sets of a design, such as MMR reads and writes, boundary port drive based FSM state transitions and event generation logic, the designer has to rely on design verification (DV) feedback which may take significant amount of time more so with new designs. Hence in order to improve design quality and reduce IP development cycle time, the automation tool named Elementary Sanity Testing Aid for Designers (ESTAD) is developed and proposed in this presentation The proposed tool ESTAD enables designers to analyze their design quickly, both in waveform and source-code without needing DV feedback The automation creates a sanity testbench systemVerilog file with the inputs from an excel sheet and a component xml file This tool enables generation of waveform and also includes basic checkers in the testbench file.
DAC Pavilion Panel


DescriptionThe CHIPS Act is a game-changer for the U.S. semiconductor industry, extending beyond manufacturing to fuel innovation in design enablement and IP development. Through strategic investments in programs like the National Semiconductor Technology Center (NSTC), the National Advanced Packaging Manufacturing Program (NAPMP), and the Microelectronics Commons, the Act supports cutting-edge EDA tools, semiconductor IP, and next-generation design infrastructure. These efforts are critical to strengthening U.S. leadership in semiconductor R&D and reducing reliance on foreign technologies.
This panel will explore how CHIPS Act investments foster next-generation design methodologies, AI-driven automation, and cloud-enabled collaboration platforms, ensuring that U.S. semiconductor design remains competitive. Experts from government and industry will discuss the role of Natcast /National Semiconductor Technology Center(NSTC) in accelerating research with prototype programs such as AIDRFIC, which is driving advancements in semiconductor innovation. The discussion will highlight how these initiatives strengthen domestic design capabilities, support startups, academic institutions, and industry partnerships, and reduce reliance on foreign technologies to create a resilient and globally competitive semiconductor ecosystem.
This panel will explore how CHIPS Act investments foster next-generation design methodologies, AI-driven automation, and cloud-enabled collaboration platforms, ensuring that U.S. semiconductor design remains competitive. Experts from government and industry will discuss the role of Natcast /National Semiconductor Technology Center(NSTC) in accelerating research with prototype programs such as AIDRFIC, which is driving advancements in semiconductor innovation. The discussion will highlight how these initiatives strengthen domestic design capabilities, support startups, academic institutions, and industry partnerships, and reduce reliance on foreign technologies to create a resilient and globally competitive semiconductor ecosystem.
Networking
Work-in-Progress Poster


DescriptionConstrained Random Verification (CRV) is the standard methodology for ASIC design verification. Parametrized tests are used to limit the sampling space and activate specific design functionalities. Despite its widespread use, the vast input space poses significant challenges in achieving coverage closure. We introduce a design-aware Multi-Armed Bandit strategy to enhance test scheduling during regression, with the objective of maximizing coverage while minimizing the number of required simulations. This approach selects tests based on their estimated probability of improving coverage, leveraging design hierarchy and cyclomatic complexity information. By preempting non-contributing tests, our method achieves an average coverage improvement of 1.12% while reducing simulation efforts by 68% compared to traditional CRV methods.
Engineering Poster
Networking
Design-for-Verification architecture to shift-left TTM and align test codebase in multi-chiplet SoCs
12:15pm - 1:15pm PDT Wednesday, June 25 Engineering Posters, Level 2 Exhibit Hall

DescriptionThe rising complexity of chiplet-based SoCs has resulted in significant challenges in the areas of post-silicon test, SLT, bring-up, debug diagnostics, as well as pre-silicon verification. In this presentation, we describe a novel approach, applied to a series of AMD multi-chiplet SoCs, of introducing HW structures in several key areas of SoC design, to align Test, SLT, Debug and Verification in chiplet-based SoCs.
HW-based Design-for-Verification and Validation (DFV) is a novel SoC methodology, reducing silicon test and bring-up efforts by implementing minimal low-footprint architectural changes.
One of the most important outcomes of introducing DfV into multi-chiplet SoCs is the ability to finally leverage the code base from pre-silicon verification testcases with the code used for SoC bring-up, debug and diagnostics.
The presentation details the implementation of a number of DFV techniques that introduce HW-based architectural changes into the design. DFV involves selectively augmenting logic at strategic locations and interfaces of design. Such logic may induce and observe various hard-to-hit scenarios and events to allow isolation of specific sections of multi-chiplet SoCs. System-level test time is thus significantly reduced due to elimination of significant parts of the overall scenario to the system-level event of interest. This process does not alter the original design intent and adds minimal overhead to SoC area footprint.
HW-based Design-for-Verification and Validation (DFV) is a novel SoC methodology, reducing silicon test and bring-up efforts by implementing minimal low-footprint architectural changes.
One of the most important outcomes of introducing DfV into multi-chiplet SoCs is the ability to finally leverage the code base from pre-silicon verification testcases with the code used for SoC bring-up, debug and diagnostics.
The presentation details the implementation of a number of DFV techniques that introduce HW-based architectural changes into the design. DFV involves selectively augmenting logic at strategic locations and interfaces of design. Such logic may induce and observe various hard-to-hit scenarios and events to allow isolation of specific sections of multi-chiplet SoCs. System-level test time is thus significantly reduced due to elimination of significant parts of the overall scenario to the system-level event of interest. This process does not alter the original design intent and adds minimal overhead to SoC area footprint.
Networking
Work-in-Progress Poster


Descriptionero-Knowledge Proof (ZKP) cryptographic algorithms have garnered significant attention for their ability to enhance privacy. However, the efficiency of these algorithms remains a challenge due to the computational complexity of performing polynomial operations with large data in the Number Theoretic Transform (NTT). This work introduces an HBM-aware design for efficient 256-bit NTT accelerators, focusing on a proposed data reorganization strategy to optimize memory access and resource-efficient modular multipliers to reduce hardware utilization. Our findings, based on Vivado synthesis and cycle-accurate simulations, demonstrate the effectiveness of these strategies in addressing ZKP performance challenges and highlight the potential of HBM-aware designs for accelerating cryptographic operations.
Engineering Poster


DescriptionSystem-on-Chip (SoC) manufacturing at nanometer regimes is a very costly affair due to high mask cost. To avoid respins, most organizations employ an extensive digital and mixed-signal verification environment, running over thousands of simulations before tapeout. However, using digital and mixed-signal simulations may not uncover all the current related bugs in the SoC. For example, forward bias diode during different power ramp-up sequences, can prevent the device from powering up cannot be caught unless a Full-Chip-SPICE (FCS) simulation are run.
In this work, we will discuss an advanced verification method to verify full SoC at SPICE level. Verilog netlist dumped from PnR tool converted to CDLs is used for the simulation. Many techniques are used to make the scale of simulation manageable. PrimeSim XA from Synopsys is used as the simulator of choice that can simulate 100M+ transistors with reasonable runtime and good accuracy.
The following bugs were caught using this methodology 1) Excess supply current during a specific power-up sequence 2) Current beyond specs in low power mode 3) Extra leakage in standby mode due to wrong level of signal between two interacting analog IPs. 4) High crowbar current as isolation cell did not assert in standby mode, etc.
In this work, we will discuss an advanced verification method to verify full SoC at SPICE level. Verilog netlist dumped from PnR tool converted to CDLs is used for the simulation. Many techniques are used to make the scale of simulation manageable. PrimeSim XA from Synopsys is used as the simulator of choice that can simulate 100M+ transistors with reasonable runtime and good accuracy.
The following bugs were caught using this methodology 1) Excess supply current during a specific power-up sequence 2) Current beyond specs in low power mode 3) Extra leakage in standby mode due to wrong level of signal between two interacting analog IPs. 4) High crowbar current as isolation cell did not assert in standby mode, etc.
Engineering Presentation


AI
Back-End Design
DescriptionClock mesh are preferable clock distribution methods for high frequency clocks because of lower clock latency/skew and on chip variation tolerance. Static timing analysis can't accurately predict the on-chip variation effect in a clock mesh because of the multi-driven nets. In this submission, Silicon proven custom statistical approach (SPICE Montecarlo simulation based) used to accurately calculate the total on-chip variation effect due to process, voltage, Interconnect and temperature variations across clock mesh. Total uncertainty calculated through above approach is significantly lesser when compared to typical on chip variation penalty with regular clock tree synthesis approach. The reduced clock uncertainty greatly simplifies the timing convergence
Research Manuscript


Design
DES5: Emerging Device and Interconnect Technologies
DescriptionCombinatorial optimization problems (COPs) are crucial in many applications but are computationally demanding. Traditional Ising annealers address COPs by directly converting them into Ising models (known as direct-E transformation) and solving them through iterative annealing. However, these approaches require vector-matrix-vector (VMV) multiplications with a complexity of 𝑂(𝑛2) for Ising energy computation and complex exponential annealing factor calculations during annealing process, thus significantly increasing hardware costs.
In this work, we propose a ferroelectric compute-in-memory (CiM) in-situ annealer to overcome aforementioned challenges. The proposed device-algorithm co-design framework consists of (i) a novel transformation method (first to our known) that converts COPs into an innovative incremental-E form, which reduces the complexity of VMV multiplication from 𝑂(𝑛2) to 𝑂(𝑛), and approximates exponential annealing factor with a much simplified fractional form; (ii) a double gate ferroelectric FET (DG FeFET)-based CiM crossbar that efficiently computes the in-situ incremental-E form by leveraging the unique structure of DG FeFETs; (iii) a CiM annealer that approaches the solutions of COPs via iterative incremental-E computations within a tunable back gate-based in-situ annealing flow.
Evaluation results show that our proposed CiM annealer significantly reduces hardware overhead, reducing energy consumption by 1503/1716× and time cost by 8.08/8.15× in solving 3000-node Max-Cut problems compared to two state-of-the-art annealers. It also exhibits high solving efficiency, achieving a remarkable average success rate of 98%, whereas other annealers show only 50% given the same iteration counts.
In this work, we propose a ferroelectric compute-in-memory (CiM) in-situ annealer to overcome aforementioned challenges. The proposed device-algorithm co-design framework consists of (i) a novel transformation method (first to our known) that converts COPs into an innovative incremental-E form, which reduces the complexity of VMV multiplication from 𝑂(𝑛2) to 𝑂(𝑛), and approximates exponential annealing factor with a much simplified fractional form; (ii) a double gate ferroelectric FET (DG FeFET)-based CiM crossbar that efficiently computes the in-situ incremental-E form by leveraging the unique structure of DG FeFETs; (iii) a CiM annealer that approaches the solutions of COPs via iterative incremental-E computations within a tunable back gate-based in-situ annealing flow.
Evaluation results show that our proposed CiM annealer significantly reduces hardware overhead, reducing energy consumption by 1503/1716× and time cost by 8.08/8.15× in solving 3000-node Max-Cut problems compared to two state-of-the-art annealers. It also exhibits high solving efficiency, achieving a remarkable average success rate of 98%, whereas other annealers show only 50% given the same iteration counts.
Research Special Session


EDA
DescriptionThis talk discusses a new test approach called Device-Aware Test (DAT) and applies it to industrial STT-MRAMs designs. DAT is a new test approach that goes beyond Cell-Aware Test; it does not assume that a defect in a device can be modeled electrically as a linear resistor (as the state-of-the art approach suggests), but it rather incorporates the impact of the physical defect into the technology parameters of the device and thereafter in its electrical parameters. Once the defective electrical model is defined, a systematic fault analysis is performed to derive appropriate fault models and subsequently test solutions. DAT is demonstrated on real STT-MRAM chips, which suffer from unique defects such as pinhole, synthetic anti-ferrimagnet flip, back-hoping, etc. The measurements show that DAT sensitizes realistic faults as well as new unique defects and faults that can never be caught with the traditional approaches.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionLong-context inference has become a central focus in recent self-regressive Transformer research. However, challenges still remain in performing decode stage due to the computational complexity of attention mechanisms and the substantial overhead associated with KV cache storage. Although attention sparsity has been proposed as a potential solution, conventional sparsity methods that rely on heuristic algorithms often suffer from accuracy degradation when applied to ultra-long sequences.
To address the dilemma between accuracy and performance, this work proposes DIAS, a distance-based irregular attention sparsity approach with processing-in-memory (PIM) architecture. DIAS employs top-K approximate attention scores through graph-based search to enhance inference efficiency while maintaining accuracy. Furthermore, a scalable gather-and-scatter-based PIM architecture is introduced to manage the storage demands of large-scale KV cache and to facilitate efficient sparsity attention computations. Various configurations of DIAS evaluated on Longbench with Mistral and Llama3 models show xx times speedup and xx times energy efficiency improvement, with accuracy drop of less than 1\%.
To address the dilemma between accuracy and performance, this work proposes DIAS, a distance-based irregular attention sparsity approach with processing-in-memory (PIM) architecture. DIAS employs top-K approximate attention scores through graph-based search to enhance inference efficiency while maintaining accuracy. Furthermore, a scalable gather-and-scatter-based PIM architecture is introduced to manage the storage demands of large-scale KV cache and to facilitate efficient sparsity attention computations. Various configurations of DIAS evaluated on Longbench with Mistral and Llama3 models show xx times speedup and xx times energy efficiency improvement, with accuracy drop of less than 1\%.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionRoutability-driven global placement is a major challenge in modern VLSI physical design, for which mitigating routing congestion is a critical approach. Cell inflation can effectively address local routing
congestion and is widely adopted, but with the issue of over-inflating or moving cells back into congested areas. Minimizing the congestion within a net bounding box is effective for alleviating global routing congestion, but the bounding box may be too large and contain congestion not contributed by the net. To address the first issue, we propose a momentum-based cell inflation technique that considers historical inflation ratios for mitigating local routing congestion. Then, we construct a differentiable global congestion function, developed from Poisson's equation, and introduce virtual standard cells onto two-pin nets to accurately guide net movements for mitigating global routing congestion. Furthermore, to improve pin accessibility, we adjust placement density around power and ground rails according to the routing congestion in global placement. The proposed techniques are integrated into an electrostatic-based global placement framework. Experiments on the ISPD 2015 contest benchmarks show that our framework achieves better routability results, with an average of 40% DRVs reduction and comparable wirelength and via count, compared to the leading routability-driven placer.
congestion and is widely adopted, but with the issue of over-inflating or moving cells back into congested areas. Minimizing the congestion within a net bounding box is effective for alleviating global routing congestion, but the bounding box may be too large and contain congestion not contributed by the net. To address the first issue, we propose a momentum-based cell inflation technique that considers historical inflation ratios for mitigating local routing congestion. Then, we construct a differentiable global congestion function, developed from Poisson's equation, and introduce virtual standard cells onto two-pin nets to accurately guide net movements for mitigating global routing congestion. Furthermore, to improve pin accessibility, we adjust placement density around power and ground rails according to the routing congestion in global placement. The proposed techniques are integrated into an electrostatic-based global placement framework. Experiments on the ISPD 2015 contest benchmarks show that our framework achieves better routability results, with an average of 40% DRVs reduction and comparable wirelength and via count, compared to the leading routability-driven placer.
Networking
Work-in-Progress Poster


DescriptionHardware description languages (HDLs), like VHDL, pose challenges for large language models (LLMs) due to limited data, syntactic complexity, and mismatched vocabularies. To address these, we introduce the VHDL-IR dataset with diverse parallel data pairs and develop a custom retriever to align VHDL syntax, functionality, and natural language. Our Divide-Retrieve-Conquer (DiReC) strategy enhances LLM performance by modularizing tasks, retrieving relevant contexts, and integrating results for accurate outputs. Experiments show up to 20% improvement in code generation and 12% in summarization over standard RAG, demonstrating DiReC's effectiveness while identifying areas for further VHDL-focused research.
Networking
Work-in-Progress Poster


DescriptionRecent advancements in memory technologies and programming models have heightened the demand for sophisticated memory hierarchies. A critical aspect of these hierarchies is the implementation of cache coherence protocols, which play a pivotal role in maintaining data integrity across multi-core memory systems. Despite the availability of various validation techniques—such as random testing, constrained random testing, and formal verification—these approaches often prove to be either unreliable or resource-intensive, requiring significant time and memory to guarantee protocol correctness. Existing directed testing frameworks, while promising, frequently encounter challenges such as state space explosion and the potential omission of critical states if not meticulously applied. To address these limitations, we propose an enhanced on-the-fly directed validation method specifically designed for cache coherence protocols. This approach achieves comprehensive state and transition coverage with approximately 50% fewer test cases, all without increasing memory overheads, offering a more efficient and effective solution for validating coherence protocols in complex memory hierarchies.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionIn large-scale DNN inference accelerators, the many-core architecture has emerged as a predominant design, with layer-pipeline (LP) mapping being a mainstream mapping approach. However, our experimental findings and theoretical justifications uncover a hardware-independent and prevalent flaw in employing layer-pipeline mapping on many-core accelerators: a significant underutilization of buffer space across numerous cores, indicating substantial potential for optimization. Building on this discovery, we develop a universal and efficient buffer allocation strategy, BufferProspector, which includes a Buffer Requirement Calculator and Buffer Allocator, to capitalize on these unused buffers, addressing the timing mismatch challenge inherent in LP mapping. Compared to the state-of-the-art (SOTA) open-source LP mapping framework Tangram, BufferProspector averages a simultaneous increase in energy efficiency and performance by 1.44x and 2.26x, respectively. BufferProspector will be open-sourced.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionDiffusion models have become essential generative tools for tasks such as image generation, video creation, and inpainting, but their high computational and memory demands pose challenges for efficient deployment. Contrary to the traditional belief that full-precision computation ensures optimal image quality, we demonstrate that a fine-grained mixed-precision strategy can surpass full-precision models in terms of image quality, diversity, and text-to-image alignment. However, directly implementing such strategies can lead to increased complexity and reduced runtime performance due to the overheads of managing multiple precision formats and casting operations. To address this, we introduce DM-Tune, which replaces complex mixed-precision quantization with a unified low-precision format, supplemented by noise-tuning, to improve both image generation quality and runtime efficiency. The proposed noise-tuning mechanism is a type of fine-tuning that reconstructs the mixed-precision output by learning adjustable noise through a parameterized nonlinear function consisting of Gaussian and linear components. Key steps in our framework include identifying sensitive layers for quantization, modeling quantization noise, and optimizing runtime with custom low-precision GPU kernels that support efficient noise-tuning. Experimental results across various diffusion models and datasets demonstrate that DM-Tune not only significantly improves runtime but also enhances diversity, quality, and text-to-image alignment compared to FP32, FP8, and state-of-the-art mixed-precision methods. Our approach is broadly applicable and lays a solid foundation for simplifying complex mixed-precision strategies at minimal cost.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionTransformers have demonstrated outstanding performance across diverse applications recently, necessitating fine-tuning to optimize their performance for downstream tasks. However, fine-tuning remains challenging due to the substantial computational costs and storage overhead of backpropagation (BP). The existing fine-tuning techniques require the BP through the massive pre-trained backbone weights for computing the input gradient, resulting in significant computing overhead and memory footprint for resource-constrained edge devices. To address the challenge, this work proposes an algorithm-hardware co-design framework, DRAFT, for efficient Transformer fine-tuning by decoupling the BP from the backbone weights, thereby efficiently reducing the BP overhead. The framework employs Feedback Decoupling Approximation (\textit{FDA}), an efficient fine-tuning algorithm that decouples BP into two low-complexity pathways: trainable adapter pathway and sparse ternary Bypass Network (\textit{BPN}) pathway. The two pathways work collaboratively to approximate the conventional BP process. Further, a DRAFT accelerator is proposed, featuring a reconfigurable design with lightweight sparse gather networks and dynamic workflows to fully harness the sparsity and data parallelism inherent to the FDA. Experimental results demonstrate that DRAFT achieves a speedup of 4.9× and an energy efficiency improvement of 4.2× on average compared to baseline fine-tuning methods across multiple fine-tuning tasks with negligible accuracy loss.
Research Manuscript


Systems
SYS3: Embedded Software
DescriptionEmbedded Android Devices have proliferated in many security-critical embedded scenarios, requiring sufficient testing to root out vulnerabilities. Due to Android's architecture, which uses a Hardware Abstraction Layer (HAL) for vendor-specific driver implementations, traditional kernel testing techniques cannot detect such bugs within the actual driver logic, which are commonly proprietary and vendor-specific. In this paper, we propose DroidFuzz, an embedded Android system fuzzer that targets such vendor-specific driver implementations to find such bugs. Through leveraging pre-testing HAL driver probing, kernel-user relational payload generation, and cross-boundary execution state feedback, we effectively test the proprietary drivers in both the kernel and the HAL layer. We implemented DroidFuzz and evaluated its effectiveness on 7 embedded Android devices, and found 12 security-critical previously unknown bugs, all of which have been confirmed by the respective vendors.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionDeploying convolutional neural networks (CNNs) on Field Programmable Gate Arrays (FPGAs) presents challenges in achieving optimal timing closure due to placement's impact on clock frequency and throughput. We propose DSPlacer, a novel framework for diverse CNN accelerator architectures, integrating techniques such as GCN-based DSP identification, DSP graph construction, min-cost-flow assignment, and ILP-based cascade legalization. DSPlacer ensures a compact layout while preserving direct datapath connections. Evaluated against AMD Xilinx Vivado 2020.2 and AMF-Placer 2.0, DSPlacer improves Worst Negative Slack (WNS) by 32% and 65%, demonstrating its effectiveness and scalability.
Research Manuscript


Design
DES4: Digital and Analog Circuits
DescriptionTo meet the computational requirements of modern workloads under tight energy constraints, general-purpose accelerator architectures have to integrate an ever-increasing number of extremely area- and energy-efficient processing elements (PEs).
In this context, single-issue in-order cores are commonplace, but lean dual-issue cores could boost PE IPC, especially for the common case of mixed integer and floating-point workloads.
We develop the COPIFT methodology and RISC-V ISA extensions to enable low-cost and flexible dual-issue execution of mixed integer and floating-point instruction sequences. On such kernels, our methodology achieves speedups of 1.47x, reaching a peak 1.75 instructions per cycle, and 1.37x energy improvements on average, over optimized RV32G baselines.
In this context, single-issue in-order cores are commonplace, but lean dual-issue cores could boost PE IPC, especially for the common case of mixed integer and floating-point workloads.
We develop the COPIFT methodology and RISC-V ISA extensions to enable low-cost and flexible dual-issue execution of mixed integer and floating-point instruction sequences. On such kernels, our methodology achieves speedups of 1.47x, reaching a peak 1.75 instructions per cycle, and 1.37x energy improvements on average, over optimized RV32G baselines.
Networking
Work-in-Progress Poster


DescriptionShifted-and-Duplicated-Kernel (SDK) mapping methods have advanced CNN acceleration in Compute-in-Memory (CIM) architectures but face efficiency challenges with increasing XBar sizes and hardware throughput gaps. We propose the Extended SDK (eSDK) and Compacted Kernels (CK) methods to address these limitations. eSDK optimizes performance by combining inter- and intra-tile duplication, while CK improves resource utilization by compacting kernels onto idle XBars with minimal overhead. These methods are integrated into DualMap, an iterative framework that balances performance and power. Experimental results demonstrate up to 5.24× acceleration over previous methods, enabling efficient CNN inference on CIM hardware.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionQuantization enables efficient deployment of large language models (LLMs) on FPGAs, but its presence of outliers affects the accuracy of the quantized model. Existing methods mainly deal with outliers through channel-wise or token-wise isolation and encoding, which leads to expensive dynamic quantization. To address this problem, we introduce DuoQ, an FPGA-oriented algorithm-hardware co-design framework. DuoQ effectively eliminates outliers through learnable equivalent transformations and low-semantic token awareness in the quantization scheme part, facilitating per-tensor quantization with 4-bits. We co-design the quantization algorithm and hardware architecture. Specifically, DuoQ accelerates end-to-end LLM through a novel DSP-aware PE unit design and encoder design. In addition, two types of post-processing units assist in the realization of nonlinear functions and dynamic token awareness. Experimental results show that compared with platforms with different architectures, DuoQ's computational efficiency and energy efficiency are improved by up to 8.8x and 23.45x. In addition, DuoQ has achieved accuracy improvements compared to other outlier-aware software and hardware works.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionRecent parameter-efficient fine-tuning (PEFT) methods reduce trainable parameters while maintaining model performance, with Low-Rank Adaptation (LoRA) as a prominent approach. However, optimizing both accuracy and efficiency remains challenging.The Dual Quantized Tensor-Train Adaptation with Decoupling Magnitude-Direction framework (DuQTTA) addresses the need for efficient fine-tuning of Large Language Models (LLMs) by employing Tensor-Train decomposition and dual-stage quantization to minimize model size and memory use.Additionally, adaptive optimization strategy and a decoupled update mechanism improve fine-tuning precision. DuQTTA consistently outperforms LoRA in fine-tuning LLaMA2-7B, LLaMA2-13B and LLaMA3-8B models across various tasks, achieving an several-fold compression rate improvement over LoRA.
Networking
Work-in-Progress Poster


DescriptionThis work introduces dyGRASS, an efficient dynamic algorithm designed for spectral sparsification of large undirected graphs that undergo streaming edge insertions and deletions. The core of dyGRASS is a random-walk-based method to efficiently estimate node-to-node distances in both the original graph and its sparsifier. This approach helps identify the most crucial edges from the updating edges, which are necessary to maintain distance relationships. It also aids in the recovery of essential edges from the original graph back into the sparsifier when edge deletions occur. To improve computational efficiency, we developed a GPU-based random walk kernel to allow multiple walkers to operate simultaneously across different targets. This parallelization significantly enhances the performance and effectiveness of the dyGRASS framework. Our comprehensive experimental evaluations demonstrate that dyGRASS achieves approximately a twofold speedup while eliminating setup overhead and improving solution quality compared to inGRASS in incremental spectral sparsification. Furthermore, dyGRASS achieves high efficiency and solution quality for fully dynamic graph sparsification involving both edge insertions and deletion operations for various graph instances derived from circuit simulations, finite element analysis, and social networks.
Networking
Work-in-Progress Poster


DescriptionThis paper presents a high-performance and energy-efficient near-sensor optical Deep Neural Network (DNN) accelerator---named Dyna-Optics---for dynamic inference in vision applications. Dyna-Optics leverages the efficiency of silicon photonic devices in an innovative real-time adjustable architecture supported by a novel channel-adaptive dynamic neural network algorithm to perform near-sensor granularity-controllable convolution operations for the first time. Dyna-Optics is co-designed to adjust its photonic device allocations and computing path through a novel device arm-dropping mechanism to best align varying workloads by eliminating the humongous energy consumption imposed by the weight tuning on photonic devices. Our device-to-architecture simulation results demonstrate that Dyna-Optics enables real-time trade-offs between speed, energy, and accuracy after model deployment. It can process ~84 Kilo FPS/W with slight accuracy degradation, reducing power consumption by a factor of up to ~6.1x and 52x on average compared with existing photonic accelerators and GPU baselines.
Networking
Work-in-Progress Poster


DescriptionHigh-Level Synthesis (HLS) has transformed FPGA
programming by allowing developers to describe hardware
functionality using high-level languages like C/C++. However,
the long synthesis time of traditional HLS tools hinder the
flexibility required to adapt hardware generation in real-time,
particularly in dynamic environments such as the cloud. In this
paper, we propose Nimble, a novel Just-in-Time (JIT) compilation
framework that brings runtime adaptability to HLS, enabling
transparent FPGA acceleration on cloud workloads without
requiring any hardware expertise. Our evaluation on real-world
applications demonstrated substantial performance gains with
speedups of up to 3.2X in SQL query processing for MySQL.
programming by allowing developers to describe hardware
functionality using high-level languages like C/C++. However,
the long synthesis time of traditional HLS tools hinder the
flexibility required to adapt hardware generation in real-time,
particularly in dynamic environments such as the cloud. In this
paper, we propose Nimble, a novel Just-in-Time (JIT) compilation
framework that brings runtime adaptability to HLS, enabling
transparent FPGA acceleration on cloud workloads without
requiring any hardware expertise. Our evaluation on real-world
applications demonstrated substantial performance gains with
speedups of up to 3.2X in SQL query processing for MySQL.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionTraditional global routing simplifies routing as a
Steiner Tree packing problem on a grid graph, often neglecting
the need to account for tile-internal wiring connections to pin
shapes. Local wiring usage is typically pre-estimated, leading
to inaccuracies. To address this gap, we introduce the Dynamic
Local Usage (DLU) model, which evaluates wiring usage based on
exact wire shapes and optimizes routes up to detailed pin shapes.
Key innovations include tip-to-tip penalty extensions to accurately
model packing density and new algorithms to optimize routes
according to the DLU congestion model.
We demonstrate the superiority of the Dynamic Local Usage
model by comparing results after global and detailed routing to
an industrial global router on the 3nm and 5nm technology nodes.
Our model achieves significant improvements in the number of
vias, scenic routes, design rule violations, and timing metrics after
detailed routing.
Furthermore, the DLU provides a better support for incremental design changes, such as pin movements, by dynamically updating local wiring usage during routing.
Steiner Tree packing problem on a grid graph, often neglecting
the need to account for tile-internal wiring connections to pin
shapes. Local wiring usage is typically pre-estimated, leading
to inaccuracies. To address this gap, we introduce the Dynamic
Local Usage (DLU) model, which evaluates wiring usage based on
exact wire shapes and optimizes routes up to detailed pin shapes.
Key innovations include tip-to-tip penalty extensions to accurately
model packing density and new algorithms to optimize routes
according to the DLU congestion model.
We demonstrate the superiority of the Dynamic Local Usage
model by comparing results after global and detailed routing to
an industrial global router on the 3nm and 5nm technology nodes.
Our model achieves significant improvements in the number of
vias, scenic routes, design rule violations, and timing metrics after
detailed routing.
Furthermore, the DLU provides a better support for incremental design changes, such as pin movements, by dynamically updating local wiring usage during routing.
Engineering Presentation


AI
Back-End Design
DescriptionSkew- the timing variation among signals, can severely impact the performance and functionality of complex design systems, if not taken care of appropriately. Traditional skew minimization techniques often focus on individual signals and consider one signal as reference leading to sub-optimal results when dealing with a large number of inter-related signals, especially in Mixed Signal designs. Traditional techniques are more post-facto and hence iterative
This paper introduces a novel "correct-by-construct" approach for skew balancing across multiple signals in Mixed-signal SoCs. Mixed-signal SoCs combine analog and digital components, presenting unique challenges in achieving precise timing synchronization.
Our proposed methodology leverages "correct-by-construct" optimization strategies to address the timing paths of multiple signals within a design, thus reducing global skew and minimizing congestion in the digital-analog interface channels. The solution therefore ensures first pass STA timing closure even for complex skew requirements across PVT corners enabling early SDF handoff. This in turn ensures faster time to market, avoids any late design/spec changes and signoff timing distortion. This method is also beneficial for multi-core processor designs such as Processors designs where bus skew balance is critical.
In summary, this comprehensive methodology offers a valuable tool for designers to meet complex timing requirements and enhance the reliability of interface timing in an era of increasing complexity and miniaturization.
This paper introduces a novel "correct-by-construct" approach for skew balancing across multiple signals in Mixed-signal SoCs. Mixed-signal SoCs combine analog and digital components, presenting unique challenges in achieving precise timing synchronization.
Our proposed methodology leverages "correct-by-construct" optimization strategies to address the timing paths of multiple signals within a design, thus reducing global skew and minimizing congestion in the digital-analog interface channels. The solution therefore ensures first pass STA timing closure even for complex skew requirements across PVT corners enabling early SDF handoff. This in turn ensures faster time to market, avoids any late design/spec changes and signoff timing distortion. This method is also beneficial for multi-core processor designs such as Processors designs where bus skew balance is critical.
In summary, this comprehensive methodology offers a valuable tool for designers to meet complex timing requirements and enhance the reliability of interface timing in an era of increasing complexity and miniaturization.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionQuantum readout error is the most significant source of error, substantially reducing the measurement fidelity. Tensor-product-based readout error mitigation has been proposed to address this issue by approximating the mitigation matrix. However, this method inevitably encounters the dynamic generation of the mitigation matrix, leading to long latency.
In this paper, we propose \papername, a software-hardware co-design approach that mitigates readout errors with an embedded accelerator. The main innovation lies in leveraging the inherent sparsity in the nonzero probability distribution of quantum states and calculating the tensor product on an embedded accelerator. Specifically, using the output sparsity, our dataflow dynamically downsamples the original mitigation matrix, which dramatically reduces the memory requirement. Then, we design \papername architecture that can flexibly gate the redundant computation of nonzero quantum states. Experiments demonstrate that \papername achieves an average speedup of $9.6\times \sim 2000\times$ and fidelity improvements of $1.03\times \sim 1.15\times$ compared to state-of-the-art readout error mitigation methods.
In this paper, we propose \papername, a software-hardware co-design approach that mitigates readout errors with an embedded accelerator. The main innovation lies in leveraging the inherent sparsity in the nonzero probability distribution of quantum states and calculating the tensor product on an embedded accelerator. Specifically, using the output sparsity, our dataflow dynamically downsamples the original mitigation matrix, which dramatically reduces the memory requirement. Then, we design \papername architecture that can flexibly gate the redundant computation of nonzero quantum states. Experiments demonstrate that \papername achieves an average speedup of $9.6\times \sim 2000\times$ and fidelity improvements of $1.03\times \sim 1.15\times$ compared to state-of-the-art readout error mitigation methods.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionAbstract—In technology mapping, the quality of the final implementation heavily relies on the circuit structure after technology independent optimization. Recent studies have introduced equality saturation as a novel optimization approach. However, its efficiency remains a hurdle against its wide adoption in logic synthesis. This paper proposes a highly scalable and efficient framework named E-morphic. It is the first work that employs equality saturation for resynthesis after conventional technology-independent logic optimizations, enabling structure exploration before technology mapping. Powered by several key enhancements to the equality saturation framework, such as direct e-graph-circuit conversion, solution-space pruning, and simulated annealing for e-graph extraction, this approach not only improves the scalability and extraction efficiency of e-graph rewriting but also addresses the structural bias issue present in conventional logic synthesis flows through parallel structural exploration and resynthesis. Experiments show that, compared to the state-of-the-art delay optimization flow in ABC, E-morphic on average achieves 12.54% area saving and 7.29% delay reduction on the large-scale circuits in the EPFL benchmark.
Engineering Poster
Networking


Descriptionclock network jitter signoff simulation takes place at very late phase of the design because of unavailability of simulation vectors, package model, post route, filled design data at early design phases. So, estimated clock network jitter values assumed at initial design phases. Using approximate modelling of power noise waveforms in clock network circuit SPICE simulations, we can predict clock network jitter at early phase of the design (clock routed database). Once accurate clock network device power waveforms (piece wise linear) are available from IR analysis, actual waveforms can be used in clock network circuit SPICE simulation to derive more accurate clock network jitter value. These methods for clock network jitter calculation are scalable and carry minimum dependency on design size
Networking
Work-in-Progress Poster


DescriptionMatching of devices and circuit elements is of utmost importance in Analog integrated circuit designs. The matching checks are mostly done manually and is on discretion of analog layout designers. These mismatches impact the simulation results, leading to a long iterative process to close the design and there is high probability that some mismatches could get missed. This paper proposes an innovative automated approach using standard tools (Cadence Virtuoso and python script) to detect and address potential device matching issues in analog layouts. This is a technology independent approach. We validated it in 90nm technology using eNVM test case. We observed potential matching issues in layout due to variations in extracted parameters like SCI parameter (which is related to WPE) and STI parameters (sa & sb). Mapping these variations to the layout helped in reducing potential device mismatches. With the help of k-means clustering algorithm, python scripting aids in identifying outlier devices. The proposed methodology will help the layout engineer to ensure that layout matches correctly before sending the netlist for simulation, which will shorten the time needed to complete the design.
Engineering Poster
Networking


DescriptionIn the realm of STA (static timing analysis), our engineers dedicate much of their time to writing timing constraints and analyzing reports. While we can't yet command, "Computer, fix my timing", we can boost our efficiency with the AI tools available today.
This presentation will showcase various examples and encourage people to think in new ways. Rather than asking AI to solve problems for us, we can ask IT to help automate our solutions.
Key areas of focus include debug of failing paths, writing scripts, and timing take down.
By smartly integrating AI into our workflows, we can streamline processes and enhance productivity.
This presentation will showcase various examples and encourage people to think in new ways. Rather than asking AI to solve problems for us, we can ask IT to help automate our solutions.
Key areas of focus include debug of failing paths, writing scripts, and timing take down.
By smartly integrating AI into our workflows, we can streamline processes and enhance productivity.
Research Manuscript


Systems
SYS4: Embedded System Design Tools and Methodologies
DescriptionNeural image compression, necessary in various machine-to-machine communication scenarios, suffers from its heavy encode-decode structures and inflexibility in switching between different compression levels. Consequently, it raises significant challenges in applying neural image compression to edge devices that are developed for powerful servers with high computational and storage capacities. We take a step to solve the challenges by proposing a new transformer-based edge-compute-free image coding framework called Easz. Easz shifts the computational overhead to the server, and hence avoids the heavy
encoding and model-switching overhead on the edge. Easz utilizes a patch-erase algorithm to selectively remove image contents using a conditional uniform-based sampler. The erased pixels are reconstructed on the receiver side through a transformer-based framework. To further reduce the computational overhead on the receiver, we then introduce a lightweight transformer-based reconstruction structure to reduce the reconstruction load on the receiver side. Extensive evaluations conducted on a real-world testbed demonstrate multiple advantages of Easz over existing compression approaches, in terms of adaptability to different compression levels, computational efficiency, and image reconstruction quality.
encoding and model-switching overhead on the edge. Easz utilizes a patch-erase algorithm to selectively remove image contents using a conditional uniform-based sampler. The erased pixels are reconstructed on the receiver side through a transformer-based framework. To further reduce the computational overhead on the receiver, we then introduce a lightweight transformer-based reconstruction structure to reduce the reconstruction load on the receiver side. Extensive evaluations conducted on a real-world testbed demonstrate multiple advantages of Easz over existing compression approaches, in terms of adaptability to different compression levels, computational efficiency, and image reconstruction quality.
Research Special Session


Design
DescriptionThis talk will discuss the requirements for EDA tooling for heterogeneous integration (HI). Specific issues that will be addressed include (1) chiplet disaggregation, to map the design on to smaller chiplets, working in conjunction with systemtechnology co-optimization to determine the right substrate and chiplet technologies; (2) multiphysics analysis that incorporates thermomechanical aspects into performance analysis; (3) physical design of chiplets on the substrate, including the design of thermal solutions and power delivery solutions; and (4) the underlying infrastructure required to facilitate HI-based design, including the design and characterization of chiplet libraries. Given the complexities of these tasks, the talk will discuss the role of analysis and optimization at various stages of design, ranging from fast machine-learning-driven analyses in early stages to signoff-quality analysis.
Exhibitor Forum


DescriptionThe electronic design automation (EDA) industry is evolving rapidly, and cloud computing is unlocking unprecedented opportunities for chip design. This presentation explores how Google Cloud Platform (GCP) empowers a new era of scalable and efficient workflows, drawing on insights and best practices honed from Alphabet's own internal use. While traditional on-premises EDA environments have served the industry well, GCP offers a transformative leap forward, enabling capabilities previously unattainable.
We'll showcase how GCP services like Google Kubernetes Engine (GKE) and Cloud Batch can handle the dynamic demands of chip design, optimizing data storage and analysis with services like Google Cloud Storage (GCS), and BigQuery. These services, battle-tested within Alphabet, provide the foundation for agile and responsive chip development. Furthermore, we'll discuss how value-added services built on GCP, including data solutions, runtime optimizations, and fine grain access control list (ACL) management, are critical for achieving next-generation chip design efficiency.
We'll showcase how GCP services like Google Kubernetes Engine (GKE) and Cloud Batch can handle the dynamic demands of chip design, optimizing data storage and analysis with services like Google Cloud Storage (GCS), and BigQuery. These services, battle-tested within Alphabet, provide the foundation for agile and responsive chip development. Furthermore, we'll discuss how value-added services built on GCP, including data solutions, runtime optimizations, and fine grain access control list (ACL) management, are critical for achieving next-generation chip design efficiency.
Networking
Work-in-Progress Poster


DescriptionContinual learning (CL) enables offline-trained models to adapt to new environments and unseen data, a critical feature for edge-deployed models. However, CL often suffers from significant data and hardware overhead or performance degradation, such as catastrophic forgetting (CF). To mitigate these challenges, this work proposes a hardware-algorithm co-design for Gaussian Mixture-based Bayesian Neural Networks (GM-BNNs). The proposed GM-BNN approach enables CL by identifying uncertain out-of-distribution (OOD) data to minimize retraining data volume and mitigates CF by integrating old and new knowledge within a unified GM framework of multiple distributions—each addressing a distinct task. To address the high computational overhead of Bayesian sampling, we design a custom in-memory Gaussian mixture computation circuit, enabling efficient and scalable CL. Leveraging shared Gaussian random number generation inside the multi-distribution memory words and near-memory distribution selection achieves a 10.9× and 1.97× improvement in energy and area respectively compared to a state-of-the-art baseline. Furthermore, the uncertainty-aware, minimal retraining GM-BNN algorithm reduces retraining data required to achieve iso-accuracy by 5×.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionBoolean decomposition is a powerful technique in logic synthesis that breaks down Boolean functions into simpler components. Decomposition-based logic synthesis produces high-quality results and is effective with small-window optimization methods in Gate-Inverter Graphs (GIG). However, the efficiency limitations of current methods have restricted their potential in handling large and complex logic. To address this challenge, we propose a framework, EDGE, which leverages modern database techniques to accelerate Boolean decomposition, achieving better synthesis results while maintaining high efficiency. Experimental results demonstrate a runtime speedup of up to 21× and an overall reduction in node count of at least 15% compared to state-of-the-art synthesis methods.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionEmerging multimodal LLMs (MLLMs) exhibit strong cross-modal perception and reasoning capabilities, and holding great potential for various applications at the edge. However, MLLMs are normally consist of compute-intensive modality encoder and memory-bound LLMs, leading to distinct performance bottlenecks for hardware designs. In this work, we present an multi-core CPU solution with heterogeneous AI extensions, which based on either compute-centric systolic array or memory-centric digital compute-in-memory (CIM) coprocessors. Furthermore, dynamic activation-aware weight pruning and bandwidth management are developed to enhance bandwidth efficiency and cores utilization, improving system overall performance. We implemented our solution using commercial 22nm technology. For a representative MLLMs, our solution achieves 2.84x performance speedup compared to laptop 3060 GPU, reaching a 138 tokens/s throughput and a 0.28 token/J efficiency.
Engineering Poster


DescriptionIn consumer products, there is a pursuit of low power consumption and small size. In the case of this article, to enhance the computing power per unit area of the chip, multiple Storage dies and a Logic die are stacked vertically. Each Die contains tens of thousands of high-precision array units, storage array units, and digital units. The requirements for voltage variation on the power supply and VSS are extremely strict.
Due to the limitation of bump resources, the VSS of the Storage and Logic Dies are combined on the Logic Die and then fan out to the VSS Bumps. It is necessary to simultaneously simulate the voltage variation of the VSS TSV(Through Silicon Via) at each grounding point location on the Storage Dies and the Logic DIE and then analyze and find an optimization scheme for the voltage variation.
In this presentation, we provides an effective analysis method for the Full Chip Voltage Variation of complex analog-digital hybrid Storage 3DIC, which can accurately analyze the voltage variation of VSS TSV at the grounding point locations.
Then, two optimization methods are presented.
The first one is we optimizes the voltage variation of the power ground of the storage array by means of the controlled delayed startup of array units. Using the method provided in this paper, the effectiveness of the scheme is demonstrated through the comparison of simulation data.
The second one is through comparative analysis of simulation data, we demonstrate that the mutual inductance between TSV Cells cannot be ignored in multi-die stacking. Meanwhile, an arrangement scheme was provided that can reduces the mutual inductance between TSV cells effectively.
Due to the limitation of bump resources, the VSS of the Storage and Logic Dies are combined on the Logic Die and then fan out to the VSS Bumps. It is necessary to simultaneously simulate the voltage variation of the VSS TSV(Through Silicon Via) at each grounding point location on the Storage Dies and the Logic DIE and then analyze and find an optimization scheme for the voltage variation.
In this presentation, we provides an effective analysis method for the Full Chip Voltage Variation of complex analog-digital hybrid Storage 3DIC, which can accurately analyze the voltage variation of VSS TSV at the grounding point locations.
Then, two optimization methods are presented.
The first one is we optimizes the voltage variation of the power ground of the storage array by means of the controlled delayed startup of array units. Using the method provided in this paper, the effectiveness of the scheme is demonstrated through the comparison of simulation data.
The second one is through comparative analysis of simulation data, we demonstrate that the mutual inductance between TSV Cells cannot be ignored in multi-die stacking. Meanwhile, an arrangement scheme was provided that can reduces the mutual inductance between TSV cells effectively.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionRealizing the full potential of quantum computing requires large-scale quantum computers capable of running quantum error correction (QEC) to mitigate hardware errors and maintain quantum data coherence. While quantum computers operate within a two-level computational subspace, many processor modalities are inherently multi-level systems. This leads to occasional leakage into energy levels outside the computational subspace, complicating error detection and undermining QEC protocols. The problem is particularly severe in engineered qubit devices like superconducting transmons, a leading technology for fault-tolerant quantum computing. Addressing this challenge requires effective multi-level quantum system readout to identify and mitigate leakage errors. We propose a scalable, high-fidelity three-level readout that reduces FPGA resource usage by 60x compared to the baseline while reducing readout time by 20%, enabling faster leakage detection. By employing matched filters to detect relaxation and excitation error patterns and integrating a modular lightweight neural network to correct crosstalk errors, the protocol significantly reduces hardware complexity, achieving a 100x reduction in neural network size. Our design supports efficient, real-time implementation on off-the-shelf FPGAs, delivering a 6.6% relative improvement in readout accuracy over the baseline. This innovation enables faster leakage mitigation, enhances QEC reliability, and accelerates the path toward fault-tolerant quantum computing.
Engineering Presentation


AI
Back-End Design
DescriptionThe existing automatic routing tools often produce design rule violations and fail to meet the required routing completion rate when dealing with the complex electrical connection and constraints. In this work, an algorithm of determining auxiliary point is introduced to guide the auto-routing tool to generate a preliminary layout for FCBGA substrate, and then the preliminary layout is optimized by Residual Neural Network and Decision Transformer model. The auxiliary point corresponding to die pad is disposed in proximity to the die edge and its coordinate is determined by the corresponding die pad coordinate, signal type and design rules, such as the minimum trace width and spacing. Experimental results demonstrate the proposed method effectively achieves high routing completion rate and enhances layout defects including detour and dense routing.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionLogic synthesis involves applying a sequence of logic transformations to circuits that significantly impact their quality-of-result (QoR).
Recently, there has been a growing focus on optimizing synthesis sequences to improve QoR.
However, previous efforts treat sequence optimization as a discrete transformation-selection task and fall short in scalability due to the exponentially sized search space.
Continuous optimization, in contrast, offers favorable efficiency advantages thanks to mature gradient-based optimization techniques.
Still, it is non-trivial to directly apply it to the discrete sequence optimization task since the gradient may lead to out-of-distribution solutions.
In this paper, we propose an efficient approach to optimize synthesis sequences within a continuous latent space.
We map discrete transformations to and from a continuous representation, enabling gradient-based optimization.
To tackle the out-of-distribution issue, we incorporate a diffusion model to constrain the optimized transformations to be in-distribution.
Without the need to search combinatorial solution space, our method is more than 5 times faster than previous methods.
Besides, the sequences found by our method reach a lower area and delay compared with baseline methods.
Recently, there has been a growing focus on optimizing synthesis sequences to improve QoR.
However, previous efforts treat sequence optimization as a discrete transformation-selection task and fall short in scalability due to the exponentially sized search space.
Continuous optimization, in contrast, offers favorable efficiency advantages thanks to mature gradient-based optimization techniques.
Still, it is non-trivial to directly apply it to the discrete sequence optimization task since the gradient may lead to out-of-distribution solutions.
In this paper, we propose an efficient approach to optimize synthesis sequences within a continuous latent space.
We map discrete transformations to and from a continuous representation, enabling gradient-based optimization.
To tackle the out-of-distribution issue, we incorporate a diffusion model to constrain the optimized transformations to be in-distribution.
Without the need to search combinatorial solution space, our method is more than 5 times faster than previous methods.
Besides, the sequences found by our method reach a lower area and delay compared with baseline methods.
Networking
Work-in-Progress Poster


DescriptionThe rapid growth of smart devices and sensors has led to an overwhelming
increase in data generation, pushing current network infrastructure to its
limits and threatening the scalability of cloud-based processing. Edge machine
learning, which processes data locally on devices, presents a viable solution
to reduce network load and latency. However, deploying deep learning at the
edge remains difficult due to the limited memory and computational capacity of
these devices which mostly precludes on-device/on-site training. Equilibrium
propagation (EP) has emerged as a promising alternative to backpropagation,
leveraging analog processing and device physics for energy-efficient learning.
Yet, its practical implementation is hindered by challenges such as voltage
variations and the need for energy-efficient circuits capable of gradient
computation at a sufficient level of accuracy. Existing solutions rely on
impractical idealized models. In this work, we introduce a novel method to
address the problem of the wide dynamic range of the voltage variation to avoid
the use of expensive low-noise amplifiers, and propose an innovative
transistor-level switched-capacitor circuit to compute gradients in accordance
with the EP rule. Additionally, our design supports batching, a key requirement
for stable training that is often overlooked. We validate our approach on the MNIST
dataset, demonstrating a practical, energy-efficient EP circuit that operates
within real hardware constraints.
increase in data generation, pushing current network infrastructure to its
limits and threatening the scalability of cloud-based processing. Edge machine
learning, which processes data locally on devices, presents a viable solution
to reduce network load and latency. However, deploying deep learning at the
edge remains difficult due to the limited memory and computational capacity of
these devices which mostly precludes on-device/on-site training. Equilibrium
propagation (EP) has emerged as a promising alternative to backpropagation,
leveraging analog processing and device physics for energy-efficient learning.
Yet, its practical implementation is hindered by challenges such as voltage
variations and the need for energy-efficient circuits capable of gradient
computation at a sufficient level of accuracy. Existing solutions rely on
impractical idealized models. In this work, we introduce a novel method to
address the problem of the wide dynamic range of the voltage variation to avoid
the use of expensive low-noise amplifiers, and propose an innovative
transistor-level switched-capacitor circuit to compute gradients in accordance
with the EP rule. Additionally, our design supports batching, a key requirement
for stable training that is often overlooked. We validate our approach on the MNIST
dataset, demonstrating a practical, energy-efficient EP circuit that operates
within real hardware constraints.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionVision Transformers (ViTs) are new foundation models for vision applications. Edge-deploying ViTs to realize energy-saving, low-latency, and high-performance dense predictions have wide applications, such as autonomous driving and surveillance image analysis. However, the quadratic complexity of the self-attention mechanism renders ViTs slow and resource-intensive, particularly for pixel-level dense predictions that involve long contexts. Additionally, the pyramid-like architecture of modern ViT variants leads to an unbalanced workload, further reducing hardware utilization and decreasing the throughput of conventional edge devices. To this end, we propose an algorithm-hardware co-optimized edge ViT accelerator tailored for efficient dense predictions. At the algorithm level, we propose a decoupled chunk attention (DCA) mechanism implemented in a pipelined manner to reduce off-chip memory access, thereby enabling efficient dense predictions within limited on-chip memory. At the architecture level, we introduce a hybrid architecture that combines SRAM-based computing-in-memory (CIM) and nonvolatile RRAM storage to eliminate extensive off-chip memory access, with a fusion scheduling to balance workloads and minimize intermediate on-chip memory access. At the circuit level, a bit/element two-way-reconfigurable CIM macro is proposed to improve hardware utilization across pyramidal ViT blocks with varied matrix sizes. The experimental results on object detection, semantic segmentation, and depth estimation tasks demonstrate that our design can efficiently process patch lengths up to 16384 with a speedup of 18.5×-217.1×, a reduction in memory accesses of 1.7×-7.4×, and an improvement in energy efficiency of 1.8×, under less than 1% performance degradation.
Engineering Presentation


Front-End Design
DescriptionAs design size and complexity grow, design verification becomes increasingly challenging. For the challenge, hardware fuzzing has emerged as a viable solution to improve coverage by injecting randomized test vectors. However, RTL simulation-based hardware fuzzing is constrained by slow simulation speeds, particularly for large designs. To address this, FPGA-based simulation acceleration for hardware fuzzing has been proposed, but it involves significant overhead in creating testbenches and porting designs to FPGA. This work introduces a simulation-based hardware fuzzing approach based on SystemC to enhance simulation speed. This approach builds on a cycle-accurate SystemC model generated by Verilator. By leveraging transaction-level modeling, the method reduces simulation events at bus interfaces. Moreover, unnecessary events caused by clock toggles and cascading are eliminated without compromising accuracy through an adaptive clock gating mechanism. Experimental results support the effectiveness of the proposed approach, achieving a 7.14× speedup over the baseline to achieve the same level of coverage.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionSynthesis-based functional Engineering Change Order (ECO) algorithms, as classified in [1], are particularly effective for addressing functional bugs. These algorithms typically involve two primary steps: (1) identifying rectification signals to address functional mismatches, and (2) generating patch circuits based on these signals. While much of the existing research focuses on enhancing step (2), step (1) often remains somewhat ad hoc and inefficient.
In this paper, we propose a novel approach for systematically collecting and validating all possible sets of rectification signals from a given set of candidates. Leveraging a heuristic for grouping and ranking rectification signals, our ECO flow efficiently identifies minimal patches while achieving a highly competitive runtime. Our contributions include three key innovations: a heuristic for identifying high-quality rectification candidates, an efficient algorithm for validating all feasible sets of rectification signals, and a signal grouping and ranking technique that ensures minimal patch size. When integrated with an open-source patch generation tool, our method demonstrates an average reduction of 44% in patch sizes compared to a leading commercial ECO tool on benchmark circuits.
In this paper, we propose a novel approach for systematically collecting and validating all possible sets of rectification signals from a given set of candidates. Leveraging a heuristic for grouping and ranking rectification signals, our ECO flow efficiently identifies minimal patches while achieving a highly competitive runtime. Our contributions include three key innovations: a heuristic for identifying high-quality rectification candidates, an efficient algorithm for validating all feasible sets of rectification signals, and a signal grouping and ranking technique that ensures minimal patch size. When integrated with an open-source patch generation tool, our method demonstrates an average reduction of 44% in patch sizes compared to a leading commercial ECO tool on benchmark circuits.
Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionPeriodic small-signal analysis is crucial but time-consuming in RF simulation, since it may deal with many frequency points. While the Krylov subspace recycling method has greatly accelerated the simulation, the increasing memory cost in large-scale circuit simulation has become a new bottleneck. A remedy for memory shortage is to restart the recycling algorithm, but may cause excessive extra iterations. To address the issue, this paper outlines a framework of recycling subspace truncation method for periodic small-signal analysis, provided with an efficient initial guess choice method and a Floquet-based subspace truncation strategy. Numerical results show that compared to the existing methods, the proposed method achieves up to 2.5× speedup in the same memory cost.
Engineering Presentation


Front-End Design
DescriptionReset metastability signoff has become imperative as modern designs increasingly rely
on software resets. This introduces risks where parts of the design remain under reset
while others operate functionally, and improper handling of resets across power
domains can potentially lead to chip failures. Reset controllers are designed to ensure
no reset metastability issues occur, but they may miss metastability arising from
asynchronous crossings or inadequate synchronization of reset signals.
Traditional methodologies like STA, functional verification, and CDC do not catch all
reset metastability issues to ensure signoff. Primitive static solutions face capacity
limitations, prohibitively high noise, and convoluted flows. Linking static signoff to
simulation or formal methods results in slower turnaround times and missed issues.
This presentation introduces a novel Reset Domain Crossing (RDC) methodology that
overcomes these challenges. The approach is scalable to designs of any size and
significantly reduces time and effort, ensuring robust reset metastability signoff. It delivers
faster turnaround times, improved coverage, and a streamlined flow compared to traditional
methods
on software resets. This introduces risks where parts of the design remain under reset
while others operate functionally, and improper handling of resets across power
domains can potentially lead to chip failures. Reset controllers are designed to ensure
no reset metastability issues occur, but they may miss metastability arising from
asynchronous crossings or inadequate synchronization of reset signals.
Traditional methodologies like STA, functional verification, and CDC do not catch all
reset metastability issues to ensure signoff. Primitive static solutions face capacity
limitations, prohibitively high noise, and convoluted flows. Linking static signoff to
simulation or formal methods results in slower turnaround times and missed issues.
This presentation introduces a novel Reset Domain Crossing (RDC) methodology that
overcomes these challenges. The approach is scalable to designs of any size and
significantly reduces time and effort, ensuring robust reset metastability signoff. It delivers
faster turnaround times, improved coverage, and a streamlined flow compared to traditional
methods
Networking
Work-in-Progress Poster


DescriptionThe efficient execution of scientific workloads on contemporary architectures is challenged by energy-intensive computations, such as matrix-vector multiplications (MVM). This energy expenditure arises substantially from the transfer of data between the memory and processing units. Systolic arrays, including those based on in-memory path-based computing, have been proposed to expedite demanding MVM operations, leading to significant energy efficiency. However, the size of matrices in scientific workloads is often much larger than the size of the available systolic arrays. Different mappings of these large matrices to relatively smaller crossbar arrays lead to different switchings of non-volatile memory devices and, hence, differences in energy expenditures. Computing an energy-efficient runtime mapping of MVM computations onto systolic arrays to minimize switching of non-volatile memory devices is a hitherto unexplored challenge.
In this paper, we present a framework named Hamiltonian for efficiently scheduling computations on path-based computing systolic arrays when there are constraints on the number of processing elements. We achieve this by introducing a distance metric between different computations and solving the problem by finding a set of Hamiltonian cycles in a complete graph. We evaluate our framework using ten SuiteSparse matrices, and our experimental results demonstrate that Hamiltonian enhances power efficiency and reduces latency by 30% and 30% on average compared with the previous state-of-the-art.
In this paper, we present a framework named Hamiltonian for efficiently scheduling computations on path-based computing systolic arrays when there are constraints on the number of processing elements. We achieve this by introducing a distance metric between different computations and solving the problem by finding a set of Hamiltonian cycles in a complete graph. We evaluate our framework using ten SuiteSparse matrices, and our experimental results demonstrate that Hamiltonian enhances power efficiency and reduces latency by 30% and 30% on average compared with the previous state-of-the-art.
Engineering Poster


DescriptionA frequent and critical step in layout design and chip integration verification is the conversion of design data from Open Access, a format suitable for editors and construction tools, to OASIS®, the format used by design rule checking, logic-to-schematic verification, and manufacturing mask creation. We present a method for distributed generation and rapid merging of OASIS and show how we achieved translation times of under 10 minutes for full-chip OASIS generation, compared to 5 hours using the prior tool. This reduction in runtime enabled an effective doubling of final closure design throughput, reducing schedule pressure and increasing design integration productivity.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionCrossbar-based computing-in-memory (CIM) systems facilitate large-scale parallel multiply-and-accumulate (MAC) operations, while a domain-specific compiler (DSC) plays a pivotal role in optimizing the deployment of neural network algorithms on such systems. With the development of multi-core and large-core architectures, some key compiler problems such as high parallel processing, resource utilization, and crossbar array assignment methods have not been solved. For low-latency application scenarios, we have designed a resource scheduling strategy for our hardware system based on stream data processing to reduce the latency caused by intra-core and inter-core communication. Additionally, a weight mapping strategy has been developed to maximize the potential of crossbar arrays in convolutional neural networks (CNNs) deployment. Experimental results on our multi-core eFlash-based CIM system-on-chip (SoC) demonstrate that these two technologies help CNNs achieve a 76% reduction in latency, a 30% improvement in resource utilization, and the use rate of crossbar array that can reach up to 94.7%.
Networking
Work-in-Progress Poster


DescriptionThe inherent parallelism of convolution neural network (CNN) inference enables efficient and flexible data processing. Consequently, highly parallel compute paradigms are widely adopted to achieve high performance for CNNs. Moreover, exploiting sparsity has become an indispensable technique for accelerating CNN inference. However, fully exploiting two-sided random sparsity (weights and input activations) can hinder the parallel processing in CNNs. The non-uniformity of sparse inputs makes synchronization overhead significant due to the requirement of input matching to ensure sufficient valid input pairs for subsequent parallel processing units. While various architectures have been proposed, the challenge remains inadequately addressed. In this paper, we introduce a stride-aware data compression method coupled with weight-stationary dataflow to fully leverage the parallel characteristics of CNNs for accelerated inference at low hardware cost and power consumption. Experimental results demonstrate that our technique achieves speedups of 1.17×, 1.16×, 1.32×, and 0.82× compared to the recent accelerator SparTen for VGG16, GoogLeNet, ResNet34, and MobileNetV1, respectively. Furthermore, FPGA implementation of our core reveals a notable 4.8× reduction in hardware size and a 5.25× enhancement in energy efficiency compared to SparTen.
Networking
Work-in-Progress Poster


DescriptionFloorplanning is a critical step in chip design, primarily aimed at minimizing overall area by optimally arranging modules. This task can be formulated as a rectangle-packing problem, which is known as an NP-hard challenge. Traditional solutions, including meta-heuristics, analytical methods, and reinforcement learning, are constrained by the von Neumann architecture, resulting in high computational cost and low efficiency. Probabilistic computing has demonstrated significant potential for efficiently solving NP-hard problems. In this work, we propose an efficient heterogeneous probabilistic computing (EHPC) architecture based on volatile RRAM to accelerate floorplanning, by directly mapping probabilistic computations to devices and circuits. Furthermore, we implant the floorplanning problem into the algorithm of EHPC architecture. Hardware simulation results of n300 (GSRC benchmark) demonstrate that our EHPC architecture significantly boosts computation speed, ranging from 20× to 15000×, with an area expansion <1.7% compared to the best-performing conventional methods, highlighting its advantages in speed, efficiency, and scalability for solving NP-hard problems.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionIn electronic design automation, logic optimization operators play a crucial role in minimizing the gate count of logic circuits. However, their computation demands are high. Operators such as refactor conventionally form iterative cuts for each node, striving for a more compact representation - a task which often fails 98% on average. Prior research has sought to mitigate computational cost through parallelization. In contrast, our approach leverages a classifier to prune unsuccessful cuts preemptively, thus eliminating unnecessary resynthesis operations. Experiments on the refactor operator using the EPFL benchmark suite and 10 large industrial designs demonstrate that this technique can speedup logic optimization by 3.9× on average compared with the state-of-the-art ABC implementation.
Networking
Work-in-Progress Poster


DescriptionState Space Model (SSM)-based machine learning architectures have recently gained attention for processing sequential data. A recent sequence-to-sequence SSM, Mamba, offers competitive accuracy over state-of-the-art transformers with higher processing efficiency. Competitive performance with lower complexity makes Mamba a compelling choice for edge ML applications. However, no hardware accelerator design frameworks have been specifically optimized for Mamba in edge scenarios to date. To fill this gap, we propose eMamba, an end-to-end framework for designing and deploying Mamba hardware accelerators for edge applications. eMamba enhances efficiency by replacing complex normalization layers with hardware-friendly alternatives while also approximating SiLU activation and exponent calculations. It also employs an approximation-aware Neural Architecture Search (NAS) to identify the best hyperparameters for edge deployment. The entire design is quantized and evaluated on AMD-ZCU102 using a mmWave radar-based human pose estimation application. eMamba achieves 9.95× higher throughput and 5.62× lower latency using 63× fewer parameters while maintaining competitive accuracy with state-of-the-art solutions.
Tutorial


AI
Sunday Program
DescriptionEdge-based Large Language Models (edge LLMs) can preserve the promising abilities of LLM while ensuring user data privacy. Additionally, edge LLMs can be utilized in various fields without internet connectivity constraints. However, edge LLMs face significant challenges in training, deployment, and inference. Limitations in memory storage, computational power, and data I/O operations can hinder the deployment of advanced LLMs on edge devices. These constraints often result in poor performance in customization, real-time user interaction, and adaptation to novel situations. Traditional acceleration methods, primarily designed for advanced computation platforms, may not be optimal for all types of edge devices. As a complementary solution, Compute-in-Memory (CiM) architectures based on emerging non-volatile memory (NVM) devices offer promising opportunities. These architectures, having demonstrated numerous advantages in traditional neural networks, can help overcome the computational memory bottleneck of edge devices and reduce competition for core computational resources. Through the introduction of software-hardware co-design and co-optimization methods, NVCiM can significantly enhance edge LLM performance in resource-limited environments. Moreover, NVCiM-based edge LLM systems are more cost-effective compared to LLMs running on high-performance computing devices. This makes them suitable for various personalized applications, particularly in healthcare and medical fields.
Section 1: Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices (Jinjun Xiong)
Section 2: Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis (Yiyu Shi)
Section 3: Robust Implementation of Retrieval-Augmented Generation on Edge-based Computing-in-Memory Architectures (Yiyu Shi)
Section 4: NVCiM-PT: An NVCiM-assisted Prompt Tuning Framework for Edge LLMs (Yiyu Shi)
Section 5: Tiny-Alignment: Bridging Automatic Speech Recognition with LLM on Edge (Jinjun Xiong)
Section 6: Analysis: Do Edge Large Language Models Allow Fair Access to Healthcare for
Language- Impaired Users? (Yiyu Shi)
Section 1: Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices (Jinjun Xiong)
Section 2: Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis (Yiyu Shi)
Section 3: Robust Implementation of Retrieval-Augmented Generation on Edge-based Computing-in-Memory Architectures (Yiyu Shi)
Section 4: NVCiM-PT: An NVCiM-assisted Prompt Tuning Framework for Edge LLMs (Yiyu Shi)
Section 5: Tiny-Alignment: Bridging Automatic Speech Recognition with LLM on Edge (Jinjun Xiong)
Section 6: Analysis: Do Edge Large Language Models Allow Fair Access to Healthcare for
Language- Impaired Users? (Yiyu Shi)
Networking
Work-in-Progress Poster


DescriptionTransformer-based models deliver high performance, but this comes at the expense of substantial computational overhead and data movement, particularly in the attention layer. Although self-attention exhibits multi-level sparsity, effectively leveraging sparsity to design hardware and dataflow that optimize both matrix multiplications (MMs) and softmax computation remains a significant challenge,
including the need for flexibility to simultaneously support element-wise and bit-wise sparsity, as well as the difficulty in fully identifying the opportunities sparsity offers to optimize softmax computation. In this paper, we propose EMSTrans, a specialized Transformer accelerator that addresses these challenges through two innovations. Firstly, we propose the power of 2 (PO2) based fusion approach and a sparsity-aware bit-element-fusion (SABEF) systolic array. This approach utilizes the systolic-gating state shifting method to leverage both bit-level and element-level sparsity, significantly reducing energy consumption and latency in MM operations. Secondly, we propose the two-stage operation fusion (TSOF) softmax computation scheme. It allows the sparse score matrix (S) to be stored in a pruned and quantized form using the proposed subtraction-based quantization pruning method (SBQP). Through the fully utilization of sparsity, the proposed accelerator achieves up to 93.34% energy saving and 1.98× acceleration for MM computation along with up to 79% memory access reduction for softmax computation, achieving 1.93× improvement of the accelerator's energy efficiency.
including the need for flexibility to simultaneously support element-wise and bit-wise sparsity, as well as the difficulty in fully identifying the opportunities sparsity offers to optimize softmax computation. In this paper, we propose EMSTrans, a specialized Transformer accelerator that addresses these challenges through two innovations. Firstly, we propose the power of 2 (PO2) based fusion approach and a sparsity-aware bit-element-fusion (SABEF) systolic array. This approach utilizes the systolic-gating state shifting method to leverage both bit-level and element-level sparsity, significantly reducing energy consumption and latency in MM operations. Secondly, we propose the two-stage operation fusion (TSOF) softmax computation scheme. It allows the sparse score matrix (S) to be stored in a pruned and quantized form using the proposed subtraction-based quantization pruning method (SBQP). Through the fully utilization of sparsity, the proposed accelerator achieves up to 93.34% energy saving and 1.98× acceleration for MM computation along with up to 79% memory access reduction for softmax computation, achieving 1.93× improvement of the accelerator's energy efficiency.
Research Manuscript


Systems
SYS3: Embedded Software
DescriptionInterlaced Magnetic Recording (IMR) drives have been regarded as promising hard disk drives (HDDs) to meet the ever-growing storage demand in the post-AI era. Among various technologies, IMR drives achieve increased storage capacity by overlapping tracks in an interlaced fashion; however, updating data on overlapped tracks necessitates rewriting up to two tracks, which can degrade performance. Previous strategies have attempted to reduce track rewrites by either delaying the timing of overlapping based on storage space usage or relocating frequently updated data to non-overlapped tracks through update-frequency-based hot/cold data separation. In contrast, this paper proposes a novel approach that leverages data deduplication as a natural solution to the rewrite challenge in IMR drives. Unlike conventional hot/cold separation, this paper utilizes deduplication metadata, specifically reference counts, as a direct indicator for data relocation. The rationale behind the proposed approach is that deduplicated data, which remains static unless reference counts drop to one, can be stored or migrated to overlapped tracks without inducing future track rewrites. The proposed approach diverges from traditional frequency tracking by combining data deduplication with IMR technology to optimize data placement and reduce rewriting. Evaluation results show that the proposed scheme effectively reduces the accumulated read/write latency by 37.57% on average compared with state-of-the-art data management on IMR drives with data deduplication enabled.
Engineering Special Session


AI
Back-End Design
Chiplet
DescriptionAs chip complexity grows, multi-die 3DIC designs are becoming a pivotal solution to meet performance and efficiency demands. This talk explores how ecosystem collaboration—spanning EDA tools, IP providers, and foundries—is leveraging AI to tackle the challenges of integration, thermal management, and design verification. Highlighting case studies, it demonstrates how AI-driven automation transforms traditionally siloed workflows into cohesive, scalable solutions, enabling faster time-to-market and innovative system designs. Attendees will gain insights into practical strategies for harnessing ecosystem synergies to address the growing demands of heterogeneous integration.
Research Manuscript


Systems
SYS3: Embedded Software
DescriptionOn-device training enables the model to adapt to user-specific data by fine-tuning a pre-trained model locally. As embedded devices become ubiquitous, on-device training is increasingly essential since users can benefit from the personalized model without transmitting data and model parameters to the server. Despite significant efforts toward efficient training, on-device training still faces two major challenges: (1) the heavy and offline computation workload required to identify important parameters for updating; (2) the prohibitive cost of multi-layer backpropagation, which strains the limited resources of embedded devices. In this paper, we propose an algorithm-system co-optimization framework that enables self-adaptive and on-device model personalization for resource-constrained embedded devices. To address the challenge of parameter selection, we propose Alternant Partial Update, a method that locally identifies essential parameters without requiring retraining or offline evolutionary search. To mitigate backpropagation costs, we introduce Gradient Condensing to condense the gradient map structure, significantly reducing the computational complexity and memory consumption of backpropagation while preserving model performance. Our framework is evaluated through extensive experiments using various CNN models (e.g., MobileNet, MCUNet) on embedded devices with minimal resources (e.g., OpenMV-H7 with less than 1MB SRAM and 2MB Flash). Experimental results show that our framework achieves up to 2X speedup, 80% memory saving, and 30% accuracy improvement on downstream tasks, outperforming SOTA approaches.
Hands-On Training Session


DescriptionAs semiconductor and system design workloads grow in complexity, enterprises require scalable, flexible, and secure cloud solutions to meet their dynamic compute demands. This workshop will showcase Cadence True Hybrid Cloud, a groundbreaking solution that enables customers to seamlessly transition their EDA and system design workloads to a hybrid cloud environment. Attendees will gain insights into how Cadence’s secure, high-performance hybrid cloud architecture integrates with on-prem infrastructure, providing on-demand scalability, optimized cost-efficiency, and cloud-native capabilities without disrupting existing workflows. Experts from Cadence will present real-world deployment strategies and best practices demonstrating how leading enterprises are accelerating design cycles and maximizing efficiency through Cadence True Hybrid cloud solution. Join us to explore how Cadence is revolutionizing hybrid cloud adoption for EDA and Systems Design workloads.
Networking
Work-in-Progress Poster


DescriptionSystolic Array is a highly efficient architecture for executing regular and parallel computations, yet its simplicity reduces programmability. In contrast, elastic Coarse-Grained Reconfigurable Array (CGRA) trades simplicity with a more flexible interconnect and execution mode, to run arbitrary dataflow computations. However, efficiently executing highly regular systolic-style computations in CGRA is challenging due to mapping algorithm limitations, data reuse constraints, and mismatching execution controls between systolic and elastic paradigms. This paper explores the tradeoffs of executing systolic-style matrix-matrix multiplication on the elastic CGRA synthesized with the state-of-the-art 3-nm FinFET node in terms of computational throughput and power, performance, and area (PPA).
Engineering Presentation


IP
DescriptionMIPI I3C, a next generation interface provides a high-performance, low-power, low-pin count serial link and is ideal for numerous applications. There is an urgent need for a zero-power subsystem to power-gate I3C target and related peripherals. The I3C sub-system should be able to handle remote communication events, which are very random and unpredictable. Power duty-cycling is not efficient. The present solutions have high latency and are susceptible to data omissions.
An architecture to minimize power consumption of an I3C interface subsystem is proposed.
Circuit components of the I3C controller are powered up and down depending on the relevance of the bus activity to the system. The I3C target provides wake up signal to the SoC side controller when needed. It is able to independently take decisions to enable itself and the SoC. This eliminates the enable pin altogether and highly optimizes system performance by activating components on need basis only.
An architecture to minimize power consumption of an I3C interface subsystem is proposed.
Circuit components of the I3C controller are powered up and down depending on the relevance of the bus activity to the system. The I3C target provides wake up signal to the SoC side controller when needed. It is able to independently take decisions to enable itself and the SoC. This eliminates the enable pin altogether and highly optimizes system performance by activating components on need basis only.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionVector similarity search (VSS) is crucial in many AI applications, such as few-shot learning (FSL) and approximate nearest-neighbor search (ANNS), but it demands significant memory capacity and incurs substantial energy costs for data transfers during large-scale comparisons. Various in-memory search technologies have been developed to improve energy efficiency, with NAND-based multi-bit content-addressable memory (MCAM) standing out as a promising solution for its high density and large capacity. MCAM can operate in exact-search (ES) mode, supporting only perfect matches with low energy cost, or in approximate-search (AS) mode, enabling flexible VSS. However, AS mode incurs significant energy waste when comparing queries with non-target stored vectors.
To address this issue, we propose Hybrid-M, a 3D NAND-based in-memory VSS architecture that integrates both modes into a single hybrid matching process, using ES mode as a filter to reduce redundant searches for AS mode. We apply three techniques to optimize this integration: range encoding for multi-level cells (MLC) to enhance filtering, search voltage shifts to mitigate the impact on AS accuracy and reduce matching currents, and a filtering-aware training method to further improve reliability and energy efficiency. Results show that Hybrid-M achieves comparable accuracy while reducing energy consumption by 67% to 83% compared to MACM-based VSS using only AS mode, across various many-class FSL and ANNS workloads.
To address this issue, we propose Hybrid-M, a 3D NAND-based in-memory VSS architecture that integrates both modes into a single hybrid matching process, using ES mode as a filter to reduce redundant searches for AS mode. We apply three techniques to optimize this integration: range encoding for multi-level cells (MLC) to enhance filtering, search voltage shifts to mitigate the impact on AS accuracy and reduce matching currents, and a filtering-aware training method to further improve reliability and energy efficiency. Results show that Hybrid-M achieves comparable accuracy while reducing energy consumption by 67% to 83% compared to MACM-based VSS using only AS mode, across various many-class FSL and ANNS workloads.
Research Special Session


AI
DescriptionSilicon design for Augmented Reality (AR) presents a unique challenge of enabling high performance applications in a small form factor. AR products require machine learning algorithms, neural networks, image signal processing applications, and more to run on energy-efficient platforms that are heavily constrained in area footprint. The third dimension, however, usually has ample space. To enable these applications to be energy-efficient, memory accesses to off-accelerator and/or off-chip can be prohibitively expensive. This paper presents recent results in utilizing a combination of face-to-face stacking technology with hybrid bonding and circuit design techniques to enable more immersive applications such as group calling with Pixel Codec Avatars to be run on AR systems-on-chip (SoCs) at 2× better energy-efficiency compared to equivalent 2D accelerators at iso-footprint. Moreover, we show that this combination of 3D integration technology and circuit design techniques can be extended beyond just digital systems-on-chip to other AR chips such as display drivers, enabling higher bit-depth displays at lower power with smaller area footprint.
Networking
Work-in-Progress Poster


DescriptionReal-world path planning for Autonomous Mobile Robots (AMR) requires the ability to navigate safely and efficiently in stochastic environments. At the same time, AMRs face stringent power limitations. As such, fast and efficient motion and path-planning solvers are needed at the edge to enable the required long-duration performance of next-generation robotic systems. In this work, we explore the potential of hardware acceleration on Field Programmable Gate Arrays (FPGAs) to achieve this dual performance requirement, targeting the state-of-the-art sampling-based motion planning algorithm, Model Predictive Path Integral (MPPI). Our preliminary results indicate that FPGAs can provide both improved runtimes—up to 68% and 82%—and more energy-efficient operation—up to 74% and 49%—than GPUs and CPUs respectively, paving the way for future robotic applications.
Exhibitor Forum


DescriptionThe rise of software-defined systems has fundamentally reshaped the product development landscape. In this new paradigm, software architecture no longer follows hardware—it leads it.
As functionality and innovation become increasingly defined by code, semiconductor and systems development organizations must navigate unprecedented complexity: from the exponential growth of cross-domain data to the unique challenges of 3D IC and chiplet-based architectures, as well as the need for requirements traceability and collaboration across mechanical, electrical, and software domains. Insights must also be synthesized from disparate sources—a challenge compounded by the complex task of maintaining data links across various levels of IC and system integration.
This presentation explores the concept of the Semiconductor Digital Thread—a holistic, integrated approach to managing the complete semiconductor lifecycle in the era of software-defined design. Built on a seamless and fully traceable flow of data throughout the product development lifecycle and a unified environment for hardware/software co-design, the Semiconductor Digital Thread can unlock new levels of productivity, innovation, and quality.
By integrating IP lifecycle management and data management with EDA, PLM, and DevOps toolchains, semiconductor organizations can enable a data-driven digital thread that spans from initial system-level architecture through physical design and verification. This digital thread provides the foundation for AI-powered, real-time analytics, helping engineers and business leaders alike to make informed design decisions that reduce risk and accelerate time to market.
Whether your focus is optimizing silicon cost, meeting aggressive power/performance targets, ensuring compliance, or boosting efficiency through effective IP reuse, the Semiconductor Digital Thread offers a blueprint for competitive advantage in the era of software-defined, AI-powered, silicon-enabled product innovation.
As functionality and innovation become increasingly defined by code, semiconductor and systems development organizations must navigate unprecedented complexity: from the exponential growth of cross-domain data to the unique challenges of 3D IC and chiplet-based architectures, as well as the need for requirements traceability and collaboration across mechanical, electrical, and software domains. Insights must also be synthesized from disparate sources—a challenge compounded by the complex task of maintaining data links across various levels of IC and system integration.
This presentation explores the concept of the Semiconductor Digital Thread—a holistic, integrated approach to managing the complete semiconductor lifecycle in the era of software-defined design. Built on a seamless and fully traceable flow of data throughout the product development lifecycle and a unified environment for hardware/software co-design, the Semiconductor Digital Thread can unlock new levels of productivity, innovation, and quality.
By integrating IP lifecycle management and data management with EDA, PLM, and DevOps toolchains, semiconductor organizations can enable a data-driven digital thread that spans from initial system-level architecture through physical design and verification. This digital thread provides the foundation for AI-powered, real-time analytics, helping engineers and business leaders alike to make informed design decisions that reduce risk and accelerate time to market.
Whether your focus is optimizing silicon cost, meeting aggressive power/performance targets, ensuring compliance, or boosting efficiency through effective IP reuse, the Semiconductor Digital Thread offers a blueprint for competitive advantage in the era of software-defined, AI-powered, silicon-enabled product innovation.
Hands-On Training Session


DescriptionThis presentation will compare the process of creating an HDL developer workspace through Perforce P4 version control check-out and Synopsys VCS build methods against the use of AWS FSx-NetApp ONTAP “Snapshot/FlexClone” to swiftly provide a pre-assembled, pre-validated HDL workspace. Users can anticipate a significant decrease in the elapsed wall clock time between the two approaches. Additionally, they can expect a reduction of 50% or more in compute, EDA license, and storage capacity usage.
Engineering Poster
Networking


DescriptionIn the realm of integrated circuit (IC) design, the efficiency and accuracy of Layout Versus Schematic (LVS) extraction are critical for ensuring design integrity and functionality. Handling LVS for cutting-edge automotive designs presents significant challenges due to the resource-intensive nature of "dirty" designs, which require high runtime and memory. The presence of shorts and high comparison times in these designs drives up physical verification (PV) cycles, leading to substantial delays in the turnaround time at the sign-off stage.
A recent case study highlights the stark contrast between LVS extraction and traditional comparison methods. The LVS extraction process was completed in a remarkable 5 hours, demonstrating its efficiency and reliability. In contrast, the traditional comparison method failed to complete even after running for 2 days, ultimately aborting due to excessive memory consumption exceeding 2TB. This significant disparity underscores the urgent need for more efficient LVS extraction techniques in IC design.
Several features have been implemented to improve performance and accuracy. These include generating an automatic list of hierarchical cells (hcells) for hierarchical LVS comparison and removing hcells causing false discrepancies. This approach reduces the list of hcells for the comparison stage, leading to shorter runtimes. Additionally, it is crucial to identify and isolate text shorts in the early stages of design. Techniques such as Detect Shorts Through High-Resistance Layers (LVS Softchk) and performing Electrical Rule Checks (ERC) ensure compliance with specified rules, reducing runtime, memory usage, and debugging complexity compared to normal LVS batch runs.
A recent case study highlights the stark contrast between LVS extraction and traditional comparison methods. The LVS extraction process was completed in a remarkable 5 hours, demonstrating its efficiency and reliability. In contrast, the traditional comparison method failed to complete even after running for 2 days, ultimately aborting due to excessive memory consumption exceeding 2TB. This significant disparity underscores the urgent need for more efficient LVS extraction techniques in IC design.
Several features have been implemented to improve performance and accuracy. These include generating an automatic list of hierarchical cells (hcells) for hierarchical LVS comparison and removing hcells causing false discrepancies. This approach reduces the list of hcells for the comparison stage, leading to shorter runtimes. Additionally, it is crucial to identify and isolate text shorts in the early stages of design. Techniques such as Detect Shorts Through High-Resistance Layers (LVS Softchk) and performing Electrical Rule Checks (ERC) ensure compliance with specified rules, reducing runtime, memory usage, and debugging complexity compared to normal LVS batch runs.
Engineering Presentation


AI
Systems and Software
DescriptionEmbedded devices are powered by batteries, so longer battery life is a critical factor in their commercial success. This paper proposes an optimization technique for system-on-chip (SOC) and software design to achieve power savings while the device needs periodic wake-ups from deep sleep, also known as warm boots. For an efficient warm boot, some components or domains are kept always-on for quick booting, while others are switched off or clock-gated, like the dynamic random-access memory (DRAM). Thus, after a warm boot, the software implementation must wait for DRAM initialization before starting operations, thereby increasing the system response time. To address this issue, the paper proposes utilizing accelerated RAM (XRAM) in the always-on domain, for performing selected operations before DRAM initialization completes, like initializing peripheral hardware components during the warm boot sequence. We propose a novel algorithm to identify scattered functions in the software that can operate without DRAM data access, gather them, and host them in the XRAM. The algorithm jointly optimizes the required size of the XRAM in a SOC, and the functions that can be hosted on the XRAM, based on the DRAM initialization time. These functions can be operated from XRAM during warm boot, while the DRAM is being initialized in parallel, thereby reducing the total response time and also reducing the power consumption during the periodic warm boot sequences. The paper presents experimental results for wake-up in a periodic workload running on a generic SOC architecture, showing power optimization up to 9%.
Engineering Poster
Networking


DescriptionThis paper presents an advanced approach to cell-aware testing, focusing on defect-oriented methodologies to enhance IC reliability. Traditional fault models, such as stuck-at and transition faults, are insufficient for detecting latent defects in complex standard cells. The study introduces a methodology based on Layout Parasitic Extracted (LPE) netlists, leveraging SPICE simulations to model defects like bridging and open faults at a finer granularity.
A key improvement is the visualization of LPE netlists using Siemens CMG tools to ensure accurate extraction, reducing false defect reports and improving debugging. Additionally, critical regions in layouts are identified using timing analysis, enabling the creation of User Defined Fault Models (UDFM) that capture latent defects under various PVT and RC conditions.
The paper also highlights the importance of Cell Neighborhood Bridge (CNB) analysis, which detects inter-cell defects that conventional models overlook. By optimizing CMG settings and collaborating closely with foundries, the approach significantly enhances defect coverage.
This novel methodology has led to a twofold improvement in productivity by reducing IC design cycle times and improving test accuracy. The enhanced defect identification process contributes to lower Defective Parts Per Million (DPPM) rates, particularly in automotive applications.
A key improvement is the visualization of LPE netlists using Siemens CMG tools to ensure accurate extraction, reducing false defect reports and improving debugging. Additionally, critical regions in layouts are identified using timing analysis, enabling the creation of User Defined Fault Models (UDFM) that capture latent defects under various PVT and RC conditions.
The paper also highlights the importance of Cell Neighborhood Bridge (CNB) analysis, which detects inter-cell defects that conventional models overlook. By optimizing CMG settings and collaborating closely with foundries, the approach significantly enhances defect coverage.
This novel methodology has led to a twofold improvement in productivity by reducing IC design cycle times and improving test accuracy. The enhanced defect identification process contributes to lower Defective Parts Per Million (DPPM) rates, particularly in automotive applications.
Research Special Session


Design
DescriptionThe demand for rapid, complex, and optimized chip designs for various applications requires enhancement in design automation. Although AI/ML can automate and optimize chip design processes, Quantum Algorithms have shown a significant gain in reducing area, power consumption and nodes. Quantum or quantum-inspired algorithms (QIA) can be run either on standard processors or quantum computers, showing significant gains in optimizing chip's power consumption as compared to pure AI/ML-based design approaches, even for optimizations in large solution spaces (>10500). Research on these techniques is based on the concept of making semiconductor matrix design nodes equivalent to Quantum Hamiltonians with high complexity, which are then solved using QIA to search for the lowest possible energy-level states. We use microprocessors designed on a 7nm technology node to benchmark AI/ML tools as compared to those aided by QIA. The research also demonstrates that GPU designs using QIA can experience an exponential advantage over AI/ML, as the number of nodes increases.
Engineering Poster
Networking


DescriptionHigh-Level Synthesis (HLS) offers significant advantages over traditional design methodologies in implementing complex digital systems. This work, conducted jointly with Politecnico di Milano, demonstrates these benefits through the design of a Keyword Spotting System (KWS) that recognizes eight short vocal commands using SystemC modeling along with the Cadence Stratus HLS tool.
The basic structure of this KWS design includes a Mel-Frequency Cepstral Coefficients (MFCC) module for feature extraction and a Deep Neural Network (DNN) module for command recognition.
Once the baseline design is ready, this flow allows to easily implement different optimized versions of the design. This is done by adding Stratus HLS tool directives at key points inside the SystemC code and/or activating specific tool options (e.g., for low power optimizations). Several versions of the KWS design were analyzed by selectively enabling power, area, and performance optimizations. Every optimized version of the design was successfully synthesized considering a clock frequency between 100 and 400 MHz.
The proposed HLS flow not only accelerates the design cycle, but also permits to optimize the design even in advanced stages with minimal impact on the time schedule. Specifically, it enables the exploration of multiple architectural solutions, optimizing latency, throughput, area, and power consumption.
The basic structure of this KWS design includes a Mel-Frequency Cepstral Coefficients (MFCC) module for feature extraction and a Deep Neural Network (DNN) module for command recognition.
Once the baseline design is ready, this flow allows to easily implement different optimized versions of the design. This is done by adding Stratus HLS tool directives at key points inside the SystemC code and/or activating specific tool options (e.g., for low power optimizations). Several versions of the KWS design were analyzed by selectively enabling power, area, and performance optimizations. Every optimized version of the design was successfully synthesized considering a clock frequency between 100 and 400 MHz.
The proposed HLS flow not only accelerates the design cycle, but also permits to optimize the design even in advanced stages with minimal impact on the time schedule. Specifically, it enables the exploration of multiple architectural solutions, optimizing latency, throughput, area, and power consumption.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionMulti-agent frameworks with Large Language Models (LLMs) have become promising tools for generating general-purpose programming languages using test-driven development, allowing developers to create more accurate and robust code.
However, their potential has not been fully unleashed for domain-specific programming languages, where specific domain exhibits unique optimization opportunities for customized improvement.
In this paper,
we take the first step in exploring multi-agent code generation for quantum programs.
We demonstrate examples of AI-assisted quantum error prediction and correction, demonstrating the effectiveness of our multi-agent framework in reducing the quantum errors of generated quantum programs.
However, their potential has not been fully unleashed for domain-specific programming languages, where specific domain exhibits unique optimization opportunities for customized improvement.
In this paper,
we take the first step in exploring multi-agent code generation for quantum programs.
We demonstrate examples of AI-assisted quantum error prediction and correction, demonstrating the effectiveness of our multi-agent framework in reducing the quantum errors of generated quantum programs.
Networking
Work-in-Progress Poster


DescriptionOptimizing Hardware Description Language (HDL) code is essential for enhancing power, performance, and area metrics. Despite progress in HDL code generation, challenges remain in RTL optimization. We first introduce RTLOpt, a dataset featuring Verilog examples for pipelining and clock gating for evaluation. Additionally, we propose Mascot, a multi-agent framework that integrates domain knowledge into LLMs for RTL optimization. Mascot employs iterative feedback loops to refine HDL code based on syntax, functionality, and PPA metrics. Empirical results show Mascot improves PPA by 20% for larger LLMs and 10% for smaller ones, establishing it as a foundational approach for HDL code optimization.
Research Manuscript


Design
DES5: Emerging Device and Interconnect Technologies
DescriptionComputational-In-Memory (CIM) architectures have emerged as energy-efficient solutions for Artificial
Intelligence (AI) applications, enabling data processing within memory arrays and reducing the bottleneck associated with data transfer. The rapid advancement of AI demands real-time on-chip learning but implementing this with CIM architectures poses significant challenges, such as limited parallelism and energy-efficiency during training and inference. In this paper, we propose a novel CIM architecture specifically designed for on-chip learning applications, which capitalizes on the unique properties of Spin Orbit Torque (SOT) technology to enhance both parallelism and energy-efficiency in computation. The proposed architecture incorporates a bulk-write mechanism for SOT-cell based arrays, enabling efficient weight updates during on-chip training. Additionally, we develop a scheme to process vector elements concurrently for vector-matrix multiplications during inference. To achieve this, we design multi-port bit-cell access capabilities along with their associated control mechanisms. Simulation results show a 5.82× reduction in latency and a 3.20× improvement in energy-efficiency compared to standard SOT-MRAM based CIM, with negligible overhead.
Intelligence (AI) applications, enabling data processing within memory arrays and reducing the bottleneck associated with data transfer. The rapid advancement of AI demands real-time on-chip learning but implementing this with CIM architectures poses significant challenges, such as limited parallelism and energy-efficiency during training and inference. In this paper, we propose a novel CIM architecture specifically designed for on-chip learning applications, which capitalizes on the unique properties of Spin Orbit Torque (SOT) technology to enhance both parallelism and energy-efficiency in computation. The proposed architecture incorporates a bulk-write mechanism for SOT-cell based arrays, enabling efficient weight updates during on-chip training. Additionally, we develop a scheme to process vector elements concurrently for vector-matrix multiplications during inference. To achieve this, we design multi-port bit-cell access capabilities along with their associated control mechanisms. Simulation results show a 5.82× reduction in latency and a 3.20× improvement in energy-efficiency compared to standard SOT-MRAM based CIM, with negligible overhead.
Engineering Presentation


AI
Back-End Design
DescriptionMaintaining the accuracy and consistency of Process Design Kits (PDKs) in the rapidly evolving semiconductor design industry is critical for ensuring high-quality integrated circuit (IC) production. Conventional techniques for PDK library comparisons, like rule-based checks and manual inspections, take a lot of time and are prone to human mistakes.
More specifically, PDK models are based on silicon data from "Golden GDS" layouts, which serve as the benchmark for model accuracy. As a device evolves, its physical layout (PCell) may need updates to accommodate model fine tuning or improve performance. Ensuring these updates remain consistent with the original Golden GDS is crucial for maintaining model accuracy.
This paper presents a novel method to improve the internal layout comparison of PDK libraries using machine learning. A regular XOR comparison between the golden GDS and the reference GDS would yield a lot of false errors and the manual review of layout variations during the lifecycle is time consuming and resource intensive to categorize changes as either expected or unexpected. Our novel approach, however, achieves considerable improvements in efficiency and reliability by streamlining the discovery of inconsistencies within PDK libraries through multiple supervised machine learning techniques.
More specifically, PDK models are based on silicon data from "Golden GDS" layouts, which serve as the benchmark for model accuracy. As a device evolves, its physical layout (PCell) may need updates to accommodate model fine tuning or improve performance. Ensuring these updates remain consistent with the original Golden GDS is crucial for maintaining model accuracy.
This paper presents a novel method to improve the internal layout comparison of PDK libraries using machine learning. A regular XOR comparison between the golden GDS and the reference GDS would yield a lot of false errors and the manual review of layout variations during the lifecycle is time consuming and resource intensive to categorize changes as either expected or unexpected. Our novel approach, however, achieves considerable improvements in efficiency and reliability by streamlining the discovery of inconsistencies within PDK libraries through multiple supervised machine learning techniques.
Engineering Presentation


Front-End Design
Chiplet
DescriptionWith the increasing size and complexity of chip designs, the power budget has become increasingly important during the chip signoff stage. It can significantly impact architecture, layout, packaging, and production. Traditional evaluation methods rely on the netlist and are performed in the later stages of development, including synthesis, placement, routing, and post-simulation. These evaluations typically take 1–3 months or longer, leaving insufficient time to optimize power based on gate-level netlist results. This delay eliminates the opportunity for architectural or algorithmic adjustments.
RTL power estimation offers a fast, simple, and efficient approach to predict power consumption during the RTL design stage. However, there are substantial differences between the RTL code and the final gate-level netlist. Achieving a reasonably good correlation between RTL and netlist power is critical. Various complex backend implementations, such as high-fanout buffer trees and repeaters, significantly affect RTL power estimation. While good correlation has been achieved for registers, memory, and clock power, large discrepancies remain for combinational and buffer logic power in advanced nodes.
In this paper, we present an advanced technology to model complex buffers at RTL stage. Using this approach, the power difference for combinational logic improved from 58.51% to 39.19%, and for buffer logic, from -86.70% to -28.57%. In another design, the power difference for combinational logic improved from 62.28% to 1.20%, and for buffer logic, from -89.81% to -67.78%.
RTL power estimation offers a fast, simple, and efficient approach to predict power consumption during the RTL design stage. However, there are substantial differences between the RTL code and the final gate-level netlist. Achieving a reasonably good correlation between RTL and netlist power is critical. Various complex backend implementations, such as high-fanout buffer trees and repeaters, significantly affect RTL power estimation. While good correlation has been achieved for registers, memory, and clock power, large discrepancies remain for combinational and buffer logic power in advanced nodes.
In this paper, we present an advanced technology to model complex buffers at RTL stage. Using this approach, the power difference for combinational logic improved from 58.51% to 39.19%, and for buffer logic, from -86.70% to -28.57%. In another design, the power difference for combinational logic improved from 62.28% to 1.20%, and for buffer logic, from -89.81% to -67.78%.
Research Special Session


EDA
DescriptionIncreasing random process variations that impact device parameters in low-nanometer nodes are introducing unpredictable circuit delays and timing marginalities. These can cause failures under adverse operating conditions in some manufactured instances of a design. Such faulty circuits often escape manufacturing tests because current scan timing tests are generated under the assumption of a single localized delay fault in the circuit; accumulation of distributed delays in a circuit path from variations in multiple gates is not targeted because path delay tests have not proven practical. While the increasing use of at-speed functional tests does detect some variability failures, the coverage of functional tests is known to be limited. This talk examines extreme slow paths from process variations, extracting some unique characteristics that can be exploited by structural test methods to more effectively screen out many such failures. The aim is to improve test quality and DPPM levels from postproduction testing.
Engineering Poster
Networking


DescriptionDesigns with latches have their own challenges while doing timing and power optimization in place and route flow. Challenges become more severe when latches are used at interfaces of different blocks.
Latch transparency makes PNR tools to balance internal and external timing points optimization difficult
Many timing paths goes through different partitions having latches at the interface and these interface latches need to meet specific clock latency targets to meet the timing but EDA PNR tools face difficulties in performing clock tuning (CCD) on these latches. This makes controlling the clock on interface sequential elements essential to meet the timing requirement during Clock tree synthesis (CTS).
To meet above requirements, we propose two approaches in clock tree
1) Precise Latency Landing on interface sequential using skew group and enhanced tool algorithm
2) Timing and Placement Aware Sink Assignments in Multi Source CTS
With These proposed approaches we were able to meet the required insertion delay target on 1% sequential of design, 28% hold TNS improvement and 24% improvement in setup TNS
Latch transparency makes PNR tools to balance internal and external timing points optimization difficult
Many timing paths goes through different partitions having latches at the interface and these interface latches need to meet specific clock latency targets to meet the timing but EDA PNR tools face difficulties in performing clock tuning (CCD) on these latches. This makes controlling the clock on interface sequential elements essential to meet the timing requirement during Clock tree synthesis (CTS).
To meet above requirements, we propose two approaches in clock tree
1) Precise Latency Landing on interface sequential using skew group and enhanced tool algorithm
2) Timing and Placement Aware Sink Assignments in Multi Source CTS
With These proposed approaches we were able to meet the required insertion delay target on 1% sequential of design, 28% hold TNS improvement and 24% improvement in setup TNS
Engineering Presentation


Front-End Design
DescriptionThe increasing complexity of modern VLSI designs, with high gate counts and intricate internal logic, presents significant challenges in verification. Traditional verification flows rely on lengthy regression simulations and manually defined stress scenarios, which are resource-intensive and often fall short in uncovering rare edge-case bugs. The Garbage Model (GM) methodology introduces an innovative approach to accelerate verification and enhance bug detection by injecting synthetic stress at key internal flow control signals.
This semi-random manipulation creates a wide range of corner-case scenarios that are difficult to generate using conventional methods. By targeting critical control points in the design, GM enables faster identification of bugs, achieving high functional coverage with less simulation time. Furthermore, the methodology is versatile, applicable at various levels of design, including block, cluster, and full-chip, and across different phases, such as simulation and emulation.
The GM methodology not only improves simulation efficiency but also reduces verification cycles, ensuring higher confidence in design quality while accelerating project timelines.
This presentation highlights the challenges of traditional verification flows, the innovative features of the GM methodology, and its tangible benefits in terms of productivity, scalability, and design robustness. GM represents a transformative step in verification, empowering teams to meet aggressive time-to-market goals with greater reliability.
This semi-random manipulation creates a wide range of corner-case scenarios that are difficult to generate using conventional methods. By targeting critical control points in the design, GM enables faster identification of bugs, achieving high functional coverage with less simulation time. Furthermore, the methodology is versatile, applicable at various levels of design, including block, cluster, and full-chip, and across different phases, such as simulation and emulation.
The GM methodology not only improves simulation efficiency but also reduces verification cycles, ensuring higher confidence in design quality while accelerating project timelines.
This presentation highlights the challenges of traditional verification flows, the innovative features of the GM methodology, and its tangible benefits in terms of productivity, scalability, and design robustness. GM represents a transformative step in verification, empowering teams to meet aggressive time-to-market goals with greater reliability.
Engineering Presentation


AI
Front-End Design
Chiplet
DescriptionVerification throughput in random test regressions is a critical challenge due to the extensive runtime required to stimulate and verify numerous operating scenarios. Traditional approaches, such as increasing parallel runs, often lead to higher resource consumption. This paper presents a novel solution leveraging Cadence SmartRun, a machine learning engine, to optimize regression runtime and resource usage.
Our approach involves analyzing the duration of initial test regressions to generate optimized execution orders and parallelism policies. This automated process reduces manual intervention, minimizes errors, and enhances resource management. By re-running regressions based on these new policies, we achieve significant reductions in runtime and resource usage.
Comparative analysis demonstrates that using Cadence SmartRun can reduce regression setup time by up to 11% and resource usage by 38%. Further, another test case showed a 17% reduction in runtime with 32% fewer resources. These improvements not only save valuable regression time but also allow for the reallocation of saved slots to other regressions, thereby enhancing overall verification throughput.
The adoption of Cadence SmartRun in verification workflows ensures efficient resource utilization, faster verification cycles, and improved time-to-market. This paper discusses the implementation, benefits, and impact of this innovative methodology on the verification process.
Our approach involves analyzing the duration of initial test regressions to generate optimized execution orders and parallelism policies. This automated process reduces manual intervention, minimizes errors, and enhances resource management. By re-running regressions based on these new policies, we achieve significant reductions in runtime and resource usage.
Comparative analysis demonstrates that using Cadence SmartRun can reduce regression setup time by up to 11% and resource usage by 38%. Further, another test case showed a 17% reduction in runtime with 32% fewer resources. These improvements not only save valuable regression time but also allow for the reallocation of saved slots to other regressions, thereby enhancing overall verification throughput.
The adoption of Cadence SmartRun in verification workflows ensures efficient resource utilization, faster verification cycles, and improved time-to-market. This paper discusses the implementation, benefits, and impact of this innovative methodology on the verification process.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionAmplitude embedding (AE) is essential in quantum machine learning (QML) for encoding classical data onto quantum circuits. However, conventional AE methods suffer from deep, variable-length circuits that introduce high output error due to extensive gate usage and variable error rates across samples, resulting in noise-driven inconsistencies that degrade model accuracy. We introduce EnQode, a fast AE technique based on symbolic representation that addresses these limitations by clustering dataset samples and solving for cluster mean states through a low-depth, machine-specific ansatz. Optimized to reduce physical gates and SWAP operations, EnQode ensures all samples face consistent, low noise levels by standardizing circuit depth and composition. With over 94% fidelity in data mapping, EnQode enables robust, high-performance QML on noisy intermediate-scale quantum (NISQ) devices. Our open-source solution provides a scalable and efficient alternative for integrating classical data with quantum models.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionDuring collaborative inference with a cloud, it is sometimes essential for the client to shield its sensitive information. In this paper, we introduce Ensembler, an extensible framework designed to substantially increase the difficulty of conducting model inversion attacks for adversarial parties. Ensembler leverages selective model ensemble on the adversarial server to obfuscate its reconstruction. Our experiments demonstrate that Ensembler can effectively shield images from reconstruction attacks when the client keeps even just one layer, significantly outperforming baseline methods by up to 43.5% in structural similarity. At the same time, Ensembler only incurs 4.8% overhead during inference time.
Engineering Poster
Networking


DescriptionDRAM data integrity is a core requirement for any of the modern SoCs NoC and PCBs where DRAM memories are used anywhere in the system. It is also one of the most difficult problems to verify in today's complex memory subsystems. Beyond the basic Refresh, Row Hammer and PRHT (Per Row Hammer Tracking) is increasing becoming an important consideration for the DRAM based systems. In the latest generation of DRAMs like DDR5 and Lpddr5, Refresh Management features are added to help designers tackle the Row Hammer challenges. This presentation talks about the innovative tools and solutions we have come up to help IP and SoC verification engineers, ensuring they can not only achieve their verification goals for the Refresh requirement that DRAMs have but also test the different aspects of Refresh Management and quantify their verification completeness by getting measurement of what all has been tested with intuitive Refresh/RFM related functional coverage.
Engineering Poster


DescriptionMeeting project timeline and enabling quality RTM is key to success for entering, winning and retaining key markets. Non standardization of DFT flows and methodology during design phase of SOC leads to inefficient execution. IPs are sourced from different vendors, following respective DFT methodologies. When deployed in SOC, they require tactical handling and additional efforts. Human intervention also increases the likelihood of errors and consequently a higher number of iterations. Moreover in large gate count designs, it becomes challenging to disposition Design Rule Checks (DRCs), track hierarchical scan coverage and simulation results across multiple corners & handoffs. The absence of standardized handoff procedures across teams and misalignment of flow-related ideologies results in inefficient quality control. Hence need of the hour, is a solution which effectively RTM multiple devices on same platform with highest efficiency.
ENZO is a python based end to end methodology, using custom scripts along with Electronic Design Automation (EDA) offered solutions, enabling faster debug from RTL to simulations. This ensures standard naming convention, eases integration and hierarchical DFT constraints porting. ENZO is scalable, adaptable, and well documented architecture, with push button end to end flow for all digital IPs and SoCs. This helps in cycle time reduction and improved efficiency by focusing on the real design violations. The dashboards are intuitive and user friendly to easily track the DFT QC progress throughout the design cycle. This work contributes to more efficient and reliable design process, ultimately reducing time-to-market and enhancing product quality.
ENZO is a python based end to end methodology, using custom scripts along with Electronic Design Automation (EDA) offered solutions, enabling faster debug from RTL to simulations. This ensures standard naming convention, eases integration and hierarchical DFT constraints porting. ENZO is scalable, adaptable, and well documented architecture, with push button end to end flow for all digital IPs and SoCs. This helps in cycle time reduction and improved efficiency by focusing on the real design violations. The dashboards are intuitive and user friendly to easily track the DFT QC progress throughout the design cycle. This work contributes to more efficient and reliable design process, ultimately reducing time-to-market and enhancing product quality.
Research Manuscript


Design
DES4: Digital and Analog Circuits
DescriptionMatrix multiplication dominates the power consumption in compute-intensive applications such as deep neural networks (DNNs), spurring intensive investigations into power-efficient multiply-accumulate (MAC) units. Among the mainstream low-power design methodologies, voltage underscaling can achieve effective power savings yet induce timing errors that may lead to catastrophic accuracy loss. In this paper, we propose an error prediction and correction framework (denoted as EPIC) for arbitrary MAC unit under voltage underscaling, which predicts the timing errors and samples the correct output by using a delay-tunable clock. A prediction bits searching algorithm is proposed to enhance the prediction accuracy with low hardware cost, resulting in up to 100% accuracy. While preserving the accuracy, EPIC achieves up to 52% power savings over the corresponding MAC operating at nominal voltage. With transistor-level optimizations, EPIC incurs only 8% area and 1% power overheads, achieving 100% error correction under a voltage underscaling ratio of 0.74. Compared to state-of-the-art error resilient circuit designs, EPIC consumes 60%-88% less area. Additionally, to achieve the accuracy performance of EPIC in error-resilient applications, we propose a simulation workflow involving precise timing features, enabling an accurate simulation of voltage underscaling MAC in large-scale applications. The experimental results show that, under voltage-underscaling, the MAC with EPIC consumes 11% less power than the one without EPIC, when a same accuracy as exact implementation is required in multi-layer perceptron (MLP).
Research Manuscript


EDA
EDA9: Design for Test and Silicon Lifecycle Management
DescriptionThe functional safety of electronic chips has become increasingly critical in sectors such as autonomous vehicles and aerospace. Standards like ISO 26262 mandate high diagnostic coverage for automotive-grade chips, necessitating extensive gate-level fault simulations. However, for large-scale industrial sequential circuits, these simulations are time-consuming, creating a significant bottleneck in chip development. Prior approaches have focused on reducing computational complexity and optimizing CPU hardware usage by minimizing redundant computations during fault propagation and leveraging bit-level parallel processing capabilities. Techniques like parallel-pattern and event-driven simulations have improved performance in combinational circuits but face limitations in sequential circuits due to timing dependencies within loops. The challenge lies in parallelizing simulations across different cycles without violating these dependencies, which is exacerbated by the complex feedback structures in SCCs. In this work, we propose a novel parallel-pattern fault simulation framework that combines loop fusion with efficient event traversal to accelerate sequential circuit simulations. By compiling simple loops into larger nodes, we reduce the number of feedback events without introducing excessive redundancy. For larger SCCs, we develop specialized algorithms for selecting loop entrance nodes based on indegree analysis and implement lazy update strategies for internal nodes. This approach minimizes simulation events caused by inaccurate predictions and reduces overhead associated with false event propagation. We integrate these techniques into our simulation framework, EPICS, which strategically mixes compiled and event-driven simulations to optimize performance. Experimental results demonstrate that EPICS achieves a 5.94× speedup over state-of-the-art commercial tools while maintaining the same fault coverage.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionIn this work, we aim to address the computational overhead challenge in quantum optimal control while reducing circuit latency. We propose a novel approach combining ZX-Calculus, circuit partitioning, and circuit synthesis for pulse generation. By implementing finer granularity in pulse generation and exploring equivalent circuit representations, we achieve increased parallelism and decreased latency. Our method demonstrates a 31.74% reduction in latency compared to previous work and a 76.80% reduction compared to gate-based pulse generation methods, while minimizing computational overhead.
Networking
Work-in-Progress Poster


DescriptionDeep neural network (DNN) compression methods help reduce the size and lower the complexity while preserving the performance. In this study, we present our EqBab, a branch-and-bound (BaB) based equivalence verification method, to evaluate compressed DNNs. We propose a merge framework, which computes the discrepancy between the reference and compressed DNNs and combines it with bound propagation to perform equivalence verification problems further. Compared to the reachability-based method, EqBaB can effectively handle a larger input domain with higher efficiency, where the complexity of the input increased by 170.67 times, but the total time only increased by 11.43 times. We also evaluate eight different compression methods using our approach in two datasets, demonstrating compression capability and discrepancies analysis.
Engineering Poster
Networking


DescriptionThe ESD validation flow integrates multiple verification steps to ensure robustness in IC design, with a significant advancement brought by the EARLY VDROP tool. Unlike conventional methods that assess ESD compliance post-layout, EARLY VDROP operates at the schematic level, allowing early detection of possible ESD network problems at the schematic level. This proactive approach enables to detect and address potential voltage drop issues early, which can significantly reduce downstream failures.
The flow, built upon Siemens EDA's Calibre tool suite, consists of three main stages: Schematic-Level Topology Checks, which validate the ESD network architecture across various hierarchical levels (from cell to domain to top level); Layout-Level Current Density Checks, which examine current density in the IC layout to confirm compliance with ESD standards; and finally, Schematic-Level Voltage Drop Checks, identifying weakness paths inside protected circuitry that cannot sustain the voltage drop generated by the ESD protection network during an ESD event.
Together, these steps form a comprehensive ESD validation workflow, with EARLY VDROP enabling early-stage, schematic-based risk assessment. Preliminary results show a strong correlation with traditional methods, reinforcing its effectiveness and value in streamlining ESD compliance.
The flow, built upon Siemens EDA's Calibre tool suite, consists of three main stages: Schematic-Level Topology Checks, which validate the ESD network architecture across various hierarchical levels (from cell to domain to top level); Layout-Level Current Density Checks, which examine current density in the IC layout to confirm compliance with ESD standards; and finally, Schematic-Level Voltage Drop Checks, identifying weakness paths inside protected circuitry that cannot sustain the voltage drop generated by the ESD protection network during an ESD event.
Together, these steps form a comprehensive ESD validation workflow, with EARLY VDROP enabling early-stage, schematic-based risk assessment. Preliminary results show a strong correlation with traditional methods, reinforcing its effectiveness and value in streamlining ESD compliance.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionHardware-aware Neural Architecture Search (NAS) is one of the most promising techniques for designing efficient Deep Neural Networks (DNNs) for resource-constrained devices. Surrogate models play a crucial role in hardware-aware NAS as they enable efficient prediction of performance characteristics (e.g., inference latency and energy consumption) of different candidate models on the target hardware device. In this paper, we focus on building hardware-aware latency prediction models. We study different types of surrogate models and highlight their strengths and weaknesses. We perform a systematic analysis to understand the impact of different factors that can influence the prediction accuracy of these models, aiming to assess the importance of each stage involved in the model designing process and identify methods and policies necessary for designing/training an effective estimation model, specifically for GPU-powered devices. Based on the insights gained from the analysis, we present a holistic framework that enables reliable dataset generation and efficient model generation, considering the overall costs of different stages of the model generation pipeline.
Research Manuscript


Systems
SYS1: Autonomous Systems (Automotive, Robotics, Drones)
DescriptionEvent-based vision sensors are novel cameras inspired by human eyes, capable of capturing the rapid motion of objects with the high-speed sparse event stream. However, it is challenging to efficiently stream process event data without demolishing its sparsity. In this paper, we design Espresso, an architecture for event-based vision processing that can efficiently stream spatiotemporal events while preserving sparsity. We implement Espresso on the FPGA platform and design experiments to compare the performance with embedded GPU and line-buffer-based accelerator. In real-world scenarios, Espresso achieves throughput up to 5000 fps, which is 5.1 times higher than the embedded GPU.
Research Manuscript


Design
DES4: Digital and Analog Circuits
DescriptionAnalog circuit design has traditionally depended on manual expertise, slowing the discovery of novel topologies essential for advanced technologies like AI, 5G/6G, and quantum computing. While AI-driven methods have accelerated hardware design workflows, most of them focus on topology synthesis, often reusing known structures to achieve specific goals. The challenge of discovering entirely new, high-performance topologies remains largely underexplored due to its abstract nature. In this work, we introduce EVA, an efficient and versatile generative engine for discovering novel analog circuit topologies. EVA employs a bottom-up generation framework, using a decoder-only transformer to sequentially predict device pin connections and create diverse circuits from scratch. Pretraining on unlabeled circuit topologies builds foundational knowledge about circuit connectivity, achieving baseline discovery efficiency by generating valid circuits and reducing performance-labeled samples needed in fine-tuning. For targeted discovery of high-performance designs, EVA leverages two fine-tuning strategies—proximal policy optimization (PPO) and direct preference optimization (DPO)—to further enhance discovery efficiency for relevant, high-performing topologies. Experimental results across various circuit types highlight EVA's strengths in validity, novelty, versatility, and both training sample and discovery efficiency.
Tutorial


AI
Sunday Program
DescriptionWith the rise of artificial intelligence, the popularization of deep learning, and a constantly evolving industry, the demand for flexible and efficient tools has never been greater. As algorithms grow more complex, their runtime and energy consumption increase exponentially. Customized hardware accelerators, long used for specific mathematical operations, remain essential for managing modern applications' computational and power demands. Hardware accelerators can speed up complex computations by orders of magnitude, but their manual design and verification processes are often challenging and time-consuming.
High-Level Synthesis (HLS) provides a solution by transforming high-level algorithm descriptions, typically written in C or C++, into synthesizable RTL suitable for hardware implementation. This approach reduces development time for RTL engineers while offering flexibility beyond what traditional handwritten RTL can provide. We extended this capability to the machine learning domain with the open-source framework hls4ml, which allows neural networks trained in Python frameworks like Tensorflow or PyTorch to be synthesized into efficient hardware representations for the traditional FPGA and ASIC flows. This breakthrough addresses the growing need for reduced design turnaround and easy verification of ML hardware accelerators with low latency and power efficiency constraints.
During this tutorial, we will demonstrate how Python complements HLS by simplifying the ML design process, bridging the gap between software and hardware development. Attendees will explore how we translate neural networks modeled in Python into fixed-point C++ models suitable for HLS workflows. We will dive into strategies like Value-Range Analysis and Quantization-Aware Training, which optimize these designs for deployment and evaluate their accuracy, power consumption, and energy efficiency.
To exemplify these concepts, experts from Fermilab will share their experiences applying this technology to high-energy physics experiments, where real-time, low-latency processing is critical. Over the years, Fermilab engineers have demonstrated how deep neural networks, optimized for hardware using hls4ml, can meet the stringent requirements of trigger systems at the CERN Large Hadron Collider. These systems rely on rapid decision-making to process immense data volumes while retaining only the most relevant events for further analysis. The application of hls4ml has also been extended to innovative technologies like smart pixel arrays. These smart pixels integrate ML inference capabilities directly into sensor devices, enabling localized data processing at the pixel level. This approach drastically reduces the need to transmit raw data to external processing units, significantly decreasing power consumption and latency. By embedding neural networks within the pixel architecture, the smart pixels can identify and prioritize relevant data in real time, providing a highly efficient solution for edge computing in scenarios such as particle detectors and imaging systems. Fermilab's work highlights the potential of hardware-accelerated ML in scenarios where both speed and power efficiency are mission-critical.
Through this tutorial, attendees will gain valuable insights into the challenges and solutions of deploying ML in hardware. Understanding how HLS and hls4ml streamline the development of neural network-based hardware accelerators is fundamental for the industry's future. Participants will learn how these technologies are shaping the future of AI and scientific computing.
Section 1: Developing Customized Accelerators (Cameron Villone)
Section 2: Introduction to HLS4ML. Even Higher Version of High-Level Synthesis (Cameron Villone)
Section 3: Example Design Description (Cameron Villone)
Section 4: How can we use HLS4ML to make our lives easier (Giuseppe Di Guglielmo)
Section 5: Design Exploration and Optimization (Giuseppe Di Guglielmo)
Section 6: Results and Conclusion (Giuseppe Di Guglielmo)
High-Level Synthesis (HLS) provides a solution by transforming high-level algorithm descriptions, typically written in C or C++, into synthesizable RTL suitable for hardware implementation. This approach reduces development time for RTL engineers while offering flexibility beyond what traditional handwritten RTL can provide. We extended this capability to the machine learning domain with the open-source framework hls4ml, which allows neural networks trained in Python frameworks like Tensorflow or PyTorch to be synthesized into efficient hardware representations for the traditional FPGA and ASIC flows. This breakthrough addresses the growing need for reduced design turnaround and easy verification of ML hardware accelerators with low latency and power efficiency constraints.
During this tutorial, we will demonstrate how Python complements HLS by simplifying the ML design process, bridging the gap between software and hardware development. Attendees will explore how we translate neural networks modeled in Python into fixed-point C++ models suitable for HLS workflows. We will dive into strategies like Value-Range Analysis and Quantization-Aware Training, which optimize these designs for deployment and evaluate their accuracy, power consumption, and energy efficiency.
To exemplify these concepts, experts from Fermilab will share their experiences applying this technology to high-energy physics experiments, where real-time, low-latency processing is critical. Over the years, Fermilab engineers have demonstrated how deep neural networks, optimized for hardware using hls4ml, can meet the stringent requirements of trigger systems at the CERN Large Hadron Collider. These systems rely on rapid decision-making to process immense data volumes while retaining only the most relevant events for further analysis. The application of hls4ml has also been extended to innovative technologies like smart pixel arrays. These smart pixels integrate ML inference capabilities directly into sensor devices, enabling localized data processing at the pixel level. This approach drastically reduces the need to transmit raw data to external processing units, significantly decreasing power consumption and latency. By embedding neural networks within the pixel architecture, the smart pixels can identify and prioritize relevant data in real time, providing a highly efficient solution for edge computing in scenarios such as particle detectors and imaging systems. Fermilab's work highlights the potential of hardware-accelerated ML in scenarios where both speed and power efficiency are mission-critical.
Through this tutorial, attendees will gain valuable insights into the challenges and solutions of deploying ML in hardware. Understanding how HLS and hls4ml streamline the development of neural network-based hardware accelerators is fundamental for the industry's future. Participants will learn how these technologies are shaping the future of AI and scientific computing.
Section 1: Developing Customized Accelerators (Cameron Villone)
Section 2: Introduction to HLS4ML. Even Higher Version of High-Level Synthesis (Cameron Villone)
Section 3: Example Design Description (Cameron Villone)
Section 4: How can we use HLS4ML to make our lives easier (Giuseppe Di Guglielmo)
Section 5: Design Exploration and Optimization (Giuseppe Di Guglielmo)
Section 6: Results and Conclusion (Giuseppe Di Guglielmo)
Networking
Work-in-Progress Poster


DescriptionLogic synthesis of computer-aided design flows uses sequences of transformation operators and associated parameters to optimize the quality of results (QoR) of circuits. Conventional methods rely on heuristic algorithms; recently, reinforcement learning (RL) approaches have achieved improved results. However, the expanding action space comprising diverse operators and continuously-valued parameters increasingly challenges existing RL solutions. In this paper, we propose EvoSolo, an evolutionary sequence optimization framework for logic synthesis with cascaded proximate policy optimization (PPO). We adapt RL with an evolutionary algorithm (EA) to enhance global optimization in action selection, in which we fine-tune high-performance optimization sequences from prior iterations. We further consider the intrinsic correlation between operators and parameters by proposing a cascaded PPO architecture where two separate PPOs sequentially optimize the operators and parameters. When evaluated using the same action space on the EPFL and MCNC benchmarks, our proposed framework outperforms prior RL solutions in 9/10 and 10/10 circuit cases. Compared to resyn2, we reduce the LUT-6 count up to 13.12% on average without level increases.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionCompressional SSDs can be a mixed blessing. While they offer users increased logical space beyond the physical capacity, they also complicate the internal design of the Flash Translation Layer (FTL) by requiring a larger Logical-to-Physical (L2P) address mapping.
In this paper, we propose a novel N-to-1 L2P mapping table design that consolidates multiple logical entries into a single entry. This approach eliminates the duplication of physical page numbers when a physical page contains several compressed logical pages, significantly reducing the memory footprint of compressional SSDs.To accommodate the dynamic compression ratios of real-world workloads, we introduce promotion and demotion schemes that enable mapping table entries to migrate between pages with different compression ratios. Additionally, to address the issue of partial invalidation—where some compressed logical pages within a physical page are invalid due to the N-to-1 mapping—we present a compression-aware garbage collection (GC) algorithm aimed at minimizing the number of copy operations for partially invalid physical pages.
We have implemented our design in MQsim, a widely used SSD simulator, and have conducted a series of experiments to evaluate the effectiveness of the proposed techniques. The results demonstrate that our approach significantly reduces the mapping table size in compressional SSDs, leading to an improved mapping table cache hit ratio and a reduced I/O latency compared to traditional compressional SSDs.
In this paper, we propose a novel N-to-1 L2P mapping table design that consolidates multiple logical entries into a single entry. This approach eliminates the duplication of physical page numbers when a physical page contains several compressed logical pages, significantly reducing the memory footprint of compressional SSDs.To accommodate the dynamic compression ratios of real-world workloads, we introduce promotion and demotion schemes that enable mapping table entries to migrate between pages with different compression ratios. Additionally, to address the issue of partial invalidation—where some compressed logical pages within a physical page are invalid due to the N-to-1 mapping—we present a compression-aware garbage collection (GC) algorithm aimed at minimizing the number of copy operations for partially invalid physical pages.
We have implemented our design in MQsim, a widely used SSD simulator, and have conducted a series of experiments to evaluate the effectiveness of the proposed techniques. The results demonstrate that our approach significantly reduces the mapping table size in compressional SSDs, leading to an improved mapping table cache hit ratio and a reduced I/O latency compared to traditional compressional SSDs.
Engineering Poster
Networking


DescriptionWith emergence of complex SOC designs, the challenges in verification have increased manifold, which involves multiple instances of custom core CPUs at SOC. Adherence to aggressive schedules holds utmost importance for platform devices. With no margin for error, first pass Silicon for custom CPU designs enables early breakthroughs in automotive, industrial, high-performance compute (HPC) and artificial intelligence applications. As the complex world of CPU clusters forays into paradigm shifting capabilities in ASIC designs, hardware-software co-debug continues to become a ubiquitous need of the hour. In this regard, Cadence Embedded Software Debug (ESWD) aids in expediting custom core-based SOC verification closure by coverage-driven firmware signoff and ease of debug for CPU execution which ensures systematic traceability of test coverage compared to erstwhile art of manual directed tests mapping in test plan for features covered. ESWD paved the way for faster debugs and unfolding corner case scenarios when the compiler scripts and platform SOC development executes in parallel. The detailed paper will highlight the achieved results with debug capabilities and impact for custom core SOC platform developments.
Engineering Presentation


AI
Front-End Design
Chiplet
DescriptionWith the emergence of complex multi-die SoC designs, the challenges in verification have increased manifold. This involves integration of pre-verified IPs, sub-systems, and partially verified chiplets using multiple vendor simulator platforms and Verification IPs. Artificial Intelligence (AI) is at the forefront of ASIC breakthroughs. With time to market being a critical factor, adherence to aggressive schedules has become the new normal. Rebuilding these complex verification environments onto a multi-die SoC testbench in the given timelines is extremely challenging. Universal Chiplet Interconnect Express (UCIe) is an open industry standard, multi-protocol, high-bandwidth (up to 32GT/s per lane), die-to-die (chiplet) interconnect that standardizes inter-die communication on-package. The Simulator Independent Verification Platform Development (SIVPD) being vendor agnostic not just addresses the issues at hand but also potentially makes use of licenses in most effective way and results in reduction of both development cycle time and saves the precious license cost. Significant performance and capacity improvements are observed while verifying UCIe using SIVPD with NDIE distributed simulation technology on a 2.5DIC for High Performance Compute (HPC) AI training and inference applications. When compared to traditional UCIe DUT back-to-back setup in a single die testbench, distributed simulation setup has drastically reduced the overall testbench development time and also helped in achieving 3x improvement in simulation speed and flexibility to make this solution ubiquitous for chiplet based architectures. The current SoC consists of 4 homogeneous 4nm chiplets on a single interposer communicating with each other via UCIe 1.1 protocol, QSPI and GPIOs. All 4 dies runs a separate simulation thread on a different LSF machine thereby optimizing the memory requirements to simulate the whole quad-chiplet package together. Emulation further helped in achieving 1000x faster closure of multi-chiplet boot use-cases.
Engineering Poster
Networking


DescriptionNeural network architectures can analyze integrated circuit layout shapes to predict attributes such as local layout effects. In our previous work, M. D. Monkowski et al., "Analysis of Local Layout Effects in Field Effect Transistors Using Neural Networks", we presented a novel method to address the challenges faced when tackling local layout effects (LLEs) with compact models. We demonstrated that we can identify important features in a layout using a variational auto encoder (VAE). Although our findings are promising, further exploration on the reliability of our method's predictions was needed. In this this work, we study the latent space of the VAE. The latent space is a lower-dimensional, continuous representation of the training data. As such, the latent space key functionality must capture the underlying structures and key features of a device that link those structures to its measured electrical properties. Our study allows us to demonstrate that our VAE develops a structured representation of FinFET device layout data. We demonstrate that the measured electrical property, for example threshold voltage, induces the VAE to learn the associated device layouts corresponding to it. Additionally, we introduce nearest neighbors (NNs) sampling as a method to assess the reliability of our methodology by using the NNs in latent space as a reference frame for our predictions.
Networking
Work-in-Progress Poster


DescriptionVortex, a newly proposed open-source GPGPU platform based on the RISC-V ISA, offers a valid alternative for GPGPU research over the broadly-used modeling platforms based on commercial GPU's. Similarly to the push originating from the RISC-V movement for CPUs, Vortex can enable a myriad of fresh research directions for GPUs. However, as a young hardware platform, it lacks the performance competitiveness necessary for wide adoption. Particularly, it underperforms for regular, memory-intensive kernels like linear algebra routines, which form the basis of many applications, including Machine Learning. For such kernels, we identified the control flow (CF) management overhead and memory orchestration as the main causes of performance degradation in the open-source Vortex GPGPU. To overcome these problems, this paper proposes 1.) a hardware CF manager to accelerate branching and predication in regular loop execution and 2.) decoupled memory streaming lanes to further hide memory latency with useful computation. The evaluation results for different kernels show 8x faster execution, 10x reduction in dynamic instruction count, and performance improvement from 0.35 to 1.63 GFLOP/s/mm2. Thanks to the proposed enhancements, Vortex can become an ideal playground to enable GPGPU research for the next generation of Machine Learning.
Research Manuscript


Design
DES3: Emerging Models of ComputatioN
DescriptionNeuro-symbolic artificial intelligence (neuro-symbolic AI) excels in logical analysis and reasoning. Hyperdimensional Computing (HDC), a promising brain-inspired computational model, is integral to neuro-symbolic AI. Various HDC models have been proposed to represent class-instance and class-class relations, but when representing the more complex class-subclass relation, where multiple objects associate different levels of classes and subclasses, they face challenges for factorization, a crucial task for neuro-symbolic AI systems. In this article, we propose FactorHD, a novel HDC model capable of representing and factorizing the complex class-subclass relation efficiently. FactorHD features a symbolic encoding method that embeds an extra memorization clause, preserving more information for multiple objects. In addition, it employs an efficient factorization algorithm that selectively eliminates redundant classes by identifying the memorization clause of the target class. Such model significantly enhances computing efficiency and accuracy in representing and factorizing multiple objects with class-subclass relation, overcoming limitations of existing HDC models such as "superposition catastrophe" and "the problem of 2". Evaluations show that FactorHD achieves approximately $5667\times$ speedup at a representation size of $10^9$ compared to existing HDC models. When integrated with the ResNet-18 neural network, FactorHD achieves $92.48\%$ factorization accuracy on the Cifar-10 dataset.
Networking
Work-in-Progress Poster


DescriptionArtificial Intelligence (AI) has become omnipresent, influencing a multitude of applications in our daily lives and expanding into a wide range of sensitive applications like medical diagnosis, autonomous driving, and more. Many systems using Machine Learning (ML) models, e.g., neural networks, function as black boxes and lack interpretability. In contrast, users are increasingly seeking greater transparency in how AI-driven decisions are made. Random Forests (RF) have arisen as a key model offering this "interpretability", but their limited performance is far from the needs of time-critical applications.
In this work, we design a specialized hardware solution to accelerate inference tasks on RFs. We first identify data retrieval inefficiencies by analyzing several RF inference algorithms, then design a processing-in-memory (PIM) architecture called FARM. FARM performs key inference tasks within the hardware's HBM memory banks, and coalesces burst memory activity to reduce unnecessary data movements. Our evaluation shows that FARM delivers an average of 8.8x performance improvement, and an 87% reduction in energy, without any loss in predictive accuracy, when compared to a state-of-the-art GPU coupled with HBM memory.
In this work, we design a specialized hardware solution to accelerate inference tasks on RFs. We first identify data retrieval inefficiencies by analyzing several RF inference algorithms, then design a processing-in-memory (PIM) architecture called FARM. FARM performs key inference tasks within the hardware's HBM memory banks, and coalesces burst memory activity to reduce unnecessary data movements. Our evaluation shows that FARM delivers an average of 8.8x performance improvement, and an 87% reduction in energy, without any loss in predictive accuracy, when compared to a state-of-the-art GPU coupled with HBM memory.
Research Manuscript


Systems
SYS2: Design of Cyber-Physical Systems and IoT
DescriptionThe fast-rising demand for wireless bandwidth [1] requires rapid evolution of high-performance baseband processing infrastructure. Programmable many-core processors for software-defined radio (SDR) have emerged as high-performance
baseband processing engines, offering the flexibility required to capture evolving wireless standards and technologies [2]–[4]. This trend must be supported by a design framework enabling functional validation and end-to-end performance analysis of SDR hardware within realistic radio environment models. We propose a static binary translation based simulator augmented with a fast, approximate timing model of the hardware and coupled to wireless channel models to simulate the most performance-critical physical layer functions implemented in software on a many (1024) RISC-V cores cluster customized for SDR. Our framework simulates the detection of a 5G OFDM-symbol on a server-class processor in 9.5s-3min, on a single thread, depending on the input MIMO size (three orders of magnitude faster than RTL simulation). The simulation is easily parallelized to 128 threads with 73-121× speedup compared to a single thread.
baseband processing engines, offering the flexibility required to capture evolving wireless standards and technologies [2]–[4]. This trend must be supported by a design framework enabling functional validation and end-to-end performance analysis of SDR hardware within realistic radio environment models. We propose a static binary translation based simulator augmented with a fast, approximate timing model of the hardware and coupled to wireless channel models to simulate the most performance-critical physical layer functions implemented in software on a many (1024) RISC-V cores cluster customized for SDR. Our framework simulates the detection of a 5G OFDM-symbol on a server-class processor in 9.5s-3min, on a single thread, depending on the input MIMO size (three orders of magnitude faster than RTL simulation). The simulation is easily parallelized to 128 threads with 73-121× speedup compared to a single thread.
Networking
Work-in-Progress Poster


DescriptionRandom walk is a prevalent statistical technique that has been widely used in network analysis. This requires a considerable number of walk samples for accurate estimation; thus, random walk does not work well on large networks with long walk lengths. This paper addresses fast random walk through the reduction of absorbing Markov chains (AMC). We first select some states in an AMC as additional absorbing states, and random walk is performed on this modified AMC. Next, we construct a reduced AMC retaining only the absorbing states, with additionally selected ones reverted to transient states, and perform random walk again. Based on the results of two random walks, the expected number of visits to each state is calculated in a stochastic manner. Experimental results on IR-drop analysis demonstrate significant speedup, achieving a 78% reduction in runtime compared to the state-of-the-art random walk approach.
Networking
Work-in-Progress Poster


DescriptionFinFET technology has been steadily replacing the traditional MOSFET in sub-20 nm IC devices, due to its low power consumption and excellent scaling characteristics. However, scaling FinFETs beyond 3 nm is challenging. To pursue better power efficiency and performance, the negative capacitance FinFET (NC-FinFET) has been introduced by adding an extra ferroelectric layer at the gate. In this paper, we propose a fast simulation algorithm for NC-FinFETs based on the Latency Insertion Method (LIM). By integrating the BSIM-CMG model and the Landau-Khalatnikov Ferroelectric model, the proposed algorithm achieves orders of magnitude in speedup over conventional circuit simulators for large-scale examples.
Engineering Poster
Networking


DescriptionWith the ever-changing and dynamically paced semiconductor world, the art of speeding up the process involved is very important to get the product to the market faster to compete with the competitors. It is imperative to shift-left the processes involved at various stages of the Chip life cycle. PCIe, being one of the most commonly used High speed peripheral in an SoC to meet the demand for faster transfers, ais also evolving and becoming more complex. This paper talks about fast-tracking PCIe verification in an SoC by automating the testbench using Triple Check test suite.
This test suite contains all the necessary testcases to be covered for Data Lik Layer, PHY layer, Configuration classes, etc for each generation of PCIe. We can run a setup testcase that runs through the RTL and decides on the RTL configurations and chooses wisely the list of testcases that are suitable and the configurations that the VIP needs. With this info, we can precisely run the selected testcases and identify any bugs related to the protocol or design integration much earlier in the design cycle, thereby enabling faster fixes.
This test suite contains all the necessary testcases to be covered for Data Lik Layer, PHY layer, Configuration classes, etc for each generation of PCIe. We can run a setup testcase that runs through the RTL and decides on the RTL configurations and chooses wisely the list of testcases that are suitable and the configurations that the VIP needs. With this info, we can precisely run the selected testcases and identify any bugs related to the protocol or design integration much earlier in the design cycle, thereby enabling faster fixes.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionIn response to the surge of security breaches in hardware designs, many verification methods have been proposed to detect microarchitectural information leakage. These sophisticated efforts have gone a long way toward preventing attackers from breaking the confidentiality of the system. However, each approach comes with its own set of weaknesses: it may not be
scalable enough, it may not be exhaustive, it may not be flexible enough to meet changing requirements, or it may not fit well
into existing verification flows.
In this paper, we propose FastPath, a hybrid verification methodology that leverages the strengths of different approaches while compensating their weaknesses. In particular, it combines the efficiency of simulation with the exhaustive nature of formal verification. In addition, FastPath employs a structural analysis framework to automate the method further. Our experimental results compare FastPath to a state-of-the-art formal approach, showing a significant reduction in manual effort while achieving the same level of exhaustive confidence. We also discovered and contributed a fix for a previously unknown leak of internal operands in cv32e40s, a RISC-V processor intended for security applications.
scalable enough, it may not be exhaustive, it may not be flexible enough to meet changing requirements, or it may not fit well
into existing verification flows.
In this paper, we propose FastPath, a hybrid verification methodology that leverages the strengths of different approaches while compensating their weaknesses. In particular, it combines the efficiency of simulation with the exhaustive nature of formal verification. In addition, FastPath employs a structural analysis framework to automate the method further. Our experimental results compare FastPath to a state-of-the-art formal approach, showing a significant reduction in manual effort while achieving the same level of exhaustive confidence. We also discovered and contributed a fix for a previously unknown leak of internal operands in cv32e40s, a RISC-V processor intended for security applications.
Networking
Work-in-Progress Poster


DescriptionReal-time health-monitoring systems are generating significantly large amounts of data, challenging memory storage space, computation capacity of processing units, and power budget for transmission. This work proposes a new approximation method, FAxC, which exploits the features in biometric data to perform multi-dimensional approximation. The proposed feature-oriented approximation not only significantly reduces the size of sensing data without compromising accuracy but also addresses privacy preservation issues in healthcare applications. The FAxC method has been successfully deployed to a machine-learning(ML)-based human activity recognition (HAR) framework. The experimental results based on a public database, MotionSense, show that FAxC can achieve a HAR accuracy of over 94.85% with a 34% reduction in data size. More importantly, this work provides quantitative assessments of the trade-off between HAR accuracy, privacy-preserving rate, and approximation efficiency. Regarding privacy preservation, our case study indicates that FAxC outperforms the existing downsampling and single-feature approximation methods by up to 3.2x and 2.8x, respectively. A tinyML-based FPGA implementation for the HAR application shows that the use of FAxC reduces the classification latency by 36% compared to a non-approximation baseline.
Engineering Presentation


AI
Front-End Design
Chiplet
DescriptionThe adoption of the Portable Stimulus Standard (PSS) language in hardware verification remains constrained by the fear of engineers having to learn a new language. Although PSS holds promise for improving verification efficiency—from block-level components to full SoC integrations—its intricate constructs and relationships frequently hinder widespread deployment. Existing solutions, such as relying on consulting services or leveraging graphical interfaces, still demand deep PSS proficiency and can be both resource-intensive and time-consuming.
This paper introduces a novel approach that harnesses Large Language Models (LLMs) to significantly simplify PSS code creation and accelerate coverage closure. By translating high-level textual descriptions into fully formed, syntactically valid PSS code, our solution reduces the learning curve for verification engineers who are not PSS experts. This method also facilitates code compilation to multiple target environments, both industry-standard UVM and C-based verification environments, further streamlining the transition to PSS. As a result, engineering teams can rapidly adopt PSS without sacrificing thoroughness or correctness, leading to more efficient, consistent, and automated verification workflows.
Our findings demonstrate that LLM-assisted PSS code generation can bridge the gap between domain expertise and language proficiency, thus enabling broader, more effective use of the Portable Stimulus Standard in complex verification scenarios.
This paper introduces a novel approach that harnesses Large Language Models (LLMs) to significantly simplify PSS code creation and accelerate coverage closure. By translating high-level textual descriptions into fully formed, syntactically valid PSS code, our solution reduces the learning curve for verification engineers who are not PSS experts. This method also facilitates code compilation to multiple target environments, both industry-standard UVM and C-based verification environments, further streamlining the transition to PSS. As a result, engineering teams can rapidly adopt PSS without sacrificing thoroughness or correctness, leading to more efficient, consistent, and automated verification workflows.
Our findings demonstrate that LLM-assisted PSS code generation can bridge the gap between domain expertise and language proficiency, thus enabling broader, more effective use of the Portable Stimulus Standard in complex verification scenarios.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionIn advanced nodes, the optimization of Power, Per- formance, and Area (PPA) is becoming increasingly complex, requiring significant resources and time for circuit design and optimization using Electronic Design Automation (EDA). As a key approach to overcome these challenges, Machine Learning (ML) techniques have been widely studied in the field of EDA. However, security concerns around Intellectual Property (IP) limit access to real-world circuit data, making it difficult to gather sufficient data for training ML models. This lack of available circuit benchmarks restricts progress in ML research. In this study, we propose FedEDA, which, to the best of our knowledge, is the first Federated Learning (FL) aggregation algorithm specifically designed for EDA. FedEDA addresses concerns about IP security by exchanging model weights among FL participants instead of sharing raw data. Furthermore, FedEDA leverages Rent's Rule and circuit size to capture the hierarchical structure of circuits, mitigating issues related to data imbalance among participants and improving the quality of weight aggregation on EDA data. We demonstrate the applicability of FedEDA across various EDA tasks, including routability, parasitic RC, and wirelength prediction. FedEDA outperforms existing FL algorithms in EDA tasks, demonstrating superior performance.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionKolmogorov-Arnold networks (KANs) have emerged as a promising alternative to MLP due to their adaptive learning capabilities for complex dependencies through B-spline basis activations (BBA). However, existing in-memory accelerators optimized for MLP-based DNNs are primarily designed for vector-matrix multiplication (VMM), making them inefficient for the dynamic and recursive B-spline interpolation (BSI) operations required by KANs. In this work, we propose FeKAN, an FeFET-based architecture designed to accelerate BBA operations. First, we develop a software-hardware co-optimized framework for mapping B-spline basis functions, leveraging a two-stage design space exploration (DSE) algorithm in combination with FeFET-based Look-Up Tables (LUT) and Content-Addressable Memory (CAM). This framework translated dynamic BSI operations into static codebook lookups, achieving a balanced trade-off between memory and computational efficiency. Second, we propose compress-sparsity-column (CSC) based encoding for B-spline basis function (BBF) and grouped-computation strategy for memory and energy reduction. Third, we propose a grouped-pipeline optimization strategy to mitigate data dependencies, significantly enhancing computation efficiency. Experimental results demonstrate that FeKAN achieves up to $150.68\text{K}\times$ and $4664\times$ higher throughput and up to $606.87\times$ and $11196\times$ greater energy efficiency over Intel Xeon Silver 4310 CPU and NVIDIA A6000 GPU, respectively.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionGraph representation learning is a powerful approach to extract features from graph-structured data such as analog/mixed-signal (AMS) circuits. However, the training of deep learning models for AMS design is severely limited by the scarcity of integrated circuit design data. In this work, we present CirGPS, a few-shot learning method for parasitic effect prediction in AMS circuits. The circuit netlist is modeled as a heterogeneous graph while the coupling capacitance is modeled as a link. CirGPS is pre-trained on link prediction and fine-tuned on edge regression. The proposed method starts with a small-hop sampling technique that converts a link or a node into a subgraph. Then, the subgraph embeddings are learned with a hybrid graph Transformer. Additionally, CirGPS integrates a low-cost positional encoding that summarizes the positional and structural information of the target link. CirGPS improves the accuracy of coupling existence by at least 20% and reduces the MAE of capacitance estimation by at least 0.067 compared to existing methods. Our method naturally has good scalability and can be applied with zero-shot learning to a wide variety of AMS designs. Through our ablation studies, we provide valuable insights into graph models for representation learning.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionBackpropagation, while foundational to neural network training, is inefficient for resource-constrained edge devices due to its high time and energy consumption. While low-precision quantization has been explored for inference speed-up, its use in training remains underexplored. The Forward-Forward (FF) algorithm offers an alternative by replacing the backward pass with an additional forward pass, reducing memory and computation. This paper introduces an INT8 quantized training approach using FF's layer-by-layer strategy to stabilize gradient quantization and proposes a "look-afterward" scheme to improve accuracy. Experiments a edge device show 4.6% faster training, 8.3% energy savings, and 27.0% reduced memory usage, with competitive accuracy.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionTo accelerate AI applications, numerous data formats and physical implementations of matrix multiplication have been proposed, creating a complex design space.
This paper studies the efficient MAC implementation of the integer, floating-point, posit, and logarithmic number system (LNS) data formats and Microscaling (MX) and Vector-Scaled Quantization (VSQ) block data formats.
We evaluate the area, power, and numerical accuracy of >25,000 MAC designs spanning each data format and several key design parameters.
We find that pareto optimal MAC designs with emerging data formats (LNS16, MXINT8, VSQINT4) achieve 1.8x, 2.2x, and 1.9x TOPS/W improvement compared to FP16, FP8, and FP4 implementations.
This paper studies the efficient MAC implementation of the integer, floating-point, posit, and logarithmic number system (LNS) data formats and Microscaling (MX) and Vector-Scaled Quantization (VSQ) block data formats.
We evaluate the area, power, and numerical accuracy of >25,000 MAC designs spanning each data format and several key design parameters.
We find that pareto optimal MAC designs with emerging data formats (LNS16, MXINT8, VSQINT4) achieve 1.8x, 2.2x, and 1.9x TOPS/W improvement compared to FP16, FP8, and FP4 implementations.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionZoned namespace (ZNS) SSDs are emerging stor- age devices offering low cost, high performance, and software definability. By adopting host-managed zone-based sequential programming, ZNS SSDs effectively eliminate the space over- head associated with on-board DRAM memory and garbage collection. However, while background read refreshing serves as a background data protection mechanism in conventional block-interface SSDs, the state-of-the-art ZNS SSDs lack read refreshing functionality to guarantee the data reliability. More- over, implementing zone-level read refreshing in ZNS SSDs incurs significant overhead due to the large volume of valid data movements in a zone, leading to degraded I/O performance.
To efficiently enable read refreshing for ZNS SSDs, this paper proposes FineRR-ZNS, a fine-granularity read refreshing mechanism for ZNS SSDs. FineRR-ZNS employs a host-controlled fine- granularity read refreshing scheme that selectively determines block-level read refreshing via metadata remapping. Additionally, a zone reconstruction method is designed to retrieve remapped data forming complete data during zone-level RR. Specially, the remapped data after zone reconstruction are still available and prioritized for read accesses until their respective block needs next RR. Experimental evaluations with RocksDB benchmarks show that FineRR-ZNS significantly enhances read refreshing efficiency and I/O throughput compared to zone-level read refreshing implemented in the state-of-the-art ZenFS file system.
To efficiently enable read refreshing for ZNS SSDs, this paper proposes FineRR-ZNS, a fine-granularity read refreshing mechanism for ZNS SSDs. FineRR-ZNS employs a host-controlled fine- granularity read refreshing scheme that selectively determines block-level read refreshing via metadata remapping. Additionally, a zone reconstruction method is designed to retrieve remapped data forming complete data during zone-level RR. Specially, the remapped data after zone reconstruction are still available and prioritized for read accesses until their respective block needs next RR. Experimental evaluations with RocksDB benchmarks show that FineRR-ZNS significantly enhances read refreshing efficiency and I/O throughput compared to zone-level read refreshing implemented in the state-of-the-art ZenFS file system.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionHigh-performance security guarantees rely on hardware support. Generic programmable support for fine-grained instruction analysis has gained broad interest in the literature as a fundamental building block for the security of future processors. However, implementation in real out-of-order (OoO) superscalar processors presents tough challenges that cannot be explored in highly abstract simulators. We detail the challenges of implementing complex programmable pathways without critical paths or contention. We then introduce FireGuard, the first implementation of fine-grained instruction analysis on a real OoO superscalar processor. We establish an end-to-end system, including microarchitecture, SoC, ISA and programming model. Experiments show that our solution simultaneously ensures both security and performance of the system, with excellent parallel scalability. We also evaluate the feasibility of building FireGuard into modern SoCs: Apple's M1-Pro, Huawei's Kirin-960, and Intel's i7-12700F, where less than 1% silicon area is introduced.
Workshop


AI
Sunday Program
DescriptionHardware tape-outs are prohibitively expensive and time-consuming, making circuit and system (CAS) simulators crucial for verifying designs efficiently and cost-effectively prior to fabrication. An extensive array of simulators exists today, tailored for various CAS applications, such as Verilog simulators for digital integrated circuits (ICs), SPICE-based simulators for analog ICs, Verilog-AMS simulators for mixed-signal systems, and electromagnetic simulators for high-frequency circuits and antennas. Despite decades of development and the high degree of maturity achieved by CAS simulators, the recent surge of artificial intelligence (AI) is rekindling renewed interest from both software and hardware perspectives. On the hardware front, the exceptional parallelism capabilities of GPUs can be harnessed to expedite CAS simulations, such as GPU-accelerated SPICE simulations and logic gate simulation. On the software side, deep learning (DL) algorithms are being seamlessly integrated into CAS simulators serving as surrogate models or providing initial guesses, to reduce computational workloads and improve efficiency. Conversely, the principles of CAS simulation are catalyzing novel AI models. One prominent example is the use of ordinary differential equations (ODEs), which have long been a cornerstone of time-domain analog circuit simulations in SPICE, with the adjoint method used for gradient computations. In the DL community, these techniques have evolved into Neural ODEs, a class of models that parameterize ODE dynamics using neural networks. Neural ODEs have proven especially effective for time-series forecasting and are closely linked to the development of generative diffusion models. Similarly, state-space models (SSMs), once the bedrock of linear time-invariant systems, now underpin architectures such as Mamba, designed for efficient natural language processing. Another notable adaptation of classical circuit principles in modern AI is Kirchhoff's current law (KCL), which has been leveraged to construct analog neural networks, such as memristor crossbar arrays and KirchhoffNet. Furthermore, Fourier transforms, widely used in frequency-domain CAS simulations for signal processing, have been reimagined as neural operators. This adaptation has led to breakthroughs in AI-driven scientific applications, such as weather forecasting.
The similarities between CAS simulation and AI are profound, yet no dedicated platform exists for researchers, engineers, and practitioners to discuss this interdisciplinary topic. Recognizing this critical need, the First International Workshop on Synergizing AI and Circuit-System Simulation aims to bring together experts to explore innovative methodologies that leverage the synergies between these fields. The workshop will provide a platform to discuss recent advancements and foster interdisciplinary collaboration.
The workshop contains 5 talks; each is scheduled to be 45 mins.
Title: GPU Accelerated Simulation: From RTL to Gate-Level, From Opportunities to Success
Contributors: Yanqing Zhang, Mark (Haoxing) Ren, Nvidia
Abstract: In this talk, we will present a brief history of accelerated simulation, and motivate why GPUs can be an attractive platform to accelerate this uber-important EDA application. We will go through several important types of simulation abstraction levels: RTL, gate-level, and re-simulation, as well as the unique challenges each type of simulation faces when attempting to accelerate them. Next, we go into detailed discussion on some recent research work that aims to attack these challenges, centered around 3 projects GEM (GPU accelerated RTL simulation), GL0AM (GPU accelerated gate-level simulation), and GATSPI (GPU accelerated re-simulation). Finally, we provide some analysis and insight into where the remaining opportunities for improvement and research lie (and why), and which challenges have been successfully conquered.
Title: AI on functions
Contributors: Kamyar Azizzadenesheli, Nvidia
Abstract: Artificial intelligence is rapidly advancing, with neural networks powering breakthroughs in computer vision and natural language processing. Yet, many scientific and engineering challenges—such as material science, climate modeling, and quantum chemistry—rely on data that are not words or images, but functions. Traditional neural networks are not equipped to handle these functional data.
To overcome this limitation, we introduce neural operators, a new paradigm in AI that generalize neural networks to learn mappings between function spaces. Neural operators enable AI to process and reason about functional data directly, opening new frontiers for scientific discovery and technological innovation across diverse disciplines.
Title: Machine Learning for EDA, or EDA for Machine Learning?
Zheng Zhang, University of California at Santa Barbara
Abstract: The rapid advancement of machine learning (especially deep learning) in the past decade has impacted, both positively and negatively, many research fields. Driven by the great success of machine learning in image and speech domains, there have been increasing interests in “Machine Learning for EDA”. In the first part of the talk, I will explain the main challenge of data sparsity when applying existing machine learning techniques to EDA. Then I will show how some data-efficient scientific machine learning techniques, specifically uncertainty quantification and physics-constraint operator learning, can be utilized to build high-fidelity surrogate models for variability analysis and for 3D-IC thermal analysis, respectively. These techniques can greatly reduce the number of required device/circuit simulation data samples.
Another important but highly ignored direction is “EDA for Machine Learning”. The five decades of EDA research has produced a huge body of solid theory and efficient algorithms for analyzing, modeling and optimizing complex electronic systems. Many of the white-box EDA ideas may be leveraged to solve black-box AI problems. In the second part of the talk, I will show how the self-healing idea and compact modeling idea from EDA can be utilized to improve the trustworthiness and sustainability of deep learning models (including large-language models).
Title: Optimization Meets Circuit Simulation
Aayushya Agarwal, Larry Pileggi, Carnegie Mellon University
Abstract: Optimization is central to the design and analysis of modern engineering systems. But as systems scale in complexity, traditional optimization tools, which are often rooted in purely mathematical representations, can struggle to reliably find feasible solutions. In this talk, we explore a new approach that bridges mathematical optimization with circuit simulation. This approach maps optimization problems as analog circuits, where optimization components are modeled as equivalent circuit devices connected through a network. This reframes the development of optimization algorithm as the design and simulation of circuits, which allows us to leverage principles from linear networks, nonlinear device physics, and solution techniques in SPICE and its many derivations. The result is a class of physics-inspired methods tailored to the nonlinearities and structure of each optimization problem that would be far less intuitive without the view through a circuit model lens. We demonstrate the efficacy of the equivalent circuit methods for real-world applications, including training machine-learning, and optimizing power grids.
Title: Oscillator Ising Machines: Principles to Working Hardware
Jaijeet Roychowdhury, University of California at Berkeley
Abstract: Modern society has become increasingly reliant on rapid and routine solution of hard discrete optimization problems. Over the past decade, fascinating analog hardware approaches have arisen that combine principles of physics and computer science with optical, electronic and quantum engineering to solve combinatorial optimization problems in new ways---these have come to be known as Ising machines. Such approaches leverage analog dynamics and physics to find good solutions of discrete optimization problems, potentially with advantages over traditional algorithms. Underlying these approaches is the Ising model, a simple but powerful graph formulation with deep historical roots in physics. About eight years ago, we discovered that networks of analog electronic oscillators can solve Ising problems “naturally”. This talk will cover the principles and practical development of these oscillator Ising machines (OIMs). We will touch upon specialized EDA tools for oscillator based systems and note the role of novel nanodevices. Applied to the MU-MIMO detection problem in modern wireless communications, OIMs yield near-optimal symbol-error rates (SERs), improving over the industrial state of the art by 20x for some scenarios.
The similarities between CAS simulation and AI are profound, yet no dedicated platform exists for researchers, engineers, and practitioners to discuss this interdisciplinary topic. Recognizing this critical need, the First International Workshop on Synergizing AI and Circuit-System Simulation aims to bring together experts to explore innovative methodologies that leverage the synergies between these fields. The workshop will provide a platform to discuss recent advancements and foster interdisciplinary collaboration.
The workshop contains 5 talks; each is scheduled to be 45 mins.
Title: GPU Accelerated Simulation: From RTL to Gate-Level, From Opportunities to Success
Contributors: Yanqing Zhang, Mark (Haoxing) Ren, Nvidia
Abstract: In this talk, we will present a brief history of accelerated simulation, and motivate why GPUs can be an attractive platform to accelerate this uber-important EDA application. We will go through several important types of simulation abstraction levels: RTL, gate-level, and re-simulation, as well as the unique challenges each type of simulation faces when attempting to accelerate them. Next, we go into detailed discussion on some recent research work that aims to attack these challenges, centered around 3 projects GEM (GPU accelerated RTL simulation), GL0AM (GPU accelerated gate-level simulation), and GATSPI (GPU accelerated re-simulation). Finally, we provide some analysis and insight into where the remaining opportunities for improvement and research lie (and why), and which challenges have been successfully conquered.
Title: AI on functions
Contributors: Kamyar Azizzadenesheli, Nvidia
Abstract: Artificial intelligence is rapidly advancing, with neural networks powering breakthroughs in computer vision and natural language processing. Yet, many scientific and engineering challenges—such as material science, climate modeling, and quantum chemistry—rely on data that are not words or images, but functions. Traditional neural networks are not equipped to handle these functional data.
To overcome this limitation, we introduce neural operators, a new paradigm in AI that generalize neural networks to learn mappings between function spaces. Neural operators enable AI to process and reason about functional data directly, opening new frontiers for scientific discovery and technological innovation across diverse disciplines.
Title: Machine Learning for EDA, or EDA for Machine Learning?
Zheng Zhang, University of California at Santa Barbara
Abstract: The rapid advancement of machine learning (especially deep learning) in the past decade has impacted, both positively and negatively, many research fields. Driven by the great success of machine learning in image and speech domains, there have been increasing interests in “Machine Learning for EDA”. In the first part of the talk, I will explain the main challenge of data sparsity when applying existing machine learning techniques to EDA. Then I will show how some data-efficient scientific machine learning techniques, specifically uncertainty quantification and physics-constraint operator learning, can be utilized to build high-fidelity surrogate models for variability analysis and for 3D-IC thermal analysis, respectively. These techniques can greatly reduce the number of required device/circuit simulation data samples.
Another important but highly ignored direction is “EDA for Machine Learning”. The five decades of EDA research has produced a huge body of solid theory and efficient algorithms for analyzing, modeling and optimizing complex electronic systems. Many of the white-box EDA ideas may be leveraged to solve black-box AI problems. In the second part of the talk, I will show how the self-healing idea and compact modeling idea from EDA can be utilized to improve the trustworthiness and sustainability of deep learning models (including large-language models).
Title: Optimization Meets Circuit Simulation
Aayushya Agarwal, Larry Pileggi, Carnegie Mellon University
Abstract: Optimization is central to the design and analysis of modern engineering systems. But as systems scale in complexity, traditional optimization tools, which are often rooted in purely mathematical representations, can struggle to reliably find feasible solutions. In this talk, we explore a new approach that bridges mathematical optimization with circuit simulation. This approach maps optimization problems as analog circuits, where optimization components are modeled as equivalent circuit devices connected through a network. This reframes the development of optimization algorithm as the design and simulation of circuits, which allows us to leverage principles from linear networks, nonlinear device physics, and solution techniques in SPICE and its many derivations. The result is a class of physics-inspired methods tailored to the nonlinearities and structure of each optimization problem that would be far less intuitive without the view through a circuit model lens. We demonstrate the efficacy of the equivalent circuit methods for real-world applications, including training machine-learning, and optimizing power grids.
Title: Oscillator Ising Machines: Principles to Working Hardware
Jaijeet Roychowdhury, University of California at Berkeley
Abstract: Modern society has become increasingly reliant on rapid and routine solution of hard discrete optimization problems. Over the past decade, fascinating analog hardware approaches have arisen that combine principles of physics and computer science with optical, electronic and quantum engineering to solve combinatorial optimization problems in new ways---these have come to be known as Ising machines. Such approaches leverage analog dynamics and physics to find good solutions of discrete optimization problems, potentially with advantages over traditional algorithms. Underlying these approaches is the Ising model, a simple but powerful graph formulation with deep historical roots in physics. About eight years ago, we discovered that networks of analog electronic oscillators can solve Ising problems “naturally”. This talk will cover the principles and practical development of these oscillator Ising machines (OIMs). We will touch upon specialized EDA tools for oscillator based systems and note the role of novel nanodevices. Applied to the MU-MIMO detection problem in modern wireless communications, OIMs yield near-optimal symbol-error rates (SERs), improving over the industrial state of the art by 20x for some scenarios.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionEnabling real-time GNN inference services requires low end-to-end latency to meet service level agreements. However, intensive preparation steps and the neighborhood explosion problem pose significant challenges to efficient GNN inference serving. In this paper, we propose FLAG, an FPGA-based GNN inference serving system using vector quantization. To reduce preparation overhead, we introduce offline preprocessing to precompute and compress hidden embeddings for serving. A dedicated FPGA accelerator leverages the precomputed data to enable lightweight aggregation. As a result, FLAG achieves average speedups of 154×, 176×, and 333× on three GNN models compared to the baseline system.
Research Manuscript


Systems
SYS6: Time-Critical and Fault-Tolerant System Design
DescriptionReliability and real-time responsiveness in safety-critical systems have traditionally been achieved using error detection mechanisms, such as LockStep, which require pre-configured checker cores, strict synchronisation between main and checker cores, static error detection regions, or limited preemption capabilities. However, these core-bound hardware mechanisms often lead to significant resource over-provisioning, and diminished real-time responsiveness, particularly in modern systems where tasks with varying reliability requirements are consolidated on shared processors to improve efficiency, reduce costs, and save power. To address these challenges, this work presents FlexStep, a systematic solution that integrates hardware and software across the SoC, ISA, and OS scheduling layers. FlexStep features a novel microarchitecture that supports dynamic core configuration and asynchronous, preemptive error detection. The FlexStep architecture naturally allows for flexible task scheduling and error detection,
nabling new scheduling algorithms that enhance both resource efficiency and real-time schedulability. We publicly release FlexStep's source code, at https://anonymous.4open.science/r/FlexStep-DAC25-7B0C.
nabling new scheduling algorithms that enhance both resource efficiency and real-time schedulability. We publicly release FlexStep's source code, at https://anonymous.4open.science/r/FlexStep-DAC25-7B0C.
Networking
Work-in-Progress Poster


DescriptionWith the rapid adoption of Deep Neural Networks (DNNs) in safety-critical systems, ensuring the reliability of DNN hardware accelerators has become essential. Although many fault- tolerant techniques have been proposed for DNN accelerators, they rely on coarse-grained error detection that analyzes the final output of computations (e.g., output activation). This coarse- grained error detection approach delays responses to errors, making it particularly ineffective against propagation errors, which can cause multiple erroneous results with high probability in systolic array (SA)-based DNN accelerators. Furthermore, as these techniques depend on offline profiling, they are limited in accurately detecting and correcting errors during inference. In this paper, we propose Flowguard-SA, a fault-tolerant SA architecture that detects and mitigates propagation errors in a fine-grained manner through online profiling of input data. Flowguard-SA determines the maximum value of the input data online and leverages this value to detect and mitigate errors in the propagated input data within the SA. Experimental results demonstrate that Flowguard-SA achieves an average accuracy improvement of 2.54× at a BER of 1e-6 across various CNN models, compared to existing fault-tolerant techniques, with 1.95% area overhead and 0.66% power overhead compared to a conventional SA.
Engineering Poster


DescriptionThis Paper leverages practical usage of UCIe/Chiplet systems as a live testimony - Formal verification and simulation/emulation verification of designs and proposes combining them for better results.
1. Formal Verification:
• Detects bugs early and allows reuse of tools across platforms.
• However, it struggles with large designs and deep-state exploration, making it less effective for complex scenarios.
2. Simulation/Emulation Verification:
• Handles large designs and deep-state simulations effectively.
• However, debugging issues can take days due to time-intensive failure analysis.
Hybrid Approach: Combining Both Methods
Running formal and simulation/emulation verification together offers the best of both worlds:
• Catch Early Bugs: Formal runs can quickly identify simple issues.
• Faster Debugging: Assertions from formal runs can be reused in simulations, stopping at failures to save time.
• Consistency: Using shared properties across both platforms ensures consistent checks and constraints.
• Accelerated Coverage: Merging formal coverage with simulation coverage helps achieve faster verification at the system-on-chip (SoC) level.
This hybrid approach shows results of both methods - Speed up and improve the overall verification process, ensuring more reliable results and hence improving TAT
1. Formal Verification:
• Detects bugs early and allows reuse of tools across platforms.
• However, it struggles with large designs and deep-state exploration, making it less effective for complex scenarios.
2. Simulation/Emulation Verification:
• Handles large designs and deep-state simulations effectively.
• However, debugging issues can take days due to time-intensive failure analysis.
Hybrid Approach: Combining Both Methods
Running formal and simulation/emulation verification together offers the best of both worlds:
• Catch Early Bugs: Formal runs can quickly identify simple issues.
• Faster Debugging: Assertions from formal runs can be reused in simulations, stopping at failures to save time.
• Consistency: Using shared properties across both platforms ensures consistent checks and constraints.
• Accelerated Coverage: Merging formal coverage with simulation coverage helps achieve faster verification at the system-on-chip (SoC) level.
This hybrid approach shows results of both methods - Speed up and improve the overall verification process, ensuring more reliable results and hence improving TAT
Engineering Presentation


Front-End Design
DescriptionIntel's next-generation Xeon Server/AI Accelerator SoCs integrates over 190 intellectual properties(IPs), with 26 key IPs developed in-house by the SIFG(Server IPs and Firmware Group) SoC Integration team. These IPs, which include both baseline and major derivatives, are crucial for SoC Integration validation. With limited IP-level validation on these IPs, discovering bugs during SoC integration often leads to prolonged debugging. To address this, we deployed Formal Property Verification (FPV) using Cadence's Jasper to validate end-to-end features of these Xeon's SoC-owned IPs. This paper illustrates how FPV complemented traditional simulation methods by providing exhaustive testing and faster bug detection, thereby reducing debug cycles and preventing costly post-silicon steppings across multiple Xeon product lines. Achieving several milestones on multiple Xeon products and it's IPs, we validated features on at least 17 SoC owned IPs using FPV, uncovering 84 unique bugs. These bugs included hard-to-reach scenarios not detected by traditional simulation methods and critical path issues. FPV environments along with Clock-Gating and Formal Coverage Apps, were instrumental in identifying these issues. These efforts resulted in a significant reduction of SoC Integration level bugs, enabling the Xeon Integration team to accelerate the overall pre-silicon validation timeline.
Engineering Presentation


Front-End Design
DescriptionIn today's complex IP design, features span across multiple blocks to achieve a certain functionality. It throws up challenges in formal verification to yield conclusive results and coverage sign-off. Having multiple clocks and resets, handling constraints are other concerns in the formal verification. As the feature spans across various blocks, there is a significant increase in the sequential depth of the cone leading to undermined state for most of the properties. To overcome foresaid challenges, this paper talks about various nifty techniques like Black Boxing, Abstractions, advanced AI/ML features like Proof Master and functional coverage merge options. These techniques help in verifying a complex feature in existing design or a new feature added to it.
Methodologies and techniques outlined above were successfully applied to qualify Compliance feature of USB 3.2 Controller. This feature is successfully verified with minimum number of undetermined properties. The functional coverage of formal is merged with simulation and the formal vPlan is back annotated with the master vPlan, to get the comprehensive verification tracking. With this formal approach we were able to expedite the verification process by x2 factor.
Methodologies and techniques outlined above were successfully applied to qualify Compliance feature of USB 3.2 Controller. This feature is successfully verified with minimum number of undetermined properties. The functional coverage of formal is merged with simulation and the formal vPlan is back annotated with the master vPlan, to get the comprehensive verification tracking. With this formal approach we were able to expedite the verification process by x2 factor.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionTo address the growing security issues faced by ARM-based mobile devices today, TrustZone was adopted to provide a trusted execution environment (TEE) to protect sensitive data.
Such TrustZone-based models have been proven to be effective, but they target CPU architectures and do not work for the security of widely used heterogeneous computing platforms such as FPGAs.
To solve this issue, we propose a comprehensive SoC-FPGA security framework, FPGA-TrustZone, to support FPGA TEE by extending the security of ARM TrustZone.
Experiments on real SoC-FPGA hardware development boards show that FPGA-TrustZone provides high security with low performance overhead.
Such TrustZone-based models have been proven to be effective, but they target CPU architectures and do not work for the security of widely used heterogeneous computing platforms such as FPGAs.
To solve this issue, we propose a comprehensive SoC-FPGA security framework, FPGA-TrustZone, to support FPGA TEE by extending the security of ARM TrustZone.
Experiments on real SoC-FPGA hardware development boards show that FPGA-TrustZone provides high security with low performance overhead.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionLimitations in Large Language Model (LLM) capabilities for hardware design tasks, such as generating functional Verilog codes, have motivated various fine-tuning optimizations utilizing curated hardware datasets from open-source repositories. However, these datasets remain limited in size and contain minimal checks on licensing for reuse, resulting in potential copyright violations by fine-tuned LLMs. Therefore, we propose an evaluation benchmark to estimate the risk of Verilog-trained LLMs to generate copyright-protected codes. To minimize this risk, we present an open-source Verilog dataset, FreeSet, containing over 220k files, along with the automated dataset curation framework utilized to provide additional guarantees of fair-use Verilog data. We then perform LLM fine-tuning framework consisting of continual pre-training, resulting in a fine-tuned Llama model for Verilog, FreeV. Our results indicate that FreeV demonstrates the smallest risk of copyright-infringement among prior works, with only a 3% violation rate. Furthermore, experimental results demonstrate improvements in Verilog generation functionality over its baseline model, improving VerilogEval pass@10 rates by over 10%.
Engineering Poster
Networking


DescriptionField Programmable Gate Array (FPGA) chips have the characteristics of high flexibility and strong parallel processing capabilities. With the FPGA chip scale increasing, more and more configurable logic blocks (CLBs) are designed in it. . Each CLB is composed of multiple look-up tables (LUTs), flip-flops, carry logic, and so on. These logic units can implement various combinational logic and sequential logic functions through programming. These analog modules are instantiated thousands of times within the chip, which also brings great challenges to the dynamic power simulation of such chips. How to get so many analog units to flip and generate accurate currents in the way they actually work without vectors is a big challenge.
This article provides a dynamic simulation idea, which is to make a model for each simulated module, including its physical information and current information in multiple states, and then bring it into the digital circuit simulation environment. By setting the toggle time and state, just like a conductor directing an orchestra, these modules can be directed to toggle in the way they actually work, so as to check the weak points of the chip under working conditions. Meanwhile, a chip power model can be generated for the system to conduct system-level power integrity analysis.
This article provides a dynamic simulation idea, which is to make a model for each simulated module, including its physical information and current information in multiple states, and then bring it into the digital circuit simulation environment. By setting the toggle time and state, just like a conductor directing an orchestra, these modules can be directed to toggle in the way they actually work, so as to check the weak points of the chip under working conditions. Meanwhile, a chip power model can be generated for the system to conduct system-level power integrity analysis.
Research Manuscript


EDA
EDA1: Design Methodologies for System-on-Chip and 3D/2.5D System-in Package
DescriptionThe growing complexity of modern hardware has created vast design spaces that are difficult to explore efficiently. Current design space exploration (DSE) methods treat designs as flat parameter vectors, failing to leverage the rich structural information inherent in hardware architectures. This paper presents a novel RTL hierarchy aware approach to microarchitecture DSE that exploits the natural structure of hardware designs.
We propose an RTL hierarchy aware kernel that enables direct comparison of RTL hierarchies, preserving their structural characteristics. Our method incorporates module importance derived from hierarchical synthesis reports through a weighted kernel extension. Additionally, we introduce a clustering method that leverages the proposed kernel to identify distinct architectural patterns, enabling efficient parallel evaluation. Experimental results and ablation studies on a Gemmini-based RISC-V SoC demonstrate the superiority of our approach.
We propose an RTL hierarchy aware kernel that enables direct comparison of RTL hierarchies, preserving their structural characteristics. Our method incorporates module importance derived from hierarchical synthesis reports through a weighted kernel extension. Additionally, we introduce a clustering method that leverages the proposed kernel to identify distinct architectural patterns, enabling efficient parallel evaluation. Experimental results and ablation studies on a Gemmini-based RISC-V SoC demonstrate the superiority of our approach.
Research Manuscript


EDA
EDA9: Design for Test and Silicon Lifecycle Management
DescriptionContinuous-flow microfluidic chips are multilayered miniaturized platforms to manipulate small volumes of fluids with valves. There are two types of channels on a chip: flow channels for the reaction of fluids, and control channels for the actuation of valves. Multiplexers (MUXes) are essential microfluidic components for individually addressing many flow channels with few control channels. As the integration scale of microfluidic chips increases, the reliability of MUXes becomes a critical concern, as a single defective control channel in a MUX will affect a large part of the flow channels addressed by the MUX. This paper formally analyzes and identifies the design rules for a MUX to tolerate n defective control channels, and model the fault-tolerant MUX (FT-MUX) design problem as a binary constant weight code problem to minimize resource overheads. We demonstrate
that FT-MUX improves resource efficiency by up to hundreds of times compared to the conventional fault-tolerant design method. Besides, given no less than 10 control channels, FT-MUX tolerates at least one defective control channel and addresses even more flow channels with equal or fewer resources than a standard MUX. The advantages become more significant as the integration scale increases.
that FT-MUX improves resource efficiency by up to hundreds of times compared to the conventional fault-tolerant design method. Besides, given no less than 10 control channels, FT-MUX tolerates at least one defective control channel and addresses even more flow channels with equal or fewer resources than a standard MUX. The advantages become more significant as the integration scale increases.
Networking
Work-in-Progress Poster


DescriptionLearning expressive and generalizable representations from raw circuit graphs is crucial for advancing various tasks in electronic design automation (EDA). However, circuit representation learning is extremely challenging due to the intricate topological structure and rich functional semantics of circuits. To address this challenge, we propose a novel graph transformer architecture, called FuncFormer, which integrates the flow of functional propagation into the representation space to enable expressive and generalizable circuit representations. The major insight behind FuncFormer is the identification that capturing the flow of functional propagation is fundamental to circuit representation learning, as it inherently encapsulates both the circuit structure and functionality. Specifically, FuncFormer initializes input node features using functional input signals and propagates these features along the directions of signal flow based on the Boolean functionality of each logic gate (i.e., node). Subsequently, FuncFormer employs a self-attention module to produce final representations by attentively embedding the generated functional features. To demonstrate the effectiveness of FuncFormer, we evaluate FuncFormer on three typical EDA tasks: Quality of Result (QoR) prediction, functional reasoning, and formal property verification. The experiments show that (1) FuncFormer reduces the estimation error by 48.49% compared to the state-of-the-art (SOTA) methods for QoR prediction after logic synthesis; (2) FuncFormer improves the reasoning accuracy by 12.07% over the SOTA approach for identifying logically equivalent gates; (3) FuncFormer increases the number of solved properties by 4.77 and 2.80 times over manually designed heuristics and typical GNNs, respectively, on large-scale sequential circuits.
Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionThe increasing complexity of high-frequency circuits calls for efficient and accurate passive macromodeling techniques.
Existing passivity enforcement methods, including those in commercial tools, often encounter convergence issues or compromise accuracy.
The Domain-Alternated Optimization (DAO) framework seeks to restore accuracy through an additional optimization step but is hampered by high memory consumption and slow convergence, particularly for large-scale problems.
This paper presents \ours, a novel GPU-accelerated framework that recasts the passivity-enforced macromodeling problem as a neural network training task.
This approach significantly enhances both the speed and scalability of passivity enforcement.
Experimental results show that \ours achieves an average speedup of 7.63$\times$ in convergence compared to DAO, while reducing memory usage by two orders of magnitude.
This enables \ours to handle complex, high-port-count circuits with greater accuracy and efficiency, paving the way for robust high-frequency circuit simulations.
Existing passivity enforcement methods, including those in commercial tools, often encounter convergence issues or compromise accuracy.
The Domain-Alternated Optimization (DAO) framework seeks to restore accuracy through an additional optimization step but is hampered by high memory consumption and slow convergence, particularly for large-scale problems.
This paper presents \ours, a novel GPU-accelerated framework that recasts the passivity-enforced macromodeling problem as a neural network training task.
This approach significantly enhances both the speed and scalability of passivity enforcement.
Experimental results show that \ours achieves an average speedup of 7.63$\times$ in convergence compared to DAO, while reducing memory usage by two orders of magnitude.
This enables \ours to handle complex, high-port-count circuits with greater accuracy and efficiency, paving the way for robust high-frequency circuit simulations.
Networking
Work-in-Progress Poster


DescriptionThe integrity and reliability of Landing Gear System (LGS) is crucial for aircraft safety.
However, the scarcity of real-world fault data hinders the creation of effective Predictive Maintenance (PdM) strategies, especially those relying on modern Machine Learning (ML) techniques.
As a result, this paper presents GAIA: the first Generative Artificial Intelligence (GenAI) approach for enabling the creation of digital twins to support PdM in the aviation domain.
Specifically, by leveraging multi-physics modeling and data-driven techniques, GAIA generates realistic in-distribution faulty samples to augment existing datasets.
As a use case, we consider the LGS and introduce DSLG D/R, a novel dataset specifically designed for LGS fault classification, created in collaboration with omitted due to blind review.
Our results demonstrate a significant 10.56% improvement in fault classification accuracy compared to other data augmentation methods.
To showcase the broader applicability of our method, we also evaluate it on the Electrical Faults dataset, a well-established benchmark for power system fault diagnosis.
Again, GAIA consistently outperforms pure physics-driven and other data augmentation methods, highlighting its versatility across critical safety domains.
The code and dataset will be released upon acceptance.
However, the scarcity of real-world fault data hinders the creation of effective Predictive Maintenance (PdM) strategies, especially those relying on modern Machine Learning (ML) techniques.
As a result, this paper presents GAIA: the first Generative Artificial Intelligence (GenAI) approach for enabling the creation of digital twins to support PdM in the aviation domain.
Specifically, by leveraging multi-physics modeling and data-driven techniques, GAIA generates realistic in-distribution faulty samples to augment existing datasets.
As a use case, we consider the LGS and introduce DSLG D/R, a novel dataset specifically designed for LGS fault classification, created in collaboration with omitted due to blind review.
Our results demonstrate a significant 10.56% improvement in fault classification accuracy compared to other data augmentation methods.
To showcase the broader applicability of our method, we also evaluate it on the Electrical Faults dataset, a well-established benchmark for power system fault diagnosis.
Again, GAIA consistently outperforms pure physics-driven and other data augmentation methods, highlighting its versatility across critical safety domains.
The code and dataset will be released upon acceptance.
Research Manuscript


AI
AI3: AI/ML Architecture Design
Description3D intelligence leverages rich 3D features and stands as a promising frontier in AI, with 3D rendering fundamental to many downstream applications. 3D Gaussian Splatting (3DGS), an emerging high-quality 3D rendering method, requires significant computation, making real-time execution on existing GPU-equipped edge devices infeasible. Previous efforts to accelerate 3DGS rely on dedicated accelerators that require substantial integration overhead and hardware costs. This work proposes an acceleration strategy that leverages the similarities between the 3DGS pipeline and the highly optimized conventional graphics pipeline in modern GPUs. Instead of developing a dedicated accelerator, we enhance existing GPU rasterizer hardware to efficiently support 3DGS operations. Our results demonstrate a 23× increase in processing speed and a 24× reduction in energy consumption, with improvements yielding 6× faster end-to-end runtime for the original 3DGS algorithm and 4× for the latest efficiency-improved pipeline, achieving 24 FPS and 46 FPS respectively. These enhancements incur only a minimal area overhead of 0.2% relative to the entire SoC chip area, underscoring the practicality and efficiency of our approach for enabling 3DGS rendering on resource-constrained platforms.
Research Manuscript


EDA
EDA1: Design Methodologies for System-on-Chip and 3D/2.5D System-in Package
DescriptionThe growing demand for efficient, high-performance processing in machine learning (ML) and image processing has made hardware accelerators, such as GPUs and Data Streaming Accelerators (DSAs), increasingly essential. These accelerators enhance ML and image processing tasks by offloading computation from the CPU to dedicated hardware. These accelerators rely on interconnects for efficient data transfer, making interconnect design crucial for system-level performance. This paper introduces Gem5-AcceSys, an innovative framework for system-level exploration of standard interconnects and configurable memory hierarchies. Using a matrix multiplication accelerator tailored for transformer workloads as a case study, we evaluate PCIe performance across diverse memory types (DDR4, DDR5, GDDR6, HBM2) and configurations, including host-side and device-side memory. Our findings demonstrate that optimized interconnects can achieve up to 80\% of device-side memory performance and, in some scenarios, even surpass it. These results offer actionable insights for system architects, enabling a balanced approach to performance and cost in next-generation accelerator design.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionIn this paper, we present the first GPU-accelerated RTL simulator, addressing critical challenges in high-speed circuit verification. Traditional CPU-based RTL simulators struggle with scalability, and while FPGA-based emulators offer acceleration, they are costly and less accessible. Previous GPU-based attempts have failed to achieve speed-up on RTL simulation due to the irregular nature of circuit graphs, which conflicts with the SIMT (Single Instruction, Multiple Thread) paradigm of GPUs. Inspired by the design of emulators, our approach introduces a novel virtual Very Long Instruction Word (VLIW) architecture, designed for efficient CUDA execution, that maps circuit logic to the GPU in a process analogous to FPGA physical design. This architecture mitigates issues of irregular memory access and thread divergence, unlocking GPU potential for emulation. Our solution achieves remarkable speed-up over the best CPU simulators, democratizing high-speed RTL emulation with accessible hardware and establishing a new frontier for GPU-accelerated circuit verification.
Exhibitor Forum


DescriptionSemiconductor devices face increasing risks of attacks, exploits, and cyber vulnerabilities. Complex supply chains, distribution channels, and in-field deployments make it difficult to secure a device at every point in its lifespan. This session will focus on a new technique that hardens semiconductor designs to thwart malicious actors from embedding Trojans, introducing design flaws, or implementing manufacturing changes that compromise device functionality, reliability, or data integrity.
We will share a patented design hardening approach that not only supports thorough validation but also enables quantitative assessment of security improvements at RTL level. Central to this approach is intelligent instrumentation, based on a sophisticated method that precisely identifies and characterizes chip vulnerabilities, assesses their severity and impact, and strategically implements countermeasures. Our method leverages supervised machine learning to make data-driven tradeoff between security efficacy and cost ensuring optimal design instrumentation.
Attendees will gain insight into advanced GenAI and ML-based design for security and trust methodology that takes a proactive approach to microelectronics security at the RTL level. The session will detail how semiconductor devices can be made secure - monitored for anomalous behavior - from design to fabrication to deployment.
We will share a patented design hardening approach that not only supports thorough validation but also enables quantitative assessment of security improvements at RTL level. Central to this approach is intelligent instrumentation, based on a sophisticated method that precisely identifies and characterizes chip vulnerabilities, assesses their severity and impact, and strategically implements countermeasures. Our method leverages supervised machine learning to make data-driven tradeoff between security efficacy and cost ensuring optimal design instrumentation.
Attendees will gain insight into advanced GenAI and ML-based design for security and trust methodology that takes a proactive approach to microelectronics security at the RTL level. The session will detail how semiconductor devices can be made secure - monitored for anomalous behavior - from design to fabrication to deployment.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionWith integrated circuits shrinking in feature size, layout printability has become increasingly challenging, making lithographic hotspot detection ever-crucial in computer-aided design (CAD) flows. In recent years, numerous studies have explored deep learning to detect lithographic hotspots, offering promising results. However, neural networks can easily be biased and overfit when lacking sufficient training data, especially in the CAD domain. A generalizable DL-based hotspot detector should learn the genuine lithography principle and ensure consistent accuracy across layouts from various designs at the same technology node, regardless of their varying design styles. However, we find that existing convolutional neural network (CNN)-based hotspot detectors fail to generalize to different circuit layouts other than the design it has been trained for. To this end, we propose a few-shot learning-based framework for generalizable CNN-based hotspot detection. We develop a meta-learning scheme that asynchronously updates the CNN feature extraction and classification component to obtain a meta-initialized model that can quickly adapt to new designs using as few as one training layout clip. We propose a layout topology-based sampling strategy for few-shot adaptation to enhance generalization stability. Experimental results on ICCAD 2012 and 2019 datasets show that our framework enables superior generalization capabilities than prior arts on unseen new designs.
Engineering Presentation


AI
Back-End Design
DescriptionThis paper presents an effective and efficient design method for inserting failure inspection patterns at BSPDN (Back-Side Power Delivery Network) designs. BSPDN designs provide more signal resources and mitigate IR-drop during front-side design, resulting in better power integrity and improved performance. However, since the conventional silicon failure inspection method of FSPDN (Front-Side Power Delivery Network) cannot be used, an appropriate method is required to inspect failures in silicon. In this paper, two insertion method for failure inspection have been represented. One method is to use inspection patterns of the topmost routing layer for electrical failure inspection (EFI). The topmost inspection patterns are connected to the target nets with the conventional auto routing method without changing the existing signal routings of the nets. The other one is die-lens, which is an empty metal-free region, for optical failure inspection (OFI). These two efficient failure inspection patterns can be used at the wafer or package level to investigate silicon failure point. We have designed a BSPDN test vehicle using the proposed method and obtained satisfactory results.
Engineering Poster
Networking


DescriptionComprehensive documentation and robust traceability play a critical role in analog and mixed-signal design flows especially from a compliance perspective.
In this abstract, we explore a strategy to enhance design flows using Generative AI to do all the grunt work of documentation and traceability and enable design engineers to concentrate on creative and high-value tasks.
By leveraging the solution proposed, we demonstrate how to –
· Identify and track design changes effectively
· Contextualize designs by integrating relevant insights from specification documents, datasheets, and other collateral
· Utilize generative AI to create summaries, add contextual comments, and create clear, structured documentation.
Through the evidence we have provided, it is clear how this approach streamlines workflows, reduces manual overhead, and enhances collaboration, thereby, driving innovation in the design engineering world.
In this abstract, we explore a strategy to enhance design flows using Generative AI to do all the grunt work of documentation and traceability and enable design engineers to concentrate on creative and high-value tasks.
By leveraging the solution proposed, we demonstrate how to –
· Identify and track design changes effectively
· Contextualize designs by integrating relevant insights from specification documents, datasheets, and other collateral
· Utilize generative AI to create summaries, add contextual comments, and create clear, structured documentation.
Through the evidence we have provided, it is clear how this approach streamlines workflows, reduces manual overhead, and enhances collaboration, thereby, driving innovation in the design engineering world.
DAC Pavilion Panel


DescriptionThe rapid rise of generative AI holds the promise of transforming semiconductor design and verification flows with unprecedented capabilities in automation, intelligent synthesis, and error detection. Enthusiasts envision a future where AI accelerates innovation cycles, reduces costs, and addresses the growing complexity of chip design with ease. However, skeptics question whether these expectations are grounded in reality, pointing to challenges like the "black box" nature of AI models, lack of explainability, and potential over-reliance on technology that may not yet be mature.
This panel brings together leading voices representing both ends of the spectrum to tackle a critical question: Are we innovating responsibly with generative AI, or are we risking costly illusions? Champions of generative AI will argue that the technology is already enabling significant advancements, from automating repetitive tasks to uncovering design optimizations that were previously unfeasible. On the other side, cautious experts will highlight the risks of blindly adopting AI-driven solutions, including trust deficits, bias in decision-making, and the potential to exacerbate the talent shortage by creating new skill demands that the current workforce may struggle to meet.
The panelists will debate key points of contention, such as:
• Accuracy vs. Reliability: Can generative AI provide the level of precision required for mission-critical designs, or do its inherent limitations in explainability make it a liability in high-stakes environments?
• Automation vs. Creativity: Does AI enable engineers to focus on higher-level problem solving, or does it risk stifling creativity by promoting over-automation and reliance on pre-trained models?
• Short-term Gains vs. Long-term Impacts: Are current generative AI applications genuinely reducing time-to-market, or are they introducing new complexities and risks that could offset these benefits in the long run?
• Workforce Transformation: Will generative AI alleviate the talent shortage by streamlining workflows, or will it deepen the skills gap by demanding expertise in both AI and traditional design methodologies?
• Trust and Governance: How do we ensure that AI-generated solutions are transparent, verifiable, and aligned with industry standards?
Join us for this dynamic and thought-provoking discussion as our panelists confront these challenges head-on and seek to separate innovation from illusion. Together, we'll explore the tangible opportunities, the realistic timelines for adoption, and the strategies needed to ensure that generative AI drives sustainable progress in semiconductor design and verification. Whether you're an optimist, a skeptic, or somewhere in between, this panel promises insights to inform your perspective on the future of design automation.
This panel brings together leading voices representing both ends of the spectrum to tackle a critical question: Are we innovating responsibly with generative AI, or are we risking costly illusions? Champions of generative AI will argue that the technology is already enabling significant advancements, from automating repetitive tasks to uncovering design optimizations that were previously unfeasible. On the other side, cautious experts will highlight the risks of blindly adopting AI-driven solutions, including trust deficits, bias in decision-making, and the potential to exacerbate the talent shortage by creating new skill demands that the current workforce may struggle to meet.
The panelists will debate key points of contention, such as:
• Accuracy vs. Reliability: Can generative AI provide the level of precision required for mission-critical designs, or do its inherent limitations in explainability make it a liability in high-stakes environments?
• Automation vs. Creativity: Does AI enable engineers to focus on higher-level problem solving, or does it risk stifling creativity by promoting over-automation and reliance on pre-trained models?
• Short-term Gains vs. Long-term Impacts: Are current generative AI applications genuinely reducing time-to-market, or are they introducing new complexities and risks that could offset these benefits in the long run?
• Workforce Transformation: Will generative AI alleviate the talent shortage by streamlining workflows, or will it deepen the skills gap by demanding expertise in both AI and traditional design methodologies?
• Trust and Governance: How do we ensure that AI-generated solutions are transparent, verifiable, and aligned with industry standards?
Join us for this dynamic and thought-provoking discussion as our panelists confront these challenges head-on and seek to separate innovation from illusion. Together, we'll explore the tangible opportunities, the realistic timelines for adoption, and the strategies needed to ensure that generative AI drives sustainable progress in semiconductor design and verification. Whether you're an optimist, a skeptic, or somewhere in between, this panel promises insights to inform your perspective on the future of design automation.
Research Panel


AI
EDA
DescriptionGenerative Artificial Intelligence (AI) is poised to transform Electronic Design Automation (EDA), offering groundbreaking opportunities to enhance how semiconductors are designed and optimized. By automating complex tasks such as circuit layout, synthesis, and verification, AI has the potential to drastically reduce design cycles, improve quality, and unlock innovative solutions that transcend traditional methodologies.
Despite its promise, integrating generative AI into EDA raises significant technical challenges. Concerns about the reliability, robustness, and interpretability of AI-generated designs remain central, particularly for safety-critical applications where the cost of errors is high. There is also the question of whether AI tools can consistently produce results that meet or exceed the quality standards of human-designed circuits. The computational cost and energy requirements for training and deploying AI models raise concerns about scalability and sustainability, especially as designs grow in complexity.
Additionally, AI-driven workflows introduce concerns about security vulnerabilities and intellectual property (IP) privacy. The use of large datasets for model training and the integration of AI in design processes could expose sensitive information or inadvertently introduce exploitable weaknesses into the final designs. Addressing these risks will be essential to ensure trust in AI-driven EDA tools. Furthermore, the adoption of these technologies is set to reshape job skills in the industry, demanding new expertise in AI, data science, and software engineering alongside traditional EDA competencies.
This panel brings together perspectives from leading industry practitioners, academic researchers, and technology innovators to delve into the technical implications of integrating generative AI into EDA workflows. Discussions will focus on the opportunities AI presents for improving efficiency and design quality, the challenges of deploying reliable and interpretable models, addressing security and IP risks, and the evolving skillsets required to work alongside AI-driven tools. Attendees will gain a nuanced understanding of the technical opportunities and hurdles at the intersection of AI and EDA, as well as insights into how the industry is preparing for this transformative shift.
Despite its promise, integrating generative AI into EDA raises significant technical challenges. Concerns about the reliability, robustness, and interpretability of AI-generated designs remain central, particularly for safety-critical applications where the cost of errors is high. There is also the question of whether AI tools can consistently produce results that meet or exceed the quality standards of human-designed circuits. The computational cost and energy requirements for training and deploying AI models raise concerns about scalability and sustainability, especially as designs grow in complexity.
Additionally, AI-driven workflows introduce concerns about security vulnerabilities and intellectual property (IP) privacy. The use of large datasets for model training and the integration of AI in design processes could expose sensitive information or inadvertently introduce exploitable weaknesses into the final designs. Addressing these risks will be essential to ensure trust in AI-driven EDA tools. Furthermore, the adoption of these technologies is set to reshape job skills in the industry, demanding new expertise in AI, data science, and software engineering alongside traditional EDA competencies.
This panel brings together perspectives from leading industry practitioners, academic researchers, and technology innovators to delve into the technical implications of integrating generative AI into EDA workflows. Discussions will focus on the opportunities AI presents for improving efficiency and design quality, the challenges of deploying reliable and interpretable models, addressing security and IP risks, and the evolving skillsets required to work alongside AI-driven tools. Attendees will gain a nuanced understanding of the technical opportunities and hurdles at the intersection of AI and EDA, as well as insights into how the industry is preparing for this transformative shift.
Research Special Session


AI
DescriptionApplications of Generative AI can substantially improve upon human-driven silicon design methodologies. Solutions based on a rich foundational models can bring forward a baseline set of fundamental knowledge, specifications, manuals and design examples tailored for each stage of the common design methodologies. Tool developers and corporate CAD departments can create layers representing insights into their specific capabilities. Design organizations may add even more specific customizations. To enable this, an underlying agentic framework is needed with open interfaces for encapsulation of common tool categories offers reasoned orchestration of directives and design data as it flows between design stages and analyzes intermediate results to drive tool inputs in support of sequential and iterative progression toward desired outcomes.
Research Manuscript


EDA
EDA3: Timing Analysis and Optimization
DescriptionAccurate cell timing characterization is essential, on which static timing analysis relies to verify timing performance and ensure design robustness across various PVT conditions (corners). The corner explosion in modern design amplifies the efficiency and scalability challenge for accurate characterization. However, the conventional characterization approach of SPICE simulation alone becomes prohibitively expensive due to the increasing computational complexity and the amount of characterized data. In this paper, we view the characterization problem from a generative modeling perspective to tackle the efficiency and scalability challenge. With a hybrid of generative adversarial network (GAN) and autoencoder, our generative model learns and generalizes among various timing arcs and corners. Experimental results demonstrate that the proposed framework achieves high accuracy and extensibility while reducing the runtime significantly.
Engineering Presentation


AI
Back-End Design
Chiplet
DescriptionPower integrity is a major design challenge at advanced nodes. The designs are becoming increasingly large and complex, along with addition of more computing resources and innovative algorithms to do EM-IR analysis. This results in an unmanageable number of IR drop and EM violations that rely on manual fixing and missed PPA opportunities. One of the major bottlenecks of in-design EM-IR analysis is that it is computationally expensive due to the size and coupled nature of the power network. To overcome these issues, this submission demonstrates novel AI-driven methodology that efficiently identifies the root-cause and categorizes implementation of efficient IR repair methods. The state-of-art Generative-AI technology automatically mitigates IR drop issues early in the design cycle, enabling improved productivity for time-to-market. Staggering >95% IR-drop fix rate was achieved on multiple diverse product line designs (4nm and 5nm nodes) without compromising on Power, Performance and Area. The adapted methodology can be further used on System level designs and perform multi-corner, multi-scenario analysis which makes it compatible for all types of designs.
Acronyms: EM: Electromigration, IR: Current x Resistance(Voltage), AI: Artificial Intelligence, PPA: Power Performance Area
Acronyms: EM: Electromigration, IR: Current x Resistance(Voltage), AI: Artificial Intelligence, PPA: Power Performance Area
Networking
Work-in-Progress Poster


DescriptionDeep generative models enable the generation of diverse layout patterns.However, existing pattern generators perform poorly in meeting requirements of layout density and large scale which are crucial parameter for process monitoring and evaluation such as etching and CMP. Recognizing that various application scenarios impose specific requirements on test patterns, we propose a regression condition constraints based on multi-scale gradient generative adversarial networks (MSG-GAN), employing vicinal risk minimization loss function and a novel label introduction method to facilitate layout generation for specific conditions. Experiments demonstrate that our generator exhibits the capability to produce layouts of desired densities while satisfying design rule constraints.Furthermore, large continuous layouts without stitches can also be generated due to the multi-scale gradient connections in MSG-GAN.
Networking
Work-in-Progress Poster


DescriptionContinual learning, the ability to acquire and transfer knowledge through a model's lifetime, is critical for artificial agents that interact in real-world environments. Biological brains inherently demonstrate these capabilities while operating within limited energy and resource budgets. Achieving continual learning capability in artificial systems considerably increases memory and computational demands, and even more so when deploying on platforms with limited resources. In this work, Genesis, a spiking continual learning accelerator, is proposed to address this gap. The architecture supports neurally inspired mechanisms, such as activity-dependent metaplasticity, to alleviate catastrophic forgetting. It integrates low-precision continual learning parameters and employs a custom data movement strategy to accommodate the sparsely distributed spikes. Furthermore, the architecture features a memory mapping technique that places metaplasticity parameters and synaptic weights in a single address location for faster memory access. Results show that the mean classification accuracy for Genesis is 74.6% on a task-agnostic split-MNIST benchmark with power consumption of 17.08 mW in a 65nm technology node.
Research Manuscript


EDA
EDA3: Timing Analysis and Optimization
DescriptionConventionally, glitch reduction is well-studied in digital design to improve power, efficiency, and security. In contrast, this paper combines the addition and removal of glitches to minimize the power side-channel leakage. Glitch Manipulation is achieved through gate sizing-based arrival time control, which is cast as a Geometric Programming formulation. We develop a framework, GLiTCH, for glitch manipulation that is guided by functional and timing simulations. The framework is evaluated on popular cipher designs like AES, CLEFIA, and SM4. Our findings illustrate up to 52.82% improvement in the Guessing Entropy for a 38.74% area overhead on average across the evaluated ciphers.
Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionAnalog/mixed-signal circuit design encounters significant challenges due to performance degradation from process, voltage, and temperature (PVT) variations. To achieve commercial-grade reliability, iterative manual design revisions and extensive statistical simulations are required. While several studies have aimed to automate variation-aware analog design to reduce time-to-market, the substantial mismatches in real-world wafers have not been thoroughly addressed. In this paper, we present GLOVA, an analog circuit sizing framework robust to PVT variations that effectively manages the impact of diverse random mismatches. In the proposed approach, risk-sensitive reinforcement learning is leveraged to account for the reliability bound affected by PVT variations, and ensemble-based critic is introduced to achieve sample-efficient learning. For design verification, we also propose μ-σ evaluation and simulation reordering method to reduce simulation costs of identifying failed designs. GLOVA supports verification through industrial-level PVT variation evaluation methods, including corner simulation as well as global and local Monte Carlo simulations.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionNative 3D IC design promises commercially viable chips with improved performance and density. While pseudo-3D flows achieve manufacturability, they fall short of full 3D optimization due to reliance on 2D EDA tools, limiting cross-tier optimization. Metal Layer Sharing (MLS) offers a solution by enabling cross-tier routing but introduces two challenges: timing degradation from indiscriminate MLS and testability issues in hybrid-bonded 3D ICs. We propose GNN-MLS, a GNN-assisted framework for targeted cross-tier net optimization through MLS to improve timing globally, alongside a tailored design-for-test (DFT) solution to ensure robust testability with minimal power overhead. Experimental results show GNN-MLS reduces timing violations by up to 79% and improves WNS by 81% and TNS by 94%, advancing pseudo-3D flows toward fully optimized, commercially viable 3D ICs.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionGPUs have been heavily utilized in diverse applications, and numerous approaches, including kernel fusion, have been proposed to boost GPU efficiency through concurrent kernel execution. However, these approaches generally overlook the opportunities to mitigate warp stalls and improve instruction level parallelism (ILP) in inter-kernel resource sharing. To address this issue, we introduce GoPTX, a novel design for kernel fusion that improves ILP through deliberate weaving instructions at the PTX level. GoPTX establishes a merged control flow graph (CFG) from original kernels, enabling to interleaving of instructions that were sequentially executed by default and minimizing pipeline stalls on data hazards. We further propose a latency-aware instruction weaving algorithm for more efficient instruction scheduling and an adaptive code slicing method to enlarge the scheduling space. Experimental evaluation demonstrates that GoPTX achieves an average speedup of 11.2% over the baseline concurrent execution, with a maximum improvement of 23%. The hardware resource utilization statistics show significant enhancements in eligible warps per cycle and resource use.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionGPart is a scalable multilevel framework for graph partitioning that integrates GNN embeddings with efficient coarsening and refinement techniques. On the Titan23 benchmarks, GPart achieves a cut size reduction of 34.13% to 42.92% over METIS and improves cut size by 9.30% on selected DIMACS benchmarks compared to G-kway. Furthermore, experiments on the Titan23 benchmarks show that GPart reduces normalized memory usage by 24.6x compared to GAP and 12.4x compared to GenPart. Unlike existing GNN-based methods, which require large hidden layers and substantial memory, GPart's multilevel architecture reduces hidden layer sizes, significantly optimizing memory efficiency.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionCoarse-grained reconfigurable architecture (CGRA) has emerged as a promising solution for accelerating computationally intensive applications, particularly in the field of artificial intelligence. One of the primary challenges for CGRA compilers is generating effective mapping results for complex applications within a limited timeframe. This paper presents an enhanced pre-scheduling method that integrates Integer Linear Programming (ILP) and Graph Neural Networks (GNN), along with a corresponding two-stage mapping approach. This combination significantly reduces the search space and accelerates the solution process for mapping problems. Experimental results demonstrate performance improvements ranging from 29.4% to 406.7%, along with compilation time reductions of up to 1106.8x compared to existing compilation techniques, as well as excellent scalability.
Networking
Work-in-Progress Poster


DescriptionDesigning integrated circuits involves substantial complexity, posing challenges in revealing its potential applications - from custom digital cells to analog circuits. Despite extensive research over the past
decades in building versatile and automated frameworks, there remains open room to explore more computationally efficient AI-based solutions. This paper introduces the graph composer GraCo, a novel method for synthesizing integrated circuits using reinforcement learning (RL). GraCo learns to construct a graph step-by-step, which is then converted into a netlist and simulated with SPICE. We demonstrate that GraCo is highly configurable, enabling the incorporation of prior design knowledge into the framework. We formalize how this prior knowledge can be utilized and, in particular, show that applying consistency checks enhances the efficiency of the sampling process. To evaluate its performance, we
compare GraCo to a random baseline, which is known to perform well for smaller design space problems. We demonstrate that GraCo can discover circuits for tasks such as generating standard cells, including the inverter and the two-input NAND (NAND2) gate. Compared to a random baseline, GraCo requires 5x fewer sampling steps to design an inverter and successfully synthesizes a NAND2 gate that is 2.5x faster.
decades in building versatile and automated frameworks, there remains open room to explore more computationally efficient AI-based solutions. This paper introduces the graph composer GraCo, a novel method for synthesizing integrated circuits using reinforcement learning (RL). GraCo learns to construct a graph step-by-step, which is then converted into a netlist and simulated with SPICE. We demonstrate that GraCo is highly configurable, enabling the incorporation of prior design knowledge into the framework. We formalize how this prior knowledge can be utilized and, in particular, show that applying consistency checks enhances the efficiency of the sampling process. To evaluate its performance, we
compare GraCo to a random baseline, which is known to perform well for smaller design space problems. We demonstrate that GraCo can discover circuits for tasks such as generating standard cells, including the inverter and the two-input NAND (NAND2) gate. Compared to a random baseline, GraCo requires 5x fewer sampling steps to design an inverter and successfully synthesizes a NAND2 gate that is 2.5x faster.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionWide deployment of machine learning models on edge devices has rendered the model intellectual property (IP) and data privacy vulnerable. We propose GNNVault, the first secure Graph Neural Network (GNN) deployment strategy based on Trusted Execution Environment (TEE).
GNNVault follows the design of *partition-before-training* and includes a private GNN rectifier to complement with a public backbone model. This way, both critical GNN model parameters and the private graph used during inference are protected within secure TEE compartments. Real-world implementations with Intel SGX demonstrate that GNNVault safeguards GNN inference against state-of-the-art link stealing attacks with a negligible accuracy degradation (<2%).
GNNVault follows the design of *partition-before-training* and includes a private GNN rectifier to complement with a public backbone model. This way, both critical GNN model parameters and the private graph used during inference are protected within secure TEE compartments. Real-world implementations with Intel SGX demonstrate that GNNVault safeguards GNN inference against state-of-the-art link stealing attacks with a negligible accuracy degradation (<2%).
Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionThis paper introduces a novel graph-guided transfer learning approach to boost the efficiency of system-level optimization of analog/mixed-signal circuits. The system-level optimization is based on Reinforcement Learning (RL) in combination with Graph Attention Networks (GAT). The results surpass state-of-the-art in efficiency and optimality. The key innovation is a graph similarity detection method that leverages embedded design knowledge to identify electrical similarities and trade-offs, enhancing knowledge transferability, even between dissimilar circuit architectures. Applied to the case study of 4th-order continuous-time Delta-Sigma analog-to-digital converters, the graph-based transfer learning framework enhances the RL sampling efficiency, reducing the amount of simulations by up to 11x, and improves the optimization results by 12.4% compared to optimization from scratch. As the framework accelerates knowledge transfer across different architectures, it can boost the optimization efficiency and improve the performance towards a broad range of analog/mixed-signal systems.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionGraph-based search for approximate vector similarity is essential in AI applications, such as retrieval-augmented generation. To support large-scale searches, vector search graphs are often stored on storage devices like SSDs. In this paper, we introduce GraphAccel, an in-storage accelerator optimized for efficient graph-based vector similarity search. Our architecture incorporates an optimized page packing mechanism to reduce SSD page accesses per query, alongside a speculative search scheme that maximizes utilization of idle SSD chips and channels. Through these optimizations, GraphAccel achieves notable performance improvements over existing SSD-based graph search solutions, including DiskANN and DiskANN++.
Networking
Work-in-Progress Poster


DescriptionThe conservative nature of static timing analysis (STA) in delay estimation consistently results in overdesign, leading to suboptimal power efficiency and increased area overhead, particularly in compute-intensive AI applications. To address these limitations, this paper introduces GraphDTA, a learning-based dynamic timing analysis (DTA) framework tailored for functional units in AI accelerators. GraphDTA combines graph-based representation learning and downstream model to predict workload-induced dynamic arrival time with high accuracy. We evaluate our approach on 10 benchmarks, including multiplier units (MULs), multiply-accumulate units (MACs) and matrix multiplier units (MMUs), across 45nm and 7nm technologies. Our framework surpasses existing machine learning methods on both technologies and provides roughly 50X speedup compared to the gate-level simulation.
Research Manuscript


Systems
SYS6: Time-Critical and Fault-Tolerant System Design
DescriptionAs graph tasks become pervasive in real-time and safety-critical domains (e.g., financial fraud detection and electrical power systems), it is also essential to guarantee their reliable execution beyond pursuing extraordinary performance. However, due to the neglect of consideration for graph-specific execution paradigm, existing Fault Injection (FI) reliability analysis methods typically incur inaccurate system error resilience characterization, making it challenging to provide helpful guidance for efficient and reliable graph processing paradigm design.
This paper proposes GraphFI, an efficient Graph Fault Injection framework on the universal parallel tasks acceleration platform (i.e., GPGPUs). Our key insight is progressively excavating the graph-specific error propagation and effect mechanisms, thereby avoiding blind FI trials. Firstly, observing that iterations with similar active vertex set exhibit similar error behavior, we propose iteration-driven GraphFI (ID-GraphFI) to solely select representative iterations for fast error resilience profile assessment. Secondly, by detecting resilience-similarity communities in graph topology, we propose topology-driven GraphFI (TD-GraphFI) that only selects representative vertices for community overall reliability evaluation. Thirdly, by exploring the graph-specific fault monotonic property, we propose the monotonicity-driven GraphFI (MD-GraphFI) to granularly draw system severe error boundaries for predictable/unnecessary fault injection avoidance. Merging them all, GraphFI can reduce system fault site space by up to two magnitudes, which achieves 2.1~15.2X speedup compared to SOTA methods while providing better reliability assessment accuracy.
This paper proposes GraphFI, an efficient Graph Fault Injection framework on the universal parallel tasks acceleration platform (i.e., GPGPUs). Our key insight is progressively excavating the graph-specific error propagation and effect mechanisms, thereby avoiding blind FI trials. Firstly, observing that iterations with similar active vertex set exhibit similar error behavior, we propose iteration-driven GraphFI (ID-GraphFI) to solely select representative iterations for fast error resilience profile assessment. Secondly, by detecting resilience-similarity communities in graph topology, we propose topology-driven GraphFI (TD-GraphFI) that only selects representative vertices for community overall reliability evaluation. Thirdly, by exploring the graph-specific fault monotonic property, we propose the monotonicity-driven GraphFI (MD-GraphFI) to granularly draw system severe error boundaries for predictable/unnecessary fault injection avoidance. Merging them all, GraphFI can reduce system fault site space by up to two magnitudes, which achieves 2.1~15.2X speedup compared to SOTA methods while providing better reliability assessment accuracy.
Networking
Work-in-Progress Poster


DescriptionDespite of the greater energy efficiency, it remains challenging for high-level users to customize graph processing accelerators on FPGAs. This paper introduces Graphitron, a domain-specific language for graph processing that customizes FPGA-based accelerators for various algorithms while hiding the complexities of low-level FPGA designs from users. Graphitron defines vertex and edge as primitive data types, which simplify the description of operations on edges and vertices in general graph processing algorithms. Further, Graphitron compiler leverage graph semantic information inherent to graph processing to apply typical hardware optimization techniques, including pipelining, data shuffling, caching and burst accesses, on the generated accelerators for higher performance. Our experiments show that Graphitron offers exceptional flexibility in algorithm description and significantly enhances the productivity of graph processing accelerator designs. Graphitron-crafted accelerators generally achieve comparable performance to those generated by pre-defined templates and even outperform on benchmarks that can benefit from the flexible graph processing architectures.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionOptimizing LLM inference has become increasingly important as the demand for efficient on-device deployments grows. To reduce the computational overhead in the MLP components, which account for a significant portion of LLM inference, ReLU-fied LLMs have been introduced to maximize activation sparsity. Several sparsity prediction methods have been developed to efficiently skip unnecessary memory accesses and computations by predicting activation sparsity. In this paper, we propose a novel magnitude-based, training-free sparsity prediction technique called Grasp that builds on the existing sign bit-based method for ReLU-fied LLMs. The proposed method enhances prediction accuracy by grouping values considering the distribution within vectors and explicitly accounting for statistical outliers. This allows us to estimate the impact of each element more accurately yet in an efficient way, improving both activation sparsity prediction accuracy and computational efficiency. Compared to the-state-of-the-art technique, Grasp achieves higher sparsity prediction accuracy and 11% higher skipping efficiency, which corresponds to 1.85× speedup against the dense inference.
Networking
Work-in-Progress Poster


DescriptionReinforcement learning is computationally intensive due to frequent data exchanges between learners and actors., making it hard to fully utilize the GPU. To address this, we propose a RL framework GRL, marking the first time the complete RL process is deployed on one GPU. Based on the features of GPU, we design the lock-free model queue and the fused actors to enhance the experience throughput of framework. We propose an auto-configurator to adjust the runtime configuration to speed up the whole framework. We test GRL in various RL environments. GRL achieves an improvement on throughput from 4 to 200 times.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
Description3D Gaussian Splatting (3D-GS) has emerged as a promising alternative to neural radiance fields (NeRF) as it offers high speed as well as high image quality in novel view synthesis. Despite these advancements, 3D-GS still struggles to meet the frames per second (FPS) demands of real-time applications. In this paper, we introduce GS-TG, a tile-grouping-based accelerator that enhances 3D-GS rendering speed by reducing redundant sorting operations and preserving rasterization efficiency. GS-TG addresses a critical trade-off issue in 3D-GS rendering: increasing the tile size effectively reduces redundant sorting operations, but it concurrently increases unnecessary rasterization computations. So, during sorting of the proposed approach, GS-TG groups small tiles (for making large tiles) to share sorting operations across tiles within each group, significantly reducing redundant computations. During rasterization, a bitmask assigned to each Gaussian identifies relevant small tiles, to enable efficient sharing of sorting results. Consequently, GS-TG enables sorting to be performed as if a large tile size is used by grouping tiles during the sorting stage, while allowing rasterization to proceed with the original small tiles by using bitmasks in the rasterization stage. GS-TG is a lossless method requiring no retraining or fine-tuning and it can be seamlessly integrated with previous 3D-GS optimization techniques. Experimental results show that GS-TG achieves an average speed-up of 1.71 times over state-of-the-art 3D-GS accelerators.
Research Manuscript


AI
AI3: AI/ML Architecture Design
Description3D Gaussian Splatting (3DGS) has emerged as a promising real-time photorealistic radiance field rendering technique. Existing GPU and hardware accelerators face limitations due to insufficient parallelism in sequential rendering pipeline stages and the memory overhead associated with interim results. This paper presents GSAcc, a hardware accelerator co-designed with dataflow to render compressed 3DGS models on edge platforms efficiently. GSAcc enhances 3DGS rendering performance through several key innovations. First, it introduces Gaussian depth speculation, parallelizing preprocessing and sorting tasks. Second, GSAcc adopts a Gaussian-centric dataflow that interleaves preprocessing and rasterization, allowing all rendering steps to execute concurrently without storing intermediate results. Finally, it employs dedicated hardware acceleration to address sorting and rasterization bottlenecks within the optimized dataflow. We implemented and synthesized GSAcc using Intel16 PDK and evaluated its performance on real-world 3DGS scenes. Compared with desktop GPUs, GSAcc achieves up to 1.66x10^4x Power-Performance-Area (PPA) improvement as well as 48.7x energy savings. Additionally, GSAcc outperforms the state-of-the-art hardware accelerator GSCore with up to 2.3x PPA improvement and 2.9x energy savings.
Networking
Work-in-Progress Poster


DescriptionVarious graph neural network (GNN) models have emerged and outperformed previous methods. However, their implementation on general-purpose processors is inefficient because GNNs exhibit both sparse graph computation and dense neural network computation patterns. Previous customized GNN accelerators suffer from insufficient programmability, only supporting a limited set of models based on sparse matrix multiplication. Moreover, they fail to account for different sparsity modes and data amounts across GNN tasks, leading to suboptimal resource utilization and computational efficiency. To address these challenges, this paper proposes GTA, a novel instruction-driven Graph Tensor Accelerator to support general GNNs efficiently, together with general compiler optimization rules for optimal dataflow and leverage multi-mode sparsity. The key innovations include 1) A message passing based instruction set architecture to describe general GNNs. 2) Graph tensor compiler with message passing operator fusion and tiling, reducing intermediate data transfers. 3) The hardware architecture for the ISA, in which the flexible computing array can support multiple sparse modes to improve efficiency. 4) An adaptive buffer management unit to fully utilize resources. Experiments show that GTA can execute various GNN models, achieving 1.8-24.6× speedup and 1.3-16.0× energy efficiency compared to existing accelerators.
Research Manuscript


EDA
EDA3: Timing Analysis and Optimization
DescriptionAs technology nodes shrink, static timing analysis (STA) must balance accuracy and efficiency to ensure circuit functionality. Graph-based analysis (GBA) is fast but pessimistic, while path-based analysis (PBA) offers higher accuracy with an expensive runtime cost. However, GBA and PBA rely on lookup table (LUT)-based standard cell libraries, introducing accuracy losses compared to accurate SPICE simulations at advanced technology nodes. This work presents GTN-Path, an efficient post-layout path timing prediction method based on waveform propagation and graph transformer network (GTN). GTN-Path captures structural information to accurately predict waveforms by modeling standard cells and interconnects as graphs. Compared to HSPICE simulations, GTN-Path predicts waveforms with 2.98% error and delay with 2.96% error, achieving a speedup of 3510x. Additionally, compared with the sign-off STA tool, the GTN-Path achieves a speedup of 12x.
Networking
Work-in-Progress Poster


DescriptionWe need proper guard rings to protect devices, especially in advanced nodes. Since they will significantly increase the overall layout area, this work introduces a novel analog circuit placement methodology to enable guard ring sharing with a transistor array layout style. First we construct a hierarchical module clustering tree based on layout constraints, and then creating a guard ring aware transistor placement. Next our stochastic method applies diffusion sharing (by continuous OD) and transition dummies insertion based on Euler path traversal to minimize the total area, while satisfying symmetry, proximity, and matching constraints. Experimental result shows that after applying diffusion sharing and transi- tion dummies insertion, the area can be reduced by 63.9%. Compared to a baseline method without guard ring aware placement, the proposed approach reduces up to 49.8% of the total area on several designs with a commercial 16nm process. Layouts with reduced area show almost no impact on the overall performance.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionDeploying deep neural networks (DNNs) on conventional digital edge devices faces significant challenges due to high energy consumption. A promising solution is the processing-in-memory (PIM) architecture with resistive random-access memory (RRAM), but RRAM-based systems suffer from imprecise weights due to programming stochasticity and cannot effectively utilize conventional weight encryption/decryption intellectual property (IP) protection schemes. To address these issues, we propose a novel software-hardware co-design $\Design$.
On the hardware side, we introduce 3T2R cells to achieve reliable multiply-accumulate (MAC) operations and use reconfigurable inverter operating voltages to encode keys for encrypting DNNs on RRAM. On the software side, we implement a contrastive training method that ensures high model accuracy on authorized chips while degrading performance on unauthorized ones. This approach protects DNN IP with minimal hardware overhead while significantly mitigating the effects of RRAM programming stochasticity.
Extensive experiments on tasks such as image classification (using MLP, ResNet, and ViT), segmentation (using SegFormer), and image generation (using DiT) validate the effectiveness of our method. The proposed contrastive training ensures negligible performance degradation on authorized chips, while performance on unauthorized chips drops to random guessing or generation. Compared to traditional RRAM accelerators, the 3T2R-based accelerator achieves a 1.41$\times$ reduction in area overhead and a 2.28$\times$ reduction in energy consumption.
On the hardware side, we introduce 3T2R cells to achieve reliable multiply-accumulate (MAC) operations and use reconfigurable inverter operating voltages to encode keys for encrypting DNNs on RRAM. On the software side, we implement a contrastive training method that ensures high model accuracy on authorized chips while degrading performance on unauthorized ones. This approach protects DNN IP with minimal hardware overhead while significantly mitigating the effects of RRAM programming stochasticity.
Extensive experiments on tasks such as image classification (using MLP, ResNet, and ViT), segmentation (using SegFormer), and image generation (using DiT) validate the effectiveness of our method. The proposed contrastive training ensures negligible performance degradation on authorized chips, while performance on unauthorized chips drops to random guessing or generation. Compared to traditional RRAM accelerators, the 3T2R-based accelerator achieves a 1.41$\times$ reduction in area overhead and a 2.28$\times$ reduction in energy consumption.
Engineering Poster


DescriptionIn the realm of power optimization, focusing significant effort on power enhancement for key blocks is essential as blocks follow the Pareto Rule. Traditional unidirectional PD flows often fall short, potentially trapping designers in local minima. However, leveraging designers' insights on these key blocks can significantly improve power Quality of Results (QoR) if, at every stage, PD designers analyze results to facilitate better trade-offs earlier in the flow. This presentation will discuss some approaches for guiding initial phases of the flow based on analysis of opportunities available in later stages of the flow and results thereof.
Engineering Presentation


AI
Back-End Design
Chiplet
DescriptionOptimizing power grids for modern high-performance chips is crucial for both performance and reliability, particularly with the increasing complexity of advanced technology nodes. While denser grids are ideal for managing voltage drops, they often necessitate additional routing tracks, creating layout space constraints and potential timing issues. Memory convergence is particularly critical, given its sensitivity to timing, DRC, and IR. The physical implementation of modern high-performance designs requires numerous time-consuming iterations involving PDN design, IR/Timing analysis, floorplanning, and placement. Accurate identification of the correct switching scenario is vital to prevent over-designing the power grid specification. Utilizing VCD as a reliable source of scenarios for DvD simulations is common, but these simulations can be lengthy (around1ms-100ms), requiring weeks to complete a single iteration. Analyzing VCD for a short duration around the peak power window can be optimistic for memories since the worst memory scenarios can be missed. In this study, we developed a method to profile multiple long vectors to guide a vectorless engine, allowing us to mimic worst-case memory scenarios. This approach reduces the pessimism found in regular vectorless analysis (which typically activates memories with a full 100% toggle rate), while offering significant runtime improvements compared to full-length VCD-based simulations and ensuring 100% switching coverage
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionSubcircuit matching is widely applied in logic synthesis, design verification, hardware security, etc. Previous works employ redundant circuit representations, coupled with time-consuming enumerative search methods. Subsequent works use a hybrid "approximate filtering – exact verification" framework, but the numerous false negatives predicted by the graph neural network (GNN) based filtering lead to severe matching failure. In this paper, an improved hybrid method named H3Match is proposed to achieve a better tradeoff between runtime, accuracy, and false negative rate. First, we model the circuits as hypergraphs to fully capture the topology and construct diverse heterogeneous hyperedge features to facilitate the learning of circuit topologies. Second, to reduce the false negatives, we reformulate the subgraph matching problem as matching directed acyclic graphs (DAGs) with embedded circular structure information and develop a directed GNN-based approximate matching approach to identify potential matching subcircuits. Finally, we propose a general mixed integer nonlinear programming (MINLP) formulation for exact verification, with convergency speed accelerated by extracting the initial solution from the results in approximate matching. Experimental results show that our approximate method outperforms state-of-the-art (SOTA) methods by 4.31% in accuracy while achieving virtually zero false negatives. Our exact verification is on average 4.16× faster than SOTA exact methods. Overall, the end-to-end flow achieves a 7.08× speedup compared to existing approaches.
Networking
Work-in-Progress Poster


DescriptionHardware security verification in modern electronic systems has been identified as a significant bottleneck due to increasing complexity and stringent time-to-market constraints. Assertion-Based Verification (ABV) is a recognized solution to this challenge; however, traditional assertion generation relies on engineers' expertise and manual effort. Formal verification and assertion generation methods are further limited by modeling complexity and a low tolerance for variations. While Large Language Models (LLMs) have emerged as promising automated tools, existing LLM-based approaches often depend on complex prompt engineering, requiring experienced labor to construct and validate prompts. The challenge also lies in identifying effective methods for constructing synthetic training datasets that enhance LLM quality while minimizing token biases. To solve these issues, we introduce HADA (Hardware Assertion through Data Augmentation), a novel framework that fine-tunes a general-purpose LLM by leveraging its ability to integrate knowledge from multiple data sources. We combine assertions generated through formal verification, hardware security knowledge from datasets like CWE, and version control data from hardware design iterations to construct a comprehensive hardware security assertion dataset. Our results demonstrate that integrating multi-source data significantly enhances the effectiveness of hardware security verification, with each addressing the limitations of the others.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionThe increasing complexity of integrated circuit design requires customizing Power, Performance, and Area (PPA) metrics according to different application demands. However, most engineers cannot anticipate requirements early in the design process, often discovering mismatches only after synthesis, necessitating iterative optimization or redesign. Some works have shown the promising capabilities of large language models (LLMs) in hardware design generation tasks, but they fail to tackle the PPA trade-off problem. In this work, we propose an LLM-based reinforcement learning framework, PPA-RTL, aiming to introduce LLMs as a cutting-edge automation tool by directly incorporating post-synthesis metrics PPA into the hardware design generation phase. We design PPA metrics as reward feedback to guide the model in producing designs aligned with specific optimization objectives across various scenarios. The experimental results demonstrate that PPA-RTL models, optimized for Power, Performance, Area, or their various combinations, significantly improve in achieving the desired trade-offs, making PPA-RTL applicable to a variety of application scenarios and project constraints.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionDistributed quantum computing (DQC) offers a promising pathway for scaling up quantum computing. Entanglement is indispensable for implementing non-local operations in DQC, especially the teleportation of quantum states and gates. Practical remote entanglement generation is probabilistic, whose duration is not only longer than local operations but also nondeterministic. Therefore, the optimization of DQC architectures with probabilistic remote entanglement generation is critically important. In this paper, we study a new DQC architecture which combines (1) asynchronously attempted entanglement generation, (2) buffering of successfully generated entanglement, and (3) adaptive scheduling of remote gates based on entanglement generation pattern. Our hardware-software co-design improves both runtime and output fidelity for realistic DQC.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionVideo Generation Models based on 3D full attention (3D-VGMs) have significantly enhanced video quality. However, their inference overhead remains substantial, primarily due to the high computational cost of the attention mechanism, which accounts for over 75% of computations. Inspired by the success of conventional video processing, where video compression exploits similarities among patches, we point out that the attention mechanism can also harness the benefits from similarities among tokens. Nonetheless, two critical problems arise: (1) How can similarities be efficiently acquired in real-time? (2) How can workload balance be maintained when similar tokens are randomly distributed?
To address these problems and leverage similarities for 3D-VGMs, we propose SIMPICKER, a comprehensive attention-aware algorithm-hardware co-design for 3D-VGMs. Our core methodology is to fully utilize similarities in attention through both coarse-grained and fine-grained approaches while adopting dynamic adaptive strategies to leverage them. From the algorithm perspective, we propose a speculation-based similarity exploitation algorithm, allowing real-time importance speculation on the frame level, which is coarse-grained, and token level, which is fine-grained. From the micro-architecture perspective, we propose a buffered lookup table-based (LUT-based) multiplication architecture for FP-INT multiplication and further eliminate potential bank conflicts to accelerate unimportant attention computation. From the mapping perspective, we propose an adaptive grouping strategy in speculation to tame workload imbalance caused by randomly distributed similar tokens and allow seamless integration of our algorithms. Extensive experiments show that SIMPICKER achieves an average of 5.21×, 1.45× speedup and 17.92×, 1.63× energy efficiency compared to the NVIDIA A100 GPU and the state-of-the-art accelerators.
To address these problems and leverage similarities for 3D-VGMs, we propose SIMPICKER, a comprehensive attention-aware algorithm-hardware co-design for 3D-VGMs. Our core methodology is to fully utilize similarities in attention through both coarse-grained and fine-grained approaches while adopting dynamic adaptive strategies to leverage them. From the algorithm perspective, we propose a speculation-based similarity exploitation algorithm, allowing real-time importance speculation on the frame level, which is coarse-grained, and token level, which is fine-grained. From the micro-architecture perspective, we propose a buffered lookup table-based (LUT-based) multiplication architecture for FP-INT multiplication and further eliminate potential bank conflicts to accelerate unimportant attention computation. From the mapping perspective, we propose an adaptive grouping strategy in speculation to tame workload imbalance caused by randomly distributed similar tokens and allow seamless integration of our algorithms. Extensive experiments show that SIMPICKER achieves an average of 5.21×, 1.45× speedup and 17.92×, 1.63× energy efficiency compared to the NVIDIA A100 GPU and the state-of-the-art accelerators.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionWith the advancement of high-speed and energyefficient optical interconnect and computation, photonic integrated circuits (PICs) have become a promising alternative to traditional CMOS circuits. A PIC can be synthesized by mapping the binary decision diagram (BDD) of target functions to optical switches and combiners. However, excessive signal attenuation along the light propagation may require extra optical-electrical signal conversion, thus introducing unwanted delays. In this paper, we aim to
overcome this deficiency during logic synthesis: First, we optimize the signal efficiency factor by applying the concept of harmonic means to optimize DC combiners. Second, we eliminate redundant
combiners by integer partition. Moreover, we properly arrange these proposed techniques in an optimal sequence of operations to form our main framework. Experimental results show that our framework outperforms the state of the art in terms of efficiency factor.
overcome this deficiency during logic synthesis: First, we optimize the signal efficiency factor by applying the concept of harmonic means to optimize DC combiners. Second, we eliminate redundant
combiners by integer partition. Moreover, we properly arrange these proposed techniques in an optimal sequence of operations to form our main framework. Experimental results show that our framework outperforms the state of the art in terms of efficiency factor.
Engineering Poster
Networking


DescriptionTiming OCV margin for mixed-signal IP blocks has been a single derate value during timing closure and signoff. This derate value must cover all critical paths, resulting in significant over-margin for the vast majority of the timing paths with much smaller OCV requirements, which in turn results in longer timing closure cycle and increased area and power. Liberty LVF has been widely adopted in standard cell design, but it has not crossed over into the modeling of large mixed-signal IP blocks due to the prohibitive Monte Carlo runtime of these large and complex simulations. We have collaborated with Empyrean to deploy their critical path based LVF characterization tool to generate LVF models for selected library arcs that are automatically merged into our in-house nominal Liberty files. This new characterization and timing methodology has been proven on the HBM design, and it resulted in expected and significant reduction in timing OCV margin. Success from this new methodology is being proliferated to other mixed-signal design IPs.
Networking
Work-in-Progress Poster


DescriptionLarge Language Models (LLMs) have gained popularity over the past two years, driven by their high performance achieved through rapid increases in the number of parameters. The inference process of LLMs consists of two distinct stages: prefill and decode, each with unique computational characteristics. While existing neural network inference platforms, such as Google's TPU, perform well during the prefill stage, they often suffer from poor resource utilization during the decode stage. To address this challenge, we propose the scalable Headtile architecture, specifically designed to improve hardware resource utilization. By analyzing the inference behavior of LLMs, we examine how each layer executes on TPUv3 and introduce the Maarg paradigm in Headtile for inter-layer scheduling and mapping. Experimental results show that Headtile can achieve up to 24× higher throughput in the decode stage compared to TPUv3. In addition, the Maarg paradigm reduces memory accesses by up to 60% during the prefill stage.
Engineering Poster
Networking


DescriptionCHAIR note: This presenter has requested to not present in person but instead to be a poster.
---
The increasing demand for high-performance and power-efficient systems presents critical challenges in optimizing power integrity (PI) and thermal integrity (TI) within advanced packages. The trade-off between PI and TI is exacerbated in multi-core xPU architectures, where thermal coupling and power domain constraints limit traditional engineer-driven methods. To address these issues, we propose HeatSync, a reinforcement learning (RL)-based framework for simultaneous optimization of PI and TI.
HeatSync employs an efficient thermal resistance matrix-based evaluator to replace traditional computational fluid dynamics tools, achieving over 500× faster computation with less than 2% error. The RL framework can optimize floorplan configurations and types of decoupling capacitors and their placement, balancing PI and TI objectives through user-defined weighting factors. In a simulation environment with 5536 floorplans and four decoupling capacitor configurations, the proposed method demonstrated up to 8.3% improvement in overall performance figure of merit. By minimizing maximum temperature, hot spot areas, and impedance, HeatSync alleviates the trade-offs between PI and TI, providing scalable and time-efficient optimization.
This work provides a foundation for PI and TI optimization in next-generation semiconductor designs. Future work will focus on a two-way coupled RL framework for detailed PI-TI analysis and extending to 3D packages to address nonlinear thermal behaviors.
---
The increasing demand for high-performance and power-efficient systems presents critical challenges in optimizing power integrity (PI) and thermal integrity (TI) within advanced packages. The trade-off between PI and TI is exacerbated in multi-core xPU architectures, where thermal coupling and power domain constraints limit traditional engineer-driven methods. To address these issues, we propose HeatSync, a reinforcement learning (RL)-based framework for simultaneous optimization of PI and TI.
HeatSync employs an efficient thermal resistance matrix-based evaluator to replace traditional computational fluid dynamics tools, achieving over 500× faster computation with less than 2% error. The RL framework can optimize floorplan configurations and types of decoupling capacitors and their placement, balancing PI and TI objectives through user-defined weighting factors. In a simulation environment with 5536 floorplans and four decoupling capacitor configurations, the proposed method demonstrated up to 8.3% improvement in overall performance figure of merit. By minimizing maximum temperature, hot spot areas, and impedance, HeatSync alleviates the trade-offs between PI and TI, providing scalable and time-efficient optimization.
This work provides a foundation for PI and TI optimization in next-generation semiconductor designs. Future work will focus on a two-way coupled RL framework for detailed PI-TI analysis and extending to 3D packages to address nonlinear thermal behaviors.
Networking
Work-in-Progress Poster


DescriptionApproximate multipliers have been used in deep learning and AI models to enhance power efficiency while maintaining acceptable performance. However, a key drawback of this technique is the computational errors that can negatively affect the model accuracy. In this study, we propose a novel method for using approximate multipliers in deep learning models that not only improves power efficiency but also enhances accuracy. Our proposed method involves a layer-wise heterogeneous approach for quantized approximate multipliers (INT8) in Deep Neural Network (DNN) models (Convolutional Neural Network (CNN), Multilayer Perceptron (MLP), VGG11, and VGG13) with different levels of approximation across layers. Thanks to our method, we achieve average energy savings of approximately 70% for VGG11 and 67% for VGG13, while surpassing trained Top-1 accuracy by up to 2.07% in all case studies. Delay improvements were also notable, with VGG11 achieving up to 37% and VGG13 up to 34%. Our method, which can be considered a novel tuning or retraining approach, reduces the required number of MAC operations during retraining by 146× to 1125× compared to traditional retraining methods, introducing a new level of computational efficiency and power saving, beyond those of inference mentioned above. This study showcases the potential of heterogeneous approximate multipliers and our novel retraining methods to advance efficiency and performance of creating and using new deep learning models, taking a step towards energy-conscious artificial intelligence
Engineering Presentation


AI
Systems and Software
Chiplet
DescriptionIn recent years, 3D design disaggregation has become instrumental in improving wafer cost, yields, design flexibility, PPA (power, performance, area). To fully realize the benefits of 3D disaggregation, it is always desired to have a heterogenous 3DIC system, where each die uses a different process technology with its unique advantages that are most suitable for the designs on such die. Historically, the selection of heterogenous 3D disaggregation design boundaries, or "cutlines", are usually determined in a holistic way that may be tedious and unoptimized, and each die was optimized separately as EDA tools generally do not support multiple process technologies during optimization. This would generally result in multiple trials of cutline definition in order to obtain a satisfactory heterogenous 3DIC system. Therefore, a better method to perform concurrent optimize on the entire heterogenous 3DIC design using multiple process technologies with automation. In this work, we demonstrate an automated method to optimize heterogenous 3DIC design PPA using Cadence Cerebrus to perform machine learning based design space exploration. With this methodology, concurrent optimization with multiple process technologies has ben achieved, and hundreds of 3DIC cutline experiments can be performed automatically and simultaneously, greatly reducing the time and effort needed to find an optimized 3DIC cutline configuration. This Cerebrus-based methodology also considers all critical QOR metrics during optimization, such as areal density, macro placements, bump assignment, timing closure, power consumption, IR drop, and thermal dissipation. With this methodology, we are able to find highly optimized heterogenous 3DIC designs with great efficiency and ease.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionSingular value decomposition (SVD) is a matrix factorization technique widely used in signal processing and recommendation systems, etc. In general, the time complexity of SVD algorithms is cubic to
the problem size, making SVD algorithms difficult to meet stringent performance requirements in real-time. However, existing FPGA and GPU solutions fall short of jointly optimizing latency, throughput, and
power consumption. To settle this issue, this paper proposes HeteroSVD, a heterogeneous reconfigurable accelerator for SVD computation on the Versal ACAP platform. HeteroSVD introduces a system-level SVD decomposition mechanism and proposes an algorithm-hardware co-design
method to jointly optimize SVD ordering and AI engine (AIE)-centric dataflow and placement with Versal. Furthermore, in order to improve the quality of results (QoR) and facilitate micro-architecture selection, we introduce an automatic optimization framework that performs accurate
performance modeling and fast design space exploration. Experiment results demonstrate that HeteroSVD reduces the latency by 1.98× over existing FPGA accelerators and outperforms GPU solutions with an improvement of up to 7.22× in latency, 1.77× in throughput, and 13.18× in energy efficiency.
the problem size, making SVD algorithms difficult to meet stringent performance requirements in real-time. However, existing FPGA and GPU solutions fall short of jointly optimizing latency, throughput, and
power consumption. To settle this issue, this paper proposes HeteroSVD, a heterogeneous reconfigurable accelerator for SVD computation on the Versal ACAP platform. HeteroSVD introduces a system-level SVD decomposition mechanism and proposes an algorithm-hardware co-design
method to jointly optimize SVD ordering and AI engine (AIE)-centric dataflow and placement with Versal. Furthermore, in order to improve the quality of results (QoR) and facilitate micro-architecture selection, we introduce an automatic optimization framework that performs accurate
performance modeling and fast design space exploration. Experiment results demonstrate that HeteroSVD reduces the latency by 1.98× over existing FPGA accelerators and outperforms GPU solutions with an improvement of up to 7.22× in latency, 1.77× in throughput, and 13.18× in energy efficiency.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionProcessing-in-Memory (PIM) architectures offer promising solutions for efficiently handling AI applications in energy-constrained edge environments. While traditional PIM designs enhance performance and energy efficiency by reducing data movement between memory and processing units, they are limited in edge devices due to continuous power demands and the storage requirements of large neural network weights in SRAM and DRAM. Hybrid PIM architectures, incorporating non-volatile memories like MRAM and ReRAM, mitigate these limitations but struggle with a mismatch between fixed computing resources and dynamically changing inference workloads. To address these challenges, this study introduces a Heterogeneous-Hybrid PIM (HH-PIM) architecture, comprising high-performance MRAM-SRAM PIM modules and low-power MRAM-SRAM PIM modules. We further propose a data placement optimization algorithm that dynamically allocates data based on computational demand, maximizing energy efficiency. FPGA prototyping and power simulations with processors featuring HH-PIM and other PIM types demonstrate that the proposed HH-PIM achieves up to 60.43% average energy savings over conventional PIMs while meeting application latency requirements. These results confirm HH-PIM's suitability for adaptive, energy-efficient AI processing in edge devices.
Engineering Poster
Networking


DescriptionWhen integrating high voltage IOs into designs, areas around the IO driver have an increased exposure to latchup effects, due to higher concentrations of charge carriers injected into the chip by the IO. These exposed areas have requirements for denser welltap cell coverage to sink the additional carriers, and other macros that are victims to these latchup aggressors must be kept away. As chip content continues to grow, physical integration requires additional levels of hierarchy and these rules become more complicated, as they must be tracked across hierarchical boundaries. Latchup rules are checked as part of signoff DRC, but due to the need to cleanly assemble the full chip, it is often not possible to perform these checks until late in design phases. This presentation details an automated flow for pushing latchup shapes in a parent context down into child block context, considering multiple levels of hierarchy, re-use, orientations, as well as enhanced welltap insertion, LEF generation with latchup victim shapes, and hierarchical checking of latchup aggressor to victim interactions. This process is integrated into the PnR tool flow so that it can be seamlessly run directly as part of floorplanning to catch latchup related placement issues much earlier.
Engineering Poster
Networking


DescriptionAs the technology nodes shrink, and chips scale up in size and PDN nodes, simulating IR-drop has faced challenges to ensure high coverage while optimizing infrastructure usage. IR-drop analysis at full-chip level for large SoCs takes days of runtime even with high number of cores. Meanwhile, developments of 2.5DIC and 3DIC technologies posed various complexities. In this regard, we require methodologies which enable concurrent IR-drop analysis of die, interposer and package, while optimized in runtime and computational resources.
In this work, we present a hierarchical IR-drop analysis for a 2.5DIC structure comprised of a compute die on an organic interposer. In this approach, we deployed Reduced Order Model (ROM) methodology which creates an abstract view of hierarchical blocks and consumes these models at top-level to run full-chip IR analysis. We simulated the Die+Interposer+Package structure by using only ROM for each block of the die (100% ROM) and instantiating them at the die top-level. Using this solution, design node count was reduced significantly (98.9%) compared to full-chip flat analysis, improving analysis runtime by 86%, and reducing total physical machine memory by 89%. The proposed methodology provides faster turnaround time and lower infrastructure cost addressing the demands of modern large SoC designs.
In this work, we present a hierarchical IR-drop analysis for a 2.5DIC structure comprised of a compute die on an organic interposer. In this approach, we deployed Reduced Order Model (ROM) methodology which creates an abstract view of hierarchical blocks and consumes these models at top-level to run full-chip IR analysis. We simulated the Die+Interposer+Package structure by using only ROM for each block of the die (100% ROM) and instantiating them at the die top-level. Using this solution, design node count was reduced significantly (98.9%) compared to full-chip flat analysis, improving analysis runtime by 86%, and reducing total physical machine memory by 89%. The proposed methodology provides faster turnaround time and lower infrastructure cost addressing the demands of modern large SoC designs.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionThis article proposes a multi-bit in-memory computing array using energy/area-efficient (1146 TOPS/W, 27 TOPS/mm2) in-memory ADC (IMADC) with Cascoded bit-cell. It increases throughput (1.9X) and linearity (23X) compared to input pulse-width modulation by using bit-slicing (BS) with charge-sharing based analog accumulator. Compared to conventional BS with digital accumulation after ADC, this method has better energy-efficiency (1.7X) and throughput (6.6X). Our IMADC is robust to temperature variation and area overhead is merely 3%. This approach achieves accuracy of 97.1% for the MLP (3/2/3b) on MNIST, 91.9% for the VGG-8 (4/2/4b) on CIFAR-10 and 83.8% for GAT (7/3/7b) on Cora.
Engineering Presentation


IP
DescriptionDecimation filter cores running at high frequencies contribute significantly to subsystem area & dynamic power in multi-rate signal processing. Arithmetic computation logic and sequential registers and are primary contributors. Contemporary low area and power techniques for digital filter optimizations are customized for case-to-case basis. There is a need for a universal architecture that significantly improves on figure of merit of all contemporary polyphase decimation digital filters. We propose a high performance, low area, low power with no trade-offs, which is highly optimized for multi-mode filter chain design and reuse across modes. A highly efficient coefficient compression method using n-th order differentiation is illustrated in a modulo CIC and FIR filter combo implementation. Techniques to reuse of Modulo MAC (MMAC) computations across variable signal rates and coefficients are demonstrated.
Engineering Presentation


AI
Systems and Software
Chiplet
DescriptionThe need for modern workloads in the latest innovations has brought 2.5D/3D stacking and advanced IC packaging technologies to the forefront. The requirements for integrating multiple chips, components, and materials to create an advanced IC package are becoming increasingly complicated and introduce new challenges to existing extraction and analysis methods. Fast and accurate extraction and analysis methods are increasingly critical for complex advanced IC package design. We developed a hybrid computational electromagnetic framework, utilizing different basic electromagnetic methods for modeling different parts of the package, and leveraging the machine learning models that greatly simplified the extraction complexity, to extract advanced IC packages efficiently and accurately. It has the capability to do the extraction of entire IC package within reasonably short time frames. It can generate various netlists models with standard die to package, and to board mapping headers, for system-level analysis and validation. With high efficiency and reliability, the developed solver helps designers to meet compressed schedules of IC package design.
Engineering Poster
Networking


DescriptionAnsys PathFinder-SC is the next-generation SoC and Analog Mixed Signal Design ESD reliability analysis platform designed to enable sub-16nm design success. In this slide, PathFinder-SC flow enables in verifying and signing-off full CHIP ESD checks for highly integrated multi-power/gnd domain SoCs. This flow demonstrates a high-capacity solution for verifying the protective circuitry found on production chip that protects from electrostatic discharge (ESD) by providing effective P2P resistance of ESD ZAP path and its current density analysis involving P/G nets as well as Signal nets that are challenging for performance of large SoC design. This technology has become increasingly pivotal as silicon technology continues to shrink to 3nm and below, where these tiny transistors
need to be protected by critical ESD circuitry that is checked, verified, and signed off with PathFinder-SC. PathFinder-SC is built on Ansys SeaScape, the world's first custom-designed big data platform for electronic system design and simulation. SeaScape provides per-core scalability, flexible design data access, instantaneous design bring-up, and many other revolutionary capabilities. SeaScape technology allows PathFinder-SC to deliver faster turnaround for ultra-large SoCs, which makes it ideal for today's large, high-speed semiconductor designs in Cloud Computing, Artificial intelligence, Imaging, Networking, and 5G and 6G telecommunications.
need to be protected by critical ESD circuitry that is checked, verified, and signed off with PathFinder-SC. PathFinder-SC is built on Ansys SeaScape, the world's first custom-designed big data platform for electronic system design and simulation. SeaScape provides per-core scalability, flexible design data access, instantaneous design bring-up, and many other revolutionary capabilities. SeaScape technology allows PathFinder-SC to deliver faster turnaround for ultra-large SoCs, which makes it ideal for today's large, high-speed semiconductor designs in Cloud Computing, Artificial intelligence, Imaging, Networking, and 5G and 6G telecommunications.
Networking
Work-in-Progress Poster


DescriptionDue to noise sensitivity of, and cost of access to, quantum hardware, quantum circuit simulators are critical for experimentation and research. Most publicly available simulators are time-shared through cloud environments, while local simulators running on consumer-grade hardware suffer from poor execution times. Therefore, the acceleration and customization of quantum simulations using dedicated hardware accelerators such as FPGAs can provide faster design iterations, and cost-effective solutions for quantum computing research. FPGA hardware development is challenging due to complex and heterogeneous design flows among FPGA vendors, increasing entry barriers for quantum application development using FPGAs. In this work, we present a high-level framework for quantum circuit simulation on FPGA backends, leading to faster quantum application development and algorithm prototyping. The proposed framework can be integrated with state-of-the-art quantum simulators such as IBM Qiskit. We evaluate the proposed framework using the Alveo-U200 and Alveo-U250 FPGAs as simulation backends, and multidimensional-quantum-convolution and Grover's algorithm as case studies.
Research Manuscript


EDA
EDA1: Design Methodologies for System-on-Chip and 3D/2.5D System-in Package
DescriptionThe emergence of new applications in High-Performance Computing is driving the need for more efficient computing machines. As supercomputer architectures become increasingly complex, the combinatorial explosion of design space and time-consuming simulations lead to a challenging design space exploration problem. This work introduces an automated search framework to ease a power-performance-area efficient Arm Neoverse V1 processor design. Based on multi-objective Bayesian optimization, we propose a new exploration algorithm named SEBO by enhancing the three main stages of the optimization. The experimental results show that SEBO can not only compete with the top state-of-the-art baseline algorithms, but also outperform them in terms of the quality and diversity of the returned Pareto-optimal designs.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionPoint cloud-based 3D sparse convolution networks are widely employed to process voxel features efficiently. However, the irregularity of voxel sparsity poses significant challenges, leading to increased hardware complexity and inefficiencies. We propose an algorithm-hardware co-design for sparse 3D convolution. At the algorithm level, an on-the-fly threshold-based voxel skipping is adopted, enhancing efficiency. At the hardware level, a hierarchical 3-stage Voxel Search and Skipping is developed to systematically narrow down the non-zero search space, enhancing both performance and hardware utilization. We implemented the proposed accelerator in a 65 nm process to demonstrate a 77.7% reduction in delay compared to a baseline design without the proposed sparsity adaptations. The proposed system also achieved the 1.34× and 2.22× higher energy efficiency and throughput as compared to the state-of-arts.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionSparse Triangular Solve (SpTRSV) is a critical level-2 kernel in sparse Basic Linear Algebra Subprograms (BLAS). While Field-Programmable Gate Array (FPGA) accelerators for SpTRSV focus on optimizing individual tiles, they overlook inter-tile parallelism. Designing an inter-tile parallelism accelerator poses challenges, including constructing fine-grained dependency graph, handling communication overhead, and balancing workloads. HiSpTRSV addresses these challenges through dependency graph parsing, tile-based highly parallel algorithm, filtering mechanisms, and bidirectional matching with modular indexing. Experimental show HiSpTRSV outperforms the state-of-the-art SpTRSV accelerator in terms of a 34.3% performance improvement. HiSpTRSV achieves a 3.58× speedup and 9.59× higher energy efficiency compared to GPUs.
Networking
Work-in-Progress Poster


DescriptionRecent advancements in large language models (LLMs) have shown significant potential in automating chip design, particularly in generating Verilog code from high-level specifications. While much of the existing research focuses on end-to-end generation from specifications to code, there has been limited exploration of the linguistic style of module descriptions and the structured representation of intermediate circuit forms. In this paper, we introduce HIVE, a novel framework that improves Verilog code generation and debugging accuracy by integrating semantic style transfer, graph-structured reasoning and compilation error modification with retrieval augmented generation. The HIVE framework leverages GPT-3.5Turbo-FT and GPT-4Turbo, achieving state-of-the-art performance on both the VerilogEvalv1 and Thakur's benchmarks. Notably, HIVE-GPT3.5Turbo-FT eliminates syntax errors across all test cases in Thakur's benchmark. Furthermore, HIVE-GPT4Turbo achieves a pass rate of 83.3% on the more challenging VerilogEvalv2 benchmark, demonstrating its robustness in generating high-quality Verilog code.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionThe victim cache was originally designed as a secondary cache to handle misses in the L1 data (L1D) cache in CPUs. However, this design is often sub-optimal for GPUs. Accessing the high-latency L1D cache and its victim cache can lead to significant latency overhead, severely degrading the performance of certain applications. We introduce HIVE, a high-priority victim cache designed to accelerate GPU memory accesses. HIVE handles memory requests first, before they reach the L1D cache. Our experimental results show that HIVE achieves an average performance improvement of 77.1% and 21.7% compared to the baseline and the state-of-the-art architecture, respectively.
Networking
Work-in-Progress Poster


DescriptionDesign space exploration (DSE) in high-level synthesis (HLS) aims at obtaining optimal combinations among a vast set of directive configurations to generate high-quality circuit designs. Due to challenges such as high-dimensional optimization and limited data, existing DSE methods perform poorly in handling complex configuration interactions. This work proposes a DSE framework, HLSRanker, based on preference Bayesian optimization (PBO), which utilizes pairwise comparisons between directive combinations to quickly and effectively explore the Pareto fronts. Firstly, a winner model based on graph neural networks (GNNs) is constructed to determine the winner between a pair of directive combinations. Importantly, a novel pairwise comparison-based PBO exploration engine has been proposed for sampling potentially better configurations. Experimental results show that our framework can explore higher-quality Pareto-optimal designs in a shorter runtime compared to state-of-the-art (SOTA) DSE methods.
Research Manuscript


Security
SEC4: Embedded and Cross-Layer Security
DescriptionThis paper introduces HoBBy, a compiler-based tool that hardens unbalanced branches at the instruction level, making parallel control flows indistinguishable to state-of-the-art attacks that bypass the source-code level balancing. To achieve this, we propose a single-step analysis method to identify unbalanced instructions in secret-dependent branches, and implement instruction shadowing, cogging, and spiraling techniques to protect them. We evaluate HoBBy by hardening secret-dependent branches in four real-world applications and validating its resilience against three state-of-the-art attacks targeting Intel SGX and AMD SEV. HoBBy achieves a runtime overhead of 2.8% for cryptographic libraries and a binary size overhead of 0.6%.
Research Manuscript


Design
DES3: Emerging Models of ComputatioN
DescriptionClassification tasks on ultra-lightweight devices demand devices that are resource-constrained and deliver swift responses. Binary Vector Symbolic Architecture (VSA) is a promising approach due to its minimal memory requirements and fast execution times compared to traditional machine learning (ML) methods. Nonetheless, binary VSA's practicality is limited by its inferior inference performance and a design that prioritizes algorithmic over hardware optimization. This paper introduces UniVSA, a co-optimized binary VSA framework for both algorithm and hardware. UniVSA not only significantly enhances inference accuracy beyond current state-of-the-art binary VSA models but also reduces memory footprints. It incorporates novel, lightweight modules and design flow tailored for optimal hardware performance. Experimental results show that UniVSA surpasses traditional ML methods in terms of performance on resource-limited devices, achieving smaller memory usage, lower latency, reduced resource demand, and decreased power consumption.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionNetwork-on-Chip (NoC) accelerators with heterogeneous Processing-in-Memory (PIM) cores achieve superior performance than homogeneous ones for neural networks. Dedicated simulators and architecture search frameworks are pivotal for obtaining performance, power, and area (PPA) metrics, as well as guiding the design process. However, existing simulators are primarily designed for homogeneous NoC and lack support for simulating heterogeneous PIM-based NoC architectures. Besides, current search frameworks for heterogeneous NoC architectures only focus on workload allocation and mapping strategies, failing to explore heterogeneous PIM configurations in a larger design space. In this work, we propose HPIM-NoC, a joint simulation and search framework for heterogeneous PIM-based NoC architectures. HPIM-NoC not only supports the simulation of heterogeneous PIM cores, but also provides more accurate latency results by introducing NoC transmission delays and pipelines in co-simulation. HPIM-NoC implements a three-stage heterogeneous search process based on priori knowledge and employs a specific simulated annealing algorithm tailored for heterogeneous architecture search. The search process is accelerated by precomputing core PPA metrics and reducing NoC simulation frequency. In addition, the framework integrates a customized layout algorithm to optimize the placement of heterogeneous NoC, minimizing
communication latency and overall area. Experimental results on various neural networks demonstrate that HPIM-NoC can quickly find near-optimal configurations within a limited time. The proposed acceleration method reduces the search time of HPIM-NoC by 2.12×, 2.17×, and 2.96×, respectively. Compared to homogeneous architectures, the Fusions of Metrics (FoMs) of heterogeneous PIM-based NoC architectures found by HPIM-NoC are reduced by 1.18%, 16.94%, and 37.41% for ResNet-18 under three settings, respectively.
communication latency and overall area. Experimental results on various neural networks demonstrate that HPIM-NoC can quickly find near-optimal configurations within a limited time. The proposed acceleration method reduces the search time of HPIM-NoC by 2.12×, 2.17×, and 2.96×, respectively. Compared to homogeneous architectures, the Fusions of Metrics (FoMs) of heterogeneous PIM-based NoC architectures found by HPIM-NoC are reduced by 1.18%, 16.94%, and 37.41% for ResNet-18 under three settings, respectively.
Engineering Poster
Networking


DescriptionWith the growing complexity of integrated circuits, traditional flat STA approaches struggle to manage the scale and granularity required for timing verification. HSTAF (Hierarchical Static Timing Analysis Flow) addresses this challenge by portioning the design into manageable blocks or modules, enabling a focused and modular timing analysis. HSTAF has emerged as a critical methodology in achieving efficient and precise timing closure in modern chip designs.
Hierarchical Static Timing Analysis Flow (HSTAF) technique leverages block-level abstractions, interface timing models (ITMs), and accurate timing budgeting to ensure that each block meets its timing constraints independently while seamlessly integrating into the full-chip timing architecture. By identifying bottlenecks early in the design process and enabling faster iteration cycles, Hierarchical STA significantly reduces the time and computational resources required for timing signoff.
This presentation explains how this HSTAF streamlines timing closure by offering scalability, reusability of timing models, and robust analysis of inter-block dependencies. It also highlights best practices, challenges, and the key role of hierarchical analysis in ensuring first-pass success for tape-out, ultimately contributing to more efficient design flow.
Hierarchical Static Timing Analysis Flow (HSTAF) technique leverages block-level abstractions, interface timing models (ITMs), and accurate timing budgeting to ensure that each block meets its timing constraints independently while seamlessly integrating into the full-chip timing architecture. By identifying bottlenecks early in the design process and enabling faster iteration cycles, Hierarchical STA significantly reduces the time and computational resources required for timing signoff.
This presentation explains how this HSTAF streamlines timing closure by offering scalability, reusability of timing models, and robust analysis of inter-block dependencies. It also highlights best practices, challenges, and the key role of hierarchical analysis in ensuring first-pass success for tape-out, ultimately contributing to more efficient design flow.
Engineering Presentation


IP
DescriptionIndustry is moving towards crystal less System on Chip(SOC) designs to save on cost and board area.
With the move towards crystal less SOCs, the low frequency crystal oscillator(LFXT) will be replaced by on chip oscillator(LFOSC). But LFOSC is poorer in terms of jitter & Random Telegraph Noise (RTN) performance.
Also LFOSC frequency characteristics like the period jitter and frequency standard deviation, can change significantly with the change in oscillator architecture, process node or any other oscillator peripheral circuitry.
Uncharacterized oscillator behavior makes the design of frequency sensitive digital logic like the real time clock (RTC) very challenging, especially in an ultra low power(ULP) MCU.
To keep the RTC running, we measure the time period of a low frequency clock (~32Khz) using a high frequency clock (which is very accurate, but consumes high power), which is typically hundreds of MHz.
But in a ULP MCU, the high frequency clock is available for <1% of the MCU run time. This makes time keeping challenging.
Bluetooth Low Energy(BLE) standard demands that in 1s, the time should not drift by more than 500ppm (i.e. 500us). Failing to do so will result in a loss of BLE connection.
This work proposes a digital filter architecture which is agnostic of analog oscillator architecture, can be adapted to any kind of oscillator noise characteristics, and will always meet the 1s BLE spec of 500ppm. Thus enabling a crystal less solution for BLE MCU.
With the move towards crystal less SOCs, the low frequency crystal oscillator(LFXT) will be replaced by on chip oscillator(LFOSC). But LFOSC is poorer in terms of jitter & Random Telegraph Noise (RTN) performance.
Also LFOSC frequency characteristics like the period jitter and frequency standard deviation, can change significantly with the change in oscillator architecture, process node or any other oscillator peripheral circuitry.
Uncharacterized oscillator behavior makes the design of frequency sensitive digital logic like the real time clock (RTC) very challenging, especially in an ultra low power(ULP) MCU.
To keep the RTC running, we measure the time period of a low frequency clock (~32Khz) using a high frequency clock (which is very accurate, but consumes high power), which is typically hundreds of MHz.
But in a ULP MCU, the high frequency clock is available for <1% of the MCU run time. This makes time keeping challenging.
Bluetooth Low Energy(BLE) standard demands that in 1s, the time should not drift by more than 500ppm (i.e. 500us). Failing to do so will result in a loss of BLE connection.
This work proposes a digital filter architecture which is agnostic of analog oscillator architecture, can be adapted to any kind of oscillator noise characteristics, and will always meet the 1s BLE spec of 500ppm. Thus enabling a crystal less solution for BLE MCU.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionThis study introduces a memory-efficient mixed representation for deep learning recommendation models (DLRM), addressing the embedding table memory bottleneck from growing data scale. By distinguishing between frequently accessed (hot) and infrequently accessed (cold) embeddings, we store hot embeddings in a compact table while representing cold embeddings using a deep hash embedding (DHE) network, significantly reducing memory usage. This hybrid approach performs table lookups for hot embeddings and parallelized computations for cold embeddings, minimizing training time while maintaining accuracy. Experimental results demonstrate that our method outperforms other embedding reduction techniques in memory efficiency, accuracy, and training speed in CPU-GPU hybrid environments.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionMixture of Experts (MoE) models enable efficient scaling of Large Language Models (LLMs) but demand substantial memory, necessitating offloading and on-demand loading to manage constraints. CPUs are often leveraged to compute expert layers during cache misses, reducing the need for costly GPU loading. However, unpredictable activation patterns in MoE models make task-to-hardware mapping in CPU-GPU hybrid systems highly complex.
We propose HybriMoE, a system that addresses these challenges with (i) dynamic intra-layer scheduling, (ii) impact-driven prefetching, and (iii) score-based caching. Evaluated on kTransformers and Llama.cpp, HybriMoE achieves 1.33x and 1.70x speedups in prefill and decode stages, respectively.
We propose HybriMoE, a system that addresses these challenges with (i) dynamic intra-layer scheduling, (ii) impact-driven prefetching, and (iii) score-based caching. Evaluated on kTransformers and Llama.cpp, HybriMoE achieves 1.33x and 1.70x speedups in prefill and decode stages, respectively.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionThe Mixture-of-Expert (MoE) mechanism has been widely adopted in Transformer-based large language models (LLMs) to enhance generalization and enable model scaling. However, the increasing size of MoE models imposes significant memory demands, leading to suboptimal hardware performance. The emerging multi-chiplet system, with its inherent scalability, offers a potential solution. However, deploying MoE models on chiplet-based architectures introduces new challenges of extensive all-to-all communication and model computational inefficiencies. To alleviate the above issues, this paper presents Hydra, a software/hardware co-design aimed at accelerating MoE inference on chiplet-based architectures. In software, Hydra employs a popularity-aware expert mapping strategy to optimize inter-chiplet communication. In hardware, it incorporates Content Addressable Memory (CAM) to eliminate expensive explicit token (un)-permutation based on sparse matrix multiplications and a redundant-calculation-skipping softmax engine to bypass unnecessary division and exponential operations. Evaluated in 22 nm technology, Hydra achieves latency reductions of 14.2× and 3.5× and power reductions of 169.1× and 18.9× over GPU and state-of-the-art MoE accelerator, respectively, thereby offering a scalable and efficient solution for MoE model deployment.
Networking
Work-in-Progress Poster


DescriptionHyperdimensional computing (HDC) is a brain-inspired paradigm valued for its noise robustness, parallelism, energy efficiency, and low computational overhead. Hardware accelerators are being explored to further enhance its performance, but current solutions are often limited by application specificity and the latency of encoding and similarity search. This paper presents a generalized, reconfigurable on-chip training and inference architecture for HDC, utilizing spin-orbit-torque magnetic (SOT-MRAM) content-addressable memory (CAM). The proposed SOT-CAM array integrates storage and computation, enabling in-memory execution of key HDC operations: binding (bitwise multiplication), permutation (bit rotation), and efficient similarity search. To mitigate interconnect parasitic effect in similarity search, a four-stage voltage scaling scheme has been proposed to ensure accurate Hamming distance representation. Additionally, a novel bit drop method replaces bit rotation during read operations, and an HDC-specific adder reduces energy and area by 1.51× and 1.43×, respectively. Benchmarked at 7nm, the architecture achieves energy reductions of 21.5×, 552.74×, 1.45×, and 282.57× for addition, permutation, multiplication, and search operations, respectively, compared to CMOS-based HDC. Against state-of-the-art HD accelerators, it achieves a 2.27× lower energy consumption and outperforms CPU and eGPU implementations by 2702× and 23161×, respectively, with less than 3% drop in accuracy.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionFully Homomorphic Encryption (FHE) introduces a novel paradigm in privacy-preserving computation, extending its applicability to various scenarios. Operating on encrypted data, however, imposes significant computational challenges, particularly elevating data transmission and memory access demands. Consequently, developing an efficient system storage architecture becomes vital for FHE-specific architectures. Traditional FHE accelerators use a Host+ACC topology, often focusing on enhancing computational performance and efficient use of on-chip caches, with the assumption that very large volumes of encrypted data are already present in the accelerator's memory while neglecting the inevitable high cost of data transfer. In this paper, we propose Hypnos—a memory-efficient homomorphic encryption processing unit. In Hypnos, we abstract operators from FHE schemes into commands suitable for memory-efficient processing units and combine them with a homomorphic encryption paged memory management system designed for memory access, significantly reducing the memory access and execution time of homomorphic encryption applications. We implement Hypnos on the QianKun FPGA Card and highlight the following results: (1) outperforms SOTA ASIC and FPGA solutions by 2.58x and 4.43x; (2) the communication overhead is reduced by 3.78x compared to traditional architectures; (3) up to 27.6x and 19.06x energy efficiency improvement compared to ASIC-based CraterLake and FPGA-based Poseidon for ResNet-20 respectively.
Networking
Work-in-Progress Poster


DescriptionProcessing-in-Memory (PIM) accelerates memoryintensive workloads by exploiting high memory bandwidth and parallelism. However, PIM performance is maximized only when data placement matches computation; otherwise, the overhead of data movement dominates any performance benefits of PIM. In this work, we explore the challenges of adding PIM to modern GPUs which have address hashing to load-balance the multiple memory channels. We first demonstrate how hashing in modern
GPUs can be reverse-engineered. Based on our understanding of address hashing, we propose I-PIM – hardware-software codesign architecture to optimize data placement for PIM on GPUs that have address hashing.
GPUs can be reverse-engineered. Based on our understanding of address hashing, we propose I-PIM – hardware-software codesign architecture to optimize data placement for PIM on GPUs that have address hashing.
Engineering Poster
Networking


DescriptionOn modern integrated circuits there are many I/O devices. These I/O devices are driven by off chip capacitive loads at various frequencies which cause switching noise on the power/ground rails. As designs get more complex, it becomes critical to analyze the I/O power grid so that decisions regarding chip I/O planning and package/board design can be made. Traditionally, the I/O power grid analysis is done when the die design is near frozen. However, with the shortening design cycles required for modern designs, this will be too late. To get an early analysis, Cadence Sigrity can quickly generate "what if" results based on different operating scenarios before the design is fully matured. To model the power grid, the resulting waveforms generated from Sigrity are used in Cadence Voltus to generate a PGV model for the I/O. Because the PGV generated uses the results from Sigrity, the operating conditions previously modelled are now part of the PGV. The PGV can then be used in a dynamic IR/EM analysis with a specified switching pattern. The Voltus analysis will account for the power grid RC and is also timing aware. This presentation will detail how an early design is modelled in Sigrity and Voltus.
Engineering Poster


DescriptionThree-dimensional (3D) ICs have transitioned from academic research to commercial production, primarily driven by advancements in memory technologies. However, the integration of 3D architectures within the realm of analog and mixed-signal (AMS) circuits presents unique challenges. A critical analysis of current Electronic Design Automation (EDA) tools and packaging technologies reveals significant limitations in effectively designing and manufacturing complex 3D AMS systems. These limitations become particularly pronounced in applications demanding high-frequency performance, such as radio frequency (RF) circuits, where the integration of multiple layers of homogeneous materials is essential.
This presentation showcases a cutting-edge 3D RF design solution, powered by our innovative IC Folding technology. Specifically tailored for leading-edge foundry RF processes, this solution addresses the critical challenges of 3D AMS integration. By leveraging these advanced tools, we enable the optimization of high-performance, power-efficient RF systems. Our technology has achieved area reduction in chip size by 35-40%, making it a unique value proposition to our customers.
This presentation showcases a cutting-edge 3D RF design solution, powered by our innovative IC Folding technology. Specifically tailored for leading-edge foundry RF processes, this solution addresses the critical challenges of 3D AMS integration. By leveraging these advanced tools, we enable the optimization of high-performance, power-efficient RF systems. Our technology has achieved area reduction in chip size by 35-40%, making it a unique value proposition to our customers.
Engineering Presentation


Front-End Design
DescriptionProblem Statement:
Software resets can drive multiple flops sequentially, then finding the exact point where reset-crossing is happening between software reset and asynchronous reset is difficult. Finding the software resets which are actually causing reset violations are difficult to find as in the tool there is no way to see only software resets to asynchronous resets crossing.
Methodology:
In the RDC tool, we provide primary inputs like, design specific reset definitions/constraints and grouping of all the resets coming from a common reset generator module. Later, we enable a Setup goal and find all soft-resets in the design under a specific tag and dump them as constraints using the tool's support for Tcl Query based, "Custom Report Generation" feature.
These newly generated constraints from the sets of soft-resets are used as inputs when we run the RDC tool with the advance goal enabled. This helps us in finding the relevant violations for the reset crossings between the soft-resets and the asynchronous resets in the design under a particular set of tags. The reset grouping for the resets coming from a common reset generator module enables us to reduce the number of asynchronous reset crossings and only provide us with the relevant crossings data.
Results & Conclusions:
Using this methodology, we are able to catch all soft-resets reported by the software team and successfully reduce the effort and time taken to manually review all soft-resets present in the designs.
In this method, we are able to produce the violations that are only related to soft-resets which makes the results less noisy w.r.t. the general industry standards.
This solution is applicable to all the users who are currently doing or planning to do RDC analysis for the soft-resets, in particular using the RDC tool.
This is a common issue faced across the industry while trying to enable/perform RDC checks for the soft-resets as part of the Sign-Off.
Software resets can drive multiple flops sequentially, then finding the exact point where reset-crossing is happening between software reset and asynchronous reset is difficult. Finding the software resets which are actually causing reset violations are difficult to find as in the tool there is no way to see only software resets to asynchronous resets crossing.
Methodology:
In the RDC tool, we provide primary inputs like, design specific reset definitions/constraints and grouping of all the resets coming from a common reset generator module. Later, we enable a Setup goal and find all soft-resets in the design under a specific tag and dump them as constraints using the tool's support for Tcl Query based, "Custom Report Generation" feature.
These newly generated constraints from the sets of soft-resets are used as inputs when we run the RDC tool with the advance goal enabled. This helps us in finding the relevant violations for the reset crossings between the soft-resets and the asynchronous resets in the design under a particular set of tags. The reset grouping for the resets coming from a common reset generator module enables us to reduce the number of asynchronous reset crossings and only provide us with the relevant crossings data.
Results & Conclusions:
Using this methodology, we are able to catch all soft-resets reported by the software team and successfully reduce the effort and time taken to manually review all soft-resets present in the designs.
In this method, we are able to produce the violations that are only related to soft-resets which makes the results less noisy w.r.t. the general industry standards.
This solution is applicable to all the users who are currently doing or planning to do RDC analysis for the soft-resets, in particular using the RDC tool.
This is a common issue faced across the industry while trying to enable/perform RDC checks for the soft-resets as part of the Sign-Off.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionIn the evolving field of hardware design, ensuring the security of System-on-Chips (SoCs) has become increasingly vital. As SoCs grow in complexity, integrating components from various sources, the identification and protection of security assets are crucial to prevent vulnerabilities. Traditional methods of identifying these assets are manual and time-intensive. To address this challenge, automated tools for security asset identification are essential, enabling faster and more accurate detection of critical assets early in the design process.
In this paper, we propose a framework for the automated identification of security assets within SoCs. By transforming register-transfer level (RTL) code into graphs and leveraging deep neural networks (DNNs) to classify assets based on their structural patterns, our approach can effectively differentiate between security and non-security assets. Experimental results show that the proposed method achieves high classification accuracy, with the model reaching up to 99\% accuracy in identifying security assets, significantly reducing the need for manual intervention.
In this paper, we propose a framework for the automated identification of security assets within SoCs. By transforming register-transfer level (RTL) code into graphs and leveraging deep neural networks (DNNs) to classify assets based on their structural patterns, our approach can effectively differentiate between security and non-security assets. Experimental results show that the proposed method achieves high classification accuracy, with the model reaching up to 99\% accuracy in identifying security assets, significantly reducing the need for manual intervention.
Ancillary Meeting


DescriptionIn the current dynamically changing landscape of computing, growth of artificial intelligence (AI) applications have caused an exponential increase in energy consumption, re-emphasizing the need for managing power footprint in chip design. To manage this escalating energy footprint and enabling true system level low power design, modeling standards play a key role to facilitate inter-operability and re-use. IEEE 2416, the ”IEEE Standard for Power Modeling to Enable System Level Analysis”, introduced in 2019, offers a unified framework spanning system-level to detailed design, facilitating comprehensive low power design for entire systems. This standard also enables efficiency through contributor-based Process, Voltage, and Temperature (PVT) independent power modeling.
IEEE 2416-2025 will include production industry-driven extensions in analog/mixed-signal, system modeling, and multi-voltage scenarios. Join this open meeting of the IEEE 2416 Working Group to learn how the upcoming release will enhance system power modeling productivity for you and your company. Attendees will include system designers and architects, logic and circuit designers, validation engineers, CAD managers, researchers, and academicians.
Discussions include
o New, advanced features focused on system power modeling in IEEE 2416-2025
o Discussion of IEEE 2416-2025 ballot pool feedback and high-priority actions
o Status and final steps for standardization
Register now! For more details go to si2.org
IEEE 2416-2025 will include production industry-driven extensions in analog/mixed-signal, system modeling, and multi-voltage scenarios. Join this open meeting of the IEEE 2416 Working Group to learn how the upcoming release will enhance system power modeling productivity for you and your company. Attendees will include system designers and architects, logic and circuit designers, validation engineers, CAD managers, researchers, and academicians.
Discussions include
o New, advanced features focused on system power modeling in IEEE 2416-2025
o Discussion of IEEE 2416-2025 ballot pool feedback and high-priority actions
o Status and final steps for standardization
Register now! For more details go to si2.org
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionRecently, researchers have leveraged GPU to accelerate graph partitioning to a new performance milestone. However, existing GPU-accelerated graph partitioners are limited to full graph partitioning and do not anticipate incremental update. Incremental partitioning is integral to many optimization-driven CAD applications, where a circuit graph is re-partitioned iteratively as it undergoes incremental modifications during the evaluation of optimization transforms. To unlock the full potential of GPU-accelerated graph partitioning, we introduce iG-kway, an incremental k-way graph partitioner on GPU. iG-kway introduces an incrementality-aware data structure to support graph modifications directly on GPU. Atop this data structure, iG-kway introduces an incremental refinement kernel that can efficiently refine affected vertices after the graph is incrementally modified, with minimal impact on partitioning quality. Experimental results show that iG-kway achieves an average speedup of 84x over the state-of-the-art G-kway, with comparable cut sizes.
Networking
Work-in-Progress Poster


DescriptionBinary Neural Networks (BNNs) are highly effective for image classification and recognition tasks, particularly on power-constrained FPGAs, which are commonly deployed on edge platforms. The FINN framework, a widely used solution, leverages a streaming architecture and a set of novel optimizations to map BNNs onto FPGAs efficiently. However, its design space exploration capabilities remain limited, often leading to suboptimal configurations. To address this, we propose the IM-DSE strategy, a novel multi-target design exploration framework. IM-DSE introduces a multi-objective optimization scheme to identify superior design points by constraining both throughput and Look-Up Table (LUT) consumption. It incorporates an accurate LUT model and a Transferring-Computation (TC) model, which predict LUT usage and processing cycles as functions of the number of SIMDs and PEs per layer. Additionally, IM-DSE employs an intelligent search strategy to efficiently explore optimal accelerator configurations under given constraints. Experimental results demonstrate that, at a target cycle of 1000, IM-DSE achieves an average improvement of 61.72% (up to 87.67%) in energy efficiency and 60.05% (up to 84.42%) in LUT utilization efficiency (FPS/LUT) compared to the state-of-the-art FINN framework across varying LUT constraints.
Engineering Poster
Networking


DescriptionWith the increase in the performance of devices like smartphones and wearables, the need to reduce their size also grows. A crucial aspect of developing the digital portion of these chips is physical synthesis. The integration of artificial intelligence into the back-end flow has been considered, aiming to achieve improvements in performance, power, and area.
The Design Space Optimization (DSO.ai) tool by Synopsys is a machine-learning application that explores the design search space and evaluates outcomes based on user-defined metrics. This AI application increases designer productivity while reducing the need for human and computational resources.
The tool has been embedded in a back-end flow to explore different design solutions for the STPMIC25 device by STMicroelectronics. Due to the limited routing resources, the critical point of the design is congestion. This results in a relatively low utilization ratio with a large amount of empty area in the floorplan. The tool finalizes the design without DRC violations, with congestion values that allow for possible future metal fixes and a potential area reduction of 7%.
The proposed flow may be adopted in the development of a device to explore new design solutions achieving improvements in the metrics of interest.
The Design Space Optimization (DSO.ai) tool by Synopsys is a machine-learning application that explores the design search space and evaluates outcomes based on user-defined metrics. This AI application increases designer productivity while reducing the need for human and computational resources.
The tool has been embedded in a back-end flow to explore different design solutions for the STPMIC25 device by STMicroelectronics. Due to the limited routing resources, the critical point of the design is congestion. This results in a relatively low utilization ratio with a large amount of empty area in the floorplan. The tool finalizes the design without DRC violations, with congestion values that allow for possible future metal fixes and a potential area reduction of 7%.
The proposed flow may be adopted in the development of a device to explore new design solutions achieving improvements in the metrics of interest.
Engineering Presentation


AI
Systems and Software
DescriptionIn conventional embedded SoC development, SW development is tightly coupled with release plan of HW prototype.
Virtual Platform (VP) enables software developers to kick-off their developments at the earlier stage than before. However, In the early development stage, VP dose not guarantee the functional accuracy.
In previous, Test-FW for HW prototype was reused for VP verification, so VP release timing is delayed due to Test-FW preparation.
In this paper, we propose Coreless Test Framework (CTF) which is simple and light verification method using custom Verification IP.
Verification utilizing Test-FW is a complex and time consuming work, as it requires many essential IPs for operating the processor, such as the ARM core, memory, and their corresponding interfaces.
The CTF is significantly simplified as it does not require the implementation of the core model's complexities and increase the firmware verification coverage of the early-stage VP by up 2.4.
Virtual Platform (VP) enables software developers to kick-off their developments at the earlier stage than before. However, In the early development stage, VP dose not guarantee the functional accuracy.
In previous, Test-FW for HW prototype was reused for VP verification, so VP release timing is delayed due to Test-FW preparation.
In this paper, we propose Coreless Test Framework (CTF) which is simple and light verification method using custom Verification IP.
Verification utilizing Test-FW is a complex and time consuming work, as it requires many essential IPs for operating the processor, such as the ARM core, memory, and their corresponding interfaces.
The CTF is significantly simplified as it does not require the implementation of the core model's complexities and increase the firmware verification coverage of the early-stage VP by up 2.4.
Workshop


Design
Sunday Program
DescriptionModern computer architectures and the device technologies used to manufacture them are facing significant challenges, limiting their ability to meet the performance demands of complex applications such as Big Data processing and Artificial Intelligence (AI). The In-Memory Architectures and Computing Applications Workshop (iMACAW) workshop seeks to provide a platform for discussing In-Memory Computing (IMC) as an alternative architectural approach and its potential applications. Adopting a cross-layer and cross-technology perspective, the workshop will cover state-of-the-art research utilizing various memory technologies, including SRAM, DRAM, FLASH, RRAM, PCM, MRAM, and FeFET. Additionally, the workshop aims to strengthen the IMC community and offer a comprehensive view of this emerging computing paradigm to design automation professionals. Attendees will have the opportunity to engage with invited speakers, who are pioneers in the field, learn from their expertise, ask questions, and participate in panel discussions.
Learn more at https://nima.eclectx.org/iMACAW/
Learn more at https://nima.eclectx.org/iMACAW/
Research Special Session


EDA
DescriptionRecent publications have reported that the root-cause of SDEs (silent date errors) include defects that escape manufacturing testing. An escaped defect is due to its behavior deviating from what is predicted by the models and metrics utilized for test generation. In order to reduce escape, the first step must involve understanding how often and in what manner does defect behavior deviate from the models/metrics used for ATPG (automatic test pattern generation). In this work, we describe and demonstrate a methodology for precisely deriving defect behavior from ATE (automatic test equipment) data collected from a failing logic circuit. The gap measured between models/metrics and actual defect behavior for a 14nm industrial test chip is so substantial that we conclude that test quality can only be maintained and improved if fortuitous detection is reduced. In other words, understanding and minimizing the deviations between predicted behavior and actual defect behavior are crucial for enhancing test quality in the context of SDEs.
Exhibitor Forum


DescriptionDesigners of multi-die 2.5D/3D-ICs have faced many new multiphysics issues that were not a concern for single-die design. These include thermal prototyping, electromagnetic coupling of digital signals, mechanical warpage of interposers, and more. This session is a moderated discussion that assembles a panel of experienced 3D-IC design engineers from leading silicon vendors to talk about their experiences with production projects. The adoption of multi-die technology is spreading to more design teams, and this is a valuable opportunity to understand what issues really matter, what lessons were learned, and which techniques that have been proven to work.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionNowadays, there is a growing trend to deploy machine learning (ML) models on edge devices. To cope with the increasing resource requirements of current ML models, multi-accelerator edge devices that integrate CPU, GPU, NPU, or TPU in a single SoC gain popularity. However, we observe that existing ML inference serving frameworks are poor in utilizing the unique hardware architecture of these edge devices. In this paper, we present InfScaler, an efficient ML inference serving framework tailored for multi-accelerator edge devices. InfScaler discovers the architectural bottleneck of ML models and designs a bottleneck-aware asymmetric auto-scaling technique to facilitate efficient resource allocation for ML models on the edge. Furthermore, InfScaler capitalizes on the hardware's unified memory feature inherent to edge devices, ensuring efficient data sharing between the asymmetrically scaled model partitions. Our experimental results show that InfScaler achieves up to 126.59% throughput improvement and 27.32% resource reduction while satisfying the latency requirements compared with the state-of-the-art inference serving approaches.
Networking
Work-in-Progress Poster


DescriptionThe most attacked operation in Elliptic Curve based cryptographic protocols is the Scalar Multiplication kP. As a defense against simple side-channel analysis (SCA), the atomicity principle and several atomic blocks were proposed. In this paper, we demonstrate that kP algorithms based on atomic patterns are vulnerable to SCA due to clear distinctions between field squaring and multiplication operations. The primary SCA leakage source is the handling of the second operand by the multiplier, creating a visible, one-clock-cycle long marker. We demonstrated this vulnerability experimenting with Longa's atomic patterns. This undermines the SCA resistance of many atomic patterns, enabling key extraction.
Engineering Poster
Networking


DescriptionAs semiconductor designs grow increasingly complex, optimizing reliability verification processes, particularly for Electrostatic Discharge (ESD), is critical to managing costs and improving efficiency in cloud environments. Traditional approaches, reliant on increasing hardware resources, often result in marginal performance improvements while significantly escalating costs.
This presentation explores innovative strategies for optimizing ESD verification workflows, emphasizing advanced parallelization techniques, hardware optimization, and predictive resource allocation using machine learning. By adopting these methods, substantial improvements in runtime and cost efficiency were achieved. Key highlights include improved hardware utilization, faster simulations, and reduced resource wastage, all while maintaining scalability and flexibility.
The proposed methodologies effectively address challenges associated with inefficient resource allocation and escalating license costs, presenting a cost-effective and high-performance solution for ESD reliability verification in cloud environments. These advancements mark a significant step forward in aligning verification processes with the demands of modern semiconductor design.
This presentation explores innovative strategies for optimizing ESD verification workflows, emphasizing advanced parallelization techniques, hardware optimization, and predictive resource allocation using machine learning. By adopting these methods, substantial improvements in runtime and cost efficiency were achieved. Key highlights include improved hardware utilization, faster simulations, and reduced resource wastage, all while maintaining scalability and flexibility.
The proposed methodologies effectively address challenges associated with inefficient resource allocation and escalating license costs, presenting a cost-effective and high-performance solution for ESD reliability verification in cloud environments. These advancements mark a significant step forward in aligning verification processes with the demands of modern semiconductor design.
Engineering Poster


DescriptionExtracting the Parasitic effect of packaging layers, such as redistribution layers (RDL) is important to enhance the performance of the design, as seen to account for more than 10% of the total parasitic capacitance on nets near RDL layers: impacting the performance of the circuit. The Calibre nmPlatform introduces an innovative RDL calibration and extraction flow (with xRC and xACT) that enables both design houses and Foundries to optimize their PEX evaluations, taking into account the RDL effect, while relying on the Foundry-qualified decks for all base layers.
Designers can use the Calibre xRC or Calibre xACT tools to perform PEX on layouts containing RDL layers using RDL Addon rule decks that provide highly consistent extraction results on base-layers and very high accuracy correlation on nets near RDL layers. The new flow expedites the calibration and extraction process through a simple configuration of the RDL process file, relying entirely on the qualified foundry decks in evaluating parasitics on the base layers' underneath packaging.
In collab with GlobalFoundries, the results show that the new robust RDL calibration and extraction approach is supported for most versions of Foundry-qualified decks. When the flow was tested on layouts formerly exhibiting inaccuracies with conventional flows and the results were evaluated against qualified deck-based extraction, the results showed high accuracy versus the reference solver solutions (within 3%) with a more accurate insight of the final performance of the circuit after adding the RDL layers.
The Calibre RDL PEX flow enabled GlobalFoundries and design houses using GF rule decks for extraction to augment their parasitic extraction evaluations, with minor setup changes and very high accuracy.
Designers can use the Calibre xRC or Calibre xACT tools to perform PEX on layouts containing RDL layers using RDL Addon rule decks that provide highly consistent extraction results on base-layers and very high accuracy correlation on nets near RDL layers. The new flow expedites the calibration and extraction process through a simple configuration of the RDL process file, relying entirely on the qualified foundry decks in evaluating parasitics on the base layers' underneath packaging.
In collab with GlobalFoundries, the results show that the new robust RDL calibration and extraction approach is supported for most versions of Foundry-qualified decks. When the flow was tested on layouts formerly exhibiting inaccuracies with conventional flows and the results were evaluated against qualified deck-based extraction, the results showed high accuracy versus the reference solver solutions (within 3%) with a more accurate insight of the final performance of the circuit after adding the RDL layers.
The Calibre RDL PEX flow enabled GlobalFoundries and design houses using GF rule decks for extraction to augment their parasitic extraction evaluations, with minor setup changes and very high accuracy.
Research Manuscript
INSIGHT: A Universal Neural Simulator Framework for Analog Circuits with Autoregressive Transformers
11:00am - 11:15am PDT Monday, June 23 3006, Level 3

EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionThis paper introduces INSIGHT, a data-efficient, adaptive, high-fidelity, technology-agnostic universal neural simulator framework that formulates analog performance prediction as an autoregressive sequence generation task to accurately predict performance across diverse circuits. INSIGHT achieves test R2 scores ≥0.95, outperforming existing neural surrogates. Cross-technology transfer learning experiments show that INSIGHT can preserve model performance with ~60% less training data. Low-Rank Adaptation (LoRA) integration further reduces memory footprint by ~42% and training time by ~25%, maintaining high performance. Our experiments show that INSIGHT-based RL sizing framework achieves ~100-1000X lower simulation costs over existing sizing methods for identical benchmarks and target specifications.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionPhysical design tools have complex workflows with many different ways of optimizing power, performance and area (PPA) out of a large number of options and hyperparameters in different engines and functionalities. Black box optimization techniques are widely adapted to automate quality-of-result (QoR) exploration. Such exploration often proves impractical in real-world customer environments due to high computational demands, lengthy exploration cycles, and the need for large parallel jobs. To reduce the exploration space for viable compute resource requirement, we propose a novel design methodology to enable transferable learning by incorporating design insights crafted on top of physical design experts' experience, and streamlining QoR exploration as a sequence generation task for best recipe selection, where we apply language model-inspired alignment techniques to learn the ranking of different recipe sets, enabling our model to generalize beyond known-good manually tuned expert design recipes. Extensive evaluations demonstrate our method's superior QoRs and runtime performance on unseen industrial designs and rigorous benchmarks.
Research Manuscript
Insights from Rights and Wrongs: A Large Language Model for Solving Assertion Failures in RTL Design
4:15pm - 4:30pm PDT Wednesday, June 25 3003, Level 3

EDA
EDA2: Design Verification and Validation
DescriptionSystemVerilog Assertions (SVAs) are essential for verifying Register Transfer Level (RTL) designs, as they can be embedded into key functional paths to detect unintended behaviours. During simulation, assertion failures occur when the design's behaviour deviates from expectations. Solving these failures, i.e., identifying and fixing the issues causing the deviation, requires analysing complex logical and timing relationships between multiple signals. This process heavily relies on human expertise, and there is currently no automatic tool available to assist with it. Here, we present AssertSolver, an open-source Large Language Model (LLM) specifically designed for solving assertion failures. By leveraging synthetic training data and learning from error responses to challenging cases,
AssertSolver achieves a bug-fixing pass@1 metric of 88.54% on our testbench, significantly outperforming OpenAI's o1-preview by up to 11.97%. We release our model and testbench for public access to encourage further research: https://anonymous.4open.science/r/AssertSolver-9022.
AssertSolver achieves a bug-fixing pass@1 metric of 88.54% on our testbench, significantly outperforming OpenAI's o1-preview by up to 11.97%. We release our model and testbench for public access to encourage further research: https://anonymous.4open.science/r/AssertSolver-9022.
Research Manuscript


EDA
EDA3: Timing Analysis and Optimization
DescriptionExisting GPU-accelerated Static Timing Analysis (GPU-STA) efforts aim to build standalone engines from scratch but result in poor correlation with commercial tools, limiting their industrial applicability.
In this paper, we present INSTA, a tool-accurate, differentiable, GPU-STA framework that overcomes these limitations by initializing timing graphs directly from reference STA tools (e.g., Synopsys PrimeTime).
INSTA's core engine utilizes two custom CUDA kernels: a forward kernel for statistical arrival time propagation, and a backward kernel for gradient backpropagation from timing endpoints, enabling two unprecedented capabilities: (1) high-fidelity, rapid timing analysis for incremental netlist updates (e.g., gate sizing), and (2) gradient-based, global timing optimization at scale (e.g., timing-driven placement).
Notably, INSTA demonstrates a near-perfect 0.999 correlation with PrimeTime on a 15-million-pin design in a commercial $3nm$ node with runtime under 0.1 seconds.
In the experiments, we showcase INSTA's power through three applications: (1) serving as a fast evaluator in a commercial gate sizing flow, achieving 25x faster incremental update_timing runtime with almost no accuracy loss; (2) INSTA-Size, a gradient-based gate sizer that achieves up to 15% better Total Negative Slack (TNS) than PrimeTime's default engine by sizing 68% fewer amount of cells; and (3) INSTA-Place, a differentiable timing-driven placer that outperforms the state-of-the-art net-weighting placer by up to 16% in Half-Perimeter Wirelegnth (HPWL) and 59.4\% in TNS.
We will open-source INSTA upon acceptance.
In this paper, we present INSTA, a tool-accurate, differentiable, GPU-STA framework that overcomes these limitations by initializing timing graphs directly from reference STA tools (e.g., Synopsys PrimeTime).
INSTA's core engine utilizes two custom CUDA kernels: a forward kernel for statistical arrival time propagation, and a backward kernel for gradient backpropagation from timing endpoints, enabling two unprecedented capabilities: (1) high-fidelity, rapid timing analysis for incremental netlist updates (e.g., gate sizing), and (2) gradient-based, global timing optimization at scale (e.g., timing-driven placement).
Notably, INSTA demonstrates a near-perfect 0.999 correlation with PrimeTime on a 15-million-pin design in a commercial $3nm$ node with runtime under 0.1 seconds.
In the experiments, we showcase INSTA's power through three applications: (1) serving as a fast evaluator in a commercial gate sizing flow, achieving 25x faster incremental update_timing runtime with almost no accuracy loss; (2) INSTA-Size, a gradient-based gate sizer that achieves up to 15% better Total Negative Slack (TNS) than PrimeTime's default engine by sizing 68% fewer amount of cells; and (3) INSTA-Place, a differentiable timing-driven placer that outperforms the state-of-the-art net-weighting placer by up to 16% in Half-Perimeter Wirelegnth (HPWL) and 59.4\% in TNS.
We will open-source INSTA upon acceptance.
Engineering Poster
Networking


DescriptionAdvancements in semiconductor technology are pushing the boundaries of chip design, but they also bring challenges like power losses and thermal impacts. Addressing these thermal issues early in the design process is crucial. We propose a Multiphysics simulation flow that integrates self-heat analysis with power integrity simulations, fitting into conventional chip design sign-off methodologies. This flow requires thermal resistance inputs, which can be obtained from the foundry or simulated separately. Our detailed simulation setup includes all necessary inputs and flows. In our example, the ambient temperature is set to 110°C. Self-heating causes the power grid net temperature to rise by about 3°C and the signal net temperature by about 8°C, increasing electromigration (EM) limits. We provide heatmaps and tables showing these temperature increases and their impact on EM limits. The entire simulation completes within 24 hours and meets existing chip design requirements. Early self-heat analysis helps identify electromigration variations and hot-spots, mitigating thermal issues before chip tape-out. This methodology will be used for advanced technology nodes.
Networking
Work-in-Progress Poster


DescriptionLarge language models (LLMs) have been widely used in software and hardware areas to help developers generate high-quality code quickly. While existing solutions often rely on commercial LLMs, such as the popular ChatGPT, the trend of training a local model for code generation is growing due to the security concerns of releasing proprietary data to third-party service providers. Still, in the hardware domain, due to the lack of high-quality training datasets, researchers have to rely on commercial LLMs, facing the issue of private training data leakage. This paper adheres to the principle of zero data upload to address data privacy concerns. Instead of commercial LLMs, we propose a localized and transparent solution leveraging local LLMs to synthesize data and eliminate data leakage risks. To overcome offline LLMs' low-performance issues, we propose an innovative approach to constructing code descriptions based on code interpretation. This approach addresses the challenge that even third-party high-performance LLMs, despite their capabilities, still require manually crafted prompts and cannot ensure the generation of high-quality hardware designs. The proposed training process and the new dataset structure help us locally train a hardware design assistant LLM named PrivacyGen. The generated PrivacyGen performs similarly to GPT-4 in complex hardware design generation but has a much smaller size and low total cost of ownership.
Exhibitor Forum


DescriptionThe need for modern workloads in the latest innovations has brought 2.5D/3D stacking and advanced packaging technologies to the forefront. The requirements for integrating multiple chips, components, and materials to create an advanced IC package are becoming increasingly complicated and introduce new challenges to existing extraction and analysis methods. A computational framework which allows for comprehensive extraction of advanced IC package designs is proposed. It is based on a hybrid computational framework, combining different electromagnetic (EM) solvers, and leveraging AI models based on 3D full-wave simulation, to extract different netlist models efficiently and accurately for system-level analysis and optimization. It has the capability to do entire extraction of the most intricate stacked die system for a variety of packaging styles and provides co-design automation flows with signoff extraction, static timing analysis (STA) and signoff with signal and power integrity (SI/PI). With exceptional performance and reliability, it enables users to meet tight schedule efficiently.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionSecurity verification of Network-on-Chip (NoC) systems is essential due to their intricate and high-concurrency structures. Traditional methods often fail to cover all scenarios or scale effectively, leading to prolonged verification and overlooked vulnerabilities. Our proposed solution, InterConFuzz, a hybrid hardware fuzzing technique, uses symbolic execution for extensive coverage. Developed on Universal Verification Methodology (UVM), InterConFuzz discovered five security flaws in the NoC architecture of the OpenTitan SoC—surpassing existing techniques by three—while reducing memory and computational needs by 24.4% and 29.5%, respectively. Furthermore, InterConFuzz furnished comparable functional coverage compared to existing NoC fuzzing approaches, proving its efficiency and robustness.
Engineering Poster
Networking


DescriptionAs ICs continue to evolve towards advanced nodes, the metal layer manufacturing processes and the parasitic resistance and capacitance between interconnects increasingly impact the overall performance of the design. Typically, to ensure the universality and yield of metal layer manufacturing processes, fabs provide conservative design rules and rough model files. This leads to a large real margin and a low accuracy of parasitic parameter in actual circuits.
To address these issues, we conduct research on interconnect testkey in the back-end metal layers. By designing testkeys with specific dimensions and structures, we evaluate potential risks in the manufacturing process, such as metal line Bridges, Opens, etc. Additionally, we further analyze the consistency between testkey simulation and test results to determine the deviation levels of parasitic parameter under different structures. This helps differentiate the advantages and disadvantages of different design solutions, providing references for subsequent similar designs. This research effectively identifies risks between interconnects and fully monitors the process platform's state. It also allows for the selection of different interconnect design solutions and improves the accuracy of parasitic parameter extraction. Through this work, we can thoroughly explore the real margin and provide an optimal design, thereby enhancing the competitiveness of the product.
To address these issues, we conduct research on interconnect testkey in the back-end metal layers. By designing testkeys with specific dimensions and structures, we evaluate potential risks in the manufacturing process, such as metal line Bridges, Opens, etc. Additionally, we further analyze the consistency between testkey simulation and test results to determine the deviation levels of parasitic parameter under different structures. This helps differentiate the advantages and disadvantages of different design solutions, providing references for subsequent similar designs. This research effectively identifies risks between interconnects and fully monitors the process platform's state. It also allows for the selection of different interconnect design solutions and improves the accuracy of parasitic parameter extraction. Through this work, we can thoroughly explore the real margin and provide an optimal design, thereby enhancing the competitiveness of the product.
Research Manuscript


Systems
SYS2: Design of Cyber-Physical Systems and IoT
DescriptionIntermittent systems require software support to execute tasks amid frequent power failures. In designing such techniques, software designers rely on execution models that abstract hardware-level operations. In this paper, we propose an execution model that more accurately describes emerging intermittent systems with small energy storage. Our evaluation shows show that systems designed based on the traditional models can be up to 5.62x less power-efficient than expected and may result in unsafe checkpoint operations. Our design guidelines enhance the performance of existing static and dynamic checkpoint techniques by 3.04x and 2.85x on average, respectively.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionIntel SGX is susceptible to intra-enclave software vulnerabilities. Existing automated bug-finding methods primarily focus on fuzzing enclave boundaries for SGX applications in simulated, rather than actual hardware-protected enclaves. This limits the ability to identify potential security violations originating from within SGX application code. This paper presents IntraFuzz, a system that enables efficient fuzzing of SGX applications inside actual hardware enclaves. We evaluated IntraFuzz with 21 real-world SGX applications, running on Intel Xeon scalable processors with up to 256 GB of enclave page cache. IntraFuzz successfully detected all vulnerabilities in SGX application code previously identified by the state-of-the-art tool EnclaveFuzz, as well as 6 previously undiscovered vulnerabilities. These results highlight the importance of hardware-based fuzzing in securing SGX applications.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionAccelerating Machine Learning (ML) workloads
requires efficient methods due to their large optimization space.
Autotuning has emerged as an effective approach for system-
atically evaluating variations of implementations. Traditionally,
autotuning requires the workloads to be executed on the target
hardware (HW). We present an interface that allows executing
autotuning workloads on simulators. This approach offers high
scalability when the availability of the target HW is limited, as
many simulations can be run in parallel on any accessible HW.
Additionally, we evaluate the feasibility of using fast
instruction-accurate simulators for autotuning. We train various
predictors to forecast the performance of ML workload imple-
mentations on the target HW based on simulation statistics.
Our results demonstrate that the tuned predictors are highly
effective. The best workload implementation in terms of actual
run time on the target HW is always within the top 3 %
of predictions for the tested x86, ARM, and RISC-V-based
architectures. In the best case, this approach outperforms native
execution on the target HW for embedded architectures when
running as few as three samples on three simulators in parallel.
requires efficient methods due to their large optimization space.
Autotuning has emerged as an effective approach for system-
atically evaluating variations of implementations. Traditionally,
autotuning requires the workloads to be executed on the target
hardware (HW). We present an interface that allows executing
autotuning workloads on simulators. This approach offers high
scalability when the availability of the target HW is limited, as
many simulations can be run in parallel on any accessible HW.
Additionally, we evaluate the feasibility of using fast
instruction-accurate simulators for autotuning. We train various
predictors to forecast the performance of ML workload imple-
mentations on the target HW based on simulation statistics.
Our results demonstrate that the tuned predictors are highly
effective. The best workload implementation in terms of actual
run time on the target HW is always within the top 3 %
of predictions for the tested x86, ARM, and RISC-V-based
architectures. In the best case, this approach outperforms native
execution on the target HW for embedded architectures when
running as few as three samples on three simulators in parallel.
Tutorial


Sunday Program
DescriptionThe objectives of this tutorial are to provide a solid foundation in understanding large language models and their applications, equip participants with the trending AI knowledge to apply self-supervised learning techniques effectively in their own target applications, demonstrate the integration of multimodal data for enhanced AI capabilities, and discuss strategies to improve the efficiency of large-scale models. This tutorial content is designed for researchers, industry practitioners, and students interested in the latest advancements in AI model development and deployment. Our target audience may work in different backgrounds, including but not limited to: EDA researchers or engineers, especially those interested in AI for EDA; computer architecture researchers or engineers, especially those working on AI accelerator design; algorithm researchers or engineers, especially those working on AI algorithms, applications, and products. The tutorial will cover basic large language model (LLM) techniques, including transformer and RAG, self-supervised pretraining techniques, such as contrastive learning, multimodal representation learning, the efficiency of large foundation models, and foundation AI model’s applications in EDA.
Section 1: Basic large language model (LLM) techniques (Zhiyao Xie)
Section 2: Self-supervised pre-training techniques (Wei Wen)
Section 3: Multimodal representation techniques (Wei Wen)
Section 4: Eficiency of large foundation models (Ang Li)
Section 5: Application of foundation models in EDA (Zhiyao Xie)
Section 1: Basic large language model (LLM) techniques (Zhiyao Xie)
Section 2: Self-supervised pre-training techniques (Wei Wen)
Section 3: Multimodal representation techniques (Wei Wen)
Section 4: Eficiency of large foundation models (Ang Li)
Section 5: Application of foundation models in EDA (Zhiyao Xie)
Keynote


DescriptionIntroductions & Awards
Keynote


AI
Design
DescriptionIntroductions & Awards
Keynote


DescriptionIntroductions & Awards
Networking
Work-in-Progress Poster


DescriptionIn-vehicle infotainment systems provide a convenient and safe interface for accessing a host of useful features while driving and form an integral part of the internet-of-vehicles (IoV) ecosystem. Previous research has highlighted vulnerabilities of various components within an IoV network to cyber-attacks, particularly in automotive sensor communication channels and Electronic Control Units (ECUs), where breaches enable attackers to gain operation control over vehicles. However, beyond these communication and control interfaces, vulnerabilities in other components of an IoV network, especially in vehicle infotainment systems and web services, remain largely unexplored despite their potential to cause similarly serious consequences. In this work, we design and implement an evaluation framework to uncover security vulnerabilities in in-vehicle infotainment systems and web services, emphasizing that inadequate protection of these systems allows widespread escalation from an isolated vehicle attack to all connected vehicles within the IoV network. Our analysis of representative infotainment systems from several major car manufacturers, including Mercedes and VW, reveals several new vulnerabilities with significant ramifications, such as enabling an attacker to gain back-end control of all connected vehicles in a web service, access to vehicle peripherals (locks, cameras), and privacy information about anyone registered on the IoV network. We found that 7 of manufacturers are vulnerable, affecting approximately 7 million of consumers worldwide. We have responsibly disclosed the vulnerabilities to all parties and requested 6 CVEs which have all been assigned.
Engineering Poster
Networking


DescriptionRTL low power design methodologies focusing on observability and stability-based clock gating schemes are central to almost all power tools available in the market today but is that sufficient? Despite iterating these schemes over multiple revisions of the RTL, non-ideal power still slips through the design and shows up at silicon.
Engineering Presentation


AI
Back-End Design
Chiplet
DescriptionAs the number of AI parameters increases, the need for 2.5D packaging, including multi-HBM, becomes more necessary. There are more than 12,000 bumps in a single HBM, increasing the design complexity of the 2.5D Si-interposer and also affecting PDN Quality. Manual Routing, an existing design methodology, is difficult to optimize the design after Multi-Physics analysis due to large PDN design turn around time (TAT).
In this paper, we propose a new methodology for optimizing the Si-Interposer PDN design based on IR drop results with 3DIC platform. Firstly, we used the PDN Auto Routing feature to design the PDN. Secondly, we established the flow of analyzing System Level IR drop. Lastly, by integrating these two flows into Cadence Integrity 3D-IC Platform, PDN design could be efficiently performed from C4Bump assign to C4Bump fix.
Through PDN Auto-Routing, PDN design time was reduced to less than 3 hours. Compared to HBM3E, PDN Quality improved by 68% and also total TAT of PDN Design from C4Bump assign to fix decreased by 53%. By analyzing IR drop results in the early design stage, we were able to determine the number of layers, effectively reducing manufacturing TAT and cost.
In this paper, we propose a new methodology for optimizing the Si-Interposer PDN design based on IR drop results with 3DIC platform. Firstly, we used the PDN Auto Routing feature to design the PDN. Secondly, we established the flow of analyzing System Level IR drop. Lastly, by integrating these two flows into Cadence Integrity 3D-IC Platform, PDN design could be efficiently performed from C4Bump assign to C4Bump fix.
Through PDN Auto-Routing, PDN design time was reduced to less than 3 hours. Compared to HBM3E, PDN Quality improved by 68% and also total TAT of PDN Design from C4Bump assign to fix decreased by 53%. By analyzing IR drop results in the early design stage, we were able to determine the number of layers, effectively reducing manufacturing TAT and cost.
Engineering Poster
Networking


DescriptionIR-drop is a significant problem in advanced technology nodes due to an increase of power density. high speed frequency, cell density and higher toggle rates, additionally coupled with technology specific challenges like higher via and metal resistance. IR-drop sign-off becomes a critical issue for design closure. Usually, the low voltage corner or the setup corner is chosen to account for the IR drop, but this approach is highly pessimistic, because by applying guard band (which is usually equal to worst case IR drop within the block), we assume that all the instances in the design are operating at this reduced supply voltage. This is not the case as IR drop is different for each instance in the design. To cater this pessimism, industry moved to IR-Drop aware Timing Signoff Flow with Native Dynamic Voltage Drop (DvD) Support with Instance-specific separate VDD and VSS. Here, dynamic IR drop analysis is performed by rail analysis tool based on the project specification and instance-based customized rail voltage file is generated, which is read by Primetime tool to perform timing analysis to calculate instance-wise delays based on the actual voltage value (after considering the IR drop) at that specific instance. This approach helped in removing the pessimism, resulting in an early timing closure and reduction in area and power. But going further in advance technology nodes, where IR drop spikes have increased more in form of deeper spikes having higher voltage drop of a very narrow time window. Now continuing with Native DvD based IR-Drop aware timing signoff flow, which considers minimum voltage seen during any time stamp in cell switching time window as IR-Drop voltage for that instance for delay calculation, where as if that voltage drop is of very short duration, it may not impact cell delay that strongly, so to remove further this pessimism, we worked on more advanced IR-drop aware timing signoff flow based on accurate Piecewise-Linear VDD and VSS Waveform. Here Timing Signoff tool (Primetime) reads instance specific piecewise-linear VDD and VSS waveforms generated by a rail analysis tool and performs more accurate timing analysis by picking up the realistic worst voltage drops. This presentation highlights complete details of How we converged to advanced IR-drop aware timing signoff flow, problem it solves, methodology and flow implemented in lower techno node for Automotive Design.
Research Manuscript


EDA
EDA4: Power Analysis and Optimization
DescriptionWith the continued scaling of integrated circuits (ICs), IR drop analysis for on-chip power grids (PGs) is crucial but increasingly computationally demanding.
Traditional numerical methods deliver high accuracy but are prohibitively time-intensive, while various machine learning (ML) methods have been introduced to alleviate these computational burdens.
However, most CNN-based methods ignore the fine structure and topological information of PGs, and face interpretability or scalability issues.
In this work, we propose a novel graph-based framework, IRGNN, leveraging the PG topology with the integration of numerical solutions and point clouds.
Our framework applies a numerical solver, AMG-PCG, to generate rough numerical solutions as a reliable interpretability foundation for ML.
Then, to capture PG topology, we regard nodes of PG as point clouds and extract point cloud features, and we introduce a novel graph structure, IRGraph.
Furthermore, a novel graph-based model IRGNN is designed, incorporating a designed neighbor distance attention (NDA) layer for distance-aware PG features aggregation and graph transformer (GT) layer to capture global information.
It should be noted that our framework can analyze the IR drop of each node in PG, which CNN-based methods cannot do.
Experimental evaluations demonstrate that our framework achieves significantly higher accuracy than previous CNN-based approaches and numerical solvers while substantially reducing computation time.
Traditional numerical methods deliver high accuracy but are prohibitively time-intensive, while various machine learning (ML) methods have been introduced to alleviate these computational burdens.
However, most CNN-based methods ignore the fine structure and topological information of PGs, and face interpretability or scalability issues.
In this work, we propose a novel graph-based framework, IRGNN, leveraging the PG topology with the integration of numerical solutions and point clouds.
Our framework applies a numerical solver, AMG-PCG, to generate rough numerical solutions as a reliable interpretability foundation for ML.
Then, to capture PG topology, we regard nodes of PG as point clouds and extract point cloud features, and we introduce a novel graph structure, IRGraph.
Furthermore, a novel graph-based model IRGNN is designed, incorporating a designed neighbor distance attention (NDA) layer for distance-aware PG features aggregation and graph transformer (GT) layer to capture global information.
It should be noted that our framework can analyze the IR drop of each node in PG, which CNN-based methods cannot do.
Experimental evaluations demonstrate that our framework achieves significantly higher accuracy than previous CNN-based approaches and numerical solvers while substantially reducing computation time.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionTask-oriented object detection is increasingly essential for intelligent sensing applications, enabling AI systems to operate autonomously in complex, real-world environments such as autonomous driving, healthcare, and industrial automation. Conventional models often struggle with generalization, requiring vast datasets to accurately detect objects within diverse contexts. In this work, we introduce iTaskSense, a task-oriented object detection framework that leverages large language models (LLMs) to generalize efficiently from limited samples by generating an abstract knowledge graph. This graph encapsulates essential task attributes, allowing iTaskSense to identify objects based on high-level characteristics rather than extensive data, making it possible to adapt to complex mission requirements with minimal samples.
iTaskSense addresses the challenges of high computational cost and resource limitations in vision-language models by offering two configuration models: a distilled, task-specific vision transformer optimized for high accuracy in defined tasks, and a quantized version of the model for broader applicability across multiple tasks. Additionally, we designed a hardware acceleration circuit to support real-time processing, essential for edge devices that require low latency and efficient task execution. Our evaluations show that the task-specific configuration achieves a 15\% higher accuracy over the quantized configuration in specific scenarios, while the quantized model provides robust multi-task performance. The hardware-accelerated iTaskSense system achieves a 3.5x speedup and a 40\% reduction in energy consumption compared to GPU-based implementations. These results demonstrate that iTaskSense's dual-configuration approach and situational adaptability offer a scalable solution for task-specific object detection, providing robust and efficient performance in resource-constrained environments.
iTaskSense addresses the challenges of high computational cost and resource limitations in vision-language models by offering two configuration models: a distilled, task-specific vision transformer optimized for high accuracy in defined tasks, and a quantized version of the model for broader applicability across multiple tasks. Additionally, we designed a hardware acceleration circuit to support real-time processing, essential for edge devices that require low latency and efficient task execution. Our evaluations show that the task-specific configuration achieves a 15\% higher accuracy over the quantized configuration in specific scenarios, while the quantized model provides robust multi-task performance. The hardware-accelerated iTaskSense system achieves a 3.5x speedup and a 40\% reduction in energy consumption compared to GPU-based implementations. These results demonstrate that iTaskSense's dual-configuration approach and situational adaptability offer a scalable solution for task-specific object detection, providing robust and efficient performance in resource-constrained environments.
Engineering Poster


DescriptionWe present our experience with a custom flow named Janus, which enhances the debugging process by converting formal verification traces into UVM tests, effectively bridging the gap between formal verification and dynamic simulation. This dual approach improves bug detection and reduces debug efforts. By leveraging formal verification (FV) to generate focused, concise trace data, Janus translates these traces into UVM-based simulation tests, enabling enhanced debugging and comprehensive coverage generation.
This technique is demonstrated in two key applications: security controller IP with taint verification and LPDDR PHY controllers with timing exception handling. For the security controller, Janus uses formal verification to identify potential taint propagation vulnerabilities, which are then validated through simulation to ensure secure data handling. For the LPDDR PHY controller, timing exceptions—such as false path and multi-cycle path constraints from the user's SDC file—are verified through formal methods, while the translated UVM tests simulate realistic operational conditions to validate the system.
By integrating formal verification with UVM simulation, Janus significantly improves debugging efficiency (by a factor of 4 on an IP), while also providing detailed coverage generation. For the SDC verification, Janus successfully identified and verified 250+ failing exceptions in simulation.
This technique is demonstrated in two key applications: security controller IP with taint verification and LPDDR PHY controllers with timing exception handling. For the security controller, Janus uses formal verification to identify potential taint propagation vulnerabilities, which are then validated through simulation to ensure secure data handling. For the LPDDR PHY controller, timing exceptions—such as false path and multi-cycle path constraints from the user's SDC file—are verified through formal methods, while the translated UVM tests simulate realistic operational conditions to validate the system.
By integrating formal verification with UVM simulation, Janus significantly improves debugging efficiency (by a factor of 4 on an IP), while also providing detailed coverage generation. For the SDC verification, Janus successfully identified and verified 250+ failing exceptions in simulation.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionPushing classical simulation methods to their limit is crucial given their exponential complexity. Besides traditional Schrödinger-style simulations, Hybrid Schrödinger-Feynman (HSF) approaches have shown promise by "cutting" circuits into smaller parts to reduce execution times, though this incurs exponential overhead with the number of cuts. We propose "joint cutting" in HSF, where gates are grouped into blocks and cut simultaneously, significantly lowering the aforementioned overhead. Experimental results show that joint cutting can outperform standard HSF by up to a factor of 4000× and Schrödinger-style simulations by up to a factor of 200× in suitable cases.
Engineering Special Session


Back-End Design
DescriptionIn today's fully connected world, attacks on computing and non-computing connected systems are becoming increasingly common. Often, those attacks severely impact the victims. To address such problems, the research community is aggressively working on various aspects of security. In the first talk of the session, the speaker will discuss utilizing the 3D heterogeneous integration technology to address the semiconductor supply-chain security issues. The second talk will be about the problems and solutions in the context of enhanced assurance for the FPGA-centric EDA tools. The third presenter will discuss the challenges for adopting the Quantum resistant cryptography in the hardware level. In the fourth topic, the speaker will discuss about secure Deep-Learning based EDA flows, because ML-assisted EDA tools are also susceptible to the security issues.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionSuperconducting qubits are among the most promising candidates for building quantum information processors. Yet, they are often limited by slow and error-prone qubit readout—a critical factor in achieving high-fidelity operations. While current methods, including deep neural networks, enhance readout accuracy, they typically lack support for mid-circuit measurements essential for quantum error correction and usually rely on large, resource-intensive network models. This paper presents KLiNQ, a novel qubit readout architecture leveraging lightweight neural networks optimized via knowledge distillation. Our approach achieves around a 99% reduction in model size compared to the baseline while maintaining a qubit-state discrimination accuracy of 91%. By assigning a dedicated, compact neural network for each qubit, KLiNQ facilitates rapid, independent qubit-state readouts that enable mid-circuit measurements. Implemented on the Xilinx UltraScale+ FPGA, our design is able to perform the discrimination with an average of 32 ns. The results demonstrate that compressed neural networks can maintain high-fidelity independent readout while enabling efficient hardware implementation, advancing practical quantum computing.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionWith the widespread deployment of long-context large language models (LLMs), efficient and high-quality generation is becoming increasingly important. Modern LLMs employ batching and key-value (KV) cache to improve generation throughput and quality. However, as the context length and batch size rise drastically, the KV cache incurs extreme external memory access (EMA) issues. Recent LLM accelerators face substantial processing element (PE) under-utilization due to the low arithmetic intensity of attention with KV cache, while existing KV cache compression algorithms struggle with hardware inefficiency or significant accuracy degradation. To address these issues, an algorithm-architecture co-optimization, KVO-LLM, is proposed for long-context batched LLM generation. At the algorithm level, we propose a KV cache quantization-aware pruning method that first adopts salient-token-aware quantization and then prunes KV channels and tokens by attention guided pruning based on salient tokens identified during quantization. Achieving substantial savings on hardware overhead, our algorithm reduces the EMA of KV cache over 91% with significant accuracy advantages compared to previous KV cache compression algorithms. At the architecture level, we propose a multi-core jointly optimized accelerator that adopts operator fusion and cross-batch interleaving strategy, maximizing PE and DRAM bandwidth utilization. Compared to the state-of-the-art LLM accelerators, KVO-LLM improves generation throughput by up to 7.32x, and attains 5.52~8.38x better energy efficiency.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionMulti-Task Learning (MTL) unifies various tasks into a single network for improved training and inference efficiency, crucial for real-time applications in resource-constrained environments. Most MTL approaches enhance parameter efficiency and task metrics but lack explicit inference latency awareness. We propose LA-MTL, an automated layer-level MTL policy search incorporating a novel analytical latency factor (ALF), balancing task metrics, parameter efficiency, and latency constraints. LA-MTL on ResNet34 achieves up to 50% lower latency on Jetson AGX Orin with competitive metrics for semantic segmentation and depth estimation (+/-2 p.p.), on CityScapes, and surpasses state-of-the-art MTL parameters efficiency by 20 p.p. Code will be published upon acceptance
Late Breaking Results


DescriptionWe propose DiTTO, a novel diffusion-based frame- work for generating realistic, precisely configurable, and diverse multi-device storage traces. Leveraging advanced diffusion techniques, DiTTO enables the synthesis of high-fidelity continuous traces that capture temporal dynamics and inter-device dependencies with user-defined configurations. Our experimental results demonstrate that DiTTO can generate traces with high fidelity and diversity while aligning closely with guided configurations only with 8% errors
Late Breaking Results


DescriptionThis paper presents FastNN, a novel accelerator architecture for efficient K-Nearest Neighbors (KNN) search in point clouds. FastNN leverages a locality-sensitive E2LSH partitioning method and a pre-comparator module to significantly reduce the candidate search space and minimize the number of Euclidean distance calculations. Compared to octree-based partitioning methods, our approach reduces candidate points by 58.57% to 86.17% and achieves a 10.04× acceleration in processing throughput relative to the BitNN comparator subsystem. The proposed design effectively enhances search throughput, resource utilization, and precision, highlighting its potential for accelerating KNN search on FPGA platforms.
Late Breaking Results


DescriptionMacro placement is crucial in VLSI design, directly impacting circuit performance. We introduce MacroDiff, a diffusion-based macro placement generative model that captures wirelength relationships instead of directly predicting macro coordinates. By leveraging wirelength as an intermediate representation, MacroDiff naturally preserves circuit connectivity, reduces placement constraints, and enhances solution flexibility while inherently handling rotational and translational invariances. Experiments on ISPD2005 benchmarks show that MacroDiff reduces macro overlap by 91.6%, lowers macro legalization displacement by 74.4%, and improves half-perimeter wirelength (HPWL) by 7.0%. While maintaining the efficiency of generative approaches, MacroDiff generates high-quality placements more reliably, narrowing the gap with state-of-the-art methods. The source code for this work is available at https://anonymous.4open.science/r/MacroDiff-E806.
Late Breaking Results


DescriptionThis paper introduces an automated placement framework to optimize component positioning on modern printed circuit boards (PCBs), addressing challenges posed by heterogeneous components, irregular geometries, and complex design rules. The framework employs three key techniques to enhance placement quality and efficiency: (1) a global placement approach integrating collision detection via the Separating Axis Theorem to handle exact component contours and board shapes, (2) a multi-stage force-directed method that dynamically adjusts attractive and repulsive forces to meet clearance and routability constraints, and (3) a scanline-based legalization technique to resolve overlaps and enforce spacing requirements. Our methodology effectively adapts to diverse design limitations and accelerates the process while preserving placement quality. Experimental results demonstrate that our placer significantly improves routability over state-of-the-art solutions, demonstrating robust performance on industrial PCB designs with complex and irregular constraints.
Late Breaking Results


DescriptionThe track assignment has been introduced between the global routing and the detailed routing. Based on the independence and divisibility of track assignment, we propose a GPU-accelerated parallel track assignment algorithm. To estimate the routability more accurately, the algorithm proposes a track assignment strategy considering both global and local net, and incorporating several strategies for optimization. Moreover, an asynchronous parallelism strategy is proposed to divide the computation of routing resources and the track assignment into fine-grained tasks. Experimental results show that, compared to related works, our algorithm achieves an overall speedup with a better routability estimation.
Late Breaking Results


DescriptionThis work presents an automated methodology for optimizing power amplifier (PA) design by predicting the most suitable circuit topology. Bidirectional long short-term memory (BiLSTM) deep neural network (DNN) is trained to determine the optimal PA topology, while multi-objective Pareto front optimization refines the network hyperparameters. The proposed approach is validated through high-performance PAs using lumped elements and transmission lines at a 1–2 GHz frequency range. The method is demonstrated using the Cree CGH40010 GaN HEMT on a Rogers RO4350B substrate, achieving power output of ∼40 dBm, power-added efficiency of at least 50%, and power gain exceeding 10 dB.
Late Breaking Results


DescriptionStatistical timing characterization for standard cells faces significant computational challenges due to the laborious bisection analysis for setup/hold constraint of sequential cells. To address this issue, we propose a Bisection-Free Learning Approach for Statistical Timing Characterization (BLAST) by extracting inherent delay for data path and clock path in sequential cells as specific features. Multi-task learning is implemented with a multi-gate mixture-of-experts (MMoE) model to exploit the profound interdependency between setup and hold constraint for different timing arcs, where the active learning strategy is incorporated to improve learning efficiency. Experimental results under 135 PVT corners with TSMC 12nm process demonstrate that the proposed BLAST achieves considerable acceleration by avoiding the iterative bisection search in for statistical constraint prediction with 76.9% runtime reduction compared to the commercial tool. Excellent prediction accuracy is achieved for various flip-flops by BLAST with the relative root mean square error (rRMSE) of 2.21% and worst-case absolute error (WCAE) of 0.82 ps.
Late Breaking Results


DescriptionLayout-dependent effects (LDEs) significantly impact analog circuit performance. Traditionally, designers have relied on symmetric placement of circuit components to mitigate variations caused by LDEs. However, due to non-linear nature of these effects, conventional methods often fall short. We propose an objective-driven, multi-level, multi-agent Q-learning framework to explore unconventional design space of analog layout, opening new avenues for optimizing analog circuit performance. Our approach achieves better variation performance than the state-of-the-art layout techniques. Notably, this is the first application of multi-agent RL in analog layout automation. The proposed approach is compared with non-ML approach based on simulated annealing.
Late Breaking Results


DescriptionIn this paper, we propose a customized diffusion model to directly generate high-quality initial floorplans.
By leveraging a classical analytical-based floorplanner on top of this initial floorplan, the final floorplanning results are significantly improved.
To enhance feature extraction, a heterogeneous graph neural network (HGNN) is developed to explicitly incorporate block-to-block and pin-to-block relationships from the netlist during the diffusion process.
Additionally, a novel guidance sampling function is introduced to optimize both wirelength and overlap, effectively reducing the required sampling steps while maintaining competitive initial solutions.
Experimental results demonstrate that integrating our proposed diffusion model with an advanced analytical-based floorplanner achieves at least 4.8\% reduction in runtime and 3.0\% reduction in HPWL compared to the original floorplanner and other diffusion-based methods.
By leveraging a classical analytical-based floorplanner on top of this initial floorplan, the final floorplanning results are significantly improved.
To enhance feature extraction, a heterogeneous graph neural network (HGNN) is developed to explicitly incorporate block-to-block and pin-to-block relationships from the netlist during the diffusion process.
Additionally, a novel guidance sampling function is introduced to optimize both wirelength and overlap, effectively reducing the required sampling steps while maintaining competitive initial solutions.
Experimental results demonstrate that integrating our proposed diffusion model with an advanced analytical-based floorplanner achieves at least 4.8\% reduction in runtime and 3.0\% reduction in HPWL compared to the original floorplanner and other diffusion-based methods.
Late Breaking Results


DescriptionRemote Attestation (RA) has become a valuable security service for Internet of Things (IoT) devices, as the security of these devices is often not prioritized during the manufacturing process. However, traditional RA schemes suffer from a single point of failure because they rely on a trusted verifier. To address this issue, we propose a voting-based blockchain attestation protocol that provides a reliable solution by eliminating the single point of failure through distributed verification across all nodes. In addition, it offers a traceable and immutable public history of the attestation results, which can be verified by external auditors at any time. Finally, we verify our proposed protocol on three NVIDIA Jetson embedded devices hosting up to 15 attestation nodes.
Late Breaking Results


DescriptionIn this paper, disruptive research using generative diffusion models (DMs) with an attention-based encoder-decoder backbone is conducted to automate the sizing of analog integrated circuits (ICs). Unlike time-consuming optimization-based methods, the encoder-decoder DM is able to sample accurate solutions at push-button speed by solving the inverse sizing problem. Experimental results show that the proposed model outperforms the most recent deep learning-based techniques, presenting higher generalization capabilities to performance targets not seen during training.
Late Breaking Results


DescriptionThe understanding and reasoning capabilities of large language models (LLMs) with text data have made them widely used for test stimuli generation. Existing studies have primarily focused on methods such as prompt engineering or providing feedback to the LLMs' generated outputs to improve test stimuli generation. However, these approaches have not been successful in enhancing the LLMs' domain-specific performance in generating test stimuli. In this paper, we introduce a framework for fine- tuning LLMs for test stimuli generation through dataset generation and reinforcement learning (RL). Our dataset generation approach creates a table-shaped test stimuli dataset, which helps ensure that the LLM produces consistent outputs. Additionally, our two-stage fine-tuning process involves training the LLMs on domain-specific data and using RL to provide feedback on the generated outputs, further enhancing the LLMs' performance in test stimuli generation. Experimental results confirm that our framework improves syntax correctness and code coverage of test stimuli, outperforming commercial models.
Late Breaking Results


DescriptionIn this work, we propose FPGen-3D, an automated framework for 3D field-programmable gate arrays (FPGA) architecture generation and exploration. FPGen-3D generates custom 3D FPGA fabrics based on user-defined architectural parameters, producing synthesizable register-transfer level (RTL) code, routing results, and programming bitstreams, enabling efficient navigation of the 3D FPGA design space. We demonstrate its capabilities through a case study of generating and exploring various vertical connection strategies via 3D switch blocks (SB) and 3D configurable logic block (CLB). Results show that 3D FPGAs can achieve lower wire length than traditional 2D FPGAs with a decrease of up to 18.5%.
Late Breaking Results


DescriptionHybrid optimization is an emerging approach in logic synthesis, focusing on applying diverse optimization methods to different parts of a logic circuit. This paper analyzes the relationship between each vertex and its corresponding optimization method. We extract a subgraph centered on each vertex and quantify the logic optimization results of these subgraphs as vertex features. Based on these features, we propose a circuit partitioning method to cluster the logic circuit, enabling the final optimized circuit to be constructed by merging clusters optimized with their respective methods. Additionally, we introduce a self-supervised prediction model to efficiently obtain vertex features. The experimental results targeting LUT mapping demonstrate that our method achieves improvements of 8.48\% in area and 9.81\% in delay compared to the state-of-the-art.
Late Breaking Results


DescriptionDesigning an efficient arithmetic division circuit has long been a significant challenge. Traditional binary computation methods rely on complex algorithms that require multiple cycles, complex control logic, and substantial hardware resources. Implementing division with emerg- ing in-memory computing technologies is even more challenging due to susceptibility to noise, process variation, and the complexity of binary division. In this work, we propose an in-memory division architecture leveraging stochastic computing (SC), an emerging technology known for its high fault tolerance and low-cost design. Our approach utilizes a magnetic tunnel junction (MTJ)-based memory architecture to efficiently execute logic-in-memory operations. Experimental results across various process variation conditions demonstrate the robustness of our method against hardware variations. To assess its practical effectiveness, we apply our approach to the Retinex Algorithm for image enhancement, demonstrating its viability in real-world applications.
Late Breaking Results


DescriptionIntegrating deep learning and image sensors has significantly transformed machine vision applications. Yet, conventional high-resolution image acquisition schemes enabled by imagers are energy-inefficient for deep learning, as they involve excessive data quantization and transmission overhead. To address this challenge, we propose a lightweight in-sensor compressive learning framework that integrates a compressive learning-based encoder within image sensors for task-specific feature extraction. Our framework encodes raw images into adaptive low-dimensional representations using only a 1-bit encoder by joint optimization with downstream machine vision tasks. It achieves 10× data compression, a minimum of 1.6% accuracy loss in the task, and 3.93× energy savings at the sensor-end, outperforming prior arts.
Late Breaking Results


DescriptionClustering single-bit flip-flops (SBFFs) into multi-bit flip-flops (MBFFs) effectively reduces power and area. However, excessive displacement during the clustering and legalization process may incur significant timing degradation. To address this issue, we propose the first comprehensive MBFF placement methodology that addresses excessive displacement caused by pre-placed cells during clustering and legalization while simultaneously optimizing timing, power, area, and bin utilization. Our methodology includes three main features: (1) a force model to relocate flip-flops and reduce timing violations, (2) a clustering and legalization process to reduce timing degradation caused by displacement, and (3) a multi-objective function to identify flip-flop candidates suitable for MBFF clustering. Our methodology outperforms all participating teams in the 2024 CAD Contest at ICCAD on Power and Timing Optimization Using Multi-Bit Flip-Flops, based on exactly the same settings
Late Breaking Results


DescriptionDue to the higher energy and hardware efficiency of spiking neural networks (SNNs) compared to deep neural networks, they have attracted a lot of attention. However, their security must be investigated, given that they have access to private and confidential data. Physically unclonable functions (PUFs) are a class of circuits with security applications like device authentication, embedded licensing, device-specific cryptographic key generation, and anti-counterfeiting. Therefore, PUFs can be used to enhance the security of SNN. Accordingly, in this paper, an MTJ-based LIF Neuron/PUF has been proposed. The proposed design can function as both a LIF neuron and PUF. The results of the Monte Carlo simulation show that the proposed design has better uniqueness and uniformity values compared to its counterparts. These values for the proposed design are 50.07% and 49.66%, which are close to their ideal value of 50%. Also, the mean value of Shannon entropy for the 128-bit PUF response of the proposed design is 0.9974, which is close to its ideal value of 1.
Late Breaking Results


DescriptionInspired by the human brain, Hyperdimensional Computing (HDC) processes information efficiently by operating in high-dimensional space using hypervectors. While previous works focus on optimizing pregenerated hypervectors in software, this study introduces a novel on-the-fly vector generation method in hardware with O(1) complexity, compared to the O(N) iterative search used in conventional approaches to find the best orthogonal hypervectors. Our approach leverages Hadamard binary coefficients and unary computing to simplify encoding into addition-only operations after the generation stage in ASIC, implemented using inmemory computing. The proposed design significantly improves accuracy and computational efficiency across multiple benchmark datasets.
Late Breaking Results


DescriptionThe front-end synthesis of analog circuits has been a long-standing challenge since the advent of integrated circuits. Many methods, ranging from conventional optimization-based techniques to emerging learning-based approaches, have been extensively explored to address this challenge. Yet, these methods are data-driven and often suffer from low design efficiency, due to their heavy reliance on time-consuming circuit simulators, which are frequently used in the synthesis loop for real-time evaluation of the evolving circuit design. In addition, benchmarking these methods is also largely unachievable due to their exclusive use of commercial semiconductor technology for evaluation. This "Late Breaking Results" introduces Opera, an open and efficient platform for the data-driven synthesis of analog circuits. Specifically, Opera develops efficient surrogate models for various circuits and integrates them into open-source OpenAI Gym-like environments to enable efficient synthesis. Case studies on exemplary circuits show that this platform can accelerate the conventional data-driven synthesis flow by up to 40×. It also enables the benchmarking of various synthesis methods with standardized environments built upon an open-source semiconductor process.
Late Breaking Results


DescriptionGlobal routing is a critical stage in the VLSI design flow, aiming to provide a robust guide for detailed routing and serve as early design feedback for placement. Many approaches have leveraged GPU parallelization to achieve significant acceleration and reduce runtime. However, with the increasing size and complexity of modern large-scale designs, recent GPU-accelerated maze routing approaches, driven by the sweep operation, struggle to find solutions efficiently within limited GPU memory resources. In order to address this issue, this paper proposes a scalable, GPU-friendly sweep-based maze routing methodology that requires significantly less memory and kernel function calls while accelerating overall runtime. We introduce a sweep-sharing technique that allows multiple nets to be routed simultaneously within a single sweeping process, significantly enhancing memory efficiency and reducing kernel launching overhead. We further propose an edge-level rip-up-and-reroute technique that selectively reroutes only overflowed segments, preserving feasible parts of the solution and substantially reducing runtime. Experimental results on the latest ISPD'24 Contest benchmarks demonstrate that our GPU-friendly maze routing with sweep-sharing technique can significantly improve the efficiency over the state-of-the-art GPU-accelerated maze router.
Late Breaking Results


DescriptionDynamic workloads running on multiple hosts will bring changing access patterns on CXL-enabled shared disaggregated memory. Existing works often un-traceably cache multi-source accesses, making it hard to exploit each host's access behavior and assure service quality. Our solution Alchemy jointly optimizes cache replacement and bypassing and runs as an online reinforcement learning agent with source-aware adaptivity. It gives rewards derived from sampling-based action effectiveness and per-host macro performance. The multi-host prototype-based results on FPGAs show 8.71%-14.56% reduction in average access latency over LRU policy and 44x faster than the hardware-efficient ICGMM method in decision-making with comparable overhead.
Late Breaking Results


DescriptionStatistical Static Timing Analysis (SSTA) is a crucial technique in digital circuit design because it addresses on-chip variations (OCV) by propagating timing distributions instead of fixed delays. However, the computational complexity of SSTA demands significant memory and long runtimes. While GPUs offer opportunities to accelerate SSTA, their limited memory caapacity makes it challenging to handle large-scale SSTA workloads. To address this challenge, we propose a statistical timing graph (STG) scheduling algorithm combined with a GPU memory management strategy. We have shown up to 4.9x speedup on a GPU with 16 GB memory compared to a 20-thread CPU baseline when solving an 18.2 GB STG.
Late Breaking Results


DescriptionThe security of Internet of Things (IoT) devices is crucial to protect the vast amounts of data exposed due to their widespread adoption. Authentication is one of the key aspects of IoT security, but it becomes increasingly challenging, especially for resource-constrained devices that require lightweight and efficient solutions. Physical Unclonable Functions (PUFs) have emerged as a promising lightweight solution by using the unique physical properties of Integrated Circuits (ICs). Pseudo Liner Feedback Shift Register PUF (PLPUF) is one of the state-of- the-art implementations known for its flexibility in altering the challenge-response space by changing the activation duration. In this work, we demonstrate that selecting an appropriate activation duration for PLPUF is critical, as improper choices can compromise security. By analyzing the linear dependency between the responses of different PLPUF pairs, our results reveal that predictability can reach up to 96% when an unsuitable activation duration is chosen
Late Breaking Results


DescriptionAs the process technology advances, reducing the leakage power as much as possible is one of the utmost challenging tasks in chip implementation. Utilizing cells with multi-VT (threshold voltage) is known to be a very effective method for optimizing leakage power under timing constraints. However, for the sequential cells on the timing critical paths in circuits, there is no easy way for the multi-VT method to reduce the leakage power unless timing is not sacrificed. To overcome this barrier, we introduce a set of new standard cells called hybrid-VT flip-flop cells, each of which is implemented with two different VT types, one implanted onto its master latch while the other onto its slave latch, by which the setup time and clock-to-Q delay can be controlled individually and independently. We confirm that applying our power recovery method utilizing hybrid-VT flip-flop cells to the benchmark circuits, which have already been optimized by the conventional multi-VT cells, is able to further reduce the leakage power by 8.97% with no timing degradation.
Late Breaking Results


DescriptionThis paper presents a high speed NAND based 4:1 Multiplexer (MUX) for various logic operations in a Resistive-Random access memory crossbar structure. Compared to existing 4:1 MUX designs that require 10 steps, our proposed method completes execution in 3 steps. Additionally, proposed 4:1 MUX
based Full adder, Full subtractor and 1-bit Comparator reduce clock cycles by 75% compared to state-of-the-art. Despite an increase in memristor count, the reduction in computational delay makes this approach promising for fast in-memory computing.
based Full adder, Full subtractor and 1-bit Comparator reduce clock cycles by 75% compared to state-of-the-art. Despite an increase in memristor count, the reduction in computational delay makes this approach promising for fast in-memory computing.
Late Breaking Results


DescriptionThis paper presents the first warpage-aware generative learning-based floorplanning algorithm to effectively model the warpage effect and optimize the die floorplan on a fixed outlined substrate. With more heterogeneous materials and dense interconnects in advanced packaging, warpage is a main reliability concern and may degrade system performance. We present a novel transformer-based encoding scheme to learn node and edge representations, followed by parallel decoding and warpage-aware legalization to jointly minimize die displacement and warpage. Experimental results show that our algorithm improves warpage by 9.9% and wirelength by 8.3% on average, compared with the state-of-the-art work.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionIn recent years, cloud providers are dedicated to enabling FPGA multi-tenancy to improve resource utilization, but this new sharing model introduces power side-channel threats, where attackers detect voltage fluctuations from co-located circuits. This paper proposes LeakyDSP, a novel on-chip sensor that maliciously configures DSP blocks to sense fine-grained voltage fluctuations but is overlooked by existing studies. Our experimental results show that LeakyDSP achieves high sensitivity to voltage fluctuations and strong robustness to different placements. Besides, we apply LeakyDSP to extract full AES keys with 25k-78k traces and build covert channels with a high transmission rate of 247.94 bit/s.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionDynamic graph processing systems using conventional array-based architectures face significant throughput limitations due to inefficient memory access and index management. While learned indexes improve data structure access, they struggle with interconnected graph data. We present LearnGraph, a novel architecture with an adaptive tree-based memory manager that dynamically optimizes for graph topology and access patterns. Our design integrates two key components: a hierarchical learned index optimized for graph topology to predict vertex and edge locations, and an adaptive tree structure that automatically reorganizes memory regions based on access patterns. Evaluation results demonstrate that LearnGraph outperforms state-of-the-art dynamic graph systems, achieving 3.4× higher throughput on average and reducing processing time by 1.7× to 11× across standard graph workloads.
Research Manuscript


Systems
SYS1: Autonomous Systems (Automotive, Robotics, Drones)
SYS2: Design of Cyber-Physical Systems and IoT
DescriptionThe design of controllers for safety-critical systems is an important research issue. Especially, the generation of controllers with formal safety guarantees is a challenging problem. Recently, for safety objectives of various system control tasks, machine learning technologies have been used to achieve ideal training and simulation performance, but formal guarantees are still lacking. This paper takes advantages of learning technology to assist safe controller synthesis with formal guarantees. On the one hand, the generation of verifiable safe controllers is aided by reinforcement learning; On the other hand, a set of barrier certificates (BC), i.e. a vector BC, is synthesized with the aid of deep learning to certify the safety of synthesized controllers. And vector BC is more expressive than the conventional single BC for safety verification. Compared with the existing work on vector BC generation, our method has two advantages: first, our method verifies a learned candidate vector BC, rather than directly generating a verified one, and thus has low computational complexity; second, the existing method has made relaxations to the non-convex vector BC constraints, which reduced the feasible region of solutions, while our method can deal with the original constraints. Furthermore, experiments fully demonstrates the effectiveness of our proposed method on a series of benchmarks.
Networking
Work-in-Progress Poster


DescriptionTraditional approaches for designing analog circuits are time-consuming and require significant human expertise. Existing automation efforts using methods like Bayesian Optimization (BO) and Reinforcement Learning (RL) are sub-optimal and costly to generalize across different topologies and technology nodes. In our work, we introduce a novel approach, LEDRO, utilizing Large Language Models (LLMs) in conjunction with optimization techniques to iteratively refine the design space for analog circuit sizing. LEDRO is highly generalizable compared to other RL and BO baselines, eliminating the need for design annotation or model training for different topologies or technology nodes. We conduct a comprehensive evaluation of our proposed framework and baseline on 22 different Op-Amp topologies across four FinFET technology nodes. Results demonstrate the superior performance of LEDRO as it outperforms our best baseline by an average of 13% FoM improvement with 2.15x speed-up on low complexity Op-Amps and 48% FoM improvement with 1.7x speed-up on high complexity Op-Amps. This highlights LEDRO's effective performance, efficiency, and generalizability.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionDesigning processor microarchitectures is increasingly challenging due to a vast design space and the need to balance multiple performance metrics. Traditional algorithm-driven design space exploration (DSE) approaches often struggle to incorporate the extensive domain knowledge of expert architects. To address this, we introduce LEMOE, a multi-objective microarchitecture optimization framework that leverages large language model (LLM) to enhance an implicit Bayesian model. LEMOE features a program-aware warm-up phase utilizing LLM and LLVM to produce an initial design set with rich prior knowledge. By harnessing LLM's contextual learning, our approach improves surrogate modeling and sampling under sparse data conditions. Experiment results show that LEMOE achieves a 22.8% improvement in energy efficiency with the same number of iterations and a 2.9× runtime speedup for the same target compared to prior works.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionEdge intelligent workstations load (store) massive empirical data from (to) remote cloud storage due to limited local storage. However, current remote storage access frameworks are complex. They use expensive computing resources to manipulate multiple concurrent request queues in modern high-speed storage devices for saturating performance. Complex software stacks and limited CPUs on edge intelligent workstations hinder saturating concurrent request queues, thus resulting in up to 75% performance degradation for remote storage. We propose Leopard, a hardware pass-through remote storage access framework with queue concurrency, which provides lossless remote storage access for edge intelligence. Leopard proposes a custom NVMe controller using SmartNIC's FPGA core to emulate it as an NVMe device, which eliminates complex remote storage stacks for edge workstations. Operations for remote storage access are implemented as hardware circuits inside the controller to eliminate CPU cycles. Parallelized and pipelined workflows are proposed for hardware circuits to accelerate remote storage access operations. Our evaluation presents that Leopard exhibits 1.09×~6.04× lower remote storage access latency than SOTA solutions for realistic workloads in edge intelligent workstations.
Exhibitor Forum


DescriptionChip development schedules and costs have risen significantly especially for complex digital designs. Engineering teams face the challenge of achieving aggressive power, performance, area (PPA), and quality metrics in shorter schedule times, and with smaller teams.
AI has come a long way in EDA, from the initial machine learning applications to reinforcement learning, generative AI, and beyond. Technologies are in a breakthrough position in semiconductor design creation today, where AI-assisted EDA workflows are expected to lead to significantly higher productivity and better quality of results. AI can be prevalent in the digital design creation flow, providing tangible benefits to engineering in seamless and intuitive enhancements to the flow.
In this session, we will explore AI methods in the digital design creation flow. From enhancing High-Level Synthesis for accelerated design exploration, quantization analysis and PPA prediction, to accelerated RTL-to-GDS implementation flows that also produce better results, and employing AI techniques for improved fault isolation, we will discuss how AI is revolutionizing digital flows today, and how we can further leverage this exciting technology going forward.
AI has come a long way in EDA, from the initial machine learning applications to reinforcement learning, generative AI, and beyond. Technologies are in a breakthrough position in semiconductor design creation today, where AI-assisted EDA workflows are expected to lead to significantly higher productivity and better quality of results. AI can be prevalent in the digital design creation flow, providing tangible benefits to engineering in seamless and intuitive enhancements to the flow.
In this session, we will explore AI methods in the digital design creation flow. From enhancing High-Level Synthesis for accelerated design exploration, quantization analysis and PPA prediction, to accelerated RTL-to-GDS implementation flows that also produce better results, and employing AI techniques for improved fault isolation, we will discuss how AI is revolutionizing digital flows today, and how we can further leverage this exciting technology going forward.
Networking
Work-in-Progress Poster


DescriptionWith continuous technology scaling, accurate and efficient glitch modeling plays a critical role in designing high-performance, low-power, and reliable ICs. In this work, we introduce a new gate-level approach for glitch propagation modeling, utilizing efficient Artificial Neural Networks (ANNs) to accurately estimate the glitch shape characteristics, propagation delay, and power consumption. Moreover, we propose an iterative workflow that seamlessly integrates our models into standard cell libraries, while exploiting the available accuracy and size trade-off. Experimental results on gates implemented in 7 nm FinFET technology demonstrate that our ANNs exhibit a strong correlation with SPICE (R2 over 0.99). Thus, our approach could enable accurate full-chip glitch analysis and effectively guide glitch reduction techniques.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionIC3 and its variants are SAT-based model-checking methods that play a critical role in hardware verification. Efficient management of proof obligations, which track states that need to be proven unreachable, is essential for improving verification performance. This paper presents a novel approach that utilizes Critical Proof Obligations (CPOs) to improve proof obligation management. We propose two techniques, CPO-Driven UNSAT Core Generation and CPO-Driven Proof Obligation Propagation, to promote lemma propagation and frame refinement. Experimental results on HWMCC benchmarks demonstrate significant improvements in CPO discovery and lemma propagation, resulting in notable performance gains.
Engineering Poster
Networking


DescriptionModern static linting tools are indispensable for ensuring high-quality RTL designs by identifying syntactic, structural, and coding-style issues. However, these tools often generate an overwhelming number of violations, many of which are false positives that require manual filtering and waiver creation. This time-consuming process not only burdens RTL designers but also introduces risk of human error. In response, we propose a novel Machine Learning (ML) based framework that automatically learns from historically waived violations and applies similar waivers to newly flagged issues. By representing RTL snippets and violations as graph structures, we employ Graph Convolution Networks (GCNs) and similarity-compute (Graph2Vec) techniques to identify patterns that warrant waiver recommendations. Experimental results show high accuracy and recall in predicting new waivers, as well as substantial time savings and productivity gains. This approach significantly reduces the manual effort required to handle static lint outputs and paves the way for more intelligent and scalable verification flows in the semiconductor design process.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionTransformers have delivered exceptional performance and are widely used across various natural language processing (NLP) tasks, owing to their powerful attention mechanism. However, the high computational complexity and substantial memory usage pose significant challenges to inference efficiency. Numerous quantization and value-level sparsification methods have been proposed to overcome these challenges. Since higher sparsity leads to greater acceleration efficiency, leveraging both value-level and bit-level sparsity (hybrid sparsity) can effectively exploit the acceleration potential of the attention mechanism. However, increased sparsity exacerbates load imbalance across compute units, potentially limiting the extent of acceleration benefits. To fully exploit the acceleration potential of hybrid sparsity, we propose Libra, an attention accelerator developed through algorithm-hardware co-design. At the algorithm level, we design the bit-group-based algorithm consisting of filtered bit-group sparsification (FBS) and dynamic bit-group quantization (DBQ) to maximize the utilization of sparsity in attention. FBS imposes structured sparsity on weights, while DBQ introduces dynamic sparsification during the computation of activations. At the hardware level, we design the task pool to achieve multi-level workload balance, effectively mitigating the load imbalance among computational units induced by hybrid sparsity. Additionally, we develop an adaptive bit-width architecture to support all stages of computation in the proposed bit-group-based algorithm. Our experiments demonstrate that, compared to state-of-the-art (SOTA) attention accelerators, Libra achieves up to 1.49x ~ 5.89x speedup and 2.65\x ~ 10.82x enhancement in energy efficiency.
Networking
Work-in-Progress Poster


DescriptionCROSS is a code-based post-quantum digital signature scheme based on a zero-knowledge(ZK) framework. It is a second-round candidate of the National Institute of Standards and Technology's additional call for standardizing post-quantum digital signatures. The memory footprint of this scheme is prohibitively large, especially for small embedded devices. In this work, we propose various techniques to reduce the memory footprint of the key generation, signature generation, and verification by as much as 50%, 52%, and 74%, respectively, on an ARM Cortex-M4 device. Moreover, our memory-optimized implementations adapt the countermeasure against the recently proposed (ASIACRYPT-24) fault attacks against the ZK-based signature schemes.
Research Manuscript


Systems
SYS4: Embedded System Design Tools and Methodologies
DescriptionLinux kernels are being widely deployed in embedded applications, such as increasingly automated vehicles and robots, due to their robust ecosystem. Security modules have been developed to enhance the integrity of Linux kernels, a critical system component. However, these modules consume substantial computational resources, making them unsuitable for embedded domains. We introduce LightRIM, a lightweight method to measure the Linux kernel's integrity during runtime, ideal for resource-limited embedded applications. We focus on major attack types and extract objects for monitoring. Our approach includes a two-stage hashing process and an event-triggered measurement algorithm tied to the security value. To mitigate Time-of-Check-to-Time-of-Use (TOCTOU) attacks, we introduce an heuristic algorithm that maximizes the attack detection rate within CPU usage constraint and randomizes the measurement intervals. Experimental results indicate that LightRIM incurs less than 0.7% performance overhead while providing extensive attack coverage.
Research Manuscript


Systems
SYS1: Autonomous Systems (Automotive, Robotics, Drones)
DescriptionLiDAR-inertial odometry is widely used in robotics navigation, autonomous driving, and drone operation to provide precise, low-latency motion estimation. Filter-based methods are fast but suffer from significant cumulative errors. Graph optimization methods reduce cumulative errors through loop closure detection but are computationally expensive. In this work, we propose LIO-DPC, a framework that combines the benefits of the filter-based approach and graph-based approach. First, we propose a dynamic pose chain optimization method. It generates an initial pose chain using the fast filter. This is followed by applying computationally efficient local graph optimization to a set of local pose chains to generate refined relative poses, which are then used to update the motion estimation. Second, we propose a loop sparsification approach to select representative loops that are both temporally and spatially proximate, to reduce the computational complexity in graph optimization and minimize loop errors. Extensive experiments demonstrate that LIO-DPC achieves real-time performance and outperforms state-of-the-art methods in accuracy.
Research Manuscript


Systems
SYS1: Autonomous Systems (Automotive, Robotics, Drones)
SYS2: Design of Cyber-Physical Systems and IoT
DescriptionMathWorks Simulink, a commercial CPS development tool chain, is widely used as an industry standard for designing and analyzing system behavior and generating embedded code for deployment. However, bugs in Simulink can cause unexpected behaviors during model compilation, making their elimination critical. Existing methods face two key challenges: generating equivalent models with varied data flows (data flow equivalence) and creating diverse block types to comprehensively test the compiler (mutation diversity). To address these, we propose \textbf{LION}, a differential testing approach. LION ensures data flow equivalence by inserting "store-revert" block pairs between existing blocks and tackles mutation diversity by employing Markov Chain Monte Carlo (MCMC) sampling to generate diverse new blocks. Differential testing is then used to identify bugs. Experiments show LION outperforms state-of-the-art approaches like SLforge, SLEMI, and COMBAT, detecting 6-10 additional compiler bugs in two weeks. Over two months, LION uncovered and reported 16 valid bugs in the widely used stable version of Simulink.
Engineering Special Session


AI
Systems and Software
DescriptionThe revolution brought about by LLM-based AI has disrupted our daily routines and significantly altered our expectations. The rapid pace of innovation in the world of LLMs has continually reshaped our outlook. Each time our applications hit a limitation, a new iteration quickly emerged, reigniting our AI aspirations and raising our expectations once more.
Are we merely chasing the AI dream like the proverbial 'chasing the carrot'? Have we already begun to reap the benefits, are we nearing the fulfillment of its promise, or are we simply engaging in a healthy exercise by pursuing it?
Drawing from real hands-on experience in this industry, where do you envision its future? What are your expectations, and how confident are you that it will meet them?
Our invited speakers will provide updates from the industry, showcasing clear examples of which investments have started to pay off, which were misguided, and how they foresee our world evolving in the near and distant future.
Are we merely chasing the AI dream like the proverbial 'chasing the carrot'? Have we already begun to reap the benefits, are we nearing the fulfillment of its promise, or are we simply engaging in a healthy exercise by pursuing it?
Drawing from real hands-on experience in this industry, where do you envision its future? What are your expectations, and how confident are you that it will meet them?
Our invited speakers will provide updates from the industry, showcasing clear examples of which investments have started to pay off, which were misguided, and how they foresee our world evolving in the near and distant future.
Networking
Work-in-Progress Poster


DescriptionRecent advancements in digital circuit manufacturing have enabled the widespread use of specialized chips, such as Deep Neural Network (DNN) accelerators, in safety-critical applications like autonomous driving and healthcare. Despite these advances, the reliability of these chips remains a concern due to soft errors from fabrication defects and radiation.Traditional soft error analysis approaches involve developing custom fault injection simulators for different data flows and hardware configurations, followed by extensive error injection experiments to inform fault-tolerant design strategies. Even with Electronic Design Automation (EDA) tools, achieving comprehensive coverage across models and hardware components is time- and resource-intensive.
The advent of Large Language Models (LLMs) presents a promising alternative. In this paper, we propose an iterative LLM-based framework to address soft errors in DNN accelerators. The framework integrates fault injection, Algorithm-Based Fault Tolerance (ABFT), and faulttolerant hardware architecture.Specifically, we introduce LEGA, a heuristic algorithm designed to enhance fault tolerance using LLM insights. The effectiveness of our framework is validated through extensive testing with 143 operators and various DNN workloads on DNN accelerators.
The advent of Large Language Models (LLMs) presents a promising alternative. In this paper, we propose an iterative LLM-based framework to address soft errors in DNN accelerators. The framework integrates fault injection, Algorithm-Based Fault Tolerance (ABFT), and faulttolerant hardware architecture.Specifically, we introduce LEGA, a heuristic algorithm designed to enhance fault tolerance using LLM insights. The effectiveness of our framework is validated through extensive testing with 143 operators and various DNN workloads on DNN accelerators.
Networking
Work-in-Progress Poster


DescriptionThe increasing complexity and scale of Deep Neural Networks (DNNs) necessitate specialized tensor accelerators, such as Tensor Processing Units (TPUs), to meet various computational and energy efficiency requirements. Nevertheless, designing optimal TPU remains challenging due to the high domain expertise level, considerable manual design time, and lack of high-quality, domain-specific datasets. This paper introduces APTPU-Gen, the first Large Language Model (LLM) based framework designed to automate the approximate TPU generation process, focusing on systolic array architectures. APTPU-Gen is supported with a meticulously curated, comprehensive, and open-source dataset that covers a wide range of spatial array designs and approximate multiply-and-accumulate units, enabling design reuse, adaptation, and customization for different DNN workloads. The proposed framework leverages Retrieval-Augmented Generation (RAG) as an effective solution for a data-scare hardware domain in building LLMs, addressing the most intriguing issue, hallucinations. APTPU-Gen transforms high-level architectural specifications into optimized low-level implementations through an effective hardware generation pipeline. Our extensive experimental evaluations demonstrate superior performance, power, and area efficiency of the designs generated with minimal deviation from the user's reference values, setting a new benchmark to drive advancements in next-generation design automation tools powered by LLMs.
Research Special Session


AI
DescriptionThe transformative power of Large Language Models (LLMs) is reshaping the role of AI in post-silicon test engineering. Recent advancements in LLMs showcase their remarkable ability to engage in diverse dialogues, reason about tasks, and generate code, unlocking new possibilities for automation and efficiency. This talk explores our experience in leveraging LLMs to develop an AI agent tailored for post-silicon test engineering. Central to our approach is the adoption of a natural language programming paradigm, enabling the LLM to reason effectively within a specific domain context. At the core of our AI agent lies a novel two-stage grounding process. First, we utilize the in-context learning capabilities of the LLM to interpret tasks, and second, we validate and refine its responses using a pre-defined knowledge graph. This grounding ensures seamless integration of the LLM with existing test engineering infrastructure, empowering the AI agent to autonomously execute tasks within the established framework. Using the Intelligent Engineering Assistant (IEA) as a case study, we demonstrate how LLM-powered domain-specific AI agents can automate key aspects of test engineering. We will share experimental results from multiple product lines, illustrating the feasibility and impact of deploying IEA in an industrial environment. This talk aims to highlight the potential of LLMs to revolutionize post-silicon test engineering by enabling intelligent, context-aware automation
Research Special Session


AI
DescriptionInnovations in generative artificial intelligence (GenAI), particularly large language models (LLMs), are poised to revolutionize silicon design automation. This paper explores the transformative potential of LLMs in automating and enhancing various tasks within the silicon design process. It reviews the current applications of LLMs and their potential to automate silicon design tasks, proposing applications, providing a qualitative analysis of the readiness of the technology to support these applications and setting directions for future research.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionLarge Language Models (LLMs) have revolutionized language tasks but pose significant deployment challenges due to their substantial computational demands during inference. The hardware configurations of existing LLM serving systems do not optimize for the different computational and bandwidth needs of the prefill and decoding phases in LLM inference, leading to inefficient resource use and increased costs. In this paper, we systematically investigate promising hardware configurations for LLM inference serving. We develop a simulator that models the performance and cost across different hardware solutions and introduce a customized design space exploration framework to identify optimal setups efficiently. By aligning hardware capabilities with the specific demands of the prefill and decoding phases, we achieve 13% cost savings and over 4x throughput improvements compared to conventional serving system setups.
Research Manuscript


EDA
EDA4: Power Analysis and Optimization
DescriptionStatic IR drop analysis is a fundamental and critical task in the field of chip design. Nevertheless, this process can be quite time-consuming, potentially requiring several hours. Moreover, addressing IR drop violations frequently demands iterative analysis, thereby causing the computational burden. Therefore, fast and accurate IR drop prediction is vital for reducing the overall time invested in chip design. In this paper, we firstly propose a novel multimodal approach that efficiently processes SPICE files through large-scale netlist transformer (LNT). Our key innovation is representing and processing netlist topology as 3D point cloud representations, enabling efficient handling of netlist with up to hundreds of thousands to millions nodes. All types of data, including netlist files and image data, are encoded into latent space as features and fed into the model for static voltage drop prediction. This enables the integration of data from multiple modalities for complementary predictions. Experimental results demonstrate that our proposed algorithm can achieve the best F1 score and the lowest MAE among the winning teams of the ICCAD 2023 contest and the state-of-the-art algorithms.
Research Manuscript


AI
AI3: AI/ML Architecture Design
Description3D Gaussian Splatting recently emerged as the new SOTA approach for 3D representation and view synthesis. While Gaussian Splatting has demonstrated impressive training capability and rendering quality on desktop GPUs, achieving on-demand training on resource-constrained edge devices is still challenging. In this work, we identified the training bottleneck comes from a few perspectives including under-utilized redundant rendering threads and insufficient shared memory. To address these problems, we present Local-GS, a compact 3D Gaussian Splatting training accelerator utilizing order-independent rendering to break the depth-wise data dependency between overlapping Gaussians. We further incorporate a highly-parallel pixel intersection unit to reschedule thread workload and improve hardware utilization based on Gaussian locality. A set of compact unified training-rendering core is also designed to achieve efficient splat-level parallel rendering and gradient propagation. Local-GS is implemented and evaluated in 7nm technology with several real-world scenes, achieving training speed improvement of 26.9-53× across different scenarios compared to Jetson Xavier NX Mobile GPU.
Research Manuscript


Systems
SYS3: Embedded Software
DescriptionIn Verilog code design, identifying and locating functional bugs is an important yet challenging task. Existing automatic bug localization methods have limited capabilities; they only suggest a set of potential buggy lines rather than precisely identifying the bug. Moreover, they depend on verification tools like testbenches and reference models, which require expert input and are time-consuming to develop. This paper introduces LiK (Location is Key), an open-source Large Language Model (LLM) to precisely locate functional bugs in Verilog code without the need for expert-written verification tools. LiK is developed from the open-source coding LLM Deepseek-Coder-Lite-Base-16B through a threestep
training process: continuous pre-training to enhance foundational knowledge, supervised fine-tuning to learn how to output localization results, and reinforcement learning to reduce output errors. Experiment
results demonstrate that LiK achieves superior functional bug localization accuracy, outperforming both the SOTA traditional method Strider, and SOTA closed-source LLMs like GPT-o1-preview and Claude-3.5-Sonnet. Moreover, integrating LiK into the SOTA LLM-based Verilog debugging tool significantly boosts its functional bug fixing success rate from 76.47% to 90.54%. This underscores LiK's potential to enhance the performance of end-to-end automatic Verilog debugging tools.
training process: continuous pre-training to enhance foundational knowledge, supervised fine-tuning to learn how to output localization results, and reinforcement learning to reduce output errors. Experiment
results demonstrate that LiK achieves superior functional bug localization accuracy, outperforming both the SOTA traditional method Strider, and SOTA closed-source LLMs like GPT-o1-preview and Claude-3.5-Sonnet. Moreover, integrating LiK into the SOTA LLM-based Verilog debugging tool significantly boosts its functional bug fixing success rate from 76.47% to 90.54%. This underscores LiK's potential to enhance the performance of end-to-end automatic Verilog debugging tools.
Engineering Presentation


IP
DescriptionLiberty (.lib) files are the universally accepted format for representing digital circuits early in chip design flows and form the initial building blocks in implementation and sign off cycles with EDA software. One such application for liberty files is in the design process for Mobile SoCs.
Mobile SoCs present a plethora of unique challenges. First, annual re-design of tape outs are necessary with the newest cutting edge technology nodes. With these advanced nodes, foundry PDK's often evolve rapidly requiring in depth verification to identify any discrepancies (or errors) and maintain compliance with initial design requirements.
Second is the wide range of voltage operations required to push the boundaries of performance and power efficiency. This often requires large sets of PVT libraries to meet design closure, which necessitate full library characterization in some cases, a frequently cost and compute intensive task.
This study details the various approaches explored to address these SoC design challenges. Advanced analysis tools for liberty are utilized for robust and complete verification of new designs, while an AI-driven liberty generations tool is used to accelerate the characterization process without the need for full flow characterization.
Mobile SoCs present a plethora of unique challenges. First, annual re-design of tape outs are necessary with the newest cutting edge technology nodes. With these advanced nodes, foundry PDK's often evolve rapidly requiring in depth verification to identify any discrepancies (or errors) and maintain compliance with initial design requirements.
Second is the wide range of voltage operations required to push the boundaries of performance and power efficiency. This often requires large sets of PVT libraries to meet design closure, which necessitate full library characterization in some cases, a frequently cost and compute intensive task.
This study details the various approaches explored to address these SoC design challenges. Advanced analysis tools for liberty are utilized for robust and complete verification of new designs, while an AI-driven liberty generations tool is used to accelerate the characterization process without the need for full flow characterization.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionThe Circuit Satisfiability (CSAT) problem, a variant of the Boolean Satisfiability (SAT) problem, plays a critical role in integrated circuit design and verification. However, existing SAT solvers, optimized for Conjunctive Normal Form (CNF), often struggle with the intrinsic complexity of circuit structures when directly applied to CSAT instances. To address this challenge, we propose a novel preprocessing framework that leverages advanced logic synthesis techniques and a reinforcement learning (RL) agent to optimize CSAT problem instances. The framework introduces a cost-customized Look-Up Table (LUT) mapping strategy that prioritizes solving efficiency, effectively transforming circuits into simplified forms tailored for SAT solvers. Our method achieves significant runtime reductions across diverse industrial-scale CSAT benchmarks, seamlessly integrating with state-of-the-art SAT solvers. Extensive experimental evaluations demonstrate up to 63\% reduction in solving time compared to conventional approaches, highlighting the potential of EDA-driven innovations to advance SAT-solving capabilities.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionDuring technology mapping, complex cells such as adders and multiplexers are often available in the standard cell library, which helps improve the final PPA results. However, technology-independent optimization tends to over-optimize towards the cost metrics measurable with technology-independent representations (e.g., AIG size) by decomposing blocks of logic that could have been mapped into complex cells. Besides, it is also of practical interest to model and preserve some special logic components, such as the enable logic of flops, during optimization. This paper studies "boxing" logic blocks before technology-independent optimization and preserving them during optimization. A wire-based resubstitution is proposed to optimize boxed networks. Experiments with academic and industrial benchmarks show that transparent-boxing, i.e., preserving boxes while utilizing the information on their logic, achieves better results than black-boxing, i.e., completely ignoring the logic in boxes. Specifically, in the technology-independent evaluation, transparent-boxing reduces the mapped network size by 4% more than black-boxing; in the full-flow evaluation, transparent-boxing achieves 1.8% more improvement on timing and similar improvements in area, power, and wire length compared to black-boxing.
Research Manuscript


EDA
EDA1: Design Methodologies for System-on-Chip and 3D/2.5D System-in Package
DescriptionThe parameterizable and synthesizable RISC-V processors enable the automatic generation of customized CPU cores through EDA tools. However, current methods often explore the extensive design space with significant model errors while neglecting design constraints, which are critical for practical implementations. To address these limitations, we propose a Self-Review Bayesian Optimization method (SRBO). This method integrates a teacher-student paradigm within a local Bayesian optimization framework to reduce model errors and enhance exploration efficiency. Additionally, it employs deep ensembles for effective constraint handling. Experimental results demonstrate that our approach outperforms state-of-the-art methods within a limited time budget, significantly enhancing exploration efficiency.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionDeep neural networks (DNNs) have been widely applied in our society, yet reducing power consumption due to large-scale matrix computations remains a critical challenge. MADDNESS is a known approach to improving energy efficiency by substituting matrix multiplication with table lookup operations. Previous research has employed large analog computing circuits to convert inputs into LUT addresses, which presents challenges to area-efficiency and computational accuracy. This paper proposes a novel MADDNESS-based all-digital accelerator featuring a self-synchronous pipeline accumulator, resulting in a compact, energy-efficient, and PVT-invariant computation. Post-layout simulation using a commercial 22nm process showed that 2.5x higher energy efficiency (174 TOPS/W) and 5x higher area efficiency (2.01 TOPS/mm2) can be achieved compared to the conventional accelerator.
Networking
Work-in-Progress Poster


DescriptionPolynomial multiplication, a core component of lattice-based cryptography, has demonstrated impressive performance for lattice cryptography chips. However, costly Number Theoretic Transform (NTT) make efficient and flexible hardware design to be extremely challenging, particularly for ASIC/FPGA-based hardware acceleration solutions that often face difficulties with high resource consumption and low computational efficiency. In this paper we propose LPA-NTT, which adheres to the same principles applicable to various post-quantum cryptography (PQC) algorithms and operates under very strict power constraints. Despite these limitations, our approach demonstrates state-of-the-art computing performance for complex applications like NTT by eliminating pre-computing operation and near-memory mapping scheme, thereby reducing computational dimensions and data relocate. Our experimental results show that LPA-NTT outperforms previous best NTT accelerators by 14.8%~36.7% in terms of resource consumption, while also significantly reducing area and power overhead.
Networking
Work-in-Progress Poster


DescriptionDue to their merits that guarantee the available but invisible property of data during processing, privacy-preserving computing techniques have attracted wide attentions from both academic and industrial fields. However, performance bottleneck has been restricting its large-scale application. Although the protocols and algorithms of privacy-preserving computing may vary, some underlying fundamental operations such as modular multiplication and modular addition are widely and frequently used. The efficiency of modular multiplication has a significant impact on the overall performance of privacy-preserving computations. Existing implementation approaches for modular multiplication are either based on Montgomery's or Barrett's algorithm. However, both of these methods require at least three times of full-word multiplications to compute the modular result, making it difficult to balance throughput and chip area for hardware accelerator design. In this paper, we present a novel architecture for modular multiplication called LUT-MM to seek for optimal tradeoff between throughput and area. LUT-MM can also achieve better performance if large area is acceptable. LUT-MM integrates a full-word multiplier based on Karatsuba's algorithm, along with a 3-stage reduction module based on multiple look-up tables. Experimental results show that the proposed LUT-MM achieves a throughput of 28700 Mbps while consuming 34938 LUTs, 13479 FFs, and 105 BRAMs on the Xilinx Virtex-7 FPGA, demonstrating superior tradeoff between throughput and area.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionMoving toward the post-Moore era, full-chip mask optimization (MO) has become a pivotal step for semiconductor designers and manufacturers in extending current resolution enhancement techniques. The majority of recent research efforts have focused on clip-level restoration, employing a divide-and-conquer approach to mitigate the impacts of optical proximity and process bias across entire chips.
Nevertheless, when confronted with industrial full-chip mask optimization challenges,
these works exhibit limited correction capabilities, struggle with generalization, and are time-inefficient.
In this paper, we propose a novel full-chip mask optimization paradigm based on a massive lithography data-driven large vision model.
Our approach features that a foundation layout feature extractor, which is aware of the mutual influence of polygons in long-range pattern perception as well as optical physics and chemical characteristics of lithography, matters.
Compared with state-of-the-art (SOTA) works, our work demonstrates significant advantages in terms of resolution fidelity, correction speed, and the ability to handle full-chip scale layouts.
Nevertheless, when confronted with industrial full-chip mask optimization challenges,
these works exhibit limited correction capabilities, struggle with generalization, and are time-inefficient.
In this paper, we propose a novel full-chip mask optimization paradigm based on a massive lithography data-driven large vision model.
Our approach features that a foundation layout feature extractor, which is aware of the mutual influence of polygons in long-range pattern perception as well as optical physics and chemical characteristics of lithography, matters.
Compared with state-of-the-art (SOTA) works, our work demonstrates significant advantages in terms of resolution fidelity, correction speed, and the ability to handle full-chip scale layouts.
Engineering Presentation


AI
Back-End Design
DescriptionWe demonstrate the first machine learning assisted solution and workflow for generating reduced-order models of electrical-optical-electrical (EOE) links, aiming to enable standard IBIS-AMI simulation to facilitate signal integrity analysis for optical module involved Serializer/Deserializer (SerDes) system designs.
Engineering Poster
Networking


DescriptionDigital design optimization is a crucial aspect of modern design flows, particularly in the context of electronic design automation (EDA) has designs are becoming increasingly complex. By leveraging ML techniques and tools, digital design optimization can be significantly improved, leading to better performance, power and Area (PPA). However, ML based automation flows have its own challenges like ML algorithms require large, high-quality datasets to train effectively which can introduce arbitrariness and uncertainty and might require more iterations to converge. Also, these algorithms vary design parameters in implementation tools that can yield diverse results putting tradeoff between power and performance. In this paper we present a ML accelerated implementation framework to train models to predict optimal design parameters and then by using different ensemble methods combining models trained from different initializations to improve overall performance and robustness, ultimately leading to more efficient and effective machine learning workflows. Using Cadence ML tool CEREBRUS integrated with implementation tools such as Genus, Innovus and Tempus, we achieved 5% Power gain, 20% Timing gain and 30% improvement in overall design cycle by onetime investment of distributed computing which increased the exploration space by factor of 10 which boost PPA and the productivity with less manual intervention.
Engineering Presentation


AI
Back-End Design
Chiplet
DescriptionThis presentation introduces a machine learning (ML) model for rapid and accurate dynamic IR drop estimation in SoC designs. Traditional dynamic IR estimation methods are computationally expensive, with runtime complexity of N², hindering timely design finalization with good PPA metrics.
This work proposes an XGBoost regression-based ML model to predict vector-less dynamic IR using power and timing features.
Tested on two industrial SoCs in most recent process nodes, with over 1.5 million instances, the ML model achieved a 15x speedup.The model maintained accuracy with less than 1 mV Mean Square Error and a correlation coefficient of ~0.85. ROC accuracy of ~90.0 indicates close approximation of predicted vs. actual IR drop.
This work proposes an XGBoost regression-based ML model to predict vector-less dynamic IR using power and timing features.
Tested on two industrial SoCs in most recent process nodes, with over 1.5 million instances, the ML model achieved a 15x speedup.The model maintained accuracy with less than 1 mV Mean Square Error and a correlation coefficient of ~0.85. ROC accuracy of ~90.0 indicates close approximation of predicted vs. actual IR drop.
Engineering Poster
Networking


DescriptionThe evolution of new technologies and applications are driving increasing chip complexity with reduced design cycles. For example, the increasing density of artificial intelligence (AI) and central processing unit (CPU) die are fueling an increase in the number of high-speed inputs and outputs (IOs). These IOs require electrostatic discharge (ESD) protection circuits where the ESD capacitance will limit signal speed if not properly compensated. The present design flow to compensate for the ESD capacitance is a very manual, time-consuming, and error-prone process.
Our new methodology proposes to automate this workflow with an adaptive metamodel of optimal prognosis (AMOP) optimizer that relies on a high-capacity electromagnetic modeling engine coupled with a circuit simulator to automatically size and place spiral devices to compensate for the ESD capacitance to find the optimal circuit performance. This reduces design cycle time (from weeks to hours) and manual effort and increases confidence that an optimal solution has been found. This flow also allows the structure to be revisited at all design milestones to validate early assumptions and ensure the optimal layout has been identified before tapeout. We use a 9.6 Gbps high-bandwidth memory read/write channel requiring the use of T-coils to demonstrate our thesis.
Our new methodology proposes to automate this workflow with an adaptive metamodel of optimal prognosis (AMOP) optimizer that relies on a high-capacity electromagnetic modeling engine coupled with a circuit simulator to automatically size and place spiral devices to compensate for the ESD capacitance to find the optimal circuit performance. This reduces design cycle time (from weeks to hours) and manual effort and increases confidence that an optimal solution has been found. This flow also allows the structure to be revisited at all design milestones to validate early assumptions and ensure the optimal layout has been identified before tapeout. We use a 9.6 Gbps high-bandwidth memory read/write channel requiring the use of T-coils to demonstrate our thesis.
Networking
Work-in-Progress Poster


DescriptionThis paper presents a new approach to the problem of allocating multi-bit flip-flops (MBFFs) for low power. The prior approaches have solved the MBFF allocation problem sequentially by addressing two sub-problems: (1) placing individual flip-flops considering the circuit timing and wirelength constraints and (2) clustering the flip-flops to form MBFFs in a way to minimize the area of standard cells and clock network power consumption under the physical proximity constraints. Yet, there is no easy way to reliably predict the solution quality of sub-problem 2 during the process of solving sub-problem 1. In
our approach, we place the primary importance on minimizing power consumption. Consequently, we try to minimize power consumption by clustering flip-flops first, utilizing a graph neural network (GNN) based prediction model, and then placing the resulting MBFFs later. Through experiments with benchmark circuits, it is shown that our approach of early consideration of clustering flip-flops is very effective, significantly reducing the power consumption by about 21% while satisfying all the timing constraints over that of the conventional approach
our approach, we place the primary importance on minimizing power consumption. Consequently, we try to minimize power consumption by clustering flip-flops first, utilizing a graph neural network (GNN) based prediction model, and then placing the resulting MBFFs later. Through experiments with benchmark circuits, it is shown that our approach of early consideration of clustering flip-flops is very effective, significantly reducing the power consumption by about 21% while satisfying all the timing constraints over that of the conventional approach
Research Manuscript


EDA
EDA9: Design for Test and Silicon Lifecycle Management
DescriptionThe increasing complexity of safety-critical hardware systems demands advanced methods for ensuring functional safety (FuSa).
Traditional techniques like ATPG and BIST are intrusive, requiring
additional hardware and disrupting operations, making them unsuitable
for in-field testing. To address this, for the first time, we propose
a machine learning (ML)-driven automated Self-Test Library (STL)
generation for seamless in-field testing during idle periods, ensuring
uninterrupted fault detection and high system performance. Utilizing
reinforcement learning, the STL generates design-specific test patterns,
achieving up to 57.57% improvement in fault coverage and up to
85% efficiency compared to existing pattern-based testing,
enhancing FuSa in mission-critical applications.
Traditional techniques like ATPG and BIST are intrusive, requiring
additional hardware and disrupting operations, making them unsuitable
for in-field testing. To address this, for the first time, we propose
a machine learning (ML)-driven automated Self-Test Library (STL)
generation for seamless in-field testing during idle periods, ensuring
uninterrupted fault detection and high system performance. Utilizing
reinforcement learning, the STL generates design-specific test patterns,
achieving up to 57.57% improvement in fault coverage and up to
85% efficiency compared to existing pattern-based testing,
enhancing FuSa in mission-critical applications.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionLogic synthesis tools are crucial to translate high-level descriptions into optimized gate-level netlists. However, complex optimization operations and operation configurations can cause synthesis faults. To address this, we propose MAGCS, a fault detection method using multi-agent reinforcement learning to dynamically refine optimization sequences. MAGCS consists of three components: a test program selector that applies feature extraction and cosine similarity to curate diverse test programs, an optimization selector using the A2C algorithm to adaptively adjust operations and configurations, and an optimization fault verifier performing equivalence checks to pinpoint optimization-induced faults. Using MAGCS, we identified 32 confirmed faults on Vivado and Yosys, all of which are resolved. MAGCS received recognition from the Vivado community for its significant contributions to tool improvement.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionThe automatic generation of RTL code (e.g., Verilog) through natural language instructions has emerged as a promising direction with the advancement of large language models (LLMs). However, producing RTL code that is both syntactically and functionally correct remains a significant challenge. Existing single-LLM-agent approaches face substantial limitations because they must navigate between various programming languages and handle intricate generation, verification, and modification tasks. To address these challenges, this paper introduces MAGE, the first open-source multi-agent AI system designed for robust and accurate Verilog RTL code generation. We propose a novel high-temperature RTL candidate sampling and debugging system that effectively explores the space of code candidates and significantly improves the quality of the candidates. Furthermore, we design a novel Verilog-state checkpoint checking mechanism that enables early detection of functional errors and delivers precise feedback for targeted fixes, significantly enhancing the functional correctness of the generated RTL code. MAGE achieves a 95.7 % rate of syntactic and functional correctness code generation on VerilogEval-Human v2 benchmark, surpassing the state-of-the-art Claude-3.5-sonnet by 23.3%, demonstrating a robust and reliable approach for AIdriven RTL design workflows. MAGE is open-sourced at https://anonymous.4open.science/r/MAGE-A-Multi-Agent-Engine-for-Automated-RTL-Code-Generation-25D1.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionState-space models (SSMs), such as Mamba, have emerged as a promising alternative to Transformers. However, the recently developed Mamba2, based on state space duality (SSD), is highly memorybound and suffers from limited computation efficiency. This inefficiency arises from its irregular broadcast element-wise multiplications and structured sparse computations. In this work, we propose MambaOPU, an FPGA overlay processor, to accelerate SSD. First, to reduce memory overhead, we introduce a software-hardware co-optimized operator fusion framework. Specifically, operator merging combines adjacent broadcast multiplication and summation operations into a single descriptor, while operator backward shifting embeds segment multiplication into subsequent operations. Both techniques shorten the computation path and improve computation efficiency. Second, to enhance sparse computation efficiency, we skip zero-region computations using a tensor-reorderand-group algorithm combined with a sparse-predefined data fetcher. Additionally, since Mamba integrates linear operations with SSD, we develop a reconfigurable systolic array to improve data reuse across different computation modes. Extensive experiment results demonstrate that MambaOPU achieves up to 1812× and 880.79× higher normalized throughput and up to 12908× and 24.27× higher energy efficiency over Intel Xeon Gold 6348 CPU and NVIDIA A100 GPU, respectively.
Research Manuscript
MARIO: A Superadditive Multi-Algorithm Interworking Optimization Framework for Analog Circuit Sizing
2:45pm - 3:00pm PDT Wednesday, June 25 3003, Level 3

EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionNumeric optimization methods are widely utilized to tackle complex analog circuit sizing problems, where the challenges include expensive simulations, non-linearity, and high parameter dimensionality.
However, the diverse characteristics exhibited by different circuits result in varied optimization landscapes, making it difficult to identify a single algorithm that consistently outperforms others across all problems. In this paper, we introduce a multi-algorithm interworking optimization framework, which achieves optimization superadditivity based on a pool of member algorithms and a powerful algorithm-interworking protocol. We propose a computing resource reallocation method, which employs multi-task Gaussian process regression and portfolio optimization techniques, leading to flexible and prudent online adaption of member algorithms. To efficiently utilize the computing resources for local exploitation,
an evaluation data broadcast strategy enables cooperativeness across member algorithms. Besides, algorithms with different modeling overheads are integrated time-adaptively via an asynchronous parallelization mechanism. Comparative experiments against state-of-the-art algorithm-combining tools and optimization algorithms demonstrate the superiority of the proposed optimization framework.
However, the diverse characteristics exhibited by different circuits result in varied optimization landscapes, making it difficult to identify a single algorithm that consistently outperforms others across all problems. In this paper, we introduce a multi-algorithm interworking optimization framework, which achieves optimization superadditivity based on a pool of member algorithms and a powerful algorithm-interworking protocol. We propose a computing resource reallocation method, which employs multi-task Gaussian process regression and portfolio optimization techniques, leading to flexible and prudent online adaption of member algorithms. To efficiently utilize the computing resources for local exploitation,
an evaluation data broadcast strategy enables cooperativeness across member algorithms. Besides, algorithms with different modeling overheads are integrated time-adaptively via an asynchronous parallelization mechanism. Comparative experiments against state-of-the-art algorithm-combining tools and optimization algorithms demonstrate the superiority of the proposed optimization framework.
Research Manuscript


Systems
SYS1: Autonomous Systems (Automotive, Robotics, Drones)
DescriptionThe rapid advancement of visual autonomous systems, especially in autonomous driving, underscores the critical role of Image Signal Processors (ISPs) as they convert RAW sensor data into RGB images suited for visual interpretation.
Traditional ISPs rely on tuning hyperparameters to adapt to varying imaging conditions;
however, the vast parameter space and intricate tuning process pose significant challenges for real-time autonomous applications.
Existing autonomous ISP hyperparameter optimization methods rely largely on offline or proxy-based online tuning, limiting their accuracy and responsiveness to real-time environmental changes.
In response, we propose an online ISP hyperparameter optimization framework based on Deep Reinforcement Learning (DRL), marking the first proxy-free, real-time optimization approach.
Our design exhibits a master-slave Multi-Agent System (MAS), enabling rapid and cooperative parameter optimization with improved inter-frame consistency.
Furthermore, we design the MAS-ISP automated visual system, incorporating innovative hardware designs such as Strip Conv Kernel and Stride-Aware Dual-Buffer Memory, which drastically reduce resource consumption in CNN hardware. MAS-ISP achieves 1080P@75FPS/240FPS on FPGA/ASIC platforms, supporting real-time and reliable visual systems.
Traditional ISPs rely on tuning hyperparameters to adapt to varying imaging conditions;
however, the vast parameter space and intricate tuning process pose significant challenges for real-time autonomous applications.
Existing autonomous ISP hyperparameter optimization methods rely largely on offline or proxy-based online tuning, limiting their accuracy and responsiveness to real-time environmental changes.
In response, we propose an online ISP hyperparameter optimization framework based on Deep Reinforcement Learning (DRL), marking the first proxy-free, real-time optimization approach.
Our design exhibits a master-slave Multi-Agent System (MAS), enabling rapid and cooperative parameter optimization with improved inter-frame consistency.
Furthermore, we design the MAS-ISP automated visual system, incorporating innovative hardware designs such as Strip Conv Kernel and Stride-Aware Dual-Buffer Memory, which drastically reduce resource consumption in CNN hardware. MAS-ISP achieves 1080P@75FPS/240FPS on FPGA/ASIC platforms, supporting real-time and reliable visual systems.
Exhibitor Forum


DescriptionThis session dives into advanced data management strategies using Keysight's Design Data Management (SOS), backed by real-world case studies. We'll showcase how Keysight SOS delivers unmatched flexibility in managing any kind of design or engineering data—structured or unstructured—across today's most demanding, multi-site environments.
Learn how Keysight SOS unifies fragmented toolchains by integrating with third-party SCM systems like Git, enabling seamless hybrid workflows while maintaining a single source of truth. We'll demonstrate how to harness Keysight SOS triggers to automate workflows, enforce governance, and accelerate development cycles. The presentation also highlights performance-driven features such as links-to-cache, sparse populate, and reference-based reuse—powerful tools that dramatically reduce storage overhead, boost speed, and scale collaboration without compromise.
Whether you're managing IP, source code, test data, or complex hardware designs, this session will show how Keysight SOS transforms design data management into a streamlined, automated, and scalable operation—built for the realities of modern engineering.
Learn how Keysight SOS unifies fragmented toolchains by integrating with third-party SCM systems like Git, enabling seamless hybrid workflows while maintaining a single source of truth. We'll demonstrate how to harness Keysight SOS triggers to automate workflows, enforce governance, and accelerate development cycles. The presentation also highlights performance-driven features such as links-to-cache, sparse populate, and reference-based reuse—powerful tools that dramatically reduce storage overhead, boost speed, and scale collaboration without compromise.
Whether you're managing IP, source code, test data, or complex hardware designs, this session will show how Keysight SOS transforms design data management into a streamlined, automated, and scalable operation—built for the realities of modern engineering.
Research Special Session


Design
DescriptionExponential progress in Artificial Intelligence (AI) owes much to remarkable advancements in semiconductor technology. The semiconductor roadmap is guided by the PPACt metrics: low Power, high Performance, reduced Area, low Cost, and rapid time to market. Typically, it involves years or even decades of meticulous refinement of semiconductor innovations traversing from Concept & Feasibility (CnF) stage to High Volume Manufacturing (HVM). This multifaceted endeavor is a sequence of four key steps - Materials discovery, Process optimization, Device engineering and Chip design. In this tutorial, we will explore transformative shifts in semiconductor industry driven by cutting edge AI/Machine Learning (ML) methodologies which can accelerate PPACt.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionExisting pruning methods for Spiking Neural Networks (SNNs) focus on a single form of sparsity, overlooking the importance of joint pruning, which is critical for minimizing synaptic operations (SOPs) and enhancing energy efficiency. This paper presents a novel dynamic joint pruning framework that leverages both spatiotemporal spike sparsity and weight sparsity to minimize SOPs in SNN inference. The proposed framework integrates a multi-stage masking mechanism for fine-grained neuron firing threshold control, a Temporal Attention Batch Normalization (TABN) module with learnable time scaling factors, and a dynamic sparse strategy that adjusts importance coefficients based on real-time computational impact. Experimental results demonstrate that our method achieves maximum SOP reduction with minimal accuracy loss, establishing a new state-of-the-art in energy-efficient SNN pruning.
Engineering Poster


DescriptionA new approach to IR drop analysis using Sigma-AV that addresses the shortcomings of traditional dynamic IR sign-off methods for mobile SoCs. Traditional methods suffer from limited coverage, high compute requirements, difficulties in outlier handling, and lower tech node complexities. These limitations can lead to over-designed power grids, impacting power, performance, and area (PPA).
Sigma-AV overcomes these by analyzing self-drop, aggressor drop, and regional drop to achieve comprehensive IR drop coverage. It reports self-drop and aggressor drop for every instance, ensuring high confidence in local switching noise coverage. The self-drop metric improves power grid issues, and the aggressor drop metric enhances the dynamic IR profile. Sigma-AV enables power grid optimization and cell profiling without requiring a complete PNR cycle. Design IR profiles can be enhanced via selective swaps using an ECO tool-based flow guided by Sigma-AV's aggressor list. Sigma-AV also identifies issues such as ICG/AON buffer clustering and multi-bit cell IR drops , which traditional methods often miss. Compared to vectorless Dynamic IR, Sigma-AV shows over 40% runtime improvement and 3-4x faster voltage impact view.
Sigma-AV enables faster, efficient IR analysis, improving PPA and reducing design cycle times.
Sigma-AV overcomes these by analyzing self-drop, aggressor drop, and regional drop to achieve comprehensive IR drop coverage. It reports self-drop and aggressor drop for every instance, ensuring high confidence in local switching noise coverage. The self-drop metric improves power grid issues, and the aggressor drop metric enhances the dynamic IR profile. Sigma-AV enables power grid optimization and cell profiling without requiring a complete PNR cycle. Design IR profiles can be enhanced via selective swaps using an ECO tool-based flow guided by Sigma-AV's aggressor list. Sigma-AV also identifies issues such as ICG/AON buffer clustering and multi-bit cell IR drops , which traditional methods often miss. Compared to vectorless Dynamic IR, Sigma-AV shows over 40% runtime improvement and 3-4x faster voltage impact view.
Sigma-AV enables faster, efficient IR analysis, improving PPA and reducing design cycle times.
Engineering Presentation


AI
Back-End Design
DescriptionModern Multi-Processor Systems on Chip (MPSoCs) comprise multiple clock domains involving synchronous and asynchronous data paths. Glitches in asynchronous clock domain crossings are very well known and there are state-of-the-art industry signoff tools to cover them.
However, the synchronous logic is also susceptible to glitches due to Multi-Cycle Paths, MCPs in design.
MCP verification and detection of invalid or incorrectly implemented MCPs is not feasible at Gate Level Simulations. Industry available tools target MCP verification at the RTL level and netlist restructuring limits their applicability. Running these tools on netlist causes an exponential explosion of reported violations making it non-feasible to complete within the tight SoC design timelines.
Incorrectly implemented or invalid MCPs can cause unintended design behavior or the device might end up getting stuck or hung or even dead.
The methodology presented targets to resolve the issues due to glitches in MCPs. The flow makes use of STA shell to check the correctness of MCPs. It can be deployed to identify invalid and incorrectly implemented MCPs at gate level netlist level. Additionally, the flow can be used to functionally verify the accuracy of MCPs. The flow is scalable and has been tested across various-sized SoCs.
However, the synchronous logic is also susceptible to glitches due to Multi-Cycle Paths, MCPs in design.
MCP verification and detection of invalid or incorrectly implemented MCPs is not feasible at Gate Level Simulations. Industry available tools target MCP verification at the RTL level and netlist restructuring limits their applicability. Running these tools on netlist causes an exponential explosion of reported violations making it non-feasible to complete within the tight SoC design timelines.
Incorrectly implemented or invalid MCPs can cause unintended design behavior or the device might end up getting stuck or hung or even dead.
The methodology presented targets to resolve the issues due to glitches in MCPs. The flow makes use of STA shell to check the correctness of MCPs. It can be deployed to identify invalid and incorrectly implemented MCPs at gate level netlist level. Additionally, the flow can be used to functionally verify the accuracy of MCPs. The flow is scalable and has been tested across various-sized SoCs.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionLarge language models (LLMs) have gained significant attention recently. However, executing LLM is memory-bound due to the extensive memory accesses. Process-in-memory (PIM) emerges as an energy-efficient solution for LLMs, delivering high memory bandwidth and compute parallelism. Nevertheless, the trend towards larger LLMs introduces escalating memory footprint challenges for monolithic PIM chips. This paper proposes McPAL, which tackles this challenge by emphasizing unstructured sparse compute within PIM and hierarchical multi-chiplet scaling. McPAL decomposes arbitrary sparse weight matrix into multiple irregular sparse vectors. The non-skipped computations in each vector are then routed via an in-memory butterfly network to the standard PIM array, enhancing the PIM utilization. In addition, we scale McPAL vertically by strategically organizing the 3D-HBM hierarchy to minimize the internal long-distance data travel. Meanwhile, a 2.5D IO chiplet scales McPAL horizontally, reducing die-to-die (D2D) data transfer and ensuring sparse workload balance. We conducted extensive experiments from Llama-7B to Llama-70B. The results show that McPAL achieves 1.57× to 3.12× speedup and 10.43× to 35.66× energy efficiency over Nvidia A100 GPU. Compared to SOTAs, McPAL also achieves 1.08× to 2.15× speedup and 1.65× to 5.14× energy efficiency.
Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionThis paper focuses on optimizing the Matrix-Power Kernel (MPK), which relies on a series of Sparse Matrix-Vector multiplications (SpMVs) using the same sparse matrix. MPK is a crucial component of Krylov subspace methods for solving large sparse linear systems in various fields, including circuit simulations. MPK offers a potential for matrix reuse in cache, which can accelerate memory-bound sparse solvers. Additionally, many sparse matrices encountered in applications are symmetric, allowing us to reduce the memory footprint for SpMVs by half. However, reusing the matrix introduces data dependencies between subsequent SpMVs, and symmetric SpMVs can result in data conflicts during shared-memory parallelization. Previous research has often focused on either matrix reuse or symmetry, failing to leverage both aspects effectively.
This paper proposes a unified, memory-efficient approach called Me-MPK that takes advantage of both cache reuse and matrix symmetry for MPK on shared-memory multi-core systems. We first introduce a unified dependency graph for a sparse matrix, which represents all potential data dependencies and conflicts. Next, we perform architecture-aware recursive partitioning on this graph to create subgraphs and formulate a separating subgraph that decouples all dependencies and conflicts among the subgraphs. These independent subgraphs are then scheduled for parallel execution of SpMV or symmetric SpMV in a specified order to optimize cache reuse. We apply Me-MPK in two s-Step Krylov subspace solvers, and our evaluations show that Me-MPK significantly outperforms the current state-of-the-art solutions, delivering an average speedup of up to 2.00X and 1.86X on X86 and ARM CPUs, respectively. As a result, we achieve overall speedup in the sparse solvers of up to 1.65X and 1.58X.
This paper proposes a unified, memory-efficient approach called Me-MPK that takes advantage of both cache reuse and matrix symmetry for MPK on shared-memory multi-core systems. We first introduce a unified dependency graph for a sparse matrix, which represents all potential data dependencies and conflicts. Next, we perform architecture-aware recursive partitioning on this graph to create subgraphs and formulate a separating subgraph that decouples all dependencies and conflicts among the subgraphs. These independent subgraphs are then scheduled for parallel execution of SpMV or symmetric SpMV in a specified order to optimize cache reuse. We apply Me-MPK in two s-Step Krylov subspace solvers, and our evaluations show that Me-MPK significantly outperforms the current state-of-the-art solutions, delivering an average speedup of up to 2.00X and 1.86X on X86 and ARM CPUs, respectively. As a result, we achieve overall speedup in the sparse solvers of up to 1.65X and 1.58X.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionMeasurement-based uncomputation (MBU) is a technique used to perform probabilistic uncomputation of quantum circuits. We formalize this technique for the case of single-qubit registers, and we show applications to modular arithmetic. Using MBU, we reduce Toffoli count and depth by 10% to 15% for modular adders based on the architecture of [VBE96], and by almost 25% for modular adders based on the architecture of [Beau02]. Our results have the potential to improve other circuits for modular arithmetic, such as modular multiplication and modular exponentiation, and can find applications in quantum cryptanalysis.
Research Manuscript


Systems
SYS2: Design of Cyber-Physical Systems and IoT
DescriptionHeterogeneous parallel error detection is an approach to achieving fault-tolerant processors, leveraging multiple power-efficient cores to re-execute software originally run on a high-performance core.
Yet, its complex components, gathering data cross-chip from many parts of the core, raise questions of how to build it into commodity cores without heavy design invasion and extensive re-engineering.
We build the first full-RTL design, MEEK, into an open-source SoC, from microarchitecture and ISA to the OS and programming model.
We identify and solve bottlenecks and bugs overlooked in previous work, and demonstrate that MEEK offers microsecond-level detection capacity with affordable overheads.
By trading off architectural functionalities across codesigned hardware-software layers, MEEK features only light changes to a mature out-of-order superscalar core, simple coordinating software layers, and a few lines of operating-system code.
Yet, its complex components, gathering data cross-chip from many parts of the core, raise questions of how to build it into commodity cores without heavy design invasion and extensive re-engineering.
We build the first full-RTL design, MEEK, into an open-source SoC, from microarchitecture and ISA to the OS and programming model.
We identify and solve bottlenecks and bugs overlooked in previous work, and demonstrate that MEEK offers microsecond-level detection capacity with affordable overheads.
By trading off architectural functionalities across codesigned hardware-software layers, MEEK features only light changes to a mature out-of-order superscalar core, simple coordinating software layers, and a few lines of operating-system code.
Research Manuscript


Systems
SYS6: Time-Critical and Fault-Tolerant System Design
DescriptionTime-Sensitive Networking (TSN) provides bounded latency and low jitter for cyber-physical systems, such as industrial control. As a key component of TSN, the Time-Aware Shaper (TAS) applies gate control rules to control the transmission time of frames in critical flows. TAS stores the gate control rules for each frame in the gate control table. However, in typical industrial setups, the memory usage of the table could reach over tens of megabits and even exceed the total memory capacity of TSN switches.
To address this issue, we propose a memory-efficient TAS design named METAS. It transitions from a per-frame to a \textit{per-flow} approach. METAS stores one \textit{persistent rule} for a flow and dynamically generates a \textit{temporary rule} for a frame only when the frame arrives. We prototyped METAS on an FPGA, and experimental results show that METAS reduces memory usage from 14.34 Mbits to 288 Kbits when supporting 1,024 flows, using just 1.56\% of the FPGA's logic resources while maintaining microsecond-level latency and nanosecond-level jitter for critical flows.
To address this issue, we propose a memory-efficient TAS design named METAS. It transitions from a per-frame to a \textit{per-flow} approach. METAS stores one \textit{persistent rule} for a flow and dynamically generates a \textit{temporary rule} for a frame only when the frame arrives. We prototyped METAS on an FPGA, and experimental results show that METAS reduces memory usage from 14.34 Mbits to 288 Kbits when supporting 1,024 flows, using just 1.56\% of the FPGA's logic resources while maintaining microsecond-level latency and nanosecond-level jitter for critical flows.
Networking
Work-in-Progress Poster


DescriptionIn-memory search has emerged as a promising solution for efficient vector discovery of the nearest neighbors in general-purpose vector databases. However, templated storage-in-array structure and VMM-based computational form of in-memory search pose challenges in supporting generic distance computations. In this work, for the first time, we introduce a novel memristive in-memory similarity measure engine, MemSearch, for configurable distance calculations, including dot distance, ED, and CD. MemSearch highlights two aspects: data storage and distance computing. For data storage, we propose a Unified Similarity Element Mapping (USEM) scheme based on a pair array to accommodate various similarity calculations. For distance computing, we introduce a Reconfigurable Current Computing (RCC) circuit designed to process multiple arithmetic rules in similarity calculations, with a slightly increase of 4.4% and 9.9% in energy consumption for ED and CD, respectively. We have tested various datasets with different modalities, including images, voice, human activity and text. Experimental results demonstrate that the MemSearch engine achieves improvements of 864×, 802×, and 1474× in energy efficiency over CMOS-based engines for dot distance, ED, and CD calculations, respectively. The MemSearch engine highlights its potential for future highly efficient general-purpose in-memory vector databases.
Research Manuscript


Systems
SYS6: Time-Critical and Fault-Tolerant System Design
DescriptionIn high-performance ultra-scale cloud computing, heterogeneous clusters consisting of x86 and ARM architecture platforms have become increasingly common to boost performance and energy efficiency. Ensuring high availability in these environments is crucial for meeting service-level agreements. However, DRAM failures, a primary cause of server downtimes, present significant challenges to reliability, availability, and serviceability. This paper provides an in-depth analysis of memory failure characteristics across cross-architecture platforms in large-scale heterogeneous clusters. We introduce MemSeer, an AIOps-integrated tool that utilizes a multi-grained memory failure prediction approach for x86/ARM heterogeneous clusters. MemSeer improves the F1-score by 17.3% and increases recall by an average of 27% across different lead times compared to state-of-the-art methods. These advancements show great promise in reducing memory failures in cluster environments, decreasing VM interruptions by up to 42.7% and averaging 24.2% in real-world implementations.
Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionAdjoint sensitivity analysis is an exceptionally efficient method for computing the gradient of an objective function with respect to given parameters, playing a crucial role in modern circuit design and verification. According to the principles of the adjoint method, it is necessary to store all essential system state information, such as state vectors and Jacobian matrices, at each time step during the forward integration process in order to construct the adjoint equations during the backward integration. Therefore, the memory overhead of the adjoint method is proportional to the system size and the number of time steps, resulting in prohibitive memory costs for solving large-scale dynamic systems.
In this paper, we propose a novel, memory-efficient adjoint sensitivity analysis method that significantly reduces the memory overhead of storing system state information by employing error-bounded lossy compression techniques. Our compression algorithm effectively utilizes the spatiotemporal characteristics of data in circuit simulations and incorporates stringent error control mechanisms. This approach achieves a two-order-of-magnitude reduction in memory overhead during simulation while ensuring that the accuracy of the adjoint solution remains unaffected.
In this paper, we propose a novel, memory-efficient adjoint sensitivity analysis method that significantly reduces the memory overhead of storing system state information by employing error-bounded lossy compression techniques. Our compression algorithm effectively utilizes the spatiotemporal characteristics of data in circuit simulations and incorporates stringent error control mechanisms. This approach achieves a two-order-of-magnitude reduction in memory overhead during simulation while ensuring that the accuracy of the adjoint solution remains unaffected.
Networking
Work-in-Progress Poster


DescriptionRecovering underlying governing equations, i.e., model recovery (MR) from data - crucial for runtime monitoring - is one of the key solutions for assurance of safe and explainable operations of mission-critical autonomous systems (MCAS). MCAS often operate under strict constraints related to time, computational resources, and power, potentially requiring the usage of edge artificial intelligence (edge-AI) for accelerated safety monitoring. Field Programmable Gate Arrays (FPGAs) have emerged as an ideal solution to meet these constraints due to their reconfigurability and capacity for hardware-optimized performance. MR approaches such as EMILY or Physics informed neural networks with sparse regression (PINN+SR), use continuous depth residual networks and continuous time latent variable models e.g. Neural ODE (NODE) as core components. A key challenge in accelerating these components comes from the iterative approach towards integration of ordinary differential equations (ODE) in the forward pass. This paper introduces a novel FPGA-based accelerated model recovery in dynamic architecture (MERINDA) approach, that utilizes equivalent neural architectures to the NODE core that are amenable for acceleration. MERINDA has four components: a) a Gated Recurrent Unit (GRU) layer that generates approximate discretized solution of the original NODE layer in EMILY or PINN-SR, b) a dense layer to solve the inverse ODE problem to obtain approximate ODE model coefficients from the discretized solutions, c) sparsity driven dropout layer to reduce model order, and d) a standard ODE solver to solve the reduced order ODE and regenerate the continuous time original output of the NODE layer. Components a and b can be accelerated using FPGA, while c and d have much less computational complexity than the original NODE layer. We first theoretically prove that the MERINDA is equivalent to EMILY and empirically compare their MR performance on four benchmark examples. We then assess the accuracy, processing speed, energy consumption, and DRAM access of MERINDA and compare with baseline accelerated machine learning approaches such as non-physics informed machine learning (ML), and learning with only physics guided loss functions (ML-PG). We also compare MERINDA with GPU-based implementations of all three approaches to evaluate the advantage of acceleration. Finally, we apply mixed integer programming to identify the optimal approach towards runtime monitoring and its hyper-parameters under various resource constraints. Our results demonstrate substantial improvements in energy efficiency, training time, and DRAM access with minimal compromises in accuracy, underscoring the viability of MERINDA for resource-constrained autonomous systems.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionCross-workload design space exploration (DSE) is crucial in CPU architecture design. Existing DSE methods typically employ the transfer learning technique to leverage knowledge from source workloads, aiming to minimize the requirement of target workload evaluation. However, these methods struggle with overfitting, data ambiguity, and workload dissimilarity.
To address these challenges, we reframe the cross-workload CPU DSE task as a few-shot meta-learning problem and further introduce MetaDSE.
By leveraging model agnostic meta-learning, MetaDSE swiftly adapts to new target workloads, greatly enhancing the efficiency of cross-workload CPU DSE. Additionally, MetaDSE introduces a novel knowledge transfer method called the workload-adaptive architectural mask algorithm, which uncovers the inherent properties of the architecture.
Experiments on SPEC CPU 2017 demonstrate that MetaDSE significantly reduces prediction error by 44.3\% compared to the state-of-the-art.
To address these challenges, we reframe the cross-workload CPU DSE task as a few-shot meta-learning problem and further introduce MetaDSE.
By leveraging model agnostic meta-learning, MetaDSE swiftly adapts to new target workloads, greatly enhancing the efficiency of cross-workload CPU DSE. Additionally, MetaDSE introduces a novel knowledge transfer method called the workload-adaptive architectural mask algorithm, which uncovers the inherent properties of the architecture.
Experiments on SPEC CPU 2017 demonstrate that MetaDSE significantly reduces prediction error by 44.3\% compared to the state-of-the-art.
Networking
Work-in-Progress Poster


DescriptionDetecting Hardware Trojans (HTs) at run-time presents significant challenges due to the increasing complexity of modern integrated circuits and the dynamic nature of streaming data in IoT-connected systems. Current detection methods often rely on specific benchmarks or focus on limited, predefined Trojan signatures, making it difficult to adapt to new, zero-day (unknown) HTs. Additionally, traditional machine learning-based methods struggle to cope with the variability of side-channel data sources and run-time constraints. In response, we explore the potential of meta-reinforcement learning (meta-RL) as a promising solution. We propose MetaGuard an effective two-step meta-RL framework for adaptive, run-time hardware Trojan detection. In the first step, we leverage meta-learning to incrementally learn from new, unknown data, effectively modeling reinforcement learning environments as multi-armed bandits. In the second step, a Thompson Sampling agent is incorporated to handle the multi-task environment by utilizing priors from recent relative working memory to compute Bayesian posterior distributions. This allows for optimized decision-making across multiple benchmarks, overcoming the limitations of approaches that focus on a single benchmark. MetaGuard is designed to monitor and detect Trojans across uncertain and evolving benchmark variants at run-time streaming data from IoT systems. Experimental results demonstrate that MetaGuard improves detection performance by 13% in F1-score compared to traditional methods, providing a robust and adaptive solution for run-time zero-day HT detection.
Engineering Poster
Networking


DescriptionIn today's world, the benefits of cloud computing are well-established. However, there remains significant hesitation within the semiconductor industry regarding the migration of data to the cloud. Why is this the case?
Out of the many different reasons, we delve into two main technical reasons for this hesitation and provide a robust solution to overcome it. The two technical reasons are user downtimes during data sync to cloud and data integrity after cloud migration. We compare and contrast the traditional data copy methods and the modern database/repository replication methods and prove through exact calculated numbers, why the latter method is superior and how it can alleviate data migration to cloud concerns from a technical standpoint in the semiconductor industry.
Out of the many different reasons, we delve into two main technical reasons for this hesitation and provide a robust solution to overcome it. The two technical reasons are user downtimes during data sync to cloud and data integrity after cloud migration. We compare and contrast the traditional data copy methods and the modern database/repository replication methods and prove through exact calculated numbers, why the latter method is superior and how it can alleviate data migration to cloud concerns from a technical standpoint in the semiconductor industry.
Engineering Poster
Networking


DescriptionWhen moving to new chip designs or new clocking schemas constraint creation is tedious, time consuming, and prone to human error. Typically, designers will use previous design constraints and manually modify when timing results do not provide correct output. This can be an iterative process.
A smart tool is needed for the designer that supports the complexity and error prone challenges of creating translated constraints. We propose a novel tool that automatically creates timing constraints based upon clock schemas and design topology. We show how our novel tool uses techniques for constraint transformation in the presence of dual edge clocking and improves the accuracy and efficiency of STA. We demonstrate the effectiveness of our techniques through implementation of constraint transformation from single edge clocking to dual edge clocking.
Jack DiLullo: EDA, IBM USA; Eric Foreman: EDA, IBM USA; Manish Verma: EDA, IBM India
A smart tool is needed for the designer that supports the complexity and error prone challenges of creating translated constraints. We propose a novel tool that automatically creates timing constraints based upon clock schemas and design topology. We show how our novel tool uses techniques for constraint transformation in the presence of dual edge clocking and improves the accuracy and efficiency of STA. We demonstrate the effectiveness of our techniques through implementation of constraint transformation from single edge clocking to dual edge clocking.
Jack DiLullo: EDA, IBM USA; Eric Foreman: EDA, IBM USA; Manish Verma: EDA, IBM India
Engineering Poster
Networking


DescriptionAs the demand for high-performance servers continues to grow, the use of dual edge clocking and high performance latches has become increasingly prevalent. However, the presence of these components poses significant challenges for static timing analysis (STA), which is a critical step in the design of digital circuits. In this paper, we present novel techniques for STA in the presence of dual edge clocking and high performance latches, enabling accurate analysis of both pulsed and transparent latches. Our approach leverages the unique properties of these components to improve the accuracy by reducing pessimism and efficiency of STA, allowing for more reliable and high-performance designs. We demonstrate the effectiveness of our techniques through a series of case studies and compare them to existing methods, showing significant improvements in accuracy and performance.
Engineering Poster
Networking


DescriptionProblem Statement:
During synthesis, hundreds of thousands of registers are reported as optimized away, but there is no way to know the exact reason for optimization except the type of optimization like Constant optimization (C0r/C1r), Unloaded (ULR), Merged etc. There could be multiple reasons like optimizations due to direct RTL constants, toggling logic, optimized register might again be optimizing 100s of registers in its path indirectly, unloaded registers whose outputs are blocked or unused and registers having constant input sources or outputs are blocked or hanging. It can take several weeks of back and forth between RTL engineers and Synthesis engineers to validate and identify the root causes of the issue, fix, constrain or waive these before proceeding to Implementation. This affects productivity and efficiency, possibly affecting the design cycles.
Generation and Debug:
The proposed solution is to shift-left this process. VC SpyGlass IDC (Implementation Design Checks) is designed to use light weight synthesis engine of Fusion Compiler to determine which registers can possibly get optimized along with the RCA debug info, to give the RTL designers to quickly validate both unintended and intended optimized registers and fix RTL using Lint Advisor fix flow , generate constraints to preserve the design elements from constant propagation and generate waive database for the Intended optimized registers to migrate and use for signoff analysis.
Applications:
1. Improve Power Correlation: The generated signed off optimized registers database by VC SpyGlass IDC tool can feed to RTL Power Analysis tools to improve Power correlation between RTL and Gate level
2. Improve Coverage metrics: Same database can be used with Dynamic verification tools to improve coverage metrics
3. Improve Signoff effort: The generated Waiver Data Format (WDF) database with Fusion Compiler Flow to optimize the duplicate effort of validating issues which have already been validated and signed off at RTL level.
4. Help to better PPA: The generated constraints database can be used with Fusion Compiler to get accurate PPA results.
During synthesis, hundreds of thousands of registers are reported as optimized away, but there is no way to know the exact reason for optimization except the type of optimization like Constant optimization (C0r/C1r), Unloaded (ULR), Merged etc. There could be multiple reasons like optimizations due to direct RTL constants, toggling logic, optimized register might again be optimizing 100s of registers in its path indirectly, unloaded registers whose outputs are blocked or unused and registers having constant input sources or outputs are blocked or hanging. It can take several weeks of back and forth between RTL engineers and Synthesis engineers to validate and identify the root causes of the issue, fix, constrain or waive these before proceeding to Implementation. This affects productivity and efficiency, possibly affecting the design cycles.
Generation and Debug:
The proposed solution is to shift-left this process. VC SpyGlass IDC (Implementation Design Checks) is designed to use light weight synthesis engine of Fusion Compiler to determine which registers can possibly get optimized along with the RCA debug info, to give the RTL designers to quickly validate both unintended and intended optimized registers and fix RTL using Lint Advisor fix flow , generate constraints to preserve the design elements from constant propagation and generate waive database for the Intended optimized registers to migrate and use for signoff analysis.
Applications:
1. Improve Power Correlation: The generated signed off optimized registers database by VC SpyGlass IDC tool can feed to RTL Power Analysis tools to improve Power correlation between RTL and Gate level
2. Improve Coverage metrics: Same database can be used with Dynamic verification tools to improve coverage metrics
3. Improve Signoff effort: The generated Waiver Data Format (WDF) database with Fusion Compiler Flow to optimize the duplicate effort of validating issues which have already been validated and signed off at RTL level.
4. Help to better PPA: The generated constraints database can be used with Fusion Compiler to get accurate PPA results.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionDiffusion models have demonstrated superior performance in image generation tasks, thus becoming the mainstream model for generative visual tasks. Diffusion models need to execute multiple timesteps sequentially, resulting in a dramatic increase in workload. Existing accelerators leverage the data similarity between adjacent timesteps and perform mixed-precision differential quantization to accelerate diffusion models. However, merging differential values with raw inputs in each layer of each timestep to ensure computational correctness requires significant memory access for loading raw inputs, which creates a heavy memory burden. Moreover, mixed-precision computations may lead to low hardware utilization if not well designed. Unlike these works, we propose MHDiff, a tailored framework that identifies the focal pixels at the first layer and finetunes them to fit all layers, then represents focal pixels with high-precision while using low-precision for others, thereby accelerating diffusion models while minimizing memory burden. To improve hardware utilization, MHDiff employs a packing module that merges low-precision values into high-precision values to create full high-precision matrices and designs a processing element (PE) array to efficiently process the packed matrices. Extensive experiment results demonstrate that MHDiff can achieve satisfactory performance with negligible quality loss.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionTSMC's FinFlex standard cells for sub-3 nm nodes provide flexibility and an optimal balance of performance, power, and area. The FinFlex technology offers multiple versions per cell type with varied heights and threshold voltages, allowing cell substitution for placement optimization. This paper introduces the first approach to standard cell legalization for the FinFlex technology, addressing cell substitution, power efficiency, and minimum implant area (MIA) constraints. The proposed algorithm includes three stages: intra-row preprocessing with cell substitution, DAG-based initial legalization with MIA violation prediction, and dynamic programming for inter-row violation removal. Experimental results show superior performance over state-of-the-art techniques.
Keynote


DescriptionTopic to be announced
Research Manuscript


AI
AI1: AI/ML Algorithms
Description.Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128K or 1M tokens. This trend, however, presents significant challenges in inference speed and memory management.
The primary bottleneck in long-context LLM inference is the quadratic computational complexity of attention mechanisms, causing substantial slowdowns as sequence length increases. KV cache mechanism alleviates this issue by storing pre-computed data, but introduces memory requirements that scale linearly with context length, hindering efficient LLM deployment. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization schemes often yield suboptimal compression results for KV caches due to two key factors:
i) On-the-fly quantization and de-quantization, causing significant performance overhead;
ii) Prevalence of outliers in KV values, challenging low-bitwidth uniform quantization.
To this end, we propose MILLION, a novel quantization framework achieving low-bitwidth KV cache through product quantization. First, we conduct a thorough analysis of KV cache distribution, revealing the limitations of existing quantization schemes. Second, we introduce a non-uniform quantization algorithm based on product quantization, which efficiently compresses data while preserving accuracy. Third, we develop a high-performance GPU inference framework for \archname that leverages sparse computation and asynchronous quantization, significantly enhancing inference speed.
Comprehensive evaluation results demonstrate that MILLION can achieve 4 bits quantization trivial perplexity and accuracy loss.
The primary bottleneck in long-context LLM inference is the quadratic computational complexity of attention mechanisms, causing substantial slowdowns as sequence length increases. KV cache mechanism alleviates this issue by storing pre-computed data, but introduces memory requirements that scale linearly with context length, hindering efficient LLM deployment. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization schemes often yield suboptimal compression results for KV caches due to two key factors:
i) On-the-fly quantization and de-quantization, causing significant performance overhead;
ii) Prevalence of outliers in KV values, challenging low-bitwidth uniform quantization.
To this end, we propose MILLION, a novel quantization framework achieving low-bitwidth KV cache through product quantization. First, we conduct a thorough analysis of KV cache distribution, revealing the limitations of existing quantization schemes. Second, we introduce a non-uniform quantization algorithm based on product quantization, which efficiently compresses data while preserving accuracy. Third, we develop a high-performance GPU inference framework for \archname that leverages sparse computation and asynchronous quantization, significantly enhancing inference speed.
Comprehensive evaluation results demonstrate that MILLION can achieve 4 bits quantization trivial perplexity and accuracy loss.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionAs the huge discrepancy between traffic and capacity persists, the lifetime of Flash in EF-SMR systems faces a grave issue.
The EF-SMR system, which combines NAND flash with Shingled Magnetic Recording (SMR) disks, offers high performance and density.
Therefore, studying the endurance of EF-SMR systems is crucial for the development of future high-performance, low-cost storage solutions.
This paper presents MiniWear, which combines disk media with flash to form a hybrid architecture, and proposes a customized scheduling strategy.
MiniWear reduces flash wear through fine-grained scheduling while minimizing impact on EF-SMR system performance.
Experimental results show that, compared to existing methods, MiniWear reduces flash wear by up to 66.67%.
The EF-SMR system, which combines NAND flash with Shingled Magnetic Recording (SMR) disks, offers high performance and density.
Therefore, studying the endurance of EF-SMR systems is crucial for the development of future high-performance, low-cost storage solutions.
This paper presents MiniWear, which combines disk media with flash to form a hybrid architecture, and proposes a customized scheduling strategy.
MiniWear reduces flash wear through fine-grained scheduling while minimizing impact on EF-SMR system performance.
Experimental results show that, compared to existing methods, MiniWear reduces flash wear by up to 66.67%.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionThe rapid advancement of information technology has brought multimodal information retrieval
into the research spotlight. Neural networks, particularly Transformers, have emerged as the dominant solution for extracting multimodal feature vectors. While neural network acceleration has been extensively explored, the subsequent retrieval stage in multimodal scenarios remains under-optimized. Conventional retrieval approaches, such as cosine similarity sorting on von Neumann architectures, suffer from significant data migration and computational inefficiencies. Hashing methods enhance storage and computation efficiency but encounter challenges in energy-efficient implementation and mitigating accuracy losses due to modal heterogeneity. This paper presents a hybrid architecture that integrates in-memory processing (PIM) and content-addressable memory (CAM) to address these challenges. Transformer-extracted features are processed via in-memory random hashing leveraging device-intrinsic properties, with CAM facilitating parallel search space reduction. A final cosine similarity reranking stage refines the results while balancing accuracy with energy efficiency. Experimental evaluations validate that the proposed method, when compared to the baseline traditional CPU-based cosine similarity retrieval, 1) achieves almost identical level of accuracy, dramatically outperforming other pure CAM-based Hamming distance retrieval approaches; and 2) reduces latency by 9.45× and energy consumption by 30.20×.
into the research spotlight. Neural networks, particularly Transformers, have emerged as the dominant solution for extracting multimodal feature vectors. While neural network acceleration has been extensively explored, the subsequent retrieval stage in multimodal scenarios remains under-optimized. Conventional retrieval approaches, such as cosine similarity sorting on von Neumann architectures, suffer from significant data migration and computational inefficiencies. Hashing methods enhance storage and computation efficiency but encounter challenges in energy-efficient implementation and mitigating accuracy losses due to modal heterogeneity. This paper presents a hybrid architecture that integrates in-memory processing (PIM) and content-addressable memory (CAM) to address these challenges. Transformer-extracted features are processed via in-memory random hashing leveraging device-intrinsic properties, with CAM facilitating parallel search space reduction. A final cosine similarity reranking stage refines the results while balancing accuracy with energy efficiency. Experimental evaluations validate that the proposed method, when compared to the baseline traditional CPU-based cosine similarity retrieval, 1) achieves almost identical level of accuracy, dramatically outperforming other pure CAM-based Hamming distance retrieval approaches; and 2) reduces latency by 9.45× and energy consumption by 30.20×.
Networking
Work-in-Progress Poster


DescriptionOn-device machine learning applications are increasingly deployed in dynamic and open environments, where resource availability can fluctuate unpredictably. This variability, combined with limited computing resources, poses significant challenges to achieving high responsiveness. Existing frameworks like TensorFlowLite rely on static and coarse-grained resource allocation, leading to performance degradation under contention. To address this, we propose FlexOn, a framework that combines fine-grained model segmentation and dynamic resource selection to adapt to highly dynamic runtime conditions. A prototype built on TensorFlowLite demonstrates significant improvements in both average and tail latencies up to 54% and 58%, respectively, across three embedded devices under resource constraints.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionAs semiconductor technology scales beyond 5nm, complementary FET (CFET) that stacks P-FET and N-FET enables extreme cell area scaling. However, reduced standard cell height with CFET presents routability challenges due to limited back-end-of-line (BEOL) routing resources. In this paper, we address two key routability problems, i.e., pin accessibility and routing congestion, by employing various pin-extended cells on demand. Moreover, we introduce an end-to-end VLSI design framework that further alleviates routing congestion using partial placement blockages. Experimental results demonstrate that our framework eliminates most design rule violations (DRVs) related to routability while maintaining the area advantages of CFET technology.
Engineering Poster
Networking


DescriptionWith the increasing design complexity at lower FinFET technology nodes and decreasing noise margins, optimizing the power grid (PG) is crucial for power integrity (PI) signoff in System-on-Chip (SoC) designs. Traditionally, designers rely on Power & Routing (P&R) tools and perform power integrity analysis with PG grid adjustments. However, challenges such as limited parameter selection for Electromigration and Voltage drop (EMIR) sensitivity analysis, long iteration cycles, and insufficient data statistics hinder effective PG grid optimization. This paper proposes a machine learning (ML)-based approach for exhaustive PDN (Power Delivery Network) parameter sensitivity analysis, which aims to address these challenges. By automating parameter sampling and utilizing metamodeling techniques, we can accelerate the exploration of PG grid parameters (e.g., metal pitch and width), enabling a more comprehensive and efficient sensitivity analysis. This methodology eliminates the need for repetitive simulations after the metamodel generation, thus reducing the iteration cycles traditionally required for power grid optimization. Our results demonstrate that ML-driven exploration helps identify key dependencies, such as the effect of metal pitch on voltage drop (DVD), and allows for signal routing congestion reduction. This approach also holds potential for optimizing PG grids with respect to other parameters, including frequency, activity, and power switch pitch, making it a versatile tool for SoC design optimization.
Engineering Presentation


Back-End Design
Chiplet
DescriptionPower optimization is a critical consideration in modern semiconductor design. Recovering even a fraction of a percent of chip power can significantly impact the chip's cost, feasibility, and overall viability. Buffers to fix hold time violations in scan-test shift paths switch in both functional & scan operation modes. They stay connected to functional paths and consume power for lifetime of the chip. The problem becomes worse in multi-clock domain designs. Functional power dissipation in these parasitic scan-test shift buffers is a waste. Existing EDA tool's ability to identify these parasitic buffers is leveraged to isolate them in functional mode.
An effective technique and automation flow for reducing the dynamic power in contemporary digital designs is presented. Dynamic power saving is achieved without any design or implementation trade-offs. The power recovery is moderate for large sized, medium frequency designs and higher for smaller high frequency blocks. All present and future digital circuits can easily adopt the proposed buffer gating technique without any ramifications. The proposed method is implementation-based techniques which is independent of the chip architecture, design size and technology node. The proposed methodology can be implemented using existing EDA tools and will not impact the design cycle time.
An effective technique and automation flow for reducing the dynamic power in contemporary digital designs is presented. Dynamic power saving is achieved without any design or implementation trade-offs. The power recovery is moderate for large sized, medium frequency designs and higher for smaller high frequency blocks. All present and future digital circuits can easily adopt the proposed buffer gating technique without any ramifications. The proposed method is implementation-based techniques which is independent of the chip architecture, design size and technology node. The proposed methodology can be implemented using existing EDA tools and will not impact the design cycle time.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionThe independence of logic optimization and technology mapping poses a significant challenge in achieving high-quality synthesis results. This paper proposes a scalable and efficient framework based on Mixed Structural Choices (MCH). This is a novel heterogeneous mapping method that combines multiple logic representations with technology-aware optimization. MCH flexibly integrates different logic representations and stores candidates for various optimization strategies. It enhances technology mapping and addresses structural bias issues.
The MCH-based LUT mapping algorithm set new records in the EPFL Best Results Challenge. Additionally, MCH-based ASIC technology mapping outperforms single representation mapping. MCH-based logic optimization overcomes local optima and improves results.
The MCH-based LUT mapping algorithm set new records in the EPFL Best Results Challenge. Additionally, MCH-based ASIC technology mapping outperforms single representation mapping. MCH-based logic optimization overcomes local optima and improves results.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionQuantization is a widely used technique to compress neural networks. Assigning uniform bit-widths across all layers can result in significant accuracy degradation at low precision and inefficiency at high precision. Mixed-precision quantization (MPQ) addresses this by assigning varied bit-widths to layers, optimizing the accuracy-efficiency trade-off. Existing sensitivity-based methods for MPQ assume that quantization errors across layers are independent, which leads to suboptimal choices. We introduce CLADO, a practical sensitivity-based MPQ algorithm that captures cross-layer dependency of quantization error. CLADO approximates pairwise cross-layer errors using linear equations on a small data subset. Layerwise bit-widths are assigned by optimizing a new MPQ formulation based on cross-layer quantization errors using an Integer Quadratic Program. Experiments with CNN and transformer models on ImageNet demonstrate that CLADO achieves state-of-the-art mixed-precision quantization performance. Code is available at \href{https://anonymous.4open.science/r/CLADO_DAC2025-BC55/README.md}{CLADO anonymous GitHub repo.}
Engineering Presentation


AI
Back-End Design
DescriptionAs the technology node scales down continuously, the complexity of the chip design has increased. The electronic design automation (EDA) tools also need to be flexible to handle the design complexity.
Advanced EDA tools offer numerous tunable parameters that can greatly affect physical design quality.
As the demand to achieve improved results (time to market) faster is growing, physical engineer face challenges that are impossible to keep pace with conventional approaches taking months to manually tune parameters with hundreds of trials.Moreover, even if engineers manage to find the best recipe through vast design space exploration for a given design, it is likely to be a one-time solution that is difficult to apply to another design unless the influence of various parameters cannot be understood sufficiently.
Therefore, we propose an ML based PPA push work flow using XAI which not only obtains the golden recipe for a given design, but also understands the influence of the parameters used.
Advanced EDA tools offer numerous tunable parameters that can greatly affect physical design quality.
As the demand to achieve improved results (time to market) faster is growing, physical engineer face challenges that are impossible to keep pace with conventional approaches taking months to manually tune parameters with hundreds of trials.Moreover, even if engineers manage to find the best recipe through vast design space exploration for a given design, it is likely to be a one-time solution that is difficult to apply to another design unless the influence of various parameters cannot be understood sufficiently.
Therefore, we propose an ML based PPA push work flow using XAI which not only obtains the golden recipe for a given design, but also understands the influence of the parameters used.
Research Manuscript


Systems
SYS4: Embedded System Design Tools and Methodologies
DescriptionAlong with the prosperity of Artificial Intelligence (AI) techniques, more and more Artificial Intelligence of Things (AIoT) applications adopt Federated Learning (FL) to enable collaborative learning without compromising the privacy of devices. Since existing centralized FL methods suffer from the problems of single-point-of-failure and communication bottleneck caused by the parameter server, we are witnessing an increasing use of Decentralized Federated Learning (DFL), which is based on Peer-to-Peer (P2P) communication without using a global model. However, DFL still faces three major challenges, i.e., limited computing power and network bandwidth of resource-constrained devices, non-Independent and Identically Distributed (non-IID) device data, and all-neighbor-dependent knowledge aggregation operations, all of which greatly suppress the learning potential of existing DFL methods. To address these problems, this paper presents an efficient DFL framework named MMDFL based on our proposed multi-model-based learning and knowledge aggregation mechanism. Specifically, MMDFL adopts multiple traveler models, which perform local training individually along their traversed devices, accelerating and maximizing knowledge learning and sharing among devices. Moreover, based on our proposed device selection strategy, MMDFL enables each traveler to adaptively explore its next best neighboring device to further enhance the DFL training performance, taking into account issues of data heterogeneity, limited resources and catastrophic forgetting phenomenon. Experimental results from simulation and a real testbed show that, compared with state-of-the-art DFL methods, MMDFL can not only significantly reduce the communication overhead but also achieve better overall classification performance for both IID and non-IID scenarios.
Networking
Work-in-Progress Poster


DescriptionThe electronics and semiconductor industry is a prominent consumer of per- and poly-fluoroalkyl substances (PFAS), also known as forever chemicals. PFAS are persistent in the environment and can bioaccumulate to ecological and human toxic levels. Computer designers have an opportunity to reduce the use of PFAS in semiconductors and electronics manufacturing, including integrated circuits (IC), batteries, displays, etc., which currently account for a staggering 10% of the total PFAS fluoropolymers usage in Europe alone. In this paper, we present a framework where we (1) quantify the environmental impact of PFAS in computing systems manufacturing with granular consideration of the metal layer stack and patterning complexities in IC manufacturing at the design phase, (2) identify contending trends between embodied carbon (carbon footprint due to hardware manufacturing) versus PFAS. For example, manufacturing an IC at a 7 nm technology node using EUV lithography uses 18% less PFAS-containing layers, compared to manufacturing the same IC at a 7 nm technology node using DUV immersion lithography (instead of EUV) unlike embodied carbon trends, and (3) conduct case studies to illustrate how to optimize and trade-off designs with lower PFAS, while meeting power-performance-area constraints. We show that optimizing designs to use less back-end-of-line (BEOL) metal stack layers can save 1.7× PFAS-containing layers in systolic arrays.
Research Manuscript


Design
DES5: Emerging Device and Interconnect Technologies
DescriptionThis work presents a novel Monolithic 3D (M3D) FPGA architecture that leverages stackable back-end-of-line (BEOL) transistors to implement configuration memory and pass gates. By integrating BEOL-compatible n-type (W-doped In₂O₃) and p-type (SnO) amorphous oxide semiconductor (AOS) transistors, Si SRAM configuration bits are substituted with a less leaky equivalent, programmable at logic-compatible voltages. Using AOS pass gates reduces the overhead of reconfigurable circuits by mapping FPGA switch block and connection block matrices above configurable logic blocks, thereby increasing the proximity of logic elements. Modifying the Verilog-to-Routing suite demonstrates that an AOS-based M3D FPGA can dramatically improve FPGA power, performance, and footprint.
Networking
Work-in-Progress Poster


DescriptionIn this paper, we propose MOOPSE (Matrix multiply Operation Optimization with Partially-Shared high-radix booth Encoder), a novel matrix multiply unit (MMU) used in processor cores. A key feature of MOOPSE is the use of high-radix Booth encoders, which reduces the area of fused multiply add (FMA) units, improving the area-efficiency of MMUs. Although high-radix Booth encoders are rarely used in conventional FMA units due to performance, power, and area overheads, our MOOPSE solves this issue by sharing main internal components of Booth encoders between multiple FMA units. Our experimental results show that MOOPSE shows respective improvements of up to 9%, 21%, 19%, and 16% in performance per area, dynamic power efficiency, leakage power efficiency, and area overhead compared to the state-of-the-art MMU.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionDeep learning has significantly advanced Electronic Design Automation (EDA), with circuit representation learning emerging as a key area for modeling the relationship between a circuit's structure and functionality. Existing methods primarily use either Large Language Models (LLMs) for Register Transfer Level (RTL) code analysis or Graph Neural Networks (GNNs) for netlist modeling. While LLMs excel at high-level functional understanding, they struggle with detailed netlist behavior. GNNs, however, face challenges when scaling to larger sequential circuits due to long-range information dependencies and insufficient functional supervision, leading to decreased accuracy and limited generalization.
To address these challenges, we propose MOSS, a multimodal framework that integrates GNNs with LLMs for sequential circuit modeling. By enhancing D-type Flip-Flop (DFF) node features with embeddings from fine-tuned LLMs on RTL code, we focus the GNN on critical anchor points, reducing reliance on long-range dependencies. The LLM also provides global circuit embeddings, offering efficient supervision for functionality-related tasks. Additionally, MOSS introduces an adaptive aggregation method and a two-phase propagation mechanism in the GNN to better model signal propagation and sequential feedback within the circuit.
Experimental results demonstrate that MOSS significantly improves the accuracy of functionality and performance predictions for sequential circuits compared to existing methods, particularly in larger circuits where previous models struggle. Specifically, MOSS achieves a 95.2% accuracy in arrival time prediction.
To address these challenges, we propose MOSS, a multimodal framework that integrates GNNs with LLMs for sequential circuit modeling. By enhancing D-type Flip-Flop (DFF) node features with embeddings from fine-tuned LLMs on RTL code, we focus the GNN on critical anchor points, reducing reliance on long-range dependencies. The LLM also provides global circuit embeddings, offering efficient supervision for functionality-related tasks. Additionally, MOSS introduces an adaptive aggregation method and a two-phase propagation mechanism in the GNN to better model signal propagation and sequential feedback within the circuit.
Experimental results demonstrate that MOSS significantly improves the accuracy of functionality and performance predictions for sequential circuits compared to existing methods, particularly in larger circuits where previous models struggle. Specifically, MOSS achieves a 95.2% accuracy in arrival time prediction.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionRetrieval-augmented language models (RALMs) have attracted widespread attention for addressing the limitations of traditional large language models. However, challenges involved in retrieval, including substantial data movement and irregular access patterns, seriously impact the efficiency and deployment of RALMs. The emerging 3D-stacked processing-in-memory (PIM) architecture, characterized by its high memory bandwidth and near-data computing capabilities, presents a promising solution for efficient retrieval. To support large-scale retrieval in RALMs, the PIM architecture should be carefully designed with joint software and hardware optimization.
This paper presents Rimast, a retrieval-in-memory architecture for fast retrieval in RALMs. The objective is to minimize data movement and improve overall performance through hardware-software co-design. At the hardware level, a hierarchical PIM architecture with a retrieval-in-memory dataflow is designed to reduce unnecessary data transfer. At the software level, skew-free data mapping and adaptive offloading strategies are proposed to address the irregular access patterns associated with retrieval in RALMs. We demonstrate the effectiveness of the proposed Rimast using extensive experiments. The experimental results demonstrate that Rimast effectively reduces data movement, achieving average speedups of 273×, 55×, and 2.41× over CPUs, GPUs, and prior art accelerators, respectively.
This paper presents Rimast, a retrieval-in-memory architecture for fast retrieval in RALMs. The objective is to minimize data movement and improve overall performance through hardware-software co-design. At the hardware level, a hierarchical PIM architecture with a retrieval-in-memory dataflow is designed to reduce unnecessary data transfer. At the software level, skew-free data mapping and adaptive offloading strategies are proposed to address the irregular access patterns associated with retrieval in RALMs. We demonstrate the effectiveness of the proposed Rimast using extensive experiments. The experimental results demonstrate that Rimast effectively reduces data movement, achieving average speedups of 273×, 55×, and 2.41× over CPUs, GPUs, and prior art accelerators, respectively.
Networking
Work-in-Progress Poster


DescriptionAs one of the most representative AI technologies, the Mamba architecture has enabled many advanced models. This paper proposes an energy-efficient Mamba inference processor, called the Mamba Processing Element (MPE). Firstly, MPE uses the recurrent framework to find Low-correlation Assignment Pruning Optimization (LAPO) schemes; Secondly, MTPE uses the mechanism of Spatial Multi-head Attention Similarity (SMAS); Thirdly, MPE designs a Dynamic Parallel Compression Quantization (DPCQ) architecture. Using 28nm CMOS synthesis tools, the proposed STPE processor has an area of 9.14 mm2 and a peak energy efficiency of 93.51TOPS/W, which is 16.3 times that of the H100 graphics processing unit (GPU).
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionTriple patterning lithography (TPL) has been recognized as one of the most promising solutions to print critical features in advanced technology nodes. A critical challenge within TPL is the effective assignment of the layout to masks. Recently, various layout decomposition methods and TPL-aware routing methods have been proposed to consider TPL. However, these methods typically result in numerous conflicts and stitches, and are mainly designed for 2-pin nets. This paper proposes a multi
pin net routing method in triple patterning lithography, called Mr.TPL. Experimental results demonstrate that Mr.TPL reduces color conflicts by 81.17%, decreases stitches by 76.89%, and achieves up to 5.4× speed improvement compared to the state-of the-art TPL-aware routing method.
pin net routing method in triple patterning lithography, called Mr.TPL. Experimental results demonstrate that Mr.TPL reduces color conflicts by 81.17%, decreases stitches by 76.89%, and achieves up to 5.4× speed improvement compared to the state-of the-art TPL-aware routing method.
Networking
Work-in-Progress Poster


DescriptionData logging is crucial for forensic analysis and diagnostics in embedded systems. However, it must remain efficient and reliable despite resource constraints and vulnerabilities. This paper introduces Trace, a framework utilizing the Data Watchpoint and Trace (DWT), universally available in the ARMv7-M architecture, to build a secure logging system. By monitoring memory-mapped I/O (MMIO) interactions via
watchpoints, MTrace collects logs and prevents tampering. We implemented MTrace on a drone using NuttX OS and demonstrated that it enables trustworthy logging of MMIO activities while ensuring reliable flight.
watchpoints, MTrace collects logs and prevents tampering. We implemented MTrace on a drone using NuttX OS and demonstrated that it enables trustworthy logging of MMIO activities while ensuring reliable flight.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionYield estimation is crucial in semiconductor manufacturing, impacting production costs and competitiveness. Traditional methods like Monte Carlo simulation are reliable but computationally intensive, while alternative approaches face challenges in consistency and validation. We introduce YieldAgent, a Large Language Model (LLM)-powered multi-agent framework for yield estimation. YieldAgent dynamically integrates multiple strategies through a three-layer architecture, optimizing estimation methods based on circuit characteristics and historical data. Experiments across 12nm and 40nm nodes demonstrate a 2.9× reduction in computational overhead while maintaining state-of-the-art accuracy, establishing a new paradigm for intelligent yield estimation in electronic design automation.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionDiffractive optical neural networks (DONNs), leveraging free-space light wave propagation for ultra-parallel, high-efficiency computing, have emerged as promising artificial intelligence (AI) accelerators. However, their inherent lack of reconfigurability due to fixed optical structures post-fabrication hinders practical deployment in the face of dynamic AI workloads and evolving applications. To overcome this challenge, we introduce, for the first time, a multi-dimensional reconfigurable hybrid diffractive ONN system (MDR-HDONN), a physically composable architecture that unlocks a new degree of freedom and unprecedented versatility in DONNs. By leveraging full-system learnability, MDR-HDONN repurposes fixed fabricated optical hardware, achieving exponentially expanded functionality and superior task adaptability through the differentiable learning of system variables. Furthermore, MDR-HDONN adopts a hybrid optical/photonic design, combining the reconfigurability of integrated photonics with the ultra-parallelism of free-space diffractive systems. Extensive evaluations demonstrate that MDR-HDONN has digital-comparable accuracy on various task adaptations with 74x faster speed and 194x lower energy. Compared to prior DONNs, \MS shows exponentially larger functional space with 5x faster training speed, paving the way for a new paradigm of versatile, composable, hybrid optical/photonic AI computing architecture design. We will open-source our codes.
Research Special Session


Design
DescriptionAs silicon transistor scaling slows, advancements in packaging and 3DIC technologies have become essential for improving semiconductor system performance. However, these technologies introduce significant thermal and mechanical design complexities, which can drastically reduce reliability and yields. To overcome thermal-stress challenges, multi-scale physics simulations are needed to identify design flaws before mass production. However, existing modeling tools struggle to scale to complex systems. In this presentation, we showcase an AI-accelerated, multi-scale thermal-stress simulator that enables highly accurate simulations with up to 1000x speedups over traditional finite element method solvers. This framework utilizes specialized AI models to simulate thermal stress across systems, from centimeter-scale interposers down to nanometer-scale interconnects and vias, all within a single, comprehensive simulation. With this approach, engineers can explore, identify, and resolve thermal-stress design challenges early, ultimately shortening development cycles and reducing time-to-market for advanced semiconductor chips.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionA crucial step in the design of multicore systems is to validate the interaction between
cores. This involves test program generation and runtime analysis. We propose a novel
reinforcement learning approach to directed test generation, where an agent induces a suite of
programs, which are executed in a simulation environment for a multicore. It focuses on how to
recover state information from raw observations of the environment such that the agent can learn
from interaction how to improve coverage for any verification task. We evaluated our state
representation for different verification tasks involving 16 and 32-core ARMv8 2-level MOESI
designs.
cores. This involves test program generation and runtime analysis. We propose a novel
reinforcement learning approach to directed test generation, where an agent induces a suite of
programs, which are executed in a simulation environment for a multicore. It focuses on how to
recover state information from raw observations of the environment such that the agent can learn
from interaction how to improve coverage for any verification task. We evaluated our state
representation for different verification tasks involving 16 and 32-core ARMv8 2-level MOESI
designs.
Networking
Work-in-Progress Poster


DescriptionDRAM chips contain multiple banks to handle several memory requests in parallel. When two memory requests try to access different rows of the same bank, it results in row buffer conflicts. Our objective in this work is to reduce row buffer conflicts by introducing multiple row buffers in each bank. We propose multiple row buffer DRAM, which employs a primary global row buffer along with multiple secondary SRAM row buffers in the DRAM chip. These SRBs act as a cache, exploiting the spatial and temporal locality of memory requests to enhance performance. Our evaluation shows that MRB-DRAM achieves an average improvement in Instructions per Cycle (IPC) of 27% for singlecore systems and a weighted speedup increase of 24% for multicore systems in DDR4 memory configurations with a modest area overhead of approximately 2%.
Research Panel


Design
DescriptionFrom chips to systems, design and design automation are embracing extraordinary opportunities, propelled by myriad of exciting advances including (i) emerging technologies such as beyond-CMOS devices and 3D heterogeneous integration, (ii) alternative computation paradigms such as in/near-memory computing and domain-specific computing, and (iii) innovative computational methods such as large language models. At the same time, the field faces significant challenges, spurred by demanding applications like AI, autonomous systems, quantum computing and healthcare. Compounding these challenges is a growing workforce gap, which underscores the urgent need to attract and cultivate talent across the design and design automation community, spanning from chips to complete systems.
In response to these opportunities and challenges, governments and industry across the globe are making substantial investments in research and workforce development in this field. However, navigating the complexities of various funding programs for research and workforce development can be daunting even for seasoned researchers and educators. This panel brings together representatives from government funding agencies as well as academia to share their perspectives on these critical issues. Some questions to be discussed include (i) what the existing representative funding opportunities are and what specific aims that these opportunities try to address, (ii)
whether the funding opportunities have covered the needs well, and if not, what other areas would need more funding, and (iii) what common fallacies and pitfalls to avoid when preparing successful proposals.
In response to these opportunities and challenges, governments and industry across the globe are making substantial investments in research and workforce development in this field. However, navigating the complexities of various funding programs for research and workforce development can be daunting even for seasoned researchers and educators. This panel brings together representatives from government funding agencies as well as academia to share their perspectives on these critical issues. Some questions to be discussed include (i) what the existing representative funding opportunities are and what specific aims that these opportunities try to address, (ii)
whether the funding opportunities have covered the needs well, and if not, what other areas would need more funding, and (iii) what common fallacies and pitfalls to avoid when preparing successful proposals.
Networking
Work-in-Progress Poster


DescriptionBluetooth Low Energy (BLE) technology plays a crucial role in IoT and wearable devices due to its energy-efficient design. This paper explores the intricate interplay between BLE's security configurations, power consumption, and performance. Through comprehensive experiments, we analyze the cost of security on power efficiency and transmission speed, revealing trade-offs essential for BLE protocol design. The experimental results indicate that security settings can impact both energy consumption and performance. Moreover, our findings reveal that more performant protocol configurations can paradoxically lead to reduced energy consumption. This highlights the paramount importance of the security-power-performance trade-off in BLE protocol design.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionLinear-response time-dependent Density Functional Theory (LR-TDDFT) is a widely used method for accurately predicting the excited-state properties of physical systems.
Previous works have attempted to accelerate LR-TDDFT using heterogeneous systems such as GPUs, FPGAs, and the Sunway architecture.
However, a major drawback of these approaches is the constant data movement between host memory and the memory of the heterogeneous systems, which results in substantial \textit{data movement overhead}.
Moreover, these works focus primarily on optimizing the compute-intensive portions of LR-TDDFT, even though the calculation steps are fundamentally \textit{memory-bound}.
To address these challenges, we propose NDFT, a \underline{N}ear-\underline{D}ata Density \underline{F}unctional \underline{T}heory framework.
Specifically, we design a novel task partitioning and scheduling mechanism to offload each part of LR-TDDFT to the most suitable computing units within a CPU-NDP system.
Additionally, we implement a hardware/software co-optimization of a critical kernel in LR-TDDFT to further enhance performance on the CPU-NDP system.
Our results show that NDFT achieves performance improvements of 5.2x and 2.5x over CPU and GPU baselines, respectively, on a large physical system.
Previous works have attempted to accelerate LR-TDDFT using heterogeneous systems such as GPUs, FPGAs, and the Sunway architecture.
However, a major drawback of these approaches is the constant data movement between host memory and the memory of the heterogeneous systems, which results in substantial \textit{data movement overhead}.
Moreover, these works focus primarily on optimizing the compute-intensive portions of LR-TDDFT, even though the calculation steps are fundamentally \textit{memory-bound}.
To address these challenges, we propose NDFT, a \underline{N}ear-\underline{D}ata Density \underline{F}unctional \underline{T}heory framework.
Specifically, we design a novel task partitioning and scheduling mechanism to offload each part of LR-TDDFT to the most suitable computing units within a CPU-NDP system.
Additionally, we implement a hardware/software co-optimization of a critical kernel in LR-TDDFT to further enhance performance on the CPU-NDP system.
Our results show that NDFT achieves performance improvements of 5.2x and 2.5x over CPU and GPU baselines, respectively, on a large physical system.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionLarge language model (LLM) inference poses dual challenges, demanding substantial memory bandwidth and computing resources. Recent advancements in near-memory accelerators leveraging 3D DRAM-to-logic hybrid-bonding (HB) interconnects have gained attention due to their highly parallel data transfer capabilities. We address limitations in previous HB-DRAM accelerators, such as those stemming from distributed controller designs, by introducing an architecture with a centralized controller and dual-IO scheme. This approach not only reduces the chip area overhead but also enables reconfigurable GEMV/GEMM operations, boosting the performance. Simulations for the OPT 66B model show that our proposed accelerator achieves 2.9X, 3.5X, and 2.5X higher performance compared to NPU, DRAM-PIM, and heterogeneous designs (DRAM-PIM + NPU), respectively.
Engineering Presentation


Front-End Design
DescriptionThe conventional RTL-based verification process provides valuable insights into SoC designs but often fails to address critical physical implementation and switching phenomena. While Gate Level Simulation (GLS) offers better accuracy, it is labor-intensive and time-consuming, creating an urgent need for innovative solutions that deliver both speed and precision.
We introduce a novel Netlist Powered Emulation paradigm that combines custom hardware, such as ASICs or FPGAs, with a robust software testbench to deliver high-speed, precise validation. Operating at MHz-level clock frequencies, this approach drastically reduces runtime while maintaining reliability, scalability and debugging efficiency.
Our methodology has been rigorously tested across diverse emulators and optimized through advanced techniques. A comparative analysis on two complex SoCs—an automotive chipset and an Exynos premium mobile processor—demonstrates its superior performance over traditional GLS by a factor of 40x.
The proposed paradigm not only addresses the inefficiencies of GLS but also redefines SoC design gate level verification by enabling faster, more accurate debugging and validation. This transformative breakthrough establishes a new benchmark in Gate Level Verification, offering a clear pathway to industry-wide adoption ensuring silicon ready solutions that meet the demands of next-generation designs.
We introduce a novel Netlist Powered Emulation paradigm that combines custom hardware, such as ASICs or FPGAs, with a robust software testbench to deliver high-speed, precise validation. Operating at MHz-level clock frequencies, this approach drastically reduces runtime while maintaining reliability, scalability and debugging efficiency.
Our methodology has been rigorously tested across diverse emulators and optimized through advanced techniques. A comparative analysis on two complex SoCs—an automotive chipset and an Exynos premium mobile processor—demonstrates its superior performance over traditional GLS by a factor of 40x.
The proposed paradigm not only addresses the inefficiencies of GLS but also redefines SoC design gate level verification by enabling faster, more accurate debugging and validation. This transformative breakthrough establishes a new benchmark in Gate Level Verification, offering a clear pathway to industry-wide adoption ensuring silicon ready solutions that meet the demands of next-generation designs.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionCircuit representation learning has shown promise in advancing Electronic Design Automation (EDA) by capturing structural and functional circuit properties for various tasks. Existing pre-trained solutions rely on graph learning with complex functional supervision, such as truth table simulation. However, they only handle simple and-inverter graphs (AIGs), struggling to fully encode other complex gate functionalities. While large language models (LLMs) excel at functional understanding, they lack the structural awareness for flattened netlists. To advance netlist representation learning, we present NetTAG, a netlist foundation model that fuses gate semantics with graph structure, handling diverse gate types and supporting a variety of functional and physical tasks. Moving beyond existing graph-only methods, NetTAG formulates netlists as text-attributed graphs, with gates annotated by symbolic logic expressions and physical characteristics as text attributes. Its multimodal architecture combines an LLM-based text encoder for gate semantics and a graph transformer for global structure. Pre-trained with gate and graph self-supervised objectives and aligned with RTL and layout stages, NetTAG captures comprehensive circuit intrinsics. Experimental results show that NetTAG consistently outperforms each task-specific method on four largely different functional and physical tasks and surpasses state-of-the-art AIG encoders, demonstrating its versatility.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionAtomistic materials modeling is a critical task with wide-ranging applications, from drug discovery to materials science, where accurate predictions of the target material property can lead to significant advancements in scientific discovery. Graph Neural Networks (GNNs) represent the state-of-the-art approach for modeling atomistic material data thanks to their capacity to capture complex relational structures. While machine learning performance has historically improved with larger models and datasets, GNNs for atomistic materials modeling remain relatively small compared to large language models (LLMs), which leverage billions of parameters and terabyte-scale datasets to achieve remarkable performance in their respective domains. To address this gap, we explore the scaling limits of GNNs for atomistic materials modeling by developing a foundational model with billions of parameters, trained on extensive datasets in terabyte-scale. Our approach incorporates techniques from LLM libraries to efficiently manage large-scale data and models, enabling both effective training and deployment of these large-scale GNN models. This work addresses three fundamental questions in scaling GNNs: the potential for scaling GNN model architectures, the effect of dataset size on model accuracy, and the applicability of LLM-inspired techniques to GNN architectures. Specifically, the outcomes of this study include (1) insights into the scaling laws for GNNs, highlighting the relationship between model size, dataset volume, and accuracy, (2) a foundational GNN model optimized for atomistic materials modeling, and (3) a GNN codebase enhanced with advanced LLM-based training techniques. Our findings lay the groundwork for large-scale GNNs with billions of parameters and terabyte-scale datasets, establishing a scalable pathway for future advancements in atomistic materials modeling.
Research Manuscript


EDA
EDA4: Power Analysis and Optimization
DescriptionAdvanced integrated circuit (IC) systems increasingly utilize chiplet-based packaging with complex 2.5D/3D structures and dense Through-Silicon Via (TSV) arrays. While the Finite Element Method (FEM) provides high-fidelity thermal simulation for these systems, its computational efficiency degrades significantly when generating and optimizing meshes for intricate geometries. To address these performance limitations while preserving simulation accuracy, we present NeuralMesh, a novel framework that accelerates thermal analysis of chiplet-based ICs. Our approach integrates deep learning and geometric analysis to optimize mesh generation without the need for iterative refinement steps. NeuralMesh first employs an enhanced segmentation model to predict thermal distributions based on geometric, material, and power parameters. These predictions, combined with key geometric features, guide the optimization of an initial coarse FEM mesh. By eliminating traditional iterative mesh refinement, our framework achieves up to 45.00X mesh generation speedup while maintaining thermal accuracy within 0.8% of commercial COMSOL simulations. It reduces the number of mesh elements in unimportant areas, which represents a speed improvement of the subsequent thermal simulation. This advancement enables rapid yet precise thermal analysis essential for modern IC package design.
Engineering Special Session


IP
DescriptionNeuromorphic computing is an emerging paradigm that emulates the computational architecture of the brain with significant performance savings compared to conventional digital architectures. Innovative hardware design will be at the forefront of future computing systems that utilize new and increasingly diverse components. Neuromorphic computing, holds transformative potential for revolutionizing computing across various applications, including artificial intelligence, edge computing, and scientific computing with high energy-efficiency. Large-scale neuromorphic testbeds are essential to engage a larger community and enable rapid prototyping and testing of novel algorithms and systems.
The proposed special session includes five short talks from experts from academia and industry that will highlight latest research efforts to design and implement diverse neuromorphic testbeds, from digital, mixed-signal, analog to beyond-CMOS approaches. The brief presentations will be succeeded by a panel discussion addressing the challenges and opportunities associated with next-generation neuromorphic systems.
The proposed special session includes five short talks from experts from academia and industry that will highlight latest research efforts to design and implement diverse neuromorphic testbeds, from digital, mixed-signal, analog to beyond-CMOS approaches. The brief presentations will be succeeded by a panel discussion addressing the challenges and opportunities associated with next-generation neuromorphic systems.
SKYTalk


DescriptionAs complex SoCs become prevalent in virtually all systems, these devices also present a primary attack surface. The risks of cyberattacks are real, and AI is making them more sophisticated. As we also deploy AI into the SoC design process, it is imperative that secure design practices are incorporated as well.
Existing security solutions are inadequate to provide effective verification of complex SoC designs due to their limitations in scalability, comprehensiveness, and adaptability. Large Language Models (LLMs) are celebrated for their remarkable success in natural language understanding, advanced reasoning, and program synthesis tasks.
Recognizing this opportunity, we propose leveraging the emergent capabilities of Generative Pre-trained Transformers (GPTs) to address the existing gaps in SoC security, aiming for a more efficient, scalable, and adaptable methodology. In this presentation we offer an in-depth analysis of existing work, showcasing achievements, prospects, and challenges of employing LLMs in SoC security design and verification tasks.
Existing security solutions are inadequate to provide effective verification of complex SoC designs due to their limitations in scalability, comprehensiveness, and adaptability. Large Language Models (LLMs) are celebrated for their remarkable success in natural language understanding, advanced reasoning, and program synthesis tasks.
Recognizing this opportunity, we propose leveraging the emergent capabilities of Generative Pre-trained Transformers (GPTs) to address the existing gaps in SoC security, aiming for a more efficient, scalable, and adaptable methodology. In this presentation we offer an in-depth analysis of existing work, showcasing achievements, prospects, and challenges of employing LLMs in SoC security design and verification tasks.
Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionHarmonic balance (HB) method is a powerful frequency-domain method used in RF circuit simulations. The key point of HB method is efficiently solving the Jacobian system in Newton's method. In this paper, we first introduce a new time-domain preconditioner for HB Jacobian. Unlike existing time-domain preconditioners, which cannot balance the efficiency of solving the linear system corresponding to the preconditioner with the reduction in iteration step for strongly nonlinear circuit, the proposed preconditioner successfully addresses both aspects. We also present a new preconditioning method that extends time-domain preconditioners to circuit with distributed devices, which was previously unattainable. Finally, a matrix norm-based metric is proposed to measure the strength of circuit nonlinearity, which can help us a priori choose the appropriate preconditioner.
Engineering Special Session


Front-End Design
Chiplet
DescriptionThe promise of an open industry standard that offers high-bandwidth, low-latency, power-efficient, cost-effective on-package connectivity between chiplets continues to evolve at blazing speed. Ever since UCIe – or Universal Chiplet Interconnect Express 1.0 was first released in March 2022, it has seen significant industry adoption. Version 1.0 was followed by Version 1.1 (Released August 2023) with key features such as runtime health monitoring for automotive and high-reliability applications. Recently, in August 2024, the latest standard, 2.0 was released.
This session aims to cover the evolution, usage and impact of UCIe through 3 different sessions: First, Dr. Das Sharma, Chair of UCIe Consortium, will provide the overview, evolution and future of UCIe. Then Dr. Zorian will delve into details of one of the key modules in UCIe: health monitoring. Finally, Mr. Jani will cover how customer usage is both benefiting from existing standards as well as driving new versions of the standard itself, highlighting its significant advancements in System-in-Package (SiP) design and paving the way for high-density systems with improved performance and reduced power consumption.
This session aims to cover the evolution, usage and impact of UCIe through 3 different sessions: First, Dr. Das Sharma, Chair of UCIe Consortium, will provide the overview, evolution and future of UCIe. Then Dr. Zorian will delve into details of one of the key modules in UCIe: health monitoring. Finally, Mr. Jani will cover how customer usage is both benefiting from existing standards as well as driving new versions of the standard itself, highlighting its significant advancements in System-in-Package (SiP) design and paving the way for high-density systems with improved performance and reduced power consumption.
Networking
Work-in-Progress Poster


DescriptionThis paper introduces Natural-Level Synthesis (NLS), an innovative approach for generating system-level hardware descriptions using generative artificial intelligence (Gen AI). NLS bridges a gap in current hardware development processes, where algorithm engineers' involvement typically ends at the requirements stage. With NLS, engineers can participate more deeply in the development, synthesis, and test stages by using Gen AI models to convert natural language descriptions directly into Hardware Description Language (HDL) code. This approach not only streamlines hardware development but also improves accessibility, fostering a collaborative workflow between hardware and algorithm engineers. We developed the NLS tool to facilitate natural language-driven HDL synthesis, enabling rapid generation of system-level HDL designs while significantly reducing development complexity. Evaluated through case studies and benchmarks using Performance, Power, and Area (PPA) metrics, NLS shows its potential to enhance resource efficiency in hardware development. This work provides a scalable, efficient solution for hardware synthesis and establishes a Visual Studio Code (VS Code) extension to assess Gen AI-driven HDL generation, laying a foundation for future advancements in electronic design automation (EDA).
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionEmerging efficient deep neural network (DNN) models, such as AdderNet, have shown great promise in significantly improving hardware efficiency compared to traditional convolutional neural networks (CNNs). However, achieving low bit-width quantization and effective model compression remains a major challenge. To this end, we introduce Nonnegative AdderNet (NN-AdderNet), a quantization- and compression-friendly AdderNet variant that enables model compression down to 4 bits or even lower. We begin by proposing an equivalent transformation of the sum-of-absolute-difference (SAD) kernel in AdderNet, which allows for the formulation of nonnegative weights. This transformation effectively eliminates the need for a sign bit, thus saving 1 bit per weight. Next, we propose to exploit the dual-sparsity pattern in the weights of the activation-oriented NN-AdderNet quantized model. This inherent sparsity enhances the lossless compression performance over NN-AdderNet. Experimental results show that NN-AdderNet can achieve an average compressed weight bitwidth down to 4 bits or even lower, while achieving negligible accuracy loss as compared to full-precision AdderNet models. Such benefits are further illustrated with hardware-level energy and latency improvements in designing DNN inference accelerators. Consequently, the NN-AdderNet model exhibits both algorithmic and hardware efficiency, thus making it a promising candidate for resource-limited applications.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionCompute-in-memory using emerging resistive random-access memory (RRAM) demonstrates significant potential for building energy-efficient deep neural networks. However, RRAM-based network training faces challenges from computational noise and gradient calculation overhead. In this study, we introduce NoiseZO, a forward-only training framework that leverages intrinsic RRAM noise to estimate gradients via zeroth-order (ZO) optimization. The framework maps neural networks onto dual RRAM arrays, utilizing their inherent write noise as ZO perturbations for training. This enables network updates through only two forward computations. A fine-grained perturbation control strategy is further developed to enhance training accuracy. Extensive experiments on vowel and image datasets, implemented with typical networks, showcase the effectiveness of our framework. Compared to conventional complementary metal-oxide-semiconductor (CMOS) implementations, our approach achieves a 21-fold reduction in energy consumption.
Networking
Work-in-Progress Poster


DescriptionIn the vision domain, AdderNet emerges as a hardware-efficient alternative to traditional Convolutional Neural Networks (CNNs) by replacing multiplications with additions. However, the security aspect of AdderNet remains under-explored. To this end, we consider a pivotal question: Does AdderNet compromise model security to hardware efficiency. In this paper, we extend the investigation on the AdderNet vulnerabilities to adversarial weight perturbation attack, e.g., Bit-Flip Attack (BFA) for the first time, and empirically demonstrate that AdderNet is indeed more susceptible against BFA. To preserve both the hardware efficiency of AdderNet and defend against BFA, we propose a novel Secure Non-Negative AdderNet (NNAN) model incorporating a lightweight non-positive weight encoding technique. NNAN enables lightweight, real-time detection and correction by securing (a) the Most Significant Bits (MSBs) of all weights and (b) second MSBs in the non-positive encoded weights. To test the resilience of our proposed defense further, we perform an advanced BFA (BFA+) where the attacker has sufficient knowledge about the proposed encoding scheme. Our results indicate that NNAN defends BFA successfully and requires a higher number of attack cycles to be compromised against BFA+, which concludes robustness can be achieved through additive architectures w/o multiplication.
Engineering Presentation


AI
IP
Chiplet
DescriptionFunctional modeling is vital for early System-on-Chip (SoC) validation, enabling architects to accurately represent design intent. However, modeling the Clock and Reset (CAR) unit—a critical component for synchronizing diverse SoC elements—faces significant challenges. These include managing multiple reset and clock domains, aligning them with initialization requirements, ensuring precise synchronization, and integrating power, safety, and reliability features while adhering to platform-specific constraints.
Despite the availability of detailed architecture specifications, these are not directly translatable into functional models. This results in manual, error-prone processes, misalignment between functional models and architecture intent, and delays in achieving crucial design and validation milestones.
We propose an automated approach to create a self-checking, executable model directly derived from architecture specifications. This approach ensures seamless inheritance of clock and reset requirements, bridging the gap between architecture and functional modeling. By automating this process, we reduce errors, accelerate development, and maintain alignment across architecture and design stages. The solution is scalable and efficient, with applications extending to SystemC model development, power architecture specification, physical design timing analysis, RTL design, and verification domains.
Despite the availability of detailed architecture specifications, these are not directly translatable into functional models. This results in manual, error-prone processes, misalignment between functional models and architecture intent, and delays in achieving crucial design and validation milestones.
We propose an automated approach to create a self-checking, executable model directly derived from architecture specifications. This approach ensures seamless inheritance of clock and reset requirements, bridging the gap between architecture and functional modeling. By automating this process, we reduce errors, accelerate development, and maintain alignment across architecture and design stages. The solution is scalable and efficient, with applications extending to SystemC model development, power architecture specification, physical design timing analysis, RTL design, and verification domains.
Engineering Presentation


IP
DescriptionWith technology evolution and increased complexity of Macros, IPs and Standard cells, the impact of layout parasitics on designs has become dominant.
Layout parasitic analysis is absolutely necessary to guarantee correct design behavior.
Conventional signoff verification tools are slow, difficult to setup and don't sufficiently address the challenges of custom Macro/IP/Std. cell library development flow.
The main pain points in our development flow include visualizing third party Layout Parasitic Extraction (LPE) netlist, ensuring correctness of extraction, locating parasitics that impact design behavior most.
We incorporated novel IC layout analysis techniques in our flow, which resulted into 2x productivity improvement. The total design cycle time has significantly reduced, from weeks to minutes or hours.
This unique approach lets us visualize, locate and optimize parasitics by quickly and easily running multiple iterations during analysis and debugging. It was a game changer as it helped us compare different LPE netlists, including those from different foundries, enabling us to decide on the right foundry for our products.
This innovative approach to parasitic analysis effectively addresses our major reliability concerns by identifying probable latent defects leading to improvements in DFT (Design for Testability). Thus, LPE netlist is no longer a black-box.
Layout parasitic analysis is absolutely necessary to guarantee correct design behavior.
Conventional signoff verification tools are slow, difficult to setup and don't sufficiently address the challenges of custom Macro/IP/Std. cell library development flow.
The main pain points in our development flow include visualizing third party Layout Parasitic Extraction (LPE) netlist, ensuring correctness of extraction, locating parasitics that impact design behavior most.
We incorporated novel IC layout analysis techniques in our flow, which resulted into 2x productivity improvement. The total design cycle time has significantly reduced, from weeks to minutes or hours.
This unique approach lets us visualize, locate and optimize parasitics by quickly and easily running multiple iterations during analysis and debugging. It was a game changer as it helped us compare different LPE netlists, including those from different foundries, enabling us to decide on the right foundry for our products.
This innovative approach to parasitic analysis effectively addresses our major reliability concerns by identifying probable latent defects leading to improvements in DFT (Design for Testability). Thus, LPE netlist is no longer a black-box.
Engineering Poster
Networking


DescriptionAs the adoption of 2.5D/3D-ICs increases, it is essential to address new challenges that differ from those in conventional 2D-ICs.
There are two primary issues in 2.5D/3D-IC ESD verification: increased verification complexity and cost. It is necessary to check paths not only within a single die but also between dies, considering the integration of dies with different specifications, such as technology nodes and ESD immunity levels (HBM/CDM).
The number of bumps is increasing and is expected to reach millions with hybrid bond technology, leading to a dramatic increase in check paths, verification time, and debugging/feedback costs.
To address these challenges, we developed an ESD verificationenvironment that reduces verification complexity in large-scale 2.5D-ICs and improves cost efficiency through two key processes.
For a large number of resistance checks between micro bumps, we programmatically reduced verification time by extracting optimal check regions based on cell placements and bump positions. Additionally, by mapping and visualizing the resistance check results between micro bumps, we clarified layout weaknesses and facilitated easier layout fixes.
There are two primary issues in 2.5D/3D-IC ESD verification: increased verification complexity and cost. It is necessary to check paths not only within a single die but also between dies, considering the integration of dies with different specifications, such as technology nodes and ESD immunity levels (HBM/CDM).
The number of bumps is increasing and is expected to reach millions with hybrid bond technology, leading to a dramatic increase in check paths, verification time, and debugging/feedback costs.
To address these challenges, we developed an ESD verificationenvironment that reduces verification complexity in large-scale 2.5D-ICs and improves cost efficiency through two key processes.
For a large number of resistance checks between micro bumps, we programmatically reduced verification time by extracting optimal check regions based on cell placements and bump positions. Additionally, by mapping and visualizing the resistance check results between micro bumps, we clarified layout weaknesses and facilitated easier layout fixes.
Engineering Poster


DescriptionIn complex systems such as Multi-Chip Modules where high speed digital and analog ICs are interconnected, the Power Integrity (PI) analysis is a key step for design sign-off. Chip, package and board back annotation optimization may have dramatic impact on the project schedule and cost if the power integrity constraints are not anticipated.
The Chip Power Model (CPM) is one of the key contributors along with the package and board models that enable the system PI analysis. It is generally generated when the design is almost frozen, and any modifications would result in significant impact on the product design. Thus, anticipation is key for the design success.
This paper highlights an approach to enable PI analysis at very early design stage allowing to anticipate the design feedback. It is based on the generation of a CPM from an array of blocks which have a representative behavior for the device. The block activity is tuned with respect to the expected power profile and the number of blocks instances is sized according to the design area. Simulation results show the need to create the complete CPM at top level instead of combining the standalone block CPMs.
The Chip Power Model (CPM) is one of the key contributors along with the package and board models that enable the system PI analysis. It is generally generated when the design is almost frozen, and any modifications would result in significant impact on the product design. Thus, anticipation is key for the design success.
This paper highlights an approach to enable PI analysis at very early design stage allowing to anticipate the design feedback. It is based on the generation of a CPM from an array of blocks which have a representative behavior for the device. The block activity is tuned with respect to the expected power profile and the number of blocks instances is sized according to the design area. Simulation results show the need to create the complete CPM at top level instead of combining the standalone block CPMs.
Engineering Presentation


AI
IP
DescriptionTrue Random Number Generators (TRNGs) are a fundamental component of hardware-level security. They generate random numbers from a physical source, such as noise, providing high-quality randomness that makes them nearly impossible to predict. In today's digital landscape, TRNGs provide the foundation for encryption systems that protect everything from financial transactions to personal communications.
Ring-oscillator-based TRNGs are popular as their structure relies on elements commonly used in analog circuits, making them easy to implement. However, the time-domain simulations needed to ensure their genuine randomness and data variability are extremely time-consuming and resource intensive. Traditional verification methods can take over a year, making the design flow inefficient.
Microsoft has researched various types of ring-oscillator-based TRNGs for hardware encryption, including Free-Running Ring-Oscillator TRNGs and Fibonacci-Galois Ring-Oscillator TRNGs. These different TRNG architectures present distinct verification challenges, each requiring specialized simulation approaches.
In this presentation, we will explore the complex verification challenges associated with these two TRNGs and highlight how the collaboration between Microsoft and Siemens EDA has enabled a breakthrough simulation approach. We will demonstrate a new simulation workflow using the Siemens Solido Simulation Suite that dramatically enhances both accuracy and efficiency in TRNG verification, while maintaining the highest standards of randomness validation.
Ring-oscillator-based TRNGs are popular as their structure relies on elements commonly used in analog circuits, making them easy to implement. However, the time-domain simulations needed to ensure their genuine randomness and data variability are extremely time-consuming and resource intensive. Traditional verification methods can take over a year, making the design flow inefficient.
Microsoft has researched various types of ring-oscillator-based TRNGs for hardware encryption, including Free-Running Ring-Oscillator TRNGs and Fibonacci-Galois Ring-Oscillator TRNGs. These different TRNG architectures present distinct verification challenges, each requiring specialized simulation approaches.
In this presentation, we will explore the complex verification challenges associated with these two TRNGs and highlight how the collaboration between Microsoft and Siemens EDA has enabled a breakthrough simulation approach. We will demonstrate a new simulation workflow using the Siemens Solido Simulation Suite that dramatically enhances both accuracy and efficiency in TRNG verification, while maintaining the highest standards of randomness validation.
Workshop


Security
Sunday Program
DescriptionThe Workshop on Hardware Attack Artifacts, Analysis, and Metrics (WHAAAM) aims to promote open and practical contributions that improve our ability to reason about offensive hardware security. Over the past decade, we've observed the repeated discovery of real-world hardware vulnerabilities. No longer a theoretical exercise, hardware attacks are developed by multinational corporations and nation states with devastating consequences. Responses from both academia, industry, and government agencies has grown as a result. However, we still fall-short in our ability to defend against creative malicious actors. This is in part due to an existing gap between academic threat modelling and PoCs versus end-to-end attacks. WHAAM seeks to bridge this gap in the hardware security research community, by seeking open and artifact-driven submissions that that grows our understanding of practical attacker capabilities, as well as robust responses based on empirical root-cause analysis and quantitative metrics. Importantly, as opposed to competition-based hardware security events (Hack@DAC, IEEE HOST) that have limited focus, WHAAAM encourages a diverse and creative outlet for student researchers to demonstrate a range of cutting-edge work in a hands-on environment.
Learn more at https://sites.google.com/view/whaaam
Learn more at https://sites.google.com/view/whaaam
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionAbstract—Neuro-Symbolic AI (NSAI) is an emerging paradigm that integrates neural networks with symbolic reasoning to enhance the trans-parency, reasoning capabilities, and data efficiency of AI systems. Recent NSAI systems have gained traction due to their exceptional performance in reasoning tasks and human-AI collaborative scenarios. Despite these algorithmic advancements, executing NSAI tasks on existing hardware (e.g., CPUs, GPUs, TPUs) remains challenging, due to their heterogeneous computing kernels, high memory intensity, and unique memory access patterns. Moreover, current NSAI algorithms exhibit significant variation in operation types and scales, making them incompatible with existing ML accelerators. These challenges highlight the need for a versatile and flexible acceleration framework tailored to NSAI workloads.
In this paper, we propose NSFlow, an FPGA-based acceleration framework designed to achieve high efficiency, scalability, and versatility across NSAI systems. NSFlow features a design architecture genera-
tor that identifies workload data dependencies and creates optimized dataflow architectures, as well as a reconfigurable array with flexible compute units, re-organizable memory, and mixed-precision capabilities. Evaluating across NSAI workloads, NSFlow achieves 31× speedup over Jetson TX2, more than 2× over GPU, 8× speedup over TPU-like systolic array, and more than 3× over Xilinx DPU. NSFlow also demonstrates enhanced scalability, with only 4× runtime increase when symbolic workloads scale by 150×. To the best of our knowledge, NSFlow is the first framework to enable real-time generalizable NSAI algorithms acceleration, demonstrating a promising solution for next-generation cognitive systems.
In this paper, we propose NSFlow, an FPGA-based acceleration framework designed to achieve high efficiency, scalability, and versatility across NSAI systems. NSFlow features a design architecture genera-
tor that identifies workload data dependencies and creates optimized dataflow architectures, as well as a reconfigurable array with flexible compute units, re-organizable memory, and mixed-precision capabilities. Evaluating across NSAI workloads, NSFlow achieves 31× speedup over Jetson TX2, more than 2× over GPU, 8× speedup over TPU-like systolic array, and more than 3× over Xilinx DPU. NSFlow also demonstrates enhanced scalability, with only 4× runtime increase when symbolic workloads scale by 150×. To the best of our knowledge, NSFlow is the first framework to enable real-time generalizable NSAI algorithms acceleration, demonstrating a promising solution for next-generation cognitive systems.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionDeep Neural Networks are increasingly leveraging sparsity to reduce the scaling up of model parameter size. However, reducing wall-clock time through sparsity and pruning remains challenging due to irregular memory access patterns, leading to frequent cache misses. In this paper, we present NPU Vector Runahead (NVR), a prefetching mechanism tailored for NPUs to address cache miss problems in sparse DNN workloads. Rather than optimising memory patterns with high overhead and poor portability, NVR adapts runahead execution to the unique architecture of NPUs. NVR provides a general micro-architectural solution for sparse DNN workloads without requiring compiler or algorithmic support, operating as a decoupled, speculative, lightweight hardware sub-thread alongside the NPU, with minimal hardware overhead (under 5%). NVR achieves an average 90% reduction in cache misses compared to SOTA prefetching in general-purpose processors, delivering 4x average speedup on sparse workloads versus NPUs without prefetching. Moreover, we investigate the advantages of incorporating a small cache (16KB) into the NPU combined with NVR. Our evaluation shows that expanding this modest cache delivers 5x higher performance benefits than increasing the L2 cache size by the same amount.
Research Manuscript


Security
SEC4: Embedded and Cross-Layer Security
DescriptionThe Controller Area Network (CAN) bus is a cornerstone of modern vehicles, orchestrating functions from engine control to auxiliary systems. However, its lack of inherent security measures makes it vulnerable to cyberattacks. Accurately mapping CAN signals with car-control actions is critical for detecting security breaches, as it allows pinpointing potential vulnerabilities exploited to compromise vehicular functions. Despite this, existing CAN reverse engineering methods struggle to achieve bit-level resolution due to the huge search space of unique IDs and payload combinations. To address this challenge, we propose a systematic framework for reverse engineering CAN bus messages, achieving precise mapping of control bits in CAN frames to car-control actions. The framework was validated on Tesla Model 3, Leapmotor C10 and C11, demonstrating its versatility across different vehicle platforms. In particular, it successfully identified 43 car-control actions on the Tesla Model 3, showcasing its extensive coverage. Furthermore, its low resource consumption enables seamless integration into compact platforms like the Raspberry Pi, supporting practical deployment in real-world automotive systems.
Research Manuscript


EDA
EDA1: Design Methodologies for System-on-Chip and 3D/2.5D System-in Package
DescriptionWhile multi-chiplet based many-core systems have emerged as a viable solution for heterogeneous integration and addressing manufacturing and technological challenges in the post-Moore's Law era, their design and optimization remain highly complex and challenging. Among the various subsystems, the cache hierarchy has significant implications for overall system performance, yet its vast design space presents substantial optimization challenges. This complexity arises from factors such as the large number of chiplets in the system, the number of cores per chiplet, memory hierarchy variations, cache size variability, caching strategies, and inter-chiplet interconnection networks. Existing design space exploration methods, such as NN-Baton and IntLP, fail to optimize the cache subsystem performance or thoroughly explore the design space. To address these limitations, we propose a novel design space exploration method for cache subsystem optimization. Our approach models cache miss rates and network latency as functions of cache hierarchy and inter-/intra-chiplet interconnection network parameters. We then define an optimization problem to minimize the concurrent average memory access time (C-AMAT) under cost and power consumption constraints. This problem is addressed using a bilevel optimization algorithm, whichiteratively solves two independent subproblems: (1) cache subsystem optimization, and (2) inter-chiplet interconnection network optimization. Experimental results show that our method reduces the application execution time by 39.7% and 39.2%, on average, compared to architectures similar to AMD Zen 4 and Intel Sapphire Rapids, respectively, and by 25.91% over IntLP. These results underscore the potential of the proposed method for optimizing cache subsystems in future multi-chiplet-based many-core systems.
Engineering Poster
Networking


DescriptionWith the development of new high-speed interconnect technologies and applications, along with the introduction of 3D packaging, chip design and simulation have become increasingly complex. In high performance optoelectronic chips, the quality of electrical signals directly determines the modulation performance of optical signals, the impact of power noise from Chiplets, packaging, and PCBs on signal integrity has become increasingly significant.
Traditional IBIS workflows have limitations in addressing complex issues such as on-chip power decoupling. To address this, we propose a new simulation workflow based on IBIS Model and Chip Power Model (CPM) to analyze the coupled effects of signals and power. By integrating TSV Spice Model, Chip Macro Model (CMM) for analog circuits, we construct a comprehensive CPM. Then combined with the extracted S-parameters of packaging and PCB, enables system-level simulations that holistically consider the sign-off factors affecting signal integrity.
This workflow is faster and more practical. Experimental results demonstrate that when CPM-based power noise simulation is kept within a certain range, the eye diagram of signal integrity shows significant improvement and exhibits stronger correlation with the actual chip performance. This underscores the effectiveness and importance of this method in identifying and resolving signal integrity issues.
Traditional IBIS workflows have limitations in addressing complex issues such as on-chip power decoupling. To address this, we propose a new simulation workflow based on IBIS Model and Chip Power Model (CPM) to analyze the coupled effects of signals and power. By integrating TSV Spice Model, Chip Macro Model (CMM) for analog circuits, we construct a comprehensive CPM. Then combined with the extracted S-parameters of packaging and PCB, enables system-level simulations that holistically consider the sign-off factors affecting signal integrity.
This workflow is faster and more practical. Experimental results demonstrate that when CPM-based power noise simulation is kept within a certain range, the eye diagram of signal integrity shows significant improvement and exhibits stronger correlation with the actual chip performance. This underscores the effectiveness and importance of this method in identifying and resolving signal integrity issues.
Networking
Work-in-Progress Poster


DescriptionGray code, a voltage-level-to-data-bit translation scheme, is widely used in QLC SSDs. However, it causes the four data bits in QLC to exhibit significantly different read and write performance with up to 8x latency variation, severely impacting the worst-case performance of QLC SSDs. Although the state-of-the-art approach combines multiple Gray codes to address this, it requires additional circuit overhead and introduces extra programming latency. Our preliminary experiments have identified the root cause of the performance degradation as the unidirectional programming method. This method always programs data into fast bits first then slow bits, leading to poor performance when hot data arrives after cold data.
In this paper, we propose BDP, a novel Bi-Directional Programming scheme that combines both the normal (forward) and reverse programming directions. To avoid extra circuit overhead introduced by multiple Gray codes, we first conduct a theoretical analysis of the ideal performance of various Gray codes to select the most suitable one for the BDP scheme. Second, we introduce a hotness-aware data allocation scheme to judiciously manage hot and cold data by assigning them to the fast and slow bits of QLC, respectively. Third, we propose a background data migration strategy to prevent sharp performance declines when data temperature changes. BDP is integrated into FTL (Flash Translation Layer) and allows the flash controller to enable runtime programming direction arbitration. Experimental results show that BDP outperforms the state-of-the-art solution, achieving an average 26.2% reduction in read latency and an average 50.3% reduction in program latency.
In this paper, we propose BDP, a novel Bi-Directional Programming scheme that combines both the normal (forward) and reverse programming directions. To avoid extra circuit overhead introduced by multiple Gray codes, we first conduct a theoretical analysis of the ideal performance of various Gray codes to select the most suitable one for the BDP scheme. Second, we introduce a hotness-aware data allocation scheme to judiciously manage hot and cold data by assigning them to the fast and slow bits of QLC, respectively. Third, we propose a background data migration strategy to prevent sharp performance declines when data temperature changes. BDP is integrated into FTL (Flash Translation Layer) and allows the flash controller to enable runtime programming direction arbitration. Experimental results show that BDP outperforms the state-of-the-art solution, achieving an average 26.2% reduction in read latency and an average 50.3% reduction in program latency.
Ancillary Meeting


DescriptionThe OpenAccess Coalition Forum features technical innovations from established industry experts, ranging from curvilinear and 3D design to multi-threading performance improvements enabled by the OpenAccess API. Presenters from coalition member companies, representing successful start-ups, international design companies, and academia, will discuss the strengths of OpenAccess technology and the value of membership. The lunch forum targets system designers and architects, logic and circuit designers, validation engineers, CAD managers, researchers, and academicians.
OpenAccess is an extensible API on top of a managed multi-user design database that enables the interoperability required by hybrid design flows. This API and database are constantly synchronized with other EDA tools by the OpenAccess Coalition through constant collaboration between the top EDA suppliers, IP providers, design companies, and the Integrator Company.
Register now! Space is limited, and registration is required for complimentary lunch. For more details go to si2.org
How It All Got Started, Andy Graham, Founder of Si2
Curvilinear Design and Application, Aki Fujimura (CEO), Design 2 Silicon
How to Improve Performance and Connect with EDA SW Using the Open-Access API, Yong-Hwan Jeon (Jason), SK-hynix
A New AI-based Area Router, Ed Gernert (Owner), Frontier Design
Fast Connectivity Highlighting with Polygon Operators Extension, Larg Weiland, PDF Solutions
3D Design with Your Present Tools, Rhett Davis, NC State
Moderator, Marshall Tiner, Senior Director of OpenAccess, Si2
OpenAccess is an extensible API on top of a managed multi-user design database that enables the interoperability required by hybrid design flows. This API and database are constantly synchronized with other EDA tools by the OpenAccess Coalition through constant collaboration between the top EDA suppliers, IP providers, design companies, and the Integrator Company.
Register now! Space is limited, and registration is required for complimentary lunch. For more details go to si2.org
How It All Got Started, Andy Graham, Founder of Si2
Curvilinear Design and Application, Aki Fujimura (CEO), Design 2 Silicon
How to Improve Performance and Connect with EDA SW Using the Open-Access API, Yong-Hwan Jeon (Jason), SK-hynix
A New AI-based Area Router, Ed Gernert (Owner), Frontier Design
Fast Connectivity Highlighting with Polygon Operators Extension, Larg Weiland, PDF Solutions
3D Design with Your Present Tools, Rhett Davis, NC State
Moderator, Marshall Tiner, Senior Director of OpenAccess, Si2
Networking
Work-in-Progress Poster


DescriptionAssertions are essential for hardware verification but are typically generated manually, leading to long development cycles. While commercial Large Language Models (LLMs) like GPT-4 show promise for automating assertion generation, they raise concerns about IP privacy and data confidentiality. This paper proposes OpenAssert, an approach for generating assertions locally using open-source LLMs. We enhance these models with Retrieval Augmentation Generation (RAG) to reduce errors and hallucinations. OpenAssert improves by up to 44% in rouge-1 score, 49% in cosine similarity, and reduces word error rate by 43.4%, outperforming GPT-4 by 23.7%, with 100% line coverage in our evaluation.
Networking
Work-in-Progress Poster


DES5: Emerging Device and Interconnect Technologies
DescriptionGain Cell (GC) memory offers higher density and lower power than SRAM, making it promising for on-chip cache applications. GC memory supports a wide range of retention times, adjustable through transistor design (e.g., threshold voltage) and operating voltage. Designing GC memory subsystems, however, is time-intensive. This paper introduces OpenGC, an open-source GC memory compiler that generates DRC- and LVS-clean GC memory layouts and provides area, delay, and power simulations based on user-specified configurations. OpenGC enables fast, accurate, and optimized GC memory block generation, reducing design time, ensuring process compliance, and delivering performance-tailored solutions for diverse applications.
Networking
Work-in-Progress Poster


DescriptionThe design of macro cells is a highly manual and time-consuming process. It represents a significant challenge, as it requires the consideration of numerous design variables and constraints, as well as the exploration of trade-off relationships to achieve an optimal design. Over the past decade, a lot of research has been conducted to develop automated and optimized macro cell designs. However, it had limited performance in terms of unit area and routing success rate. In this paper, we propose a novel graph-based placement and routing methodology for designing macro cells. In order to design macro cells with optimal area, we introduce a Recursive Multi-Level Steiner Tree methodology, as well as a placement methodology based on hypergraph partitioning and combinatorial optimization. The proposed method resulted in an average area reduction of 7.8 percent compared to the manual results obtained by layout experts, with a maximum reduction of 30.8 percent. With regard to routing quality, the proposed method achieved LVS and DRC clear for all macro cells and demonstrated a significant average reduction of 40.2 percent in Metal1 track usage, with a maximum reduction of 87.3 percent. In conclusion, the implementation of the proposed optimization process for the macro cell can reduce the time to design from 2.96 person-months to 1 person-month.
Networking
Work-in-Progress Poster


DescriptionOpti-SpiSSL is a reconfigurable hardware framework that optimizes Spiking Self-Supervised Learning (SSL) on heterogeneous System-on-Chip (SoC), addressing challenges with asynchronous, event-driven processing. Leveraging spiking SSL-based optimizations such as neuron density, memory flow, operator fuse with automated code generation and adaptive acceleration, Opti-SpiSSL efficiently balances performance, power, and area through design space exploration (DSE) across FPGA and ASIC. It achieves 28.7% FPS improvement, 37.5% gate reduction, 57.75% area reduction, and 38% resource utilization enhancement, with 0.41s reconfiguration time. Compared to state-of-the-art, Opti-SpiSSL improves 31.9% FPS and 28.8% energy consumption with low resources, offering a scalable, power-efficient solution for next-generation architectures.
Networking
Work-in-Progress Poster


DES5: Emerging Device and Interconnect Technologies
DescriptionThis study explores the potential of Back-side Contact (BSC) technology for enhancing Power, Performance, and Area (PPA) in integrated circuits. Through Design-Technology-Circuit co-optimization (DTCO) experiments on a 32-bit RISC-V core, the research demonstrates significant PPA gains by optimizing front and back-side metal stack and layer purpose allocation. Key findings include an average 2.7% power reduction per added back-side layer for clock routing, a 5.5% frequency improvement with combined front and back-side clock routing, and up to 17% power reduction or 10% frequency increase through DTCO-based metal stack optimization for low-power and high-performance targets, respectively.
Engineering Presentation


Front-End Design
Chiplet
DescriptionPower estimation at the RTL(Register-Transfer Level) during SoC(System-on-Chip) design has been a fast and accurate way to estimate power early in the chip design process, but it has suffered from inaccuracies. There have been many proposals to improve this, but they are limited in that they do not provide a specific method for generating input vectors, which are a large part of the power prediction. In order to improve the accuracy of RTL-based power prediction, this paper shows an optimal methodology to generate the input vector required for power measurement, and confirms the power prediction results between gate and RTL with an error rate within 5% through power prediction experiments. The highly correlated RTL-based power estimation results compared to the gate level can ensure the accuracy of average power and IR-Drop measurements later in the chip design cycle and shorten the power estimation period for the entire chip design by 1-1.5 months compared to gate power estimation.
Engineering Presentation


AI
IP
Chiplet
DescriptionIntegrating algorithms into sensors poses a significant challenge for designers. Technological nodes are chosen for analog requirements, leaving limited space for digital portion, which can not take advantage from technology scaling. Digital designers are challenged with implementing on ASIC high-quality algorithms in terms of accuracy while minimizing area and power consumption. Our proposal introduces a flow based on modular blocks called "Bricks", which have both SystemC and Python views. This approach allows the analysis of how to best connect and configure these Bricks using Python libraries, while also providing feedback to the High-Level Synthesis (HLS) tool. This Python-HLS loop optimizes power, area, latency, and accuracy in a multi-objective optimization process. The proposed flow reduces time to market and enables the best choice among various options. By leveraging the use of Python the flow offers flexibility and configurability, allowing for the utilization of a wide range of libraries and tools. Continuous feedback to the HLS tool facilitates rapid iterations and incremental improvements.
The results in a specific case for a people counter on an infrared image sensor demonstrate the effectiveness of the flow in providing excellent solutions in terms of both accuracy and area, significantly reducing the IP development time.
The results in a specific case for a people counter on an infrared image sensor demonstrate the effectiveness of the flow in providing excellent solutions in terms of both accuracy and area, significantly reducing the IP development time.
Engineering Poster
Networking


DescriptionNearly all industries are striving to implement AI-powered solutions to significantly enhance performance across various workflows and functional groups. However, this shift brings a new, emerging issue. AI-driven EDA tools and workflows, while promising to enhance design processes, will substantially increase data volume and provisioning needs due to their dependence on large datasets for training and operation. These tools often magnify requirements by several orders of magnitude. Therefore, an effective method to optimize network storage is essential to manage the data explosion caused by AI-enabled EDA workflows.
In this proposal, we go over the challenges and demonstrate how a smart caching agent solution can provide maximum network storage optimization and the best performance when it comes to managing large datasets generated through such AI-powered EDA deployments.
In this proposal, we go over the challenges and demonstrate how a smart caching agent solution can provide maximum network storage optimization and the best performance when it comes to managing large datasets generated through such AI-powered EDA deployments.
Engineering Poster
Networking


DescriptionThe rapid pace of product development necessitates minimizing time to market, often constraining the time available for Physical Design and Signoff (PDN, STA, and Physical Verification). This challenge is exacerbated as the semiconductor industry transitions to advanced technology nodes such as 5nm and 3nm, where robust PDN is essential for ensuring chip reliability in diverse applications, including laptops, cloud servers, AI, mobile devices, automotive systems, medical equipment, and wearables. The increasing design complexity and stringent performance requirements necessitate extensive PDN checks across multiple Voltage-Controlled Domains (VCDs), functional and test modes, and process-voltage-temperature (PVT) corners, significantly extending simulation and debugging times. Addressing Electromigration (EM) and IR drop issues within compressed timelines, while managing the high volume of intricate designs, presents substantial challenges, particularly given the specialized nature of PDN and the limited availability of skilled engineers in this domain.
To mitigate these challenges, the "Smart PDN Framework" has been developed and implemented. This advanced solution enhances efficiency by facilitating comprehensive PDN checks, ensuring the integrity of simulation runs, and continuously monitoring the PDN status of each design block across all checks. By streamlining the debugging process and expediting fixes, the Smart PDN Framework significantly improves turnaround time and signoff quality, thereby meeting the critical demands of contemporary product development cycles.
To mitigate these challenges, the "Smart PDN Framework" has been developed and implemented. This advanced solution enhances efficiency by facilitating comprehensive PDN checks, ensuring the integrity of simulation runs, and continuously monitoring the PDN status of each design block across all checks. By streamlining the debugging process and expediting fixes, the Smart PDN Framework significantly improves turnaround time and signoff quality, thereby meeting the critical demands of contemporary product development cycles.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionHigh-Level Synthesis (HLS) excels at handling compute-intensive loops with straightforward control but struggles to identify parallelism in kernels with complex and irregular control-flow. To address this, novel scheduling techniques based on speculation have been introduced. While these methods outperform traditional static scheduling, they also introduce significant area overhead, particularly in the rollback control logic. Optimizing the cost of this rollback control logic remains an open challenge. In this work, we show how it is possible to simplify and/or eliminate rollback logic using a combination of static analysis and linear programming. Our results show improvements in both execution throughput and area cost.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionWindowed arithmetic [Gid19] uses precomputed lookup tables (LUTs) to reduce quantum-classical multiplication costs, achieving state-of-the-art resource estimates for integer factoring [GE21]. We introduce four optimizations to enhance this approach, focusing on efficient uncomputation-of-lookups (unlookups), minimizing lookups, and reducing table entries. These improvements reduce unlookup costs by ~50% for factoring, yielding a 16% runtime reduction with only a 12% qubit increase for RSA-2048. Our techniques provide broad utility in improving depth, runtime, and gate costs of quantum LUTs and windowed arithmetic, enabling significant performance gains across diverse quantum algorithms.
Networking
Work-in-Progress Poster


DescriptionResource-constrained scheduling is a fundamental and NP-hard problem in high level syntheis, crucial for optimizing the performance and efficiency of hardware designs. Despite significant advances, existing exact methods, such as holistic modeling approaches (e.g., time-indexed ILP/MILP) and iterative search techniques (e.g., SDC-SAT), continue to struggle with scalability, limiting their applicability to large-scale problems.
This paper proposes leveraging the inherent structural properties of resource binding, such as symmetry and redundancy, to segment the NP-hard RCS problem into three subproblems: unconstrained scheduling, resource sharing, and operation ordering. Additionally, we design and implement a new scheduling algorithm centered on operation order to address these subproblems. This algorithm optimizes the search process by focusing on operation order issues and strategically bypasses resource sharing calculations through the use of symmetry and redundancy, thus significantly improving the search efficiency.
Experimental results validate the proposed method's superiority, achieving at least average speedups of 71.96x and 19.22x over two state-of-the-art methods. This work presents a novel perspective on exact scheduling methodologies, offering a scalable and efficient solution for High-Level Synthesis (HLS) challenges.
This paper proposes leveraging the inherent structural properties of resource binding, such as symmetry and redundancy, to segment the NP-hard RCS problem into three subproblems: unconstrained scheduling, resource sharing, and operation ordering. Additionally, we design and implement a new scheduling algorithm centered on operation order to address these subproblems. This algorithm optimizes the search process by focusing on operation order issues and strategically bypasses resource sharing calculations through the use of symmetry and redundancy, thus significantly improving the search efficiency.
Experimental results validate the proposed method's superiority, achieving at least average speedups of 71.96x and 19.22x over two state-of-the-art methods. This work presents a novel perspective on exact scheduling methodologies, offering a scalable and efficient solution for High-Level Synthesis (HLS) challenges.
Networking
Work-in-Progress Poster


DescriptionTailoring memory controller policies to user tasks
using reinforcement learning tuners has shown significant potential
for improving energy efficiency. However, learning inefficiency
and catastrophic forgetting remain key challenges,
introducing overhead and limiting effectiveness. We propose
OT-CRL, a continual reinforcement learning framework for
online memory controller tuning. OT-CRL integrates continual
learning (CL) to prevent forgetting and leverages independent
reinforcement learning, assigning each parameter to a separate
agent to enhance efficiency. Evaluations on various workloads
show OT-CRL achieves up to 33% better performance than
non-CL methods and improves learning efficiency by over an
order of magnitude.
using reinforcement learning tuners has shown significant potential
for improving energy efficiency. However, learning inefficiency
and catastrophic forgetting remain key challenges,
introducing overhead and limiting effectiveness. We propose
OT-CRL, a continual reinforcement learning framework for
online memory controller tuning. OT-CRL integrates continual
learning (CL) to prevent forgetting and leverages independent
reinforcement learning, assigning each parameter to a separate
agent to enhance efficiency. Evaluations on various workloads
show OT-CRL achieves up to 33% better performance than
non-CL methods and improves learning efficiency by over an
order of magnitude.
Networking
Work-in-Progress Poster


DescriptionArchitects or designers write C/C++ model to test-out their algorithm at the early stages before design(RTL) is ready or the testbench to test the design is ready. So, verification engineers need to prove the equivalence between the design and golden C/C++ model. The criticality for these equivalence verification of data-path designs (RTL) with their reference high-level C/C++ models, is well accepted. Simulation approaches likes DPI or score boarding can achieve verification completeness to some extent. The means of verifying this equivalence is shifting from simulation to formal gradually, thanks to new solver capabilities available in formal tools and the exhaustive nature of formal.
However complex algorithms often run into the challenge of compiling i.e. building formal model from C++ of in reasonable time. Also, we often struggle to achieve proof convergence on the equivalence check targets, even if we manage to compile the C++ models. While formal methods can be potentially effective, the right methodology is required to overcome these challenges and scale to larger and complex designs. In this paper, we propose several generic techniques that have helped us to generate formal model for complex image processing algorithm like FFT and Decompression units in couple of mins and converge on the properties which was impossible to achieve out of the box. In this process we also uncovered bugs that the traditional verification methods could not catch. All these techniques are reusable, and this paper talks about details on these techniques by taking FFT and Decompression algorithm as an example.
However complex algorithms often run into the challenge of compiling i.e. building formal model from C++ of in reasonable time. Also, we often struggle to achieve proof convergence on the equivalence check targets, even if we manage to compile the C++ models. While formal methods can be potentially effective, the right methodology is required to overcome these challenges and scale to larger and complex designs. In this paper, we propose several generic techniques that have helped us to generate formal model for complex image processing algorithm like FFT and Decompression units in couple of mins and converge on the properties which was impossible to achieve out of the box. In this process we also uncovered bugs that the traditional verification methods could not catch. All these techniques are reusable, and this paper talks about details on these techniques by taking FFT and Decompression algorithm as an example.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionActivation outliers in Large Language Models (LLMs), which exhibit large magnitudes but small quantities, significantly affect model performance and pose challenges for the acceleration of LLMs. To address this bottleneck, researchers have proposed several co-design frameworks with outlier-aware algorithms and dedicated hardware. However, they face challenges balancing model accuracy with hardware efficiency when accelerating LLMs in a low bit-width manner. To this end, we propose OutlierCIM, the first algorithm and hardware co-design framework for compute-in-memory (CIM) accelerator with outlier-aware quantization algorithm. The key contributions of OutlierCIM are 1) an outlier-clustered tiling strategy that regulates memory access and reduces inefficient workloads which are both introduced by outliers, 2) a hybrid-strategy quantization and a reconfigurable double-bit CIM macro array that overcome the low storage utilization and high latency of outlier-based LLM quantization, and 3) a quantization factor post-processing strategy and a dedicated quantizer that efficiently unify the multiplication and accumulation of outlier-caused FP-INT workloads. Implemented in a 28nm CMOS technology, OutlierCIM occupies an area of 2.25 mm². When evaluated at comprehensive benchmarks, OutlierCIM achieves up to 4.54× energy efficiency improvement and 3.91× speedup compared to the state-of-the-art outlier-aware accelerators.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionAs traditional electronic hardware encounters the limitations of Moore's Law, optical computing is emerging as a promising alternative, delivering high data transmission rates, especially beneficial for big data and AI applications. Photonic accelerators, such as the Lightening-Transformer, utilize optical analog signals to accelerate Transformer-based models, achieving exceptional speed and low energy consumption. However, controlling modern optical intensity modulators (e.g., Mach-Zehnder Modulators) requires using electrical analog signals (e.g., voltage values) to adjust the optical signal intensity for realizing optical-based vector inner product calculations. Managing this modulation consumes significant power, as it involves selecting optimal electrical values through an electrical controller and converting digital signals to analog using digital-to-analog converters (DACs). In this work, we introduce P-DAC, a solution designed to reduce DAC power consumption, significantly enhancing the energy efficiency of optical accelerators for Transformer models.
Networking
Work-in-Progress Poster


DescriptionControl-flow speculation enables out-of-order processors to utilize large scheduling windows but wastes energy on wrong-path instructions when mispredictions occur.
Prior speculation control methods suffer significant performance loss in modern architectures.
We propose Pacemaker, a microarchitecture for energy-efficient scheduling window management with negligible performance impact.
Pacemaker dynamically adjusts confidence thresholds for each branch and preemptively restores scheduling windows to maintain performance.
It saves energy by 18.5% with only 2.0% performance loss, offering 7.5% better energy efficiency than the latest methods.
Prior speculation control methods suffer significant performance loss in modern architectures.
We propose Pacemaker, a microarchitecture for energy-efficient scheduling window management with negligible performance impact.
Pacemaker dynamically adjusts confidence thresholds for each branch and preemptively restores scheduling windows to maintain performance.
It saves energy by 18.5% with only 2.0% performance loss, offering 7.5% better energy efficiency than the latest methods.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionWeight-only quantization has been widely explored in large language models (LLMs) to reduce memory storage and data loading overhead. During deployment on single-instruction-multiple-threads (SIMT) architectures, weights are stored in low-precision integer (INT) format, while activations remain in full-precision floating-point (FP) format to preserve inference accuracy. Although memory footprint and data loading requirements for weight matrices are reduced, computation performance gains remain limited due to the need to convert weights back to FP format through unpacking and dequantization before GEMM operations. In this work, we investigate methods to accelerate GEMM operations involving packed low-precision INT weights and high-precision FP activations, defining this as the hyper-asymmetric GEMM problem. Our approach co-optimizes tile-level packing and dataflow strategies for INT weight matrices. We further design a specialized FP-INT multiplier unit tailored to our packing and dataflow strategies, enabling parallel processing of multiple INT weights. Finally, we integrate the packing, dataflow, and multiplier unit into PacQ, a SIMT microarchitecture designed to efficiently accelerate hyper-asymmetric GEMMs. We show that PacQ can achieve up to 1.99x speedup and 81.4% reduction in EDP compared to weight-only quantized LLM workloads running on conventional SIMT baselines.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionLarge-scale deep neural networks (DNN) exhibit excellent performance for various tasks. As DNNs and datasets grow, distributed training becomes extremely time-consuming and demands larger clusters. A main bottleneck is the resulting gradient aggregation overhead. While gradient compression and sparse collective communication techniques are commonly employed to alleviate network load, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. This paper introduces PacTrain, a novel framework that accelerates distributed training by combining pruning with sparse gradient compression. Active pruning of the neural network makes the model weights and gradients sparse. By ensuring the global knowledge of the gradient sparsity among all distributed training workers, we can perform lightweight compression communication without harming accuracy. We show that the PacTrain compression scheme achieves a near-optimal compression strategy while remaining compatible with the all-reduce primitive. Experimental evaluations show that PacTrain improves training throughput by 1.25 to 8.72$\times$ compared to state-of-the-art compression-enabled systems for representative vision and language models training tasks under bandwidth-constrained conditions.
Networking
Work-in-Progress Poster


DescriptionIn Multi-Objective Goal Attainment Optimization, balancing exploration and exploitation is critical, as focusing only on improving current solutions can lead to local optima. This study introduces a new approach that combines goal-oriented search with Pareto optimization to systematically explore candidates that may not immediately improve the current solution but could help achieve the goal. This Pareto-assisted exploration is combined with a transformer neural network to provide strong problem-solving capabilities while enabling sample-efficient search in high-dimensional spaces.
Tested on the complex analog circuit of class AB amplifier (Huijing amplifier) with 44 parameters, goals were derived from a Pareto front generated via large-scale Monte Carlo simulations. The results show that the approach can efficiently find solutions in high-dimensional spaces without other methods requiring pre-processing.
Tested on the complex analog circuit of class AB amplifier (Huijing amplifier) with 44 parameters, goals were derived from a Pareto front generated via large-scale Monte Carlo simulations. The results show that the approach can efficiently find solutions in high-dimensional spaces without other methods requiring pre-processing.
Research Manuscript


Design
DES3: Emerging Models of ComputatioN
DescriptionPairwise queries have been widely used in various applications. However, existing methods still face challenges in handling Concurrent Pairwise Queries (CPQ) due to irregular memory access and fragmented data sharing. To address this, we propose PairGraph, an accelerator that identifies frequently traversed graph structure data and fully reuse the data worth sharing to minimize off-chip communications. Experimental results show that PairGraph achieves 5.59×∼14.25× and 3.76×∼7.58× speedups over software systems Gemini and Gunrock, respectively. It also outperforms the cutting-edge accelerators (LCCG, ScalaGraph, and ReGraph), achieving speedups of 1.67×∼2.72×, 1.93×∼4.26×, and 2.66×∼4.28×, respectively.
Networking
Work-in-Progress Poster


DescriptionBlockchain database systems, such as Bitcoin and Ethereum, suffer from limited memory bandwidth and high memory access latency when retrieving user-requested data. Processing-in-memory (PIM) is promising to accelerate users' queries, by enabling low-latency memory access and aggregated memory bandwidth scaling with the number of PIM modules. We present Panther, the first PIM-based blockchain database system supporting efficient verifiable queries. Blocks are distributed to PIM modules for high parallelism with low inter-PIM communication cost, managed by a learning-based model. For load balance across PIM modules, data are adaptively promoted and demoted between the host and PIM sides. In multiple datasets, Panther achieves up to 23.6× speedup for verifiable queries and reduces metadata storage by orders of magnitude compared to state-of-the-art designs.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionCombinational Equivalence Checking (CEC) is a crucial technique in electronic design automation for verifying the functional equivalence of combinational circuits. Recently, combinational circuit design increasingly incorporates more complex arithmetic structures, commonly known as datapath circuits. However, existing state-of-the-art tools often exhibit subpar performance in solving datapath CEC problems. To further advance the exploration on datapath CEC process, this study introduces PDP-CEC (Parallel Dynamic Partitioning Combinational Equivalence Checking), a novel parallel CEC approach integrating circuit partitioning and dynamic task scheduling into the CEC process, enhancing the efficiency of CEC for datapath circuits. PDP-CEC introduces an innovative method for selecting critical nodes to split the search space of the CEC problem, facilitating the efficient generation of numerous independent subproblems. Meanwhile, a dynamic task scheduling strategy is implemented in PDP-CEC to ensure load balancing and prevent hard-to-solve subproblems from stalling the entire process. Compared to the most advanced tools such as ABC and Hybrid-CEC, PDP-CEC significantly accelerates CEC process, achieving speedups ranging from 5.11x to 125.27x, while effectively solving approximately three times more datapath CEC problems. With excellent scalability, PDP-CEC shows substantial improvements in combinational equivalence checking for datapath circuits, offering an efficient parallel approach to meet the demands of large-scale datapath CEC tasks.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionFull-batch Graph Neural Networks (GNN) training is indispensable for interdisciplinary applications. Although full-batch training has advantages in convergence accuracy and speed, it still faces challenges such as severe load imbalance and high communication traffic overhead. In order to address these challenges, we propose ParGNN, an efficient full-batch training system for GNNs, which adopts a profiler-guided adaptive load balancing method in conjunction with graph over-partition to alleviate load imbalance. Based on the over-partition results, we present a subgraph pipeline algorithm to overlap communication and computation while maintaining the accuracy of GNN training. Extensive experiments demonstrate that ParGNN can not only obtain the highest accuracy but also reach the preset accuracy in the shortest time.
In the end-to-end experiments performed on the four datasets, ParGNN outperforms the two state-of-the-art full-batch GNN systems, PipeGCN and DGL, achieving the highest speedup of 2.7$\times$ and 21.8$\times$ times respectively.
In the end-to-end experiments performed on the four datasets, ParGNN outperforms the two state-of-the-art full-batch GNN systems, PipeGCN and DGL, achieving the highest speedup of 2.7$\times$ and 21.8$\times$ times respectively.
Networking
Work-in-Progress Poster


DescriptionLogic synthesis is an important step in digital circuit design, where optimization operators are applied iteratively to improve the quality of results (QoR).
While recent work has made progress in optimizing sequences of operators, these methods typically apply optimization operators directly to the entire circuit, which may lead to suboptimal results due to the heterogeneous nature of different circuit regions.
To address this problem, we propose ParLS, a novel framework that integrates circuit partitioning and reinforcement learning (RL) for finer-grained and efficient optimization.
Specifically, we convert and-inverter graphs (AIGs) into hypergraphs for partitioning, and then use RL to select the most suitable operator for each subgraph based on its characteristics.
Moreover, parallel optimization between subgraphs is utilized to achieve overall acceleration.
Experimental results show that our partition-based optimization framework achieves superior performance across various benchmarks compared to existing optimization methods.
While recent work has made progress in optimizing sequences of operators, these methods typically apply optimization operators directly to the entire circuit, which may lead to suboptimal results due to the heterogeneous nature of different circuit regions.
To address this problem, we propose ParLS, a novel framework that integrates circuit partitioning and reinforcement learning (RL) for finer-grained and efficient optimization.
Specifically, we convert and-inverter graphs (AIGs) into hypergraphs for partitioning, and then use RL to select the most suitable operator for each subgraph based on its characteristics.
Moreover, parallel optimization between subgraphs is utilized to achieve overall acceleration.
Experimental results show that our partition-based optimization framework achieves superior performance across various benchmarks compared to existing optimization methods.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionTransformer-based video generation models have demonstrated significant potential in content creation. However, the current state-of-the-art model employing "3D full attention"
encounters substantial computation and storage challenges. For instance, the attention map size for CogVideoX-5B requires 56.50 GB, and generating a video of 49 frames takes approximately 1 minute on an NVIDIA A100 GPU under FP16. Although model quantization has proven effective in reducing both memory and computational costs, applying it to video generation models still faces challenges in preserving algorithmic performance while ensuring efficient hardware processing. To address these issues, we introduce PARO, a video generation accelerator with pattern-aware reorder-based attention quantization. PARO investigates the diverse attention patterns of 3D full attention and proposes a novel reorder technique to unify these patterns into a unified "block diagonal" structure. Block-wise mixed precision quantization is further applied to achieve lossless compression under an average bitwidth of 4.80 bits. In terms of hardware, to overcome the limitation of existing mixed-precision computing units could not fully utilize the attention map bitwidth to accelerate QK
multiplication, PARO designs an output-bitwidth aware mixed-precision processing element (PE) array through hardware-software co-design. This approach ensures that the mixed-precision characteristics are fully utilized to enhance hardware efficiency in the bottleneck attention computation. Experiments demonstrate that PARO delivers up to 2.71× improvement in end-to-end performance compared to an NVIDIA A100 GPU and achieves up to 6.38∼7.05× speedup over state-of-the-art ASIC-based accelerators on the CogVideoX-2B and 5B models.
encounters substantial computation and storage challenges. For instance, the attention map size for CogVideoX-5B requires 56.50 GB, and generating a video of 49 frames takes approximately 1 minute on an NVIDIA A100 GPU under FP16. Although model quantization has proven effective in reducing both memory and computational costs, applying it to video generation models still faces challenges in preserving algorithmic performance while ensuring efficient hardware processing. To address these issues, we introduce PARO, a video generation accelerator with pattern-aware reorder-based attention quantization. PARO investigates the diverse attention patterns of 3D full attention and proposes a novel reorder technique to unify these patterns into a unified "block diagonal" structure. Block-wise mixed precision quantization is further applied to achieve lossless compression under an average bitwidth of 4.80 bits. In terms of hardware, to overcome the limitation of existing mixed-precision computing units could not fully utilize the attention map bitwidth to accelerate QK
multiplication, PARO designs an output-bitwidth aware mixed-precision processing element (PE) array through hardware-software co-design. This approach ensures that the mixed-precision characteristics are fully utilized to enhance hardware efficiency in the bottleneck attention computation. Experiments demonstrate that PARO delivers up to 2.71× improvement in end-to-end performance compared to an NVIDIA A100 GPU and achieves up to 6.38∼7.05× speedup over state-of-the-art ASIC-based accelerators on the CogVideoX-2B and 5B models.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionToday, DNN inference is widely adopted, with numerous inference services being spawned from scratch across instances scenarios such as spot serving, serverless scaling and edge computing, where frequent start-stops are required. In this work, we first delve into the inference workflow and uncover the origins of cold start when invoking a DNN model. Specifically, DNN execution is blocked by the kernel loading process to prepare the code object executing on GPU at the DL primitive library (e.g., cuDNN and MIOpen). To tackle this, we propose PASK, a kernel loading and reusing middleware to mitigate the widespread cold start issue. Unlike the reactive kernel scheduling policy used by existing frameworks, PASK adopts a proactive strategy to interleave code loading, kernel issuing and GPU computation to achieve higher hardware utilization. To further reduce the loading overhead, PASK recycles existing loaded kernels to accomplish the DNN operator, rather than inducting new kernels for every layer. Meanwhile, PASK categorically organizes the cached kernels to efficiently find the applicable kernel for reuse and thus minimize incurred runtime overhead. We implement and evaluate PASK atop of open source DNN inference engine and primitive library on off-the-shelf GPUs. Experiments demonstrate PASK is capable of alleviating the cold start overhead of popular DNN models with 5.62x speedup on average.
Research Manuscript


EDA
EDA9: Design for Test and Silicon Lifecycle Management
DescriptionIn automatic test pattern generation (ATPG), SAT-based methods are typically used to complement structural approaches, especially for addressing hard-to-detect faults. However, as the size and complexity of circuits grow, SAT-based ATPG faces challenges like pattern inflation and excessive runtime, limiting its overall performance.
The key problem lies in the fact that current mainstream SAT solvers perform complete assignments for all primary inputs of the fault's transitive fanin cone without considering the detection of other faults, making test compaction extremely difficult and time consuming.
In this paper, a novel SAT solver PA-MiniSat is proposed, which is capable of generating partial assignments for solving variables and significantly reduces the number of specified bits in test cubes.
As an extension of MiniSat, it employs a full-literal watching technique and a circuit-adapted heuristic branching strategy, achieving overall improved performance in ATPG.
Based on PA-MiniSat, a hybrid ATPG framework PastATPG is proposed for better test compaction, which tightly integrates structural algorithms with the SAT solver into the unified test compaction flow.
Experimental results demonstrate that our method outperforms other SAT solvers in pattern compaction and, in some cases, even surpasses commercial ATPG tools in terms of speed.
The key problem lies in the fact that current mainstream SAT solvers perform complete assignments for all primary inputs of the fault's transitive fanin cone without considering the detection of other faults, making test compaction extremely difficult and time consuming.
In this paper, a novel SAT solver PA-MiniSat is proposed, which is capable of generating partial assignments for solving variables and significantly reduces the number of specified bits in test cubes.
As an extension of MiniSat, it employs a full-literal watching technique and a circuit-adapted heuristic branching strategy, achieving overall improved performance in ATPG.
Based on PA-MiniSat, a hybrid ATPG framework PastATPG is proposed for better test compaction, which tightly integrates structural algorithms with the SAT solver into the unified test compaction flow.
Experimental results demonstrate that our method outperforms other SAT solvers in pattern compaction and, in some cases, even surpasses commercial ATPG tools in terms of speed.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionWirelength is the fundamental metric for VLSI routing. With the advancement of new technologies, wire delay has also become a significant factor for timing performances. It is thus necessary to consider both wirelength and delay in routing tree construction, i.e., timing-driven routing trees. Prior methods propose various heuristics to balance wirelength and delay with a tunable parameter, which cannot compute the full Pareto frontier. In this work, we propose PatLabor, a practical method for timing-driven routing. PatLabor directly optimizes the Pareto set, which obtains tighter Pareto curves than prior methods and does not require parameter tuning. PatLabor obtains all Pareto-optimal solutions on small-degree nets up to 9 pins and is theoretically guaranteed by provable time complexity and approximation bounds. Experimental results verify our theoretical findings and show that PatLabor obtains tighter Pareto curves than state-of-the-art methods on ICCAD-15 benchmarks. For example, PatLabor obtains up to 58.5% more Pareto-optimal solutions than prior methods for degree-9 nets.
Engineering Poster


DescriptionTransistor-level static timing analysis plays a vital role in IBM high-performance processors' design, allowing custom circuits to meet aggressive frequency and power goals. It enables designers to robustly determine timing critical paths based on simulation of transistor-level netlist incorporating logic and layout parasitic effects in advanced process nodes.
Previously, due to analog nature or capacity concerns combined with extractor limitations, designers had to abstract large portions of designs. "Grey boxing" could easily miss real layout problems or mis-estimate effects inside abstracts.
This talk will present a pattern-based abstraction timing methodology for mixed transistor-level and abstract-based static timing analysis, which has been successfully leveraged by IBM in multiple microprocessor generations. This novel approach ensures truly layout-based timing analysis and rule generation by providing flexibility to precisely control timing abstraction on subsets of a netlist while modelling parasitic effects at transistor level, achieving higher accuracy than prior methods. This methodology has enhanced both design and EDA productivity with early enablement of circuits and easier support of complicated subcircuits; and it has reduced analysis runtime by 3.7X to 9.6X, leading to faster timing signoff and earlier delivery of rules to integrators.
Previously, due to analog nature or capacity concerns combined with extractor limitations, designers had to abstract large portions of designs. "Grey boxing" could easily miss real layout problems or mis-estimate effects inside abstracts.
This talk will present a pattern-based abstraction timing methodology for mixed transistor-level and abstract-based static timing analysis, which has been successfully leveraged by IBM in multiple microprocessor generations. This novel approach ensures truly layout-based timing analysis and rule generation by providing flexibility to precisely control timing abstraction on subsets of a netlist while modelling parasitic effects at transistor level, achieving higher accuracy than prior methods. This methodology has enhanced both design and EDA productivity with early enablement of circuits and easier support of complicated subcircuits; and it has reduced analysis runtime by 3.7X to 9.6X, leading to faster timing signoff and earlier delivery of rules to integrators.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionGeneration of diverse VLSI layout patterns is crucial for various downstream tasks in design for manufacturing (DFM) studies. However, the lengthy design cycles often hinder the creation of a comprehensive layout pattern library, and new detrimental patterns may be discovered late in the product development process. Existing training-based ML pattern generation approaches struggle to produce legal layout patterns in the early stages of technology node development due to the limited availability of training samples.
To address this challenge, we propose PatternPaint, a few-shot learning framework capable of generating legal patterns with limited DRC Clean training samples. PatternPaint simplifies complex layout pattern generation into a series of inpainting processes with a template-based denoising scheme.
Our framework enables even a general pre-trained image foundation model (stable-diffusion), to generate valuable pattern variations, thereby enhancing the library. Notably, PatternPaint can operate with any input size. Furthermore, we explore fine-tuning a pre-trained model with VLSI layout images, resulting in a 2x generation efficiency compared to the base model.
Our results show that the proposed model can generate legal patterns in complex 2D metal interconnect design rule settings and achieves a high diversity score. The designed system, with its flexible settings, supports pattern generation with localized changes and design rule violation correction. Validated on a sub-3nm technology node (Intel 18A), PatternPaint is the first framework to generate a complex 2D layout pattern library using only 20 design rule clean layout patterns as input.
To address this challenge, we propose PatternPaint, a few-shot learning framework capable of generating legal patterns with limited DRC Clean training samples. PatternPaint simplifies complex layout pattern generation into a series of inpainting processes with a template-based denoising scheme.
Our framework enables even a general pre-trained image foundation model (stable-diffusion), to generate valuable pattern variations, thereby enhancing the library. Notably, PatternPaint can operate with any input size. Furthermore, we explore fine-tuning a pre-trained model with VLSI layout images, resulting in a 2x generation efficiency compared to the base model.
Our results show that the proposed model can generate legal patterns in complex 2D metal interconnect design rule settings and achieves a high diversity score. The designed system, with its flexible settings, supports pattern generation with localized changes and design rule violation correction. Validated on a sub-3nm technology node (Intel 18A), PatternPaint is the first framework to generate a complex 2D layout pattern library using only 20 design rule clean layout patterns as input.
Engineering Poster


DescriptionIn contemporary chip design process, both Average and Peak power are important metrices. With more focus on average power, the power optimization techniques that save peak power are often neglected. This leads to critical issues in the downstream flows, as well as increases the cost of packaging for thermal management. Currently, no automated solution exists that can reduce peak power in the chip designing process.
In this paper, we propose a peak power optimization technique by re-scheduling the data-path operators across cycles in the RTL2GDS flow. This technique is similar to the retiming flow. However, the guidance from the RTL Power Optimization tool, based on data-path operator activity profile in the peak power region is consumed in the RTL2GDS flow. To enable this technique, a new optimization was added in the RTL2GDS flow that takes guidance from the RTL tool as an additional input. Using the existing and the modified RTL2GSD flows, two netlists were generated that were compared to validate the impact on Peak Power.
The results indicate that the design can be closed with lower peak power. Also, there was no noticeable impact on other PPA metrics of the design.
In this paper, we propose a peak power optimization technique by re-scheduling the data-path operators across cycles in the RTL2GDS flow. This technique is similar to the retiming flow. However, the guidance from the RTL Power Optimization tool, based on data-path operator activity profile in the peak power region is consumed in the RTL2GDS flow. To enable this technique, a new optimization was added in the RTL2GDS flow that takes guidance from the RTL tool as an additional input. Using the existing and the modified RTL2GSD flows, two netlists were generated that were compared to validate the impact on Peak Power.
The results indicate that the design can be closed with lower peak power. Also, there was no noticeable impact on other PPA metrics of the design.
Networking
Work-in-Progress Poster


DescriptionCoarse-grained reconfigurable architectures (CGRAs) strike a fine balance between flexibility and efficiency by incorporating reconfigurable functional units and interconnectivity patterns tailored to specific application domains. However, to fully harness the potential of CGRAs, sophisticated compilation techniques are essential to effectively exploit their architectural features. This paper proposes a comprehensive CGRA compilation framework based on MLIR, which integrates effective computation graph optimization and polyhedral-based tensor optimization methods. Experimental results demonstrate an 82.5% performance improvement on neural network models and a 53.7% reduction in mapping time compared to existing compilation techniques, as well as excellent adaptability for AI applications.
Research Panel


Systems
DescriptionIn today's state-of-the-art technology, High-Performance computing (HPC) is paramount to solve complex scientific and business problems. Some of the key challenges of HPC is the power consumption, scalability limitations, and the rapid pace of hardware innovation making it difficult to keep systems up to date. All these factors lead to a potentially unsustainable situation.
With that, Sustainable and Energy-Efficient computing paradigm has become extremely critical. By using technological advancement, we need to maximize energy efficiency, reduce resource consumption, and promoting recycling of electronic waste throughout the product lifecycle.
In this panel, we will discuss various ways to achieve Sustainability and Energy-Efficient Computing. While all the panelists believe that Energy-Efficiency and Sustainability needs to be achieved, there is a wide variety of opinions and disagreements about the way to achieve that. Significant amount of research needs to be performed in various angles mentioned by the panelists.
With that, Sustainable and Energy-Efficient computing paradigm has become extremely critical. By using technological advancement, we need to maximize energy efficiency, reduce resource consumption, and promoting recycling of electronic waste throughout the product lifecycle.
In this panel, we will discuss various ways to achieve Sustainability and Energy-Efficient Computing. While all the panelists believe that Energy-Efficiency and Sustainability needs to be achieved, there is a wide variety of opinions and disagreements about the way to achieve that. Significant amount of research needs to be performed in various angles mentioned by the panelists.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionVariational quantum algorithms (VQA) based on Hamiltonian simulation represent a specialized class of quantum programs well-suited for near-term quantum computing applications due to its modest resource requirements in terms of qubits and circuit depth. Unlike the conventional single-qubit (1Q) and two-qubit (2Q) gate sequence representation, Hamiltonian simulation programs are essentially composed of disciplined subroutines known as Pauli exponentiations (Pauli strings with coefficients) that are variably arranged. To capitalize on these distinct program features, this study introduces Phoenix, a highly effective compilation framework that primarily operates at the high-level Pauli-based intermediate representation (IR) for generic Hamiltonian simulation programs. Phoenix exploits global program optimization opportunities to the greatest extent, compared to existing SOTA methods despite some of them also utilizing similar IRs. Phoenix employs the binary symplectic form (BSF) to formally describe Pauli strings and reformulates IR synthesis as reducing the column weights of BSF by appropriate Clifford transformations. It comes with a heuristic BSF simplification algorithm that searches for the most appropriate 2Q Clifford operators in sequence to maximally simplify the BSF at each step, until the BSF can be directly synthesized by basic 1Q and 2Q gates. Phoenix further performs a global ordering strategy in a Tetris-like fashion for these simplified IR groups, carefully balancing optimization opportunities for gate cancellation, minimizing circuit depth, and managing qubit routing overhead. Experimental results demonstrate that Phoenix outperforms SOTA VQA compilers across diverse program categories, backend ISAs, and hardware topologies.
Engineering Poster


DescriptionRAM sequential ATPG could be needed for designs which has some requirement but RAM sequential ATPG could result in IR issue in advance process since ATPG cannot control memory well. In advanced process, scan IR issue becomes more serious. Memory control mechanism is provided involving configurable memory control circuit insertion and physical aware ATPG constraint generation. Result shows memories are well controlled in patterns from ATPG and IR result shows good result without hot spots with the most critical patterns.
Engineering Poster


DescriptionTraditional SoC Power Grid (PG) design is iterative, requiring time-consuming analysis with commercial tools and multiple PnR cycles. This work presents a fast, automated PG synthesis and IR drop solver using a Python framework, enabling early-stage SoC analysis.The framework takes a PG specification as input, constructs a resistor-via mesh, generates a SPICE netlist, and employs a sparse matrix solver for rapid IR drop evaluation. This allows quick evaluation of various PG specifications, simulating different current distributions and mixed-PG regions.
Comparison with post-PnR sign-off analysis shows close approximation in IR histograms and heat-maps, validating the accuracy of the proposed method.The solver achieves a significant speedup of at least 10X in runtime and user effort compared to the conventional PnR-dependent flow.
The maximum error is approximately 1.5 mV for always-on and 10 mV for switchable grids, demonstrating acceptable accuracy for early-stage analysis.
This approach enables early assessment of PG grid templates, significantly accelerating PG design and optimization before PnR implementation.
Comparison with post-PnR sign-off analysis shows close approximation in IR histograms and heat-maps, validating the accuracy of the proposed method.The solver achieves a significant speedup of at least 10X in runtime and user effort compared to the conventional PnR-dependent flow.
The maximum error is approximately 1.5 mV for always-on and 10 mV for switchable grids, demonstrating acceptable accuracy for early-stage analysis.
This approach enables early assessment of PG grid templates, significantly accelerating PG design and optimization before PnR implementation.
Networking
Work-in-Progress Poster


DescriptionThree important challenges must be addressed for resource-constrained ASR models on the edge, i.e., adaptivity, incrementality, and inclusivity. We propose a novel ASR system, PI-Whisper, in this work and show how it can improve an ASR's recognition capabilities adaptively by identifying different speakers' characteristics, how such an adaption can be performed incrementally without repetitive retraining, and how it can improve the equity and fairness for diverse speaker groups. The proposed system attains all of these nice properties while achieving state-of-the-art accuracy with up to 13.7% reduction of the word error rate (WER) with linear scalability with respect to computing resources.
Networking
Work-in-Progress Poster


DescriptionFloorplanning is the initial stage of physical design. However, most existing algorithms focus solely on optimizing half-perimeter wire length (HPWL). This focus often neglects pin assignments for signal transmission and fails to accommodate multiple constraints such as pre-placed modules (PPM). Consequently, this leads to suboptimal power, performance, and area (PPA) metrics, and misaligns with real-world design requirements. To address these challenges, we introduce Piano, a multi-constraint, pin assignment-aware floorplanner that serves as an incremental optimizer for any floorplanning algorithm. Piano constructs a graph based on pin-to-pin connections, enabling effective pin assignments and the calculation of feedthrough paths for long-distance connections. Additionally, it offers a method for whitespace removal and incorporates three operators to enhance pin assignments, while adhering to complex design constraints. Experimental results demonstrate that Piano significantly outperforms recent state-of-the-art approaches in floorplanning, achieving an average reduction of 8% in HPWL and a 23% improvement in unplaced pins.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionIn this work, we present PICK, an efficient processing-in-memory (PIM) architecture designed to accelerate kNN search in point cloud applications. We exploit bit-serial-based-PIM (BS-PIM) and customized computing modules to efficiently perform the elementary operations in kNN search,
i.e., distance calculation and top-k search. The in-situ and bit-serial-based computing approach of BS-PIM significantly reduces on-chip data movements and simplifies circuit design, enabling an expanded on-chip memory that fully eliminates runtime off-chip memory access. For the distance calculation,
we propose a bit-width clipping method to reduce the high latency typically associated with the bit-serial algorithm, with negligible accuracy degradation. Such an optimization allows flexible trade-offs between performance and accuracy as well, for various scenarios with different priorities. For the top-k search, we propose an efficient filtering-and-selection search strategy to handle arbitrary values of k with approximately constant time complexity. Additionally, a two-stage pipeline is applied to parallelize the execution of distance calculation and top-k search, hiding latency and enhancing system.
i.e., distance calculation and top-k search. The in-situ and bit-serial-based computing approach of BS-PIM significantly reduces on-chip data movements and simplifies circuit design, enabling an expanded on-chip memory that fully eliminates runtime off-chip memory access. For the distance calculation,
we propose a bit-width clipping method to reduce the high latency typically associated with the bit-serial algorithm, with negligible accuracy degradation. Such an optimization allows flexible trade-offs between performance and accuracy as well, for various scenarios with different priorities. For the top-k search, we propose an efficient filtering-and-selection search strategy to handle arbitrary values of k with approximately constant time complexity. Additionally, a two-stage pipeline is applied to parallelize the execution of distance calculation and top-k search, hiding latency and enhancing system.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionData deduplication enhances storage efficiency through non-destructive compression but is often hindered by the chunking process, which requires scanning the entire dataset. While traditional methods leveraging conventional architectures and hardware accelerators (e.g., GPUs and FPGAs) have been developed to address this issue, they continue to face challenges related to excessive data movement and associated performance degradation. These limitations stem from the von Neumann architecture, where computation and storage are separated in a processor-centric design, necessitating multiple memory hierarchy traversals and causing inefficiencies. To overcome these challenges, we explore UPMEM's DPU, a processing-in-memory (PIM) technology that reduces data movement by performing computations directly within memory. However, designing a deduplication system for DPUs presents unique obstacles, including restricted inter-DPU data sharing, the absence of native multiplication support, and significant DPU-CPU communication overhead. In response, we propose PIMDup, a DPU-optimized deduplication system that addresses these constraints through efficient parallelization, DPU-friendly chunking techniques, and reduced data transfer volumes. Experimental results demonstrate that PIMDup improves chunking performance without compromising deduplication accuracy, achieving a 1.67× speedup over CPU-based systems while maintaining 100% result consistency.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionMixture-of-experts (MoE) technique holds significant promise for scaling up Transformer models.
However, the data transfer overhead and imbalanced workload hinder efficient deployment.
This work presents PIMoE, a heterogeneous system combining processing-in-memory (PIM) and neural-processing-unit (NPU) to facilitate efficient MoE Transformer inference.
We propose a throttle-aware task offloading method that addresses workload imbalance between NPU and PIM, achieving optimal task distribution.
Furthermore, we design a near-memory-controller data condenser to address the mismatch of sparse data layout between NPU and PIM, enhancing data transfer efficiency.
Experimental results demonstrate that PIMoE achieves 4.5× speedup and 13.7× greater energy efficiency compared to the A100, and 1.4× speedup over a state-of-the-art MoE platform.
However, the data transfer overhead and imbalanced workload hinder efficient deployment.
This work presents PIMoE, a heterogeneous system combining processing-in-memory (PIM) and neural-processing-unit (NPU) to facilitate efficient MoE Transformer inference.
We propose a throttle-aware task offloading method that addresses workload imbalance between NPU and PIM, achieving optimal task distribution.
Furthermore, we design a near-memory-controller data condenser to address the mismatch of sparse data layout between NPU and PIM, enhancing data transfer efficiency.
Experimental results demonstrate that PIMoE achieves 4.5× speedup and 13.7× greater energy efficiency compared to the A100, and 1.4× speedup over a state-of-the-art MoE platform.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionDeploying Large Language Models (LLMs) on edge
devices poses significant challenges due to their high compu-
tational and memory demands. In particular, General Matrix-
Vector Multiplication (GEMV), a key operation in LLM infer-
ence, is highly memory-intensive, making it difficult to accelerate
using conventional edge computing systems. While Processing-
in-memory (PIM) architectures have emerged as a promising
solution to this challenge, they often suffer from high area
overhead or restricted computational precision.
This paper proposes PIMPAL (Process-In-Memory architecture
with Parallel Arithmetic Lookup), a cost-effective PIM architecture
leveraging LookUp Table (LUT)-based computation for GEMV
acceleration in sLLMs (small LLMs). By replacing traditional
arithmetic operations with parallel in-DRAM LUT lookups,
PIMPAL significantly reduces area overhead while maintaining
high performance. PIMPAL introduces three key innovations: (1)
it divides DRAM bank subarrays into compute blocks for parallel
LUT processing; (2) it employs Locality-aware Compute Mapping
(LCM) to reduce DRAM row activations by maximizing LUT
access locality; and (3) it enables multi-precision computations
through a LUT Aggregation (LAG) mechanism that combines
results from multiple small LUTs. Experimental results show that
PIMPAL achieves up to 17x higher performance than previous
LUT-based PIM designs and reduces area overhead by 40%
compared to conventional processing unit-based PIM designs.
devices poses significant challenges due to their high compu-
tational and memory demands. In particular, General Matrix-
Vector Multiplication (GEMV), a key operation in LLM infer-
ence, is highly memory-intensive, making it difficult to accelerate
using conventional edge computing systems. While Processing-
in-memory (PIM) architectures have emerged as a promising
solution to this challenge, they often suffer from high area
overhead or restricted computational precision.
This paper proposes PIMPAL (Process-In-Memory architecture
with Parallel Arithmetic Lookup), a cost-effective PIM architecture
leveraging LookUp Table (LUT)-based computation for GEMV
acceleration in sLLMs (small LLMs). By replacing traditional
arithmetic operations with parallel in-DRAM LUT lookups,
PIMPAL significantly reduces area overhead while maintaining
high performance. PIMPAL introduces three key innovations: (1)
it divides DRAM bank subarrays into compute blocks for parallel
LUT processing; (2) it employs Locality-aware Compute Mapping
(LCM) to reduce DRAM row activations by maximizing LUT
access locality; and (3) it enables multi-precision computations
through a LUT Aggregation (LAG) mechanism that combines
results from multiple small LUTs. Experimental results show that
PIMPAL achieves up to 17x higher performance than previous
LUT-based PIM designs and reduces area overhead by 40%
compared to conventional processing unit-based PIM designs.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionDynamically scheduled high-level synthesis (HLS) is an approach to HLS that maps programs into dataflow circuits. These circuits use distributed control for communication and therefore can be automatically pipelined. However, pipelined function call is challenging due to the absence of centralized control, thus general software programs cannot be well-supported by the HLS tools. Traditional solutions to this problem impose restrictions on pipelining when accessing the shared functions. We present \name, a modular synthesis method that decomposes the synthesis into compilation stage and linking stage, which enables the fully-pipelined access to shared functions in dataflow circuits. We develop a complete HLS engine using this approach and use asynchronous circuits as the target implementation. Our engine supports pipelined access to shared function units, as well as pipelined memory access in a unified fashion. Compared to existing HLS tools, PipeLink results in 20X reduction in energy and 1.56X improvement in throughput.
Networking
Work-in-Progress Poster


DescriptionSpeculative decoding accelerates LLM inference by using smaller draft models to generate candidates for verification by larger models. However, existing approaches are limited by sequential dependencies between stages. We present PipeSpec, a novel framework that breaks these dependencies through asynchronous execution across a hierarchical pipeline of k models, enabling continuous parallel execution. Our approach achieves up to 2.54x speedup over conventional autoregressive decoding while outperforming state-of-the-art speculative methods. Results across text summarization and code generation tasks demonstrate that pipeline efficiency increases with model depth, providing a scalable solution for multi-device systems.
Research Manuscript


Design
DES3: Emerging Models of ComputatioN
DescriptionWhile sparsity, a feature of data in many applications, provides optimization opportunities such as reducing unnecessary computations, data transfers, and storage, it causes several challenges, too. For instance, even in state-of-the-art
sparse accelerators, sparsity can result in load imbalance; a performance bottleneck. To solve such challenges, our key insight
is that if while reading/streaming compressed sparse matrices we can quickly anticipate the locations of the non-zero values in
a sparse matrix, we can leverage this knowledge to accelerate processing sparse matrices. To enable this, we propose Pipirima, a lightweight prediction-based sparse accelerator. Inspired by traditional branch predictors, Pipirima uses resource-friendly simple counters to predict the patterns of non-zero values in the sparse matrices. We evaluate Pipirima based on sparse matrix vector multiplication (SpMV) and sparse matrix-dense matrix multiplication (SpMM) kernels on CSR compressed matrices derived from both scientific computing and transformer models. On average, our experiments show 6× and 4× speed up over Tensaurus for SpMM and SpMV, respectively on SuiteSparse workload. Pipirima also shows 40× speed up over ExTensor for SpMM. We achieve 8.3×, 48.2× over Tensaurus and ExTensor in lesser sparse transformer workloads. Piprima consumes 5.621mm2 area and 544.93mW power using 45nm technology with predictor related components as the least expensive ones.
sparse accelerators, sparsity can result in load imbalance; a performance bottleneck. To solve such challenges, our key insight
is that if while reading/streaming compressed sparse matrices we can quickly anticipate the locations of the non-zero values in
a sparse matrix, we can leverage this knowledge to accelerate processing sparse matrices. To enable this, we propose Pipirima, a lightweight prediction-based sparse accelerator. Inspired by traditional branch predictors, Pipirima uses resource-friendly simple counters to predict the patterns of non-zero values in the sparse matrices. We evaluate Pipirima based on sparse matrix vector multiplication (SpMV) and sparse matrix-dense matrix multiplication (SpMM) kernels on CSR compressed matrices derived from both scientific computing and transformer models. On average, our experiments show 6× and 4× speed up over Tensaurus for SpMM and SpMV, respectively on SuiteSparse workload. Pipirima also shows 40× speed up over ExTensor for SpMM. We achieve 8.3×, 48.2× over Tensaurus and ExTensor in lesser sparse transformer workloads. Piprima consumes 5.621mm2 area and 544.93mW power using 45nm technology with predictor related components as the least expensive ones.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionLarge language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing costs and protecting user privacy. However, the astronomical model size and limited hardware resources pose significant deployment challenges. Model quantization is a promising approach to mitigate this gap, but the presence of outliers in LLMs reduces its effectiveness.
Previous efforts addressed this issue by employing compression-based encoding for mixed-precision quantization. These approaches struggle to balance model accuracy with hardware efficiency due to their value-wise outlier granularity and complex encoding/decoding hardware logic.
To address this, we propose PISA, an acceleration framework that exploits massive sparsity in the higher-order part of LLMs by splitting 16-bit values into a 4-bit/12-bit format. Crucially, PISA introduces an early bird mechanism that leverages the high-order 4-bit computation to predict the importance of the full calculation result. This mechanism enables efficient computational skips by continuing execution only for important computations and using preset values for less significant ones. This scheme can be efficiently integrated with existing hardware accelerators like systolic arrays without complex encoding/decoding. As a result, PISA outperforms state-of-the-art outlier-aware accelerators, achieving a $1.3-4.3\times$ performance boost and $14.3-66.7\%$ greater energy efficiency, with minimal model accuracy loss. This approach enables more efficient on-device LLM deployment, effectively balancing computational efficiency and model accuracy.
Previous efforts addressed this issue by employing compression-based encoding for mixed-precision quantization. These approaches struggle to balance model accuracy with hardware efficiency due to their value-wise outlier granularity and complex encoding/decoding hardware logic.
To address this, we propose PISA, an acceleration framework that exploits massive sparsity in the higher-order part of LLMs by splitting 16-bit values into a 4-bit/12-bit format. Crucially, PISA introduces an early bird mechanism that leverages the high-order 4-bit computation to predict the importance of the full calculation result. This mechanism enables efficient computational skips by continuing execution only for important computations and using preset values for less significant ones. This scheme can be efficiently integrated with existing hardware accelerators like systolic arrays without complex encoding/decoding. As a result, PISA outperforms state-of-the-art outlier-aware accelerators, achieving a $1.3-4.3\times$ performance boost and $14.3-66.7\%$ greater energy efficiency, with minimal model accuracy loss. This approach enables more efficient on-device LLM deployment, effectively balancing computational efficiency and model accuracy.
Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
DescriptionAs process nodes advance to sub-5nm technologies, post-layout simulations for integrated circuits have become increasingly complex, with billions to trillions of nodes. The growing design complexity and transistor integration require more accurate and efficient post-layout SPICE simulations. However, existing methods for solving large-scale post-layout circuits face significant challenges due to high computational costs. In this paper, we propose a new approach, PiSPICE, which utilizes adjoint sensitivity analysis to identify critical parasitics and eliminate non-critical ones, effectively reducing the simulation scale and improving speed. By modeling parasitics and performing sensitivity analysis on pre-layout circuits, we significantly reduce the computational burden and avoid the overhead of directly analyzing sensitivities in large-scale post-layout circuits. By retaining only the critical parasitics and apply model order reduction to minimize their impact, while eliminating non-critical parasitics, PiSPICE achieves a maximum 17.27x speedup in simulation with an error margin of less than 0.78% compared to the commercial simulator Spectre.
Networking
Work-in-Progress Poster


DescriptionRapid advances of Electro-Photonic Integrated Circuits (EPICs), across various material platforms, are driving the integration of numerous devices into complex heterogeneous systems. A key application is high-bandwidth communications for distributed computing. While manufacturing Photonic Integrated Circuits (PICs) leverages standard VLSI fabrication processes for electronic ICs, the PIC design ecosystem still lags behind the maturity of Electronic Design Automation (EDA) tools.
To address these challenges, we introduce a publicly available, fully automated Photonic Place-and-Route (PnR) flow that operates across different material platforms and integrates seamlessly with industry-standard EDA tools. Our contributions include:
(1) Photonic-to-electronic PnR mapping: Techniques to accommodate the distinct characteristics of photonic routing, such as waveguide bending constraints and optical metric mapping for performance optimization.
(2) Comprehensive verification suite: A suite comprising Design Rule Check (DRC), Layout-vs-Schematic (LVS), and Parasitic Extraction (PEX) to ensure manufacturing compliance and accurate simulation using compact models (VerilogA/SPICE) of Electro-Optical devices.
(3) Reference designs for multiple applications: Demonstrations include (a) high-bandwidth optical switch fabrics and (b) timing-optimized optical trees, that showcase the flexibility and performance of our Photonic PnR solution (completion of 1000 devices in 20 & 50 minutes for unconstrained & constrained respectively).
This integrated approach enables more efficient design and optimization of EPICs, supporting future advances in electro-photonic systems.
To address these challenges, we introduce a publicly available, fully automated Photonic Place-and-Route (PnR) flow that operates across different material platforms and integrates seamlessly with industry-standard EDA tools. Our contributions include:
(1) Photonic-to-electronic PnR mapping: Techniques to accommodate the distinct characteristics of photonic routing, such as waveguide bending constraints and optical metric mapping for performance optimization.
(2) Comprehensive verification suite: A suite comprising Design Rule Check (DRC), Layout-vs-Schematic (LVS), and Parasitic Extraction (PEX) to ensure manufacturing compliance and accurate simulation using compact models (VerilogA/SPICE) of Electro-Optical devices.
(3) Reference designs for multiple applications: Demonstrations include (a) high-bandwidth optical switch fabrics and (b) timing-optimized optical trees, that showcase the flexibility and performance of our Photonic PnR solution (completion of 1000 devices in 20 & 50 minutes for unconstrained & constrained respectively).
This integrated approach enables more efficient design and optimization of EPICs, supporting future advances in electro-photonic systems.
Networking
Work-in-Progress Poster


DescriptionThis paper presents a power management integrated circuits optimization (PMICO) framework that integrates expert knowledge with deep reinforcement learning in multi-agents. The framework incorporates a systematic methodology for quantifying objective weights in multi-objective optimization, a transistor clustering strategy, and single-step iteration mechanisms derived from analog circuit expertise. PMICO demonstrates exceptional proficiency in optimizing circuit performance, achieving results that rival or surpass those of expert designers, even in scenarios lacking appropriate initial parameters. The well-trained policies exhibit cross-process transferability, facilitating optimization
across diverse manufacturing processes and operational scenarios. PMICO was verified by a practical low-dropout regulator comprising 149 design parameters and 23 performance metrics. It consistently matches or exceeds manual design performance while achieving up to 5.2× acceleration in training efficiency compared to state-of-the-art multi-agent reinforcement learning methods
across diverse manufacturing processes and operational scenarios. PMICO was verified by a practical low-dropout regulator comprising 149 design parameters and 23 performance metrics. It consistently matches or exceeds manual design performance while achieving up to 5.2× acceleration in training efficiency compared to state-of-the-art multi-agent reinforcement learning methods
Networking
Work-in-Progress Poster


DescriptionThis paper proposes POG, a novel surrogate model for analog circuit parameter optimization using reinforcement learning (RL). POG embeds circuit features using separate voltage and current graph
neural networks (GNNs) that consider different topologies and directed edges, reflecting the physical behavior of circuits more accurately. These embeddings are used in policy and value networks to determine circuit parameters. Our experiments with realistic circuits demonstrate that POG achieves faster convergence and improved parameter optimization compared to other configurations.
neural networks (GNNs) that consider different topologies and directed edges, reflecting the physical behavior of circuits more accurately. These embeddings are used in policy and value networks to determine circuit parameters. Our experiments with realistic circuits demonstrate that POG achieves faster convergence and improved parameter optimization compared to other configurations.
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionMicroelectronic systems are widely used in many sensitive applications (e.g., manufacturing, energy, defense). These systems increasingly handle sensitive data (e.g., encryption key) and are vulnerable to threats such as power side-channel attacks (infers sensitive data from power leakage). We present a novel framework, POLARIS for mitigating power side channel leakage using an Explainable Artificial Intelligence (XAI) guided masking approach. POLARIS uses an unsupervised process to automatically build a tailored training dataset and utilize it to train a masking model. The POLARIS framework outperforms VALIANT (state-of-the-art) in terms of leakage reduction, execution time, and overhead across large designs.
Research Manuscript


Design
DES4: Digital and Analog Circuits
DescriptionDeep Neural Networks (DNNs) in safety-critical systems require high reliability. Many systems deploy Error Correction Codes (ECCs) to protect DNNs from memory errors. However, continuous process scaling increases memory errors in severity and frequency, necessitating strong protection against Multi-Bit Upsets (MBUs). This paper proposes Parities of Parities ECC (PoP-ECC), a novel two-tier memory protection scheme designed to provide robust, efficient, and flexible protection against MBUs. PoP-ECC generates Virtual Parities (VPs), which are used to compute second-level parities called Parities of Parities (PPs). This two-level ECC structure allows for dynamic error correction tailored to varying error patterns, ensuring system reliability with minimal memory overhead. Our evaluation demonstrates that PoP-ECC can tolerate significantly higher MBU ratios compared to state-of-the-art solutions, with negligible delay, area, and power overhead.
Engineering Poster


DescriptionCell library characterization is often a difficult and time-consuming task, due to the breadth of calculations and simulations required in its process. In typical characterization runs, simulations can be in the billions, resulting in days to weeks for completion. This is exemplified further for standard cell libraries with large sets of PVTs and thousands of cells, further increasing the engineering resources, computational consumption, and project time.
Recently, advances and developments in EDA technology have allowed for the use of AI algorithms to reduce the overall time committed to a characterization flow. In this paper, we'll discuss a 2-step methodology known as Portfolio Re-Characterization to characterize and generate data in .libs using AI.
The methodology in question first identifies seed PVTs among the total set of required PVTs for use in full scale characterization through reinforcement learning. Next, the same seed PVTs are used as training data in an AI-enabled workflow to produce new .libs without the need for full characterization. Through this approach, standard cell library characterization runtime and resources are reduced.
Recently, advances and developments in EDA technology have allowed for the use of AI algorithms to reduce the overall time committed to a characterization flow. In this paper, we'll discuss a 2-step methodology known as Portfolio Re-Characterization to characterize and generate data in .libs using AI.
The methodology in question first identifies seed PVTs among the total set of required PVTs for use in full scale characterization through reinforcement learning. Next, the same seed PVTs are used as training data in an AI-enabled workflow to produce new .libs without the need for full characterization. Through this approach, standard cell library characterization runtime and resources are reduced.
Engineering Poster


DescriptionDue to enhanced circuit gating and redundant architectures, automotive-grade low-power integrated circuits exhibit increased susceptibility to IR-drop. The stringent reliability and stability requirements imposed by automotive applications across a wide range of voltage and temperature conditions, including extreme working conditions. Necessitate robust Design-for-Test (DFT) methodologies. Consequently, IR-drop analysis during DFT has become increasingly challenging and difficult to converge.
This paper provides a systematic overview of DFT optimization methods in low-power design. Leveraging power-aware DFT-driven EM/IR analysis, it proposes enhancement strategies to address IR hotspots. Focusing on DFT, the paper examines techniques like Q-gating, clock staggering, partitioning, one-hot scan chains, and memory bypass signals for X-state management, analyzing their effectiveness in reducing IR-drop and comprehensively discussing the challenges associated with low-power automotive chip sign-off.
This paper provides a systematic overview of DFT optimization methods in low-power design. Leveraging power-aware DFT-driven EM/IR analysis, it proposes enhancement strategies to address IR hotspots. Focusing on DFT, the paper examines techniques like Q-gating, clock staggering, partitioning, one-hot scan chains, and memory bypass signals for X-state management, analyzing their effectiveness in reducing IR-drop and comprehensively discussing the challenges associated with low-power automotive chip sign-off.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionXGBoost (eXtreme Gradient Boosting), a widely-used decision tree algorithm, plays a crucial role in applications such as ransomware and fraud detection. While its performance is well-established, its security against model extraction on hardware platforms like Field Programmable Gate Arrays (FPGAs) has not been fully explored. In this paper, we demonstrate a significant vulnerability where sensitive model data can be leaked from an XGBoost implementation through side-channel attacks (SCAs). By analyzing variations in power consumption, we show how an attacker can infer node features within the XGBoost model, leading to the extraction of critical data. We conduct an experiment using the XGBoost accelerator FAXID on the Sakura-X platform, demonstrating a method to deduce model decisions by monitoring power consumptions. The results show that on average 367k tests are sufficient to leak sensitive values. Our findings underscore the need for improved hardware and algorithmic protections to safeguard machine learning models from these types of attacks.
Research Manuscript


Systems
SYS2: Design of Cyber-Physical Systems and IoT
DescriptionWith the rising demand for ultra-low-cost and flexible electronics in applications like smart packaging and wearable health monitoring, printed electronics provide an affordable, adaptable, and customizable alternative to conventional silicon. However, these systems often rely on printed batteries or energy harvesters with limited power capacity, making strict power budgets critical. Printed neuromorphic circuits (pNCs) are promising for their analog signal processing, reduced circuit complexity, and energy efficiency in low-power environments. Nonetheless, maintaining robust performance under strict power constraints remains challenging, necessitating advanced optimization techniques. In this work, we propose an augmented Lagrangian approach to enforce task-specific power constraints in pNCs, validated across 13 benchmark datasets. Our method preserves accuracy within strict power budgets while achieving Pareto-optimal power-accuracy trade-offs in a single training run. In contrast, the penalty-based method, which serves as the baseline, requires up to 150 runs per dataset to generate the Pareto front. For low-power scenarios (≈ 20% of the original power), our method demonstrates a 52× improvement in accuracy-to-power ratio over the baseline. At higher power budgets (≈ 80% of the original power), it achieves a 59× improvement, maintaining competitive performance. Experimental results demonstrate that our approach achieves 81.82% accuracy with p-tanh activation function (AF) at high power budgets and excels with p-Clipped ReLU AF under low power constraints. This highlights the computational efficiency and effectiveness of our approach for power-constrained circuit design.
Research Manuscript


EDA
EDA4: Power Analysis and Optimization
DescriptionAs increasing gap of the shrinking ratio between device and metal pitches under advanced nodes, power-grid (PG) structure plays a critical role in circuit performance considering power integrity. Initial PG structures, as initiating implementation fundamentals, dominate optimization space. To address initial structures on industrial flow, we propose the first unified framework fueled by novel sequence-based structure generator and Transformer-based predictor, providing accurate static voltage-drop estimation. Learned embeddings are then adopted to determine promising candidates with Pareto frontier during structure optimization. The predictor is demonstrated in different scenarios under 3nm and 2nm technology with average 0.011% maximum absolute error of drop percentage while the optimized structures reduce 15% PG utilization with 34% worst timing-slack improvement on industrial designs.
Exhibitor Forum


DescriptionDriven by the frantic pace of Large Language Model adoption, AI is dramatically redefining Data Center Infrastructure requirements. For instance, Meta's Llama 3 needed 16,000 Nvidia Hopper GPUs and 70 days to train, negotiating 405 Billion parameters and using 15.6 Trillion tokens. Such massive workloads demand not only optimally fast interconnects and innovations in Power delivery, but they also demand materially superior design and reliability performance from Compute's eternal execution twin - Memory, both on and off chip. Memory requirements for High Performance compute are an increasing challenge as new versions of LLMs (eg. Llama 3.1) are pushing per-model-instance memory requirements to nearly a Terabyte (854 GB to be precise), which in turn means subsequent generations of GPUs (like the H200) and other general purpose SoCs will have to support significantly larger on-die Memory clusters. Designing, characterizing and delivering reliable Memory banks within acceptable time-to-market windows thus becomes a key competency that deserves increasing focus. Our design team's track record of delivering multiple generations of silicon proven Memory IP in the most advanced process nodes across multiple foundries and customers, enables us to be a significant part of that focus. Our design experience is enhanced by addressing advanced testability demands of modern multi-die SOCs, including proven history of designing on-chip memory diagnostics (to enable real-time fault monitoring without need for external testers), supporting pattern diagnosis, debug, flexible repair hierarchies and Quality of results optimization, all techniques for non-intrusive fault tolerance and optimal system performance essential for AI/ML applications.
Double-click on our expertise and you will find Infosys employing robust AI/ML powered algorithms in the Memory design and characterization process. Using these algorithms, we are identifying critical paths within large memory instances, predicting PPA metrics across different process, voltage and temperature corners and aging corners efficiently. These techniques dramatically reduce memory design times by eliminating the need to run actual simulations across all corners. Any design/feature change requires re-characterization and model retraining ONLY over the corresponding leaf cell(s) while complier range change does not require any further adjustments. Such innovations, using AI to power development of future AI platforms, is one of many reasons why Infosys is a dependable partner in delivering core (Silicon) elements of AI infrastructure.
Double-click on our expertise and you will find Infosys employing robust AI/ML powered algorithms in the Memory design and characterization process. Using these algorithms, we are identifying critical paths within large memory instances, predicting PPA metrics across different process, voltage and temperature corners and aging corners efficiently. These techniques dramatically reduce memory design times by eliminating the need to run actual simulations across all corners. Any design/feature change requires re-characterization and model retraining ONLY over the corresponding leaf cell(s) while complier range change does not require any further adjustments. Such innovations, using AI to power development of future AI platforms, is one of many reasons why Infosys is a dependable partner in delivering core (Silicon) elements of AI infrastructure.
Ancillary Meeting


DescriptionAs AI workloads scale dramatically, power, thermal management, and reliability have emerged as critical concerns. Industry standards such as IEEE 2416 and IEEE 1801 have attempted to address these issues through system-level power and thermal modeling. But are these standards accelerating innovation, or have they become a bottleneck for rapid technological advancement?
Powering the Future of AI features panelists from industry leaders in design and methodology for power and thermal optimization, including IBM and Cadence Design Systems. The lunch forum includes experienced leaders in the semiconductor industry and EDA standardization and targets system designers and architects, logic and circuit designers, validation engineers, CAD managers, researchers, and academicians.
Key discussion points include:
o Standards: Foundation or Constraint to Innovation in AI and system-level power management?
o Practical impacts and industry adoption: Real-world experiences from semiconductor foundries, AI hardware developers, and system integrators.
o Bridging Gaps: The role of emerging IEEE 2416 extensions (A/MS, thermal, multi-voltage) in addressing AI workloads.
o Cross-industry perspectives from EDA tool vendors, system integrators, and IP developers.
o How should standards evolve to effectively enable next-gen AI applications?
Register now! Space is limited, and registration is required for complimentary lunch. For more details go to si2.org
Powering the Future of AI features panelists from industry leaders in design and methodology for power and thermal optimization, including IBM and Cadence Design Systems. The lunch forum includes experienced leaders in the semiconductor industry and EDA standardization and targets system designers and architects, logic and circuit designers, validation engineers, CAD managers, researchers, and academicians.
Key discussion points include:
o Standards: Foundation or Constraint to Innovation in AI and system-level power management?
o Practical impacts and industry adoption: Real-world experiences from semiconductor foundries, AI hardware developers, and system integrators.
o Bridging Gaps: The role of emerging IEEE 2416 extensions (A/MS, thermal, multi-voltage) in addressing AI workloads.
o Cross-industry perspectives from EDA tool vendors, system integrators, and IP developers.
o How should standards evolve to effectively enable next-gen AI applications?
Register now! Space is limited, and registration is required for complimentary lunch. For more details go to si2.org
Tutorial


AI
Sunday Program
DescriptionThis tutorial will provide attendees with a comprehensive understanding of the IEEE 2416 standard, used for system level power modeling in the design and analysis of integrated circuits and systems. Participants will gain practical knowledge necessary to implement and utilize the standard effectively. The tutorial will highlight the pressing need for low-power design methodologies, particularly in cutting-edge fields like AI, where computational demands are high. By getting a clear understanding of the IEEE 2416 standard, attendees will be equipped to make decisions on how the standard can be incorporated into their design flow to deliver the efficiencies needed to build their cutting edge low power designs. The presenters, who are experts from different industry segments (EDA, Foundry, SoC and IP) and academia will use the IEEE2416-2025 version of the standard that is being released at DAC 2025 to explain concepts presented in the tutorial.
This tutorial is tailored for: • IP Developers: Engineers responsible for designing and characterizing IP blocks who need to create accurate and efficient power models.
• SoC Architects and Designers: Professionals involved in system-level design and integration who require a deep understanding of power analysis and optimization using the 2416 standard.
• EDA Tool Providers and Users: Developers and users of EDA tools who need to integrate and leverage the capabilities of the 2416 standard in their workflows.
Section 1: Introduction to IEEE 2416 and Power Modeling Evolution (Nagu Dhanwada)
Section 2: Core Concepts of IEEE 2416 – Digital and AMS Highlights (Akil Sutton)
Section 3: Real-World Applications – Industry Deep Dives (Eunju Hwang, Pritesh Johari)
Section 4: System-Level Example: AI Accelerator with AMS Blocks (Daniel Cross, Rhett Davis)
This tutorial is tailored for: • IP Developers: Engineers responsible for designing and characterizing IP blocks who need to create accurate and efficient power models.
• SoC Architects and Designers: Professionals involved in system-level design and integration who require a deep understanding of power analysis and optimization using the 2416 standard.
• EDA Tool Providers and Users: Developers and users of EDA tools who need to integrate and leverage the capabilities of the 2416 standard in their workflows.
Section 1: Introduction to IEEE 2416 and Power Modeling Evolution (Nagu Dhanwada)
Section 2: Core Concepts of IEEE 2416 – Digital and AMS Highlights (Akil Sutton)
Section 3: Real-World Applications – Industry Deep Dives (Eunju Hwang, Pritesh Johari)
Section 4: System-Level Example: AI Accelerator with AMS Blocks (Daniel Cross, Rhett Davis)
Engineering Presentation


Back-End Design
Chiplet
DescriptionFS-PDN: (Traditional Power Delivery Network) is used to deliver power to silicon mos device, implemented by metal routing and via connection from bump to device. BS-PDN: (Advanced PDN) moves those PDN to the backside of silicon substrate and free the routing resource to signal routing and provide a lower resistance PDN to deliver power to silicon. It is critical to get the Power Performance Area and IR drop improvement numbers for process node selection or process DTCO. This work provide a methodology and case study to evaluate PPA and IR drop gain at early stage while BSPDN PDK and EDA flow is not 100% ready.
Engineering Poster
Networking


DescriptionWhen EDA tools insert a sub-optimal number of repeaters in timing-critical paths, it can lead to significant performance issues in the design. This sub-optimal insertion often arises from either improper timing constraints or overly stringent transition limits. Even if the EDA tool manages to insert an optimal number of repeaters to meet the delay specifications, achieving consistency across multiple runs can still be a challenge. Variability in the results can stem from several factors, including tool algorithms, design changes, and environmental conditions. To achieve more predictable delay requirements in very high-speed designs, constructing a custom tree can be an effective strategy. This approach allows for a more tailored optimization of the signal paths, reducing path delay and ensuring that setup requirements are consistently met.
Networking
Work-in-Progress Poster


DescriptionThe clustering-based placement framework has demonstrated promising potential in improving the efficiency and quality of very-large-scale integration (VLSI) placement.
However, existing methods typically impose unified and rule-based constraints on different clusters, overlooking the unique intra- and inter-cluster connection properties that vary across clusters, which leads to suboptimal results.To address this challenge and promote effective PPA optimization, we introduce an innovative PPA-driven placement paradigm with mixed-grained Adaptive Cluster Constraints Optimization (ACCO), which applies constraints with tailored constraint tightness to different clusters, balancing local and global interactions for improved placement performance. Specifically, we propose a novel eBound model with quantified constraint tightness, combined with a Bayesian optimizer to dynamically adjust the constraints for each cluster based on PPA outcomes, which are ultimately passed on to the final flat placement. Experimental results on benchmarks across various domains show that our methods can achieve up to 62%, 97% and 25% improvements in post-route WNS, TNS and power compared to existing methods.
However, existing methods typically impose unified and rule-based constraints on different clusters, overlooking the unique intra- and inter-cluster connection properties that vary across clusters, which leads to suboptimal results.To address this challenge and promote effective PPA optimization, we introduce an innovative PPA-driven placement paradigm with mixed-grained Adaptive Cluster Constraints Optimization (ACCO), which applies constraints with tailored constraint tightness to different clusters, balancing local and global interactions for improved placement performance. Specifically, we propose a novel eBound model with quantified constraint tightness, combined with a Bayesian optimizer to dynamically adjust the constraints for each cluster based on PPA outcomes, which are ultimately passed on to the final flat placement. Experimental results on benchmarks across various domains show that our methods can achieve up to 62%, 97% and 25% improvements in post-route WNS, TNS and power compared to existing methods.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionFederating heterogeneous models on edge devices with diverse resource constraints has been a notable trend in recent years. Compared to traditional federated learning (FL) that assumes an identical model architecture to cooperate, model-heterogeneous FL is more practical and flexible since the model can be customized to satisfy the deployment requirement. Unfortunately, no prior work ever dives into the existing model-heterogeneous FL algorithms under the practical edge device constraints and provides quantitative analysis on various data scenarios and metrics, which motivates us to rethink and re-evaluate this paradigm. In our work, we construct the first system platform \textbf{PracMHBench} to evaluate model-heterogeneous FL on practical constraints of edge devices, where diverse model heterogeneity algorithms are classified and tested on multiple data tasks and metrics. Based on the platform, we perform extensive experiments on these algorithms under the different edge constraints to observe their applicability and the corresponding heterogeneity pattern.
Engineering Poster
Networking


DescriptionThe pre-validation tool addresses critical challenges in test case execution by implementing a robust validation framework to ensure system readiness and reduce inefficiencies. Key issues such as lack of validation checks, high rates of defect rejections, and manual interventions are mitigated through automated pre-execution validations. The tool identifies and resolves common blockers, including hardware malfunctions, user errors, configuration mismatches, environmental inconsistencies, network issues, unsupported software, and overlooked setup prerequisites. By streamlining defect identification and reducing the time spent on opening defects, the tool enhances the efficiency of test workflows, minimizes manual effort, and ensures a higher success rate for test case execution. This innovation significantly improves productivity and the reliability of the overall validation process.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionThe sensitivity of LLMs to quantization has driven the development of hardware accelerators tailored for specific low-precision configurations such as weight-only quantization and mixed-precision, which can introduce inefficiencies in dedicated hardware architecture. In this work, we propose Precon, a precision-convertible architecture designed to accelerate various quantized deep learning models, particularly LLMs, through a unified processing unit. By enabling on-the-fly switching between half-float (FP16) decoding and integer (INT) decomposition, the design effectively supports INT4-FP16, INT4-INT4, and INT4-INT8 arithmetic within shared logic. Precon achieves up to 4.1x speedup and 81.4% reduction in energy consumption compared to the baseline across various domains, including the support of both accurate and efficient acceleration of quantized LLMs.
Research Manuscript


Design
DES3: Emerging Models of ComputatioN
DescriptionApproximate computing has emerged as a promising solution in energy-efficiency applications. Recently, attention has shifted from approximate components to Design Space Exploration (DSE) algorithms. However, traditional DSE algorithms face challenges in efficiently obtaining optimal solutions within large and complex design spaces. This paper introduces a pre-refining enhanced design space exploration framework that provides customized design space and cost-performance formula for applications. Experimental results demonstrate that integrating this pre-refining step into various DSE algorithms leads to substantial performance gains, including up to 87× speedup and a 23% improvement in hardware overhead. Moreover, the innovative cost-performance-based DSE algorithm attains a 7.7× acceleration and further optimizes hardware metrics by an additional 8.8% compared to advanced frameworks employing the same pre-refinement.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionAlthough design space exploration (DSE) is good at finding dataflow for optimal memory access in tensor accelerators, it is very timing-consuming and lacks of architecture insight. In this study, we for the first time propose several principles for dataflow optimization that provides lower bound of memory communication for tensor operators such as matrix multiplication. Through these principles we can calculate the best tiling, scheduling and mapping for both intra- and inter-operator dataflow. In addition, we can identify all the tensor-wise opertor fusion that are profitable in memory communication, so we propose FuseCU, a new architecture that supports these profitable fusion which can be applied to existing spatial architectures for data movement saving. Experimental results show that FuseCU delivers 63.6%, 62.4% and 38.7% data movement saving and 1.33×, 1.25× and 1.14× speedup compared to the TPUv4i, Gemmini and Planaria designs without increasing buffer size or bandwidth. Additionally, FuseCU will be open-sourced.
Networking
Work-in-Progress Poster


DescriptionAttacks that exploit side-channel information inevitably emitted during hardware encryption pose a significant threat to hardware security, as they can extract sensitive user information. Additionally, fabless companies often rely on specific foundries for chip manufacturing, which increases the risk of information leakage from untrusted fabs. Circuit design is also vulnerable to replication threats due to sophisticated reverse engineering techniques. To address these challenges, this paper proposes Fast Post-Manufacturing Programmable Camouflaged (FP2C) Logic. Implemented as a flip-flop with embedded FP2C logic, this design exclusively utilizes regular threshold voltage (RVT) transistors, allowing integration into standard cell libraries and enabling automated digital circuit design through EDA tools. By applying a post-programming code after manufacturing, FP2C logic safeguards against information leakage from untrusted fabs and strengthens security by imposing an additional overhead for attackers attempting to obtain the programming code.
The proposed FP2C Logic-embedded Flip-Flop (FP2C Logic-eFF) was designed in a 28nm CMOS process, achieving a 67% reduction in cell area compared to Post-Manufacturing Programmable (PMP) -TVD logic cell on the same technology node. To demonstrate the feasibility of implementing digital systems with FP2C Logic, the Advanced Encryption Standard (AES) was designed as a representative cryptographic IP. Tests on the AES-128 module demonstrated that a standard cell-based AES design exposes the secret key at 21k power traces, while FP2C logic-eFF-based AES has been proven to extract only one byte of key extraction at 100k traces and achieves an additional 81.3-bit security strength.
The proposed FP2C Logic-embedded Flip-Flop (FP2C Logic-eFF) was designed in a 28nm CMOS process, achieving a 67% reduction in cell area compared to Post-Manufacturing Programmable (PMP) -TVD logic cell on the same technology node. To demonstrate the feasibility of implementing digital systems with FP2C Logic, the Advanced Encryption Standard (AES) was designed as a representative cryptographic IP. Tests on the AES-128 module demonstrated that a standard cell-based AES design exposes the secret key at 21k power traces, while FP2C logic-eFF-based AES has been proven to extract only one byte of key extraction at 100k traces and achieves an additional 81.3-bit security strength.
Networking
Work-in-Progress Poster


DescriptionElectronic-Photonic Integrated Circuits (EPICs) are evolving rapidly for high performance communication, computation, imaging, and more. However, optimizing performance for EPICs is challenging due to the separate tools used for designing photonic and electronic components, while the interconnect between the two often limits the overall EPIC performance. Thus, co-designing, simulating, and verifing electronics, photonics, and their interconnect together, in a unified design environment is essential.
To address this, we present Process Design Kits (PDKs) for co-designing electronics, photonics, and interconnect parasitics using industry-standard VLSI design flows. Key characteristics of our PDKs include:
(1) SPICE-compatible compact models for photonic devices, enabling circuit simulations for ICs comprising both electronic and photonic devices as well as their interconnection.
(2) Techniques to account for broad optical spectrums and non-linear interactions between wavelengths, e.g., non-linear optical processes in semiconductor optical amplifiers and second harmonic generation.
(3) Physical design and verification, with parameterized cells for layouts, and tools for automatic Design Rule Check (DRC) and Layout vs. Schematic (LVS).
We demonstrate the effectiveness of our PDKs with detailed physical designs and simulations for applications including: (a) high-speed optical datalinks leveraging resonators to implement wavelength-division multiplexing (WDM); and (b) high-bandwidth computation kernels for matrix multiplication in the optical domain.
To address this, we present Process Design Kits (PDKs) for co-designing electronics, photonics, and interconnect parasitics using industry-standard VLSI design flows. Key characteristics of our PDKs include:
(1) SPICE-compatible compact models for photonic devices, enabling circuit simulations for ICs comprising both electronic and photonic devices as well as their interconnection.
(2) Techniques to account for broad optical spectrums and non-linear interactions between wavelengths, e.g., non-linear optical processes in semiconductor optical amplifiers and second harmonic generation.
(3) Physical design and verification, with parameterized cells for layouts, and tools for automatic Design Rule Check (DRC) and Layout vs. Schematic (LVS).
We demonstrate the effectiveness of our PDKs with detailed physical designs and simulations for applications including: (a) high-speed optical datalinks leveraging resonators to implement wavelength-division multiplexing (WDM); and (b) high-bandwidth computation kernels for matrix multiplication in the optical domain.
Research Manuscript


EDA
EDA9: Design for Test and Silicon Lifecycle Management
DescriptionWavelength-routed optical networks-on-chip (WRONoC) offer low latency and collision-free communication, meeting growing multi-core communication demands. Microring resonators (MRRs), the key components in WRONoC, are susceptible to process variation, causing transmission spectrum shifts, reduced signal power, and increased crosstalk. However, current WRONoC designs have overlooked the impacts of process variation. To counter process variation, we propose a methodology to optimize the MRR radii and signal wavelengths. Specifically, we quantify expected signal power under process variation and propose optimization methods to maximize the expected signal power. Results show up to 7.51 dB improvement in worst-case signal power over designs neglecting process variation.
Networking
Work-in-Progress Poster


DescriptionWith the increasing design complexity and human resource constraints, large language models (LLMs) is emerged as a promising solution for electronic design automation (EDA) tasks, particularly in hardware description language (HDL) code generation. Recent advances in agentic LLMs have demonstrated remarkable capabilities in automated Verilog code generation. However, existing approaches either demand substantial computational resources or rely on LLM-assisted single-agent prompt learning techniques, which we identify for the first time as susceptible to a degeneration phenomenon—characterized by deteriorating generative performance and diminished error detection and correction capabilities. In this paper, we propose a multi-agent prompt learning framework to address these limitations and enhance code generation quality. Our key contribution is the empirical demonstration that multi-agent architectures can effectively mitigate the degeneration risk while improving code error correction capabilities, resulting in higher-quality Verilog code generation. The effectiveness of our approach is validated through comprehensive evaluations: achieving 96.4% and 96.5% pass@10 scores on VerilogEval Machine and Human benchmarks, respectively, while attaining 100% Syntax and 99.9% Functionality pass@5 metrics on the RTLLM benchmark.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionModel checking is an automated method used to formally verify systems by checking them against properties. However, a major problem in model checking is the state explosion. To overcome this challenge, one approach is to utilize parallel processing capabilities to either speed up computations or handle larger-scale problems. Explicit model checking has lower computational complexity and can be easily parallelized. There are numerous parallel explicit model checking algorithms available in the literature. Symbolic model checking offers significant advantages over explicit model checking in terms of problem scalability and verification speed. However, treating states encountered during the search as sets poses a challenge in devising efficient parallel algorithms. As a result, current research on parallelizing symbolic model checking has primarily focused on reachability analysis or safety properties, rather than attempting to parallelize the nested fixpoint calculations. In this paper, we propose a novel property-driven approach for parallel symbolic model checking of full LTL. Our algorithm introduces a fair model state labelling function that forms a partition of the nested fixpoint across the product combining the model and the property B\"uchi automaton. The experimental results demonstrate significant speedup, ranging from 2.81 to 17.19 times compared to sequential approaches on a 32-core machine. Moreover, in comparison to existing parallel model checking methods, our approach not only surpasses those relying on BDD libraries with a maximum improvement of up to 134\% and an average improvement of 33.1\% but also demonstrates significant superiority over the state-of-the-art parallel explicit model checker.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionCompute-in-memory (CiM) has become a promising candidate for edge AI by reducing data movements through in-situ operations. However, this emerging computational paradigm also poses the vulnerability of model leakage as the weights are stored in plaintext for computing. While prior works have explored lightweight encryption methods, CiM is usually considered a separate module instead of a system component, leaving the origin of keys unclear and unprotected. Physical unclonable functions (PUFs) offer a potential origin of keys, but a comprehensive framework for securing key generation and delivery remains lacking. Besides, the complementary ciphertext storage incurs substantial costs and degrades the performance.
This work proposes PUFiM, a robust and efficient security solution for edge computing based on ferroelectric FETs (FeFETs). For the first time, a strong PUF is synergized with CiM to enable authentication, key generation, and encrypted computations within a unified array for comprehensive protection. To achieve this synergization, a high-density hybrid storage and computation approach combining PUF and weight bits via multi-level cell (MLC) FeFETs is proposed. Besides, two PUF enhancement techniques and a novel mapping scheme are developed to improve security and efficiency further. Results show that PUFiM withstands PUF modeling attacks with up to 10M samples. Moreover, PUFiM reduces the inference accuracy by >60% under 95% key leakage and achieves >9.7× compute density and >1.2× energy efficiency improvement compared with the state-of-the-art SRAM/NVM secure CiMs.
This work proposes PUFiM, a robust and efficient security solution for edge computing based on ferroelectric FETs (FeFETs). For the first time, a strong PUF is synergized with CiM to enable authentication, key generation, and encrypted computations within a unified array for comprehensive protection. To achieve this synergization, a high-density hybrid storage and computation approach combining PUF and weight bits via multi-level cell (MLC) FeFETs is proposed. Besides, two PUF enhancement techniques and a novel mapping scheme are developed to improve security and efficiency further. Results show that PUFiM withstands PUF modeling attacks with up to 10M samples. Moreover, PUFiM reduces the inference accuracy by >60% under 95% key leakage and achieves >9.7× compute density and >1.2× energy efficiency improvement compared with the state-of-the-art SRAM/NVM secure CiMs.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionRecently, there has been a growing interest in leveraging Large Language Models for Verilog code generation.
However, the current quality of the generated Verilog code remains suboptimal.
This is largely due to the absence of well-defined, well-organized datasets with high-quality samples, as well as a lack of innovative fine-tuning methods and models specifically trained on Verilog.
In this paper, we introduce a novel open-source dataset and a corresponding fine-tuning technique, which utilizes a multi-layered structure that we refer to as PyraNet.
Our experiments demonstrate that employing the proposed dataset and fine-tuning approach leads to a more accurate fine-tuned model, producing syntactically and functionally correct Verilog code.
The evaluation results show improvements by up-to $32.6\%$ in comparison to the CodeLlama-7B baseline model and up-to $16.7\%$ in comparison to the state-of-the-art models using VerilogEval evaluation platform.
However, the current quality of the generated Verilog code remains suboptimal.
This is largely due to the absence of well-defined, well-organized datasets with high-quality samples, as well as a lack of innovative fine-tuning methods and models specifically trained on Verilog.
In this paper, we introduce a novel open-source dataset and a corresponding fine-tuning technique, which utilizes a multi-layered structure that we refer to as PyraNet.
Our experiments demonstrate that employing the proposed dataset and fine-tuning approach leads to a more accurate fine-tuned model, producing syntactically and functionally correct Verilog code.
The evaluation results show improvements by up-to $32.6\%$ in comparison to the CodeLlama-7B baseline model and up-to $16.7\%$ in comparison to the state-of-the-art models using VerilogEval evaluation platform.
Research Special Session


Design
DescriptionAs quantum computing technology continues to advance, the quantum research community is finding new ways to utilize them as experimental and computational platforms for physics, chemistry, materials science, and biology. Significant progress has been made in quantum hardware, software, algorithms, and quantum error correction, enabling more reliable and scalable quantum computations. These advancements allow researchers to probe scientific questions using both analog and digital quantum simulators with increasing
accuracy. I will discuss recent efforts to harness quantum technological progress, including error mitigation strategies, to address complex problems in chemical and materials physics.
accuracy. I will discuss recent efforts to harness quantum technological progress, including error mitigation strategies, to address complex problems in chemical and materials physics.
Tutorial


Design
Sunday Program
DescriptionThis tutorial provides a comprehensive exploration of design automation for quantum computing, structured into three focused sections, each addressing critical challenges and advancements in the field. With quantum computing poised to revolutionize technology, the efficient design and optimization of quantum systems are imperative for scaling, performance enhancement, as well as sustainability. The first section, Physical Design of Quantum Computers, introduces a frequency aware placement framework for superconducting quantum chips that mitigates crosstalk and enhances fidelity while reducing spatial violations and substrate size. In addition, this section describes pulse-level noise mitigation techniques, including dynamical decoupling (DD) and hardware-native pulse-efficient gates, demonstrating how these methods improve quantum circuit fidelity on noisy intermediate-scale quantum (NISQ) hardware. The second section, Automated Configuration of Quantum Computer States, presents a peephole optimization algorithm that identifies "don't care" conditions in quantum state preparation circuits, enabling a significant reduction in two-qubit gates while maintaining unitary equivalence. This section additionally introduces innovative methods for efficient quantum amplitude encoding of polynomial functions, highlighting approaches that improve state preparation complexity while managing controllable errors. The third section, Sustainability of Quantum Computer Design Automation, examines the environmental impact of quantum circuit simulations. It introduces a framework to quantify carbon emissions associated with large-scale simulations, emphasizing the need for sustainable practices in quantum design automation.
Through these focused sections, this tutorial aims to bridge theoretical research and practical applications, offering attendees a thorough understanding of the current challenges and innovative solutions in quantum design automation. By addressing physical design, computational optimization, and sustainability, this tutorial provides a holistic perspective on advancing quantum technologies for scalable and impactful real-world applications.
Section 1: Physical Design of Quantum Computers (Yiran Chen, Siyuan Niu)
Section 2: Automated Confguration of Quantum Computer States (Daniel Tan, Thomas W. Watts)
Section 3: Sustainability of Quantum Computer Design Automation (Weiwen Jiang)
Through these focused sections, this tutorial aims to bridge theoretical research and practical applications, offering attendees a thorough understanding of the current challenges and innovative solutions in quantum design automation. By addressing physical design, computational optimization, and sustainability, this tutorial provides a holistic perspective on advancing quantum technologies for scalable and impactful real-world applications.
Section 1: Physical Design of Quantum Computers (Yiran Chen, Siyuan Niu)
Section 2: Automated Confguration of Quantum Computer States (Daniel Tan, Thomas W. Watts)
Section 3: Sustainability of Quantum Computer Design Automation (Weiwen Jiang)
Networking
Work-in-Progress Poster


DescriptionQuantum neural networks (QNN) hold immense potential for the future of quantum machine learning (QML). However, QNN security and robustness remain largely unexplored. In this work, we proposed novel trojan attacks based on the quantum computing properties in a QNN-based binary classifier. Our proposed Quantum Properties Trojans (QuPTs) are based on the unitary property of quantum gates to insert noise and Hadamard gates to enable superposition to develop trojans and attack QNNs. We showed that the proposed QuPTs are significantly stealthier and exert an immensely high impact on quantum circuits' performance, specifically QNNs. To the best of our knowledge, this is the first work on the trojan attack on a fully quantum neural network independent of any hybrid classical-quantum architecture.
Networking
Work-in-Progress Poster


DescriptionQuantum Secure Hash Oracle (QSHO) is probably one of the first attempts at bringing a new approach to the secure computation of cryptographic functions. The QSHO starts with Kyber key encapsulation and AES encryption. Its quantum-resistant hashes resist Grover's search algorithm as well as hybrid classical-quantum algorithms. Detailed quantum simulations of the impact of Grover's search on both pre-image and collision resistance are made available by the QSHO, offering clear and detailed assessment of quantum resistance. This oracle uses quantum randomness through QRNG and analyzes resistance against possible quantum attacks. QSHO gives a key milestone in post-quantum cryptography. It can provide secure quantum-era systems for any service.
Research Special Session


Design
Description"What is your PQC Readiness" - in this talk, we will explore this topic, why is it important, and how to address this question for any enterprise, and for the entire digital world. Quantum computing capabilities are evolving fast and are getting more powerful. With that several classical cryptography protocols such as RSA are poised to be broken in the next few years. NIST has already announced four cryptography protocols as part of the first batch of post-quantum cryptography standards (in August '24). However, implementation and adoption of quantum-resistant cryptography is a hard problem given the complexities of today's internet and computing stack. Our research work has led to development of algorithms and a system - Quartz (Quantum Risk and Threat Analyzer) for observability for quantum vulnerabilities for cryptography suites, where they are used, and analyzing their risks. We developed the concept of cryptography supply chain, and cryptography bill of materials (CBOM). In this talk, we will delve into the following subtopics: (1) Analyze the problem of PQC-readiness of an enterprise, of an individual entity and of the digital world at scale. (2) present Quartz - a state of the art system to analyze and measure the quantum vulnerabilities and associated risks across the computing stack, (3) present our latest analysis of deployment and usage of PQC, and whether these implementations are ready to be adopted in the real-world. (4) A demo of our Quartz system.
Research Manuscript


Systems
SYS2: Design of Cyber-Physical Systems and IoT
DescriptionWe study the vulnerability of Cyber-physical systems (CPS) under stealthy sensor attacks in black-box scenarios. "Black-box" refers to scenarios where the attacker has minimal knowledge of the target system. Designing a stealthy sensor attack sequence under this scenario has two main challenges.
The first one lies in ensuring the stealthiness of the sensor attack, meaning does not trigger an alert when applying the generated sensor attack sequence to the CPS. The second one is maintaining stealthiness throughout the attack generation process, indicating the limitation on the alarm frequency when generating the attack sequence. To address the above challenges, we develop a query-based black-box stealthy attack framework to violate the safety of the CPS. To maintain stealthiness during training, an active learning method has been introduced to extract the detector's information to a time series model. The stealthy attack sequence is then generated from that model. Experiments on four numerical simulations and a high-fidelity simulator demonstrate the effectiveness of the proposed framework.
The first one lies in ensuring the stealthiness of the sensor attack, meaning does not trigger an alert when applying the generated sensor attack sequence to the CPS. The second one is maintaining stealthiness throughout the attack generation process, indicating the limitation on the alarm frequency when generating the attack sequence. To address the above challenges, we develop a query-based black-box stealthy attack framework to violate the safety of the CPS. To maintain stealthiness during training, an active learning method has been introduced to extract the detector's information to a time series model. The stealthy attack sequence is then generated from that model. Experiments on four numerical simulations and a high-fidelity simulator demonstrate the effectiveness of the proposed framework.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionDetecting mission-critical anomalous events and data is a crucial challenge across various industries, including finance, healthcare, and energy. Quantum computing has recently emerged as a powerful tool for tackling several machine learning tasks, but training quantum machine learning models remains challenging, particularly due to the difficulty of gradient calculation. The challenge is even greater for anomaly detection, where unsupervised learning methods are essential to ensure practical applicability. To address these issues, we propose Quorum, the first quantum anomaly detection framework designed for unsupervised learning that operates without requiring any training.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionDiffusion Transformers (DiTs) have demonstrated unprecedented performance across various generative tasks including image and video generation.
However, a large amount of computations on the inference process and iterative sampling steps in the DiT models result in high computational costs, leading to substantial latency and energy consumption challenges.
To address these issues, we propose a redundancy-aware DiT (RADiT), a novel software-hardware co-optimization accelerator for DiTs that minimizes redundant operations in the iterative sampling stages.
We identify data redundancy by evaluating blockwise input features and skip redundant computations by reusing results from consecutive timesteps.
Furthermore, to minimize accuracy degradation and maximize computational efficiency, the Dynamic Threshold Scaling Module (DTSM) and Compress and Compare Unit (CCU) are employed in the redundancy detection process.
This approach enables DiTs to achieve up to 1.8x and 1.7x faster speeds for image and video generation, respectively, without compromising quality, along with 41% and 45.5% reductions in energy consumption.
Our RADiT scheme improves throughput by 1.67x and 1.76x for image and video generation tasks, respectively, while maintaining output quality and significantly reducing energy consumption.
However, a large amount of computations on the inference process and iterative sampling steps in the DiT models result in high computational costs, leading to substantial latency and energy consumption challenges.
To address these issues, we propose a redundancy-aware DiT (RADiT), a novel software-hardware co-optimization accelerator for DiTs that minimizes redundant operations in the iterative sampling stages.
We identify data redundancy by evaluating blockwise input features and skip redundant computations by reusing results from consecutive timesteps.
Furthermore, to minimize accuracy degradation and maximize computational efficiency, the Dynamic Threshold Scaling Module (DTSM) and Compress and Compare Unit (CCU) are employed in the redundancy detection process.
This approach enables DiTs to achieve up to 1.8x and 1.7x faster speeds for image and video generation, respectively, without compromising quality, along with 41% and 45.5% reductions in energy consumption.
Our RADiT scheme improves throughput by 1.67x and 1.76x for image and video generation tasks, respectively, while maintaining output quality and significantly reducing energy consumption.
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionWith the surge in data computation, Remote Direct Memory Access (RDMA) becomes crucial to offering low-latency and high-throughput communication for data centers, but it faces new security threats. This paper presents Ragnar, a comprehensive suite of hardware-contention-based volatile-channel attacks leveraging the under-explored security vulnerabilities in RDMA hardware. Through comprehensive microbenchmark reverse engineering, we analyze RDMA NICs at multiple granularity levels and then construct covert-channel attacks, achieving 3.2x the bandwidth of state-of-the-art RDMA-targeted attacks on CX-5. We apply side-channel attacks on real-world distributed databases and disaggregated memory, where we successfully fingerprint operations and recover sensitive address data with 95.6% accuracy.
Networking
Work-in-Progress Poster


DescriptionLiDAR is a technology that uses laser pulses to measure the distance of an object, it is an important technology for Advanced Driver Assistance Systems (ADAS). However, it can be affected by adverse weather environments that may reduce the safety of ADAS. This paper proposes a convolutional neural network that utilizes lightweight network nodes with multiple repetitions instead of the traditional large-scale model. The proposed approach reduces the parameter size, a consistent pre-processing method is designed to control the input parameters of the network. This process reducing the data size while retaining sufficient features for neural network training. The method was tested on a LiDAR system, demonstrating its ability of run on a simple embedded systems and be deployed in heavy rainy environments for real-time processing. As verified by testing on the WADS dataset, the 98.53% accuracy, with 96.31%F1 score.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionApproximate Logic Synthesis (ALS) is an automated technique designed for error-tolerant applications, optimizing delay, area, and power under specified error constraints.
However, existing methods typically focus on either delay reduction or area minimization, often leading to local optima in multi-objective optimization.
This paper proposes a rank-based multi-objective ALS framework using Monte Carlo Tree Search (MCTS).
It develops non-dominated circuit ranking, to guide MCTS in exploring local approximate changes (LACs) across the entire circuit and generate approximate circuit sets with great optimization potential.
Additionally, a Rank-Transformer model is introduced to predict path-domain ranks, enhancing the application of high-quality LACs within circuit paths.
Experimental results show that our framework achieves faster and more efficient optimization in delay and area simultaneously compared to state-of-the-art methods.
However, existing methods typically focus on either delay reduction or area minimization, often leading to local optima in multi-objective optimization.
This paper proposes a rank-based multi-objective ALS framework using Monte Carlo Tree Search (MCTS).
It develops non-dominated circuit ranking, to guide MCTS in exploring local approximate changes (LACs) across the entire circuit and generate approximate circuit sets with great optimization potential.
Additionally, a Rank-Transformer model is introduced to predict path-domain ranks, enhancing the application of high-quality LACs within circuit paths.
Experimental results show that our framework achieves faster and more efficient optimization in delay and area simultaneously compared to state-of-the-art methods.
Research Manuscript


Security
SEC4: Embedded and Cross-Layer Security
DescriptionControl Flow Attestation (CFA) has emerged as a important service to enable remote verification of control flow paths in safety-critical embedded systems. However, current CFA for commodity devices suffers performance penalties due to code instrumentation and frequent context switches required to securely log control flow paths at runtime. Our work introduces a technique that leverages commodity hardware extensions, namely Micro Trace Buffer (MTB) and Data Watchpoint and Trace Unit (DWT), to track control flow paths in parallel with the execution of the attested program, thus avoiding aforementioned overheads present in state of the art CFA. Our evaluation (based on an open-source prototype) demonstrates substantial performance gains, enhancing practicality and security of CFA.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionA refinement relation captures the state equivalence between two sequential circuits. It finds applications in various tasks of VLSI design automation, including regression verification, behavioral model synthesis, assertion synthesis, and design space exploration. However, manually constructing a refinement relation requires an engineer to have both domain knowledge and expertise in formal methods, which is especially challenging for complex designs after significant transformations. This paper presents a rigorous and efficient sequential equivalence checking algorithm for non-cycle-accurate designs. The algorithm can automatically find a concise and human-comprehensible refinement relation between two designs, helping engineers understand the essence of design transformations.
We demonstrate the usefulness and efficiency of the proposed algorithm with experiments and case studies. In particular, we showcase how refinement relations can facilitate error detection and correction for LLM-generated RTL designs.
We demonstrate the usefulness and efficiency of the proposed algorithm with experiments and case studies. In particular, we showcase how refinement relations can facilitate error detection and correction for LLM-generated RTL designs.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionResistive random-access memory (ReRAM) based Physical Unclonable Functions (PUFs) have emerged as an attractive hardware security primitive due to their low energy consumption and compact footprint. However, the reliability of existing ReRAM-based PUFs is challenged by read noise and temperature variations, as well as their resistance to Deep Neural Network (DNN) modeling attacks and Side Channel Attacks (SCAs). In this paper, we propose a novel 3T2R ReRAM-based reconfigurable PUF to address these challenges. By adopting the digital 3T2R voltage division cell design, we improve its reliability against ReRAM read noise and temperature variations, while the adjustable analog supply voltage of inverters enables quick, low-cost reconfigurability without reprogramming ReRAMs, effectively mitigating DNN modeling and SCA vulnerabilities. Our Re4PUF chip has been experimentally validated, achieving a low Bit Error Rate (BER) of 1% at 85°C, a 7.59-fold reduction compared to existing ReRAM-based PUFs. It also demonstrates robust resistance to both DNN modeling attacks and SCAs, with success rates of approximately 50% and less than 70%, respectively.
Research Manuscript


EDA
EDA4: Power Analysis and Optimization
DescriptionDuring the IR Engineering Change Order (ECO) stage, cell moving leads to uncertain IR-drop results, requiring designers to explore multiple ECO candidates in each iteration to find a solution that effectively mitigates IR-drop, resulting in long evaluation time. Although machine learning (ML)-based predictors have been proposed to expedite IR-drop evaluation, partial simulations are still needed to update features after ECO, taking over an hour and delaying IR-drop results. In this work, we propose a real-time dynamic IR-drop estimation method based on an XGBoost model with a global view of a cell's surroundings. After ECO, our method provides dynamic IR-drop results in minutes without running any simulations and thus achieve real-time estimation. This allows designers to evaluate multiple ECO candidates concurrently in a single iteration. We conducted the experiments on five ECO candidates of an industrial design with 3𝑛𝑚 technology. The results show that the proposed model can effectively predict the IR-drop variations of moved cells after ECO with over 93% of fixed cells detected and an average MAE of 8.75𝑚𝑉 achieved. Furthermore, our method achieves an 88𝑋 speedup over Voltus (commercial tool) and a 64𝑋 speedup over traditional ML predictors when evaluating a single ECO candidate. The speedup is expected to increase as the number of ECO candidates increases.
Engineering Poster
Networking


DescriptionCreation of new special constructs, exploratory layouts, and design optimization is highly time and resource consuming. There is no layout design method which makes all relevant design information readily available. A layout design method which utilizes design information, design rule values, with the option to consider Si validated margin and AI predicted margin information and PEX/parasitics information. Design information will be available to the user in a layout overlay, updating in real-time with design/shape modifications, allowing for design optimization and creation of new special constructs. Margin information is obtained through Si validation experiments, as well as default (non-experimental) values determined by process assumption and DR calculations. PEX information is obtained from existing PEX tables and interpolated values. Exploratory design and margin prediction may reveal blind spots in design rule interactions, leading to PDK improvement. Also discussed is the methodology to predict design rule margin compaction opportunities, using machine learning applied to the Silicon validation data dataset.
Research Manuscript


Systems
SYS6: Time-Critical and Fault-Tolerant System Design
DescriptionThe demand for efficient large language model (LLM) inference has propelled the development of dedicated accelerators. As accelerators are vulnerable to hardware faults due to aging, variation, etc, existing accelerator designs often reserve a large voltage margin or leverage algorithm-based fault tolerance (ABFT) techniques to ensure LLM inference correctness. However, previous methods often overlook the inherent fault tolerance of LLMs, leading to high computation and energy overhead. To enable reliable yet efficient LLM inference, in this paper, we propose a novel algorithm/circuit co-design framework, dubbed ReaLM. For the first time, we systematically characterize the fault tolerance of LLMs by performing a large-scale error injection study of representative LLMs and natural language understanding tasks. Then, we propose a statistical ABFT algorithm that fully leverages
the error robustness to minimize error recovery as much as possible. We also customize the error detection circuits to enable a low-cost online collection of error statistics. Extensive experiments show that with only 1.42% circuit area and 1.79% power overhead, our ReaLM can reduce perplexity degradation from 18.54 to 0.29. Compared to existing methods, ReaLM consistently reduces recovery costs across different operating voltages and improves energy efficiency by up to 35.83% without compromising LLM performance.
the error robustness to minimize error recovery as much as possible. We also customize the error detection circuits to enable a low-cost online collection of error statistics. Extensive experiments show that with only 1.42% circuit area and 1.79% power overhead, our ReaLM can reduce perplexity degradation from 18.54 to 0.29. Compared to existing methods, ReaLM consistently reduces recovery costs across different operating voltages and improves energy efficiency by up to 35.83% without compromising LLM performance.
Engineering Presentation


AI
Back-End Design
DescriptionFoundry Process Design Kit (PDK) defines a set of semiconductor components for integrated circuits design. The high-performance circuits design drives the need for MOSFET PDK compact model recalibration, which can tune multiple model parameters to hit multiple targets beyond the most typical I-V and C-V measurement. However, there is often a discrepancy between chip level hardware performance and the predictions using SPICE simulation from MOSFET compact models. With the scaling of advanced semiconductor technology, the complexity of device structures, physics, and models increases dramatically. Matching circuit level hardware measurements demands accuracy from SPICE simulations.
In this work, we propose a new approach that drives recalibration by matching product circuit targets to minimize the gap between chip level product performance and SPICE model prediction. In addition, our approach integrates accurate PEX netlists into the SPICE simulation used for model recalibration. PEX netlist-based model simulation becomes extremely time consuming for large circuits. To improve the overall efficiency and accuracy, we utilize a parallel Bayesian optimization algorithm and associated software infrastructure to solve this multi-circuit optimization. Our experiments show a massive turn-around-time reduction from weeks to a few days for better model quality as measured by predictions of circuit level metrics.
In this work, we propose a new approach that drives recalibration by matching product circuit targets to minimize the gap between chip level product performance and SPICE model prediction. In addition, our approach integrates accurate PEX netlists into the SPICE simulation used for model recalibration. PEX netlist-based model simulation becomes extremely time consuming for large circuits. To improve the overall efficiency and accuracy, we utilize a parallel Bayesian optimization algorithm and associated software infrastructure to solve this multi-circuit optimization. Our experiments show a massive turn-around-time reduction from weeks to a few days for better model quality as measured by predictions of circuit level metrics.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionCoding with hardware description languages (HDLs) such as Verilog is a time-intensive and laborious task. With the rapid advancement of large language models (LLMs), there is increasing interest in applying LLMs to assist with HDL coding. Recent efforts have demonstrated the potential of LLMs in translating natural language to traditional HDL Verilog.
Chisel, a next-generation HDL based on Scala, introduces higher-level abstractions, facilitating more concise, maintainable, and scalable hardware designs.
However, the potential of using LLMs for Chisel code generation
remains largely unexplored.
This work proposes ReChisel, an LLM-based agentic system designed to enhance the effetiveness of Chisel code generation.
ReChisel incorporates a reflection mechanism to iteratively refine
the quality of generated code using feedback from compilation and
simulation processes, and introduces an escape mechanism to break
free from non-progress loops.
Experiments demonstrate that improves the success rate of Chisel code generation, achieving performance comparable to state-of-the-art LLM-based agentic systems for Verilog code generation.
Chisel, a next-generation HDL based on Scala, introduces higher-level abstractions, facilitating more concise, maintainable, and scalable hardware designs.
However, the potential of using LLMs for Chisel code generation
remains largely unexplored.
This work proposes ReChisel, an LLM-based agentic system designed to enhance the effetiveness of Chisel code generation.
ReChisel incorporates a reflection mechanism to iteratively refine
the quality of generated code using feedback from compilation and
simulation processes, and introduces an escape mechanism to break
free from non-progress loops.
Experiments demonstrate that improves the success rate of Chisel code generation, achieving performance comparable to state-of-the-art LLM-based agentic systems for Verilog code generation.
Engineering Poster


DescriptionHighly customized block-specific recipes to achieve the best Quality of Results (QoR) are a reality. There is no universal solution that fits all scenarios. Recipes that work for specific blocks often become obsolete quickly and lack broad applicability. Integrating these recipes into workflows to improve productivity can have the opposite effect; resulting in complex, ever-changing flows without adequate regression coverage leading to a challenging situation. An PD flow architecture was designed to maintain a crowdsourced repository of recipes outside the workflow and automate their inclusion and evaluation at runtime with simple command-line arguments.
Engineering Presentation


IP
DescriptionWe propose an architecture for accelerating floating point operations through a novel reconfigurable vector floating point design. This includes support for multiple precisions, including the standard IEEE 754 Single (SP-32), Double (DP-64), Tensorfloat (TF-32), Bfloat (BF-16), and custom configurations like Quarter precision (QP-8). The architecture also introduces vector lane reconfiguration, allowing for efficient parallelization through packing and unpacking techniques. Each vector lane can be adjusted at runtime on an FPGA, providing flexibility in supporting different unrolling factors for loop optimization. The design is implemented on AMD-Xilinx ZCU104 FPGA and integrates DSPs, optimizing LUT usage, power consumption, and performance. For example, SP-32 with DSP achieves a 31.6% reduction in LUT usage, a 9.3% increase in operating frequency, and 24.7% lower power consumption. We recommend DSP usage at higher bit precisions. This flexibility in configuring precision levels allows for efficient utilization of FPGA resources and energy-efficient design, especially for higher precision operations. Incorporating this design into AI/ML workflows, signal processing, and scientific computing accelerates performance, providing a balance between throughput, power efficiency, and computational complexity.
Engineering Poster
Networking


DescriptionIncreasing functionality of Automotive multiprocessor SOCs has resulted in increasing power grid complexity leading to high voltage-ripple noise caused by simultaneous switching of multiple processor blocks in the SoC. Meeting chip-package-system (CPS) performance targets becomes daunting due to this issue. Designers grapple with the lack of accurate chip models for chip-package-system co-analysis for power integrity signoff involving microsecond long simulations. The conventional Chip Power Model (CPM) falls short in addressing low frequency noise (0.1 – 50 MHz) caused during chip mode-changes over longer durations. Multiprocessor chips have high demand currents that require techniques like clock and power gating to deal with excessive power requirement. However, Dynamic Voltage and Frequency Scaling (DVFS) and clock gating can induce significant simultaneous switching noise (SSN) on VDD. We present here the results of our study that utilized advanced chip power models involving time extensions, stitching of multiple models and modulation of high frequency chip currents over mode-changing low frequency current envelope, to help detect and mitigate high peak to peak voltage variations in our chip-package-system transient analysis with a faster turn-around-time
Engineering Poster
Networking


DescriptionA High Frequency Clock Generator is an integral part of such electronic circuits and plays a crucial role in defining the performance of the overall system. Today, designers are facing below challenges while doing verification of high frequency PLL designs,
1. For critical RF analog applications covering targeting frequencies upto 20GHz the simulation runtime time with spice solver can go upto a month. This motivates designers to explore other techniques which are prone to modelling and interpolation errors for which extra design margins are taken into account which eventually lead to over-designs.
2. Top level Cross Corner Simulations to cover various PVTs further increases the verification time.
3. Jitter measurement becomes extremely complex typically in sub-ps range. This forces a tight constraint on the max step size such that simulator must ensure during threshold crossing.
4. Incorporate supply (RLC) network to mimic exact SOC behavior especially, Inductors which can be tricky.
This paper extensively studies the use of new enhanced Fast-SPICE simulator (Spectre-FX) on Charge-Pump Based PLL Design which shows close to spice accuracy and can also do clock jitter and phase noise verification in presence of supply inductive network without compromising on accuracy with an impressive 15.3X performance gain.
1. For critical RF analog applications covering targeting frequencies upto 20GHz the simulation runtime time with spice solver can go upto a month. This motivates designers to explore other techniques which are prone to modelling and interpolation errors for which extra design margins are taken into account which eventually lead to over-designs.
2. Top level Cross Corner Simulations to cover various PVTs further increases the verification time.
3. Jitter measurement becomes extremely complex typically in sub-ps range. This forces a tight constraint on the max step size such that simulator must ensure during threshold crossing.
4. Incorporate supply (RLC) network to mimic exact SOC behavior especially, Inductors which can be tricky.
This paper extensively studies the use of new enhanced Fast-SPICE simulator (Spectre-FX) on Charge-Pump Based PLL Design which shows close to spice accuracy and can also do clock jitter and phase noise verification in presence of supply inductive network without compromising on accuracy with an impressive 15.3X performance gain.
Engineering Presentation


Front-End Design
DescriptionHardware Description-Language (HDL) environments and simulators have always leveraged non-HDL languages to expand beyond the capabilities of native language features. Since Python is currently the most popular programming language, it's no surprise that there is interest in levering it to extend the capabilities of UVM testbench environments. Python can quickly and easily bring a range of new capabilities to your testbench by leveraging a broad ecosystem of existing libraries for tasks as varied as reading files, such JSON or ELF, and performing numeric manipulation. Capturing test sequences in Python can shorten development iteration time, and make use of existing Python language knowledge. There are several challenges to achieving a full-featured, easy-to-use integration. The two languages use conflicting approaches to concurrent programming, the Python C API is verbose and complex, and the most common integration mechanism involves generating code that is tied to a specific Python version. This paper describes the PyHDL-IF package that implements a bi-directional method-calling interface between Python and SystemVerilog, bridges concurrent-programming differences between the two languages, and removes the need to generate and compile C wrapper code.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionWith increasingly complex design rules and pin density in advanced technology nodes, achieving a violation-free layout has become more challenging, also making rip-up and reroute (RUR) the most runtime-intensive component of detailed routing. We propose a novel reinforcement learning (RL)-based approach to enhance the window-based RUR process. Our method features a dynamic window generation strategy that adjusts window size and position based on the distribution of design rule violations (DRV), enabling efficient targeting of congested areas. By leveraging the predictive capabilities of RL, our approach aims to minimize DRVs and achieve high-quality routing results. Experimental results demonstrate that our method outperforms the state-of-the-art detailed routers, TritonRoute, achieving a DRV-free solution, averagely improving wirelength by 0.07%, via count by 2.42%, and consuming almost the same average runtime.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionWe introduce the ReMaP framework, which generates expert-quality macro placements through recursively prototyping and periphery-guided relocating. A key innovation is ABPlace, an angle-based analytical method that arranges macros along an ellipse to facilitate a rough distribution near the periphery, while optimizing dataflow, minimizing overlap, and ensuring convergence. Based on the results of ABPlace, an efficient heuristic is proposed to position macros along the chip's periphery, mirroring practices often employed by experts. Our framework outperforms three leading macro placers in both WNS and TNS across eight test cases, achieving improvements up to 34.15% in WNS and 65.39% in TNS, as tested on the popular OpenROAD-flow-scripts infrastructure. Additionally, our parameter auto-tuning method further improves timing by 8.75%.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionThe deployment of commercial-off-the-shelf (COTS) GPUs in space has emerged as a promising approach for supporting in-orbit deep neural network (DNN) inference. However, unlike terrestrial environments, understanding the impact of space radiation on COTS GPU-enabled DNNs is critical. This is challenging because existing methods, such as real-world radiation testing and software emulation, fail to link radiation-induced memory errors to runtime DNN behaviors. In this paper, we propose REMU, a memory-aware Radiation EMUlator to fill this gap. REMU introduces a dual addressing mechanism across virtual, physical, and DRAM memory spaces, enabling precise mapping and efficient injection of radiation-induced errors from DRAM to runtime DNN inference. Extensive evaluations across 10 well-known DNN models and 2 typical in-orbit computing tasks demonstrate the effectiveness of REMU, providing valuable insights for understanding the resilience of runtime DNN inferences on space radiations.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionNeuromorphic Continual Learning (NCL) paradigm leverages Spiking Neural Networks (SNNs) to enable continual learning (CL) capabilities for AI systems to adapt to dynamically changing environments. Currently, the state-of-the-art employ a memory replay-based method to maintain the old knowledge. However, this technique relies on long timesteps and compression-decompression steps, thereby incurring significant latency and energy overheads, which are not suitable for tightly-constrained embedded AI systems (e.g., mobile agents/robotics). To address this, we propose Replay4NCL, a novel efficient memory replaybased methodology for enabling NCL in embedded AI systems. Specifically, Replay4NCL compresses the latent data (old knowledge), then replays them during the NCL training phase with small timesteps, to minimize the processing latency and energy consumption. To compensate the information loss from reduced spikes, we adjust the neuron threshold potential and learning rate settings. Experimental results on the class-incremental scenario with the Spiking Heidelberg Digits (SHD) dataset show that Replay4NCL can preserve old knowledge with Top-1 accuracy of 90.43% compared to 86.22% from the state-of-the-art, while effectively learning new tasks, achieving 4.88x latency speed-up, 20% latent memory saving, and 36.43% energy saving. These results highlight the potential of our Replay4NCL methodology to further advances NCL capabilities for embedded AI systems.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionFederated learning enables decentralized model training while preserving data privacy. However, since the learning process overlays the physical network infrastructure, the efficiency of learning can be impacted by network connectivity. In this work, we conducted extensive experiments to empirically characterize the impacts and leverage the insights to propose an adaptive federation protocol, where clients with limited bandwidth are only prompted to transmit adaptively compressed gradient updates when the gradient similarity score is similar between the local and global models. Our evaluation in simulated environments and on real hardware devices shows bandwidth savings of 60% to 78% compared to state-of-the-art methods.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionThis paper presents ResISC, an RNS-based integrated sensing and computing architecture enabling efficient edge AI. ResISC platform features (i) an in-sensor residue encoder converting images directly to RNS in the analog domain, (ii) an energy-efficient RNS-based processing-near-sensor CNN accelerator utilizing SOT-MRAM, and (iii) an innovative mixed-radix unit for efficient activation operations. By employing selective channel deactivation, ResISC reduces computation overhead by up to 89%, while achieving a 3.4x improvement in power efficiency and up to a 71x reduction in execution time compared to processing-in-MRAM platforms. Experiments on various datasets demonstrate that ResISC achieves competitive accuracy levels (up to 94.63% on CIFAR-10) with minimal degradation, making it an ideal solution for power-constrained, real-time edge applications.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionThe solution of sparse matrix equations is essential in scientific computing. However, traditional solvers on digital computing platforms are limited by memory bottlenecks in large-scale sparse matrix storage and computation. Resistive Random Access Memory (ReRAM)-based computing-in-memory (CIM) offers a promising solution to this challenge but faces constraints in achieving high solution precision and energy efficiency simultaneously in sparse matrix computations. In this work, we propose ReSMiPS, a ReRAM-accelerated sparse mixed-precision solver. ReSMiPS incorporates a novel Fast Sparse Matrix Reordering (FSMR) algorithm and introduces an In-memory float64 (IF64) data format, enabling efficient floating-point sparse matrix computation directly within the analog ReRAM array. By combining our floating-point CIM macro design with a hybrid-domain solution framework, ReSMiPS achieves precision comparable to CPU and GPU-based BiCGSTAB solvers (with errors below 10^−15) on real-world workloads, while demonstrating over two orders of magnitude improvement in both computational speed and energy efficiency.
Networking
Work-in-Progress Poster


DescriptionOutliers in large language models (LLMs) significantly impact performance, particularly in quantization and compression. These outliers often cause substantial quantization errors, degrading model accuracy and limiting deployment on edge devices or specialized hardware. Two common types, massive activations and channel-wise outliers, pose significant challenges. While various quantization algorithms aim to mitigate their effects, few studies deeply explore their root causes.
This paper investigates the formation mechanisms of these outliers and introduces strategies to address them. We propose efficient methods to eliminate most massive activations and channel-wise outliers, enhancing the quantization process and facilitating more effective and accurate model deployment.
This paper investigates the formation mechanisms of these outliers and introduces strategies to address them. We propose efficient methods to eliminate most massive activations and channel-wise outliers, enhancing the quantization process and facilitating more effective and accurate model deployment.
Networking
Work-in-Progress Poster


DescriptionObject deformations can significantly impair the performance of pre-trained computer vision networks in real-world conditions, which is a crucial aspect of building reliable and stable practical systems.
In encoder-decoder semantic segmentation architecture, low-level features from skip-layer provide positional context to the decoder, but translational distortions can still adversely affect the network's function.
This paper quantitatively and qualitatively analyzes weak-stationarity theory, and introducing four inference modes to simulate anomalous object translation scenarios and quantify the network's impairment.
We rethink the deep-supervision strategy and skip-layer connection variants, demonstrating their effectiveness in mitigating the relative degree of impairment, beyond benefiting from network absolute performance.
Furthermore, we show that weak data augmentation, combined with these strategies, can achieve competitive performance compared to strong data augmentation, providing guidance for training semantic segmentation networks with limited resources.
In summary, this paper focuses on analyzing and improving the reliable translation robustness to anomalous inputs in encoder-decoder paradigm-based networks, rather than only focusing on further improving the absolute performance.
In encoder-decoder semantic segmentation architecture, low-level features from skip-layer provide positional context to the decoder, but translational distortions can still adversely affect the network's function.
This paper quantitatively and qualitatively analyzes weak-stationarity theory, and introducing four inference modes to simulate anomalous object translation scenarios and quantify the network's impairment.
We rethink the deep-supervision strategy and skip-layer connection variants, demonstrating their effectiveness in mitigating the relative degree of impairment, beyond benefiting from network absolute performance.
Furthermore, we show that weak data augmentation, combined with these strategies, can achieve competitive performance compared to strong data augmentation, providing guidance for training semantic segmentation networks with limited resources.
In summary, this paper focuses on analyzing and improving the reliable translation robustness to anomalous inputs in encoder-decoder paradigm-based networks, rather than only focusing on further improving the absolute performance.
Engineering Poster
Networking


DescriptionScoreboard is an integral Universal Verification Component (UVC) in a Testbench infrastructure which helps to decide whether the DUT is functioning correctly or not. There are primarily two functionalities within a scoreboard: prediction and evaluation through which it determines the DUT correctness. But with the rising chip complexity, shrinking time to market demands, and the need to verify more features, coding an efficient and reusable scoreboard is the urgent need of the hour since it is one of the difficult blocks within a testbench.
The motivation behind writing this paper is to present an efficient, reusable, and time-critical scoreboard that can be used across IP and subsystem levels without minimal code modification. The proposed scoreboard architecture uses a hybrid approach which first separates the predictor or the reference model from the scoreboard to make it reusable. After that, it uses the mix of uvm analysis imp declaration macros and a combination of queues and tlm analysis fifo to handle the Inorder, Out of order and Inorder Producer comparison to make it efficient.
Finally, the proposed architecture also takes the help of Complexity theory and breaks a problem into P, NP, and NP-Hard problem which includes Greedy and Brute force algorithm to quickly verify stimulus which makes it time critical and helps to meet the time to market needs.
The motivation behind writing this paper is to present an efficient, reusable, and time-critical scoreboard that can be used across IP and subsystem levels without minimal code modification. The proposed scoreboard architecture uses a hybrid approach which first separates the predictor or the reference model from the scoreboard to make it reusable. After that, it uses the mix of uvm analysis imp declaration macros and a combination of queues and tlm analysis fifo to handle the Inorder, Out of order and Inorder Producer comparison to make it efficient.
Finally, the proposed architecture also takes the help of Complexity theory and breaks a problem into P, NP, and NP-Hard problem which includes Greedy and Brute force algorithm to quickly verify stimulus which makes it time critical and helps to meet the time to market needs.
Networking
Work-in-Progress Poster


DescriptionArithmetic circuits, particularly optimized multipliers following logic synthesis, pose significant challenges for formal verification due to the difficulty in reconstructing word-level functional
components. Traditional algebraic reasoning methods struggle with these circuits, as optimizations obscure the original word-level components, making reverse engineering difficult. This hampers the
reconstruction of pre-optimized structures, leading to an inability to effectively mitigate the explosion of vanishing monomials, which is a critical issue for Computer Algebra (CA)-based approaches. Although recent methods integrating SAT sweeping and CA through analytical techniques for adder boundary detection have shown improved performance, they still face computational limitations. To address this, we propose ReVEAL, a novel approach that combines Graph Neural Networks (GNNs) and GPU acceleration with CA to facilitate the reverse engineering of optimized multiplier netlists back to their abstraction-level architectures. ReVEAL enables formal verification by concurrently validating reference templates against the under-test multipliers using modern SAT solvers and SAT sweeping techniques. Experimental results show ReVEAL achieves a 4.90 x speedup, 19.97 x memory reduction compared to state-of-the-art CA tools, and a 13.39 x performance increase over combined SAT + CA methods.
components. Traditional algebraic reasoning methods struggle with these circuits, as optimizations obscure the original word-level components, making reverse engineering difficult. This hampers the
reconstruction of pre-optimized structures, leading to an inability to effectively mitigate the explosion of vanishing monomials, which is a critical issue for Computer Algebra (CA)-based approaches. Although recent methods integrating SAT sweeping and CA through analytical techniques for adder boundary detection have shown improved performance, they still face computational limitations. To address this, we propose ReVEAL, a novel approach that combines Graph Neural Networks (GNNs) and GPU acceleration with CA to facilitate the reverse engineering of optimized multiplier netlists back to their abstraction-level architectures. ReVEAL enables formal verification by concurrently validating reference templates against the under-test multipliers using modern SAT solvers and SAT sweeping techniques. Experimental results show ReVEAL achieves a 4.90 x speedup, 19.97 x memory reduction compared to state-of-the-art CA tools, and a 13.39 x performance increase over combined SAT + CA methods.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionBackdoor attacks embed hidden functionalities in deep neural networks (DNN), triggering malicious behavior with specific inputs. Advanced defenses monitor anomalous DNN inferences to detect such attacks. However, concealed backdoors evade detection by maintaining a low pre-deployment attack success rate (ASR) and restoring high ASR post-deployment via machine unlearning. Existing concealed backdoors are often constrained by requiring white-box or black-box access or auxiliary data, limiting their practicality when such access or data is unavailable. This paper introduces ReVeil, a concealed backdoor attack targeting the data collection phase of the DNN training pipeline, requiring no model access or auxiliary data. ReVeil maintains low pre-deployment ASR across four datasets and four trigger patterns, successfully evades three popular backdoor detection methods, and restores high ASR post-deployment through machine unlearning.
Engineering Poster
Networking


DescriptionOur work presents a novel SV EEnet methodology leveraging the Cadence EEnet net type to model current and voltage on a single net, enabling accurate transistor-level modeling and efficient identification of non-idealities such as loading and coupling. By accurately representing the Analog Test Bus (ATB) path, our approach enhances Digital Mixed-Signal (DMS) co-simulations for ATB verification.
This advancement significantly reduces the reliance on Analog Mixed-Signal (AMS) co-simulations, simplifying the verification process.
The methodology can be used and implemented by all the mixed-signal SoCs to expedite the process of behavioural model generation with better accuracy to the conventional VAMS models. It leads to reduction of the execution cycle time and higher coverage in AMS co-simulations by covering scenarios which were earlier incomprehensible in DMS co-sim.
This advancement significantly reduces the reliance on Analog Mixed-Signal (AMS) co-simulations, simplifying the verification process.
The methodology can be used and implemented by all the mixed-signal SoCs to expedite the process of behavioural model generation with better accuracy to the conventional VAMS models. It leads to reduction of the execution cycle time and higher coverage in AMS co-simulations by covering scenarios which were earlier incomprehensible in DMS co-sim.
Engineering Presentation


AI
Front-End Design
Chiplet
DescriptionThis work introduces a digital design flow advisor, a groundbreaking AI-based approach that assists designers throughout the intricate stages of digital ASIC design. The focus is initially on CDC analysis and quality checks, often belittled and time-consuming aspects. Using AI agents, this flow advisor provides users with actionable recommendations and fosters discussions via a chatbot interface. It seamlessly integrates with existing CI/CD flows and processes regression logfiles offline using an AI agent. Prioritizing tasks, providing recommendations, and facilitating user interaction via a GUI, this flow advisor significantly enhances efficiency, productivity, and quality in the industry. This presentation will explore the challenges in digital ASIC design, the benefits of the digital design flow advisor, and its transformative impact on the industry.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionCoarse-Grained Reconfigurable Arrays (CGRA) balance the performance and power efficiency in computing systems. Effective compilers play a crucial role in fully realizing its potential. The compiler maps Data Flow Graphs (DFG), which represent compute-intensive loop kernels, onto CGRAs. However, existing compilers often tackle DFG nodes individually, neglecting their intricate inter-dependencies. We introduce a novel mapping paradigm called Rewire that can place and route multiple nodes in one shot.
Rewire first generates routing information that is shareable among multiple nodes via propagation. Then, Rewire intersects the routing information to generate individual placement candidates for each node.
Finally, Rewire innovatively utilizes data dependencies as constraints to quickly find suitable placement for multiple nodes together. Our evaluation demonstrates that Rewire can generate more near-optimal mappings than prior works. Rewire achieves 2.1x and 1.3x performance improvement and 13.5x and 4.7x compilation time reduction, respectively, compared to two popular mappers.
Rewire first generates routing information that is shareable among multiple nodes via propagation. Then, Rewire intersects the routing information to generate individual placement candidates for each node.
Finally, Rewire innovatively utilizes data dependencies as constraints to quickly find suitable placement for multiple nodes together. Our evaluation demonstrates that Rewire can generate more near-optimal mappings than prior works. Rewire achieves 2.1x and 1.3x performance improvement and 13.5x and 4.7x compilation time reduction, respectively, compared to two popular mappers.
Engineering Presentation


IP
DescriptionImplementing designs with complex hard macros, like PHY subsystems, creates challenges for timing closure. Liberty models detail the timing intent, requiring correct arcs for all interface data and clock ports. At the integration level, understanding clock and data paths in these hard macros can be overwhelming. Constraints linting and timing analysis tools often miss issues like missing arcs, risking late detection of top-level timing problems or silicon failure if left unaddressed. For instance, a missing arc on a clock port for a maximum frequency clock can lead to under-constrained designs, while a missing data port arc may leave synchronous paths untimed.
This paper proposes a utility for validating hard macro Liberty models against design goals and a dashboard to aid integrators in establishing top-level constraints. The process involves designers generating a preliminary design intent in XLS format from the STA session, followed by manual review to create a golden reference. Subsequent Liberty revisions are compared to this reference, highlighting discrepancies. Additionally, the methodology generates a clock tree trace from the root to output clock ports, pinpointing missing arcs. This approach leverages Synopsys PrimeTime along with Python and Perl scripts and is applicable to any hard macro design.
This paper proposes a utility for validating hard macro Liberty models against design goals and a dashboard to aid integrators in establishing top-level constraints. The process involves designers generating a preliminary design intent in XLS format from the STA session, followed by manual review to create a golden reference. Subsequent Liberty revisions are compared to this reference, highlighting discrepancies. Additionally, the methodology generates a clock tree trace from the root to output clock ports, pinpointing missing arcs. This approach leverages Synopsys PrimeTime along with Python and Perl scripts and is applicable to any hard macro design.
Engineering Presentation


IP
DescriptionFoundational IP design of standard cells, memory's, IOs and more are a crucial component for SoC design. The foundational IPs in question can have numerous views, one of which is liberty, a representation of timing, power, noise compiled within a cell library. Due to the advancement of technology, Liberty data has increased in complexity including appropriate representations within cell libraries. This added complexity can take into account data such as noise, waveforms, and statistical variation.
Some of the aforementioned data can be difficult to interpret and analyze which can often be a bottleneck in the first stages of the design. As such, ease of use and accuracy for validating this data is crucial to minimize the time spent in the QA process and reduce overall cycle time.
This presentation elaborates on a solution embedded into NXPs QA flow to alleviate the most difficult properties of verification, which include:
1. QA of complex cells in the most advanced tech nodes
2. Analysis of intricate EDA data such as LVF, moments, and CCS
3. Detection of errors, specifically outliers in liberty data with AI
Some of the aforementioned data can be difficult to interpret and analyze which can often be a bottleneck in the first stages of the design. As such, ease of use and accuracy for validating this data is crucial to minimize the time spent in the QA process and reduce overall cycle time.
This presentation elaborates on a solution embedded into NXPs QA flow to alleviate the most difficult properties of verification, which include:
1. QA of complex cells in the most advanced tech nodes
2. Analysis of intricate EDA data such as LVF, moments, and CCS
3. Detection of errors, specifically outliers in liberty data with AI
Engineering Presentation


AI
Systems and Software
DescriptionA Software-Defined Vehicle is any vehicle that manages its operations, adds functionality, and enables new features primarily or entirely through software. Such a vehicle makes use of aggregation units, multi-subsystem SoCs, which combine both real time tasks, as vehicle bus protocols, rather than feature-rich and high-performance applications. Typically, at least one sub-system manages safety critical and cybersecurity duties, besides implementing the early rom-based bootstrap procedure, in compliancy with the standards. Therefore, validating and verifying boot-rom code peculiarities becomes a challenge, considering it as a piece of silicon, rather than a piece of volatile SW which can be upgraded anytime.
Networking
Work-in-Progress Poster


DescriptionMachine Learning (ML) applications exponentially scale their model parameter sizes and complexity. The inference and training process access the memory in non-predictable orders, making it hard to optimize. This work presents an architecture for ML acceleration with some key features. Firstly, it reduces the unique memory page access by utilizing the data in the open page to compute and store partial results asynchronously. Secondly, it reduces memory latency for different memory organizations by asynchronous fetching and interleaving data from independent banks. This work achieves an area efficiency of 1896.30 GFLOPS/mm2 and a power efficiency of 17.07 GFLOPS/W.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionNonvolatile field-programmable gate arrays (NVFPGAs) can use multi-level cell (MLC) nonvolatile memories (NVMs) to enhance their logic density. However, the high-density design of NVFPGAs degrades the intra-routability of configurable logic blocks (CLBs), which significantly prolongs the time consumed by the packing process in the computer-aided design (CAD) flow. To relieve the efficiency degradation, in this paper, we propose a routability-aware re-pair stage to adjust the logical-physical look-up table (LUT) assignments to mitigate the congestion and improve their intra-routability, thereby reducing the packing time. In addition, exploiting the structural equivalence of MLC LUTs, we remove unnecessary intra-routing attempts from packing to further improve efficiency. Evaluation shows the proposed strategies reduce packing time by 41.48% on average.
Engineering Presentation


AI
Back-End Design
DescriptionChallenging high performance design schedules with competitive PPA targets may lead to congested or unrouteable designs during EDA backend process. To address these challenges, we propose multiple novel routing congestion mitigation methods including: (1) finer integration of estimated global routing congestion based cell spreading method into optimization flows to mitigate congestion (2) incremental routing based cell movement transformation to further mitigate congestion, (3) temporary removal of polarity inverters to eliminate their influence on cell placement & reduce congestion. Proposed enhancements showed up to 3% congestion reduction on high performance industrial designs leading to their expedited adoption by our design teams. They turned several initially unrouteable dense designs routable, and our optimization steps have gone from typically increasing congestion to decreasing congestion!
Networking
Work-in-Progress Poster


DescriptionLarge Language Models (LLMs) show promise in assisting with Register Transfer Level (RTL) design tasks, including code summarization, documentation, and question answering. However, real-world RTL projects often involve extensive codebases that exceed the prompt length limitations of current LLMs, making it difficult for these models to fully comprehend the designs when only partial code snippets are provided. To overcome this challenge, we propose RTLExplain, a bottom-up, data-dependency-aware approach that processes structurally truncated code along with summaries of relevant signals and modules, presented as comments, to generate comprehensive summaries of RTL components. Our method does not require further training or fine-tuning. Experiments on code summarization demonstrate consistent improvements across various medium-to-large RTL projects, even when variable names are obfuscated. Furthermore, we generate documentation from the produced summaries and leverage project code and documentation for Retrieval-Augmented Generation (RAG) in question answering tasks. Experiments show that our enhanced database, when combined with RAG, improves question-answering accuracy by 37% compared to naïve prompting and 27% compared to conventional RAG.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionPlacement plays a critical role in VLSI physical design, particularly in optimizing routability. With continuous advancements in semiconductor manufacturing technology, increased integration, and growing design complexity, managing routing congestion during placement has become increasingly challenging. Despite the widespread techniques to improve routability, these methods often lack theoretical guidance or sever the intrinsic connection between placement and routing optimization.
In this paper, we present an ADMM-based framework for unified optimization of placement and routing. Leveraging Wasserstein distance and bilevel optimization, our approach provides a unified framework for congestion optimization by alternately running global routing and incremental placement. Furthermore, we introduce a simple yet effective model for node inflation-based global placement, where convex programming is employed to determine the optimal inflation ratio.
Experimental results on a diverse set of open-source industrial benchmarks from CircuitNet and Chipyard demonstrate that our method achieves superior congestion reduction compared to widely used tools such as OpenROAD, Xplace 2.0, and DREAMPlace 4.1, while maintaining competitive wirelength and runtime.
In this paper, we present an ADMM-based framework for unified optimization of placement and routing. Leveraging Wasserstein distance and bilevel optimization, our approach provides a unified framework for congestion optimization by alternately running global routing and incremental placement. Furthermore, we introduce a simple yet effective model for node inflation-based global placement, where convex programming is employed to determine the optimal inflation ratio.
Experimental results on a diverse set of open-source industrial benchmarks from CircuitNet and Chipyard demonstrate that our method achieves superior congestion reduction compared to widely used tools such as OpenROAD, Xplace 2.0, and DREAMPlace 4.1, while maintaining competitive wirelength and runtime.
Networking
Work-in-Progress Poster


DescriptionData security is critical for protecting valuable data, especially in environments that demand strict data integrity and confidentiality protection. While memory security has been extensively studied, SSD security remains underexplored despite its growing importance. Recent approaches often adapt memory-focused techniques, such as Merkle tree-based protection, to SSDs, leading to significant overhead from metadata transfers between SSD controllers and NAND Flash devices. Additionally, existing solutions often focus solely on data-at-rest encryption, neglecting the equally critical issue of securing data in transit.
This paper introduces a treeless SSD security mechanism by utilizing the out-of-place update property of NVMe-based SSDs where the physical address (PA) changes with each write. By leveraging PA as a timestamp, our approach eliminates the need for counter and Merkle tree, significantly reducing metadata storage requirements by 1.2x and triggered flash commands by 0.52x. This includes a notable 1.8x reduction in program commands, which extends flash cell endurance. These improvements result in a 2.2x reduction in latency directly accelerates the execution time by 6.7x. Additionally, we show a 0.1x increase in throughput over the closest competing work.
This paper introduces a treeless SSD security mechanism by utilizing the out-of-place update property of NVMe-based SSDs where the physical address (PA) changes with each write. By leveraging PA as a timestamp, our approach eliminates the need for counter and Merkle tree, significantly reducing metadata storage requirements by 1.2x and triggered flash commands by 0.52x. This includes a notable 1.8x reduction in program commands, which extends flash cell endurance. These improvements result in a 2.2x reduction in latency directly accelerates the execution time by 6.7x. Additionally, we show a 0.1x increase in throughput over the closest competing work.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionGraph-traversal-based Approximate Nearest Neighbor (GANN) search and construction have become key retrieval techniques in various domains, such as recommendation systems and social networks. However, deploying GANN in real-world scenarios faces significant challenges, as high-dimensional vertices within the graph can lead to intensive memory demands. Although architectures like NDSearch have been proposed to accelerate GANN search, they are hard to deploy for GANN construction, as their pre-processing methods introduce massive overhead in dynamic graphs.
In this paper, given the observation that neighboring vertices in a dynamic graph exhibit feature similarity, we propose SAGA, the first accelerator that alleviates memory bound in GANN construction.
To capture this similarity, we directly leverage the first step of construction to gather vertices with the same starting point into a cluster to minimize the similarity detection overhead.
Next, we decompose vertices into key and non-key ones, where their deltas fall in a narrow range, which is suitable to be quantized to lower bit widths.
Building upon this approach, we design a specialized architecture, which efficiently implements the GANN construction by two-level scheduling and a mixed-precision supported bit-serial unit.
Through comprehensive evaluation, we demonstrate that SAGA can achieve an average speedup of 9.30$\times$, 4.87$\times$, 4.15$\times$ and 35.46$\times$, 7.60$\times$, 5.15$\times$ energy savings over CPU, GPU and NDSearch, respectively, while retaining task accuracy.
In this paper, given the observation that neighboring vertices in a dynamic graph exhibit feature similarity, we propose SAGA, the first accelerator that alleviates memory bound in GANN construction.
To capture this similarity, we directly leverage the first step of construction to gather vertices with the same starting point into a cluster to minimize the similarity detection overhead.
Next, we decompose vertices into key and non-key ones, where their deltas fall in a narrow range, which is suitable to be quantized to lower bit widths.
Building upon this approach, we design a specialized architecture, which efficiently implements the GANN construction by two-level scheduling and a mixed-precision supported bit-serial unit.
Through comprehensive evaluation, we demonstrate that SAGA can achieve an average speedup of 9.30$\times$, 4.87$\times$, 4.15$\times$ and 35.46$\times$, 7.60$\times$, 5.15$\times$ energy savings over CPU, GPU and NDSearch, respectively, while retaining task accuracy.
Research Special Session


AI
DescriptionAgentic AI involves multiple specialized language models working in concert to perform complex tasks. Open-source large language models (LLMs) coupled has enabled the machine learning community to build agentic systems with smaller models that exceed the capabilities of monolithic LLMs. Techniques like chain-of-thought reasoning and prompt caching accomplish complex tasks during inference, shifting the performance bottleneck to the autoregressive decode phase of token generation. However, token generation is inefficient on GPUs for two main reasons: (1) GPUs utilize only 20% of their peak memory bandwidth due to inadequate operator fusion coupled with synchronization overheads at kernel boundaries, and (2) hosting and dynamically switching between a large number of models can be prohibitively expensive and slow.
This talk describes SambaNova's approach to address the challenges above with the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) chip. The SN40L RDU is a 2.5D CoWoS chiplet-based design containing two RDU dies on a silicon interposer. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. Model parameters reside in DDR, and actively used models are cached and served from high bandwidth memory. On-chip streaming dataflow enables an unprecedented level of operator fusion: entire decoder blocks can be automatically fused into a single kernel call. Furthermore, streaming dataflow eliminates synchronization costs between consecutive calls to the same kernel by transforming these calls into a single call to a modified kernel containing a pipelined outer loop, achieving over 75% of peak performance during token generation on 8 and 16 RDUs. At the time of writing, a single rack of 16 SN40L RDUs serves Deepseek-R1 671B model at 198 tokens/s, the fastest in the world and the first non-GPU vendor to host. Techniques described in this talk are deployed in production in a commercial AI inference cloud at cloud.sambanova.ai.
This talk describes SambaNova's approach to address the challenges above with the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) chip. The SN40L RDU is a 2.5D CoWoS chiplet-based design containing two RDU dies on a silicon interposer. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. Model parameters reside in DDR, and actively used models are cached and served from high bandwidth memory. On-chip streaming dataflow enables an unprecedented level of operator fusion: entire decoder blocks can be automatically fused into a single kernel call. Furthermore, streaming dataflow eliminates synchronization costs between consecutive calls to the same kernel by transforming these calls into a single call to a modified kernel containing a pipelined outer loop, achieving over 75% of peak performance during token generation on 8 and 16 RDUs. At the time of writing, a single rack of 16 SN40L RDUs serves Deepseek-R1 671B model at 198 tokens/s, the fastest in the world and the first non-GPU vendor to host. Techniques described in this talk are deployed in production in a commercial AI inference cloud at cloud.sambanova.ai.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionPortfolio optimization problem stands as one of the most important financial service, which suffers from huge computational pressure due to arithmetic complexity. While, quantum computing has developed rapidly over the last few decades, offering polynomial or even exponential speedup that turns out to be a promising approach. However,existing quantum methods is fundamentally limited by either poor scalability or insufficient accuracy.
In this paper, we propose SAPO, which formally articulates the quantum circuit that seamlessly integrates financial theory and historical data characteristics with quantum algebra.
The circuit design is extended from the HHL algorithm incorporating mean-variance theory, which promotes scalability by equivalent transformation. Then, we present a min-max eigenvalue model that leverages historical financial information to refine parameter settings with high accuracy.
Experiments conducted on stock market data demonstrate that SAPO can effectively reduce the complexity by 36.94% compared to basic HHL and improve the accuracy by 1.46X compared to hybrid HHL.
In this paper, we propose SAPO, which formally articulates the quantum circuit that seamlessly integrates financial theory and historical data characteristics with quantum algebra.
The circuit design is extended from the HHL algorithm incorporating mean-variance theory, which promotes scalability by equivalent transformation. Then, we present a min-max eigenvalue model that leverages historical financial information to refine parameter settings with high accuracy.
Experiments conducted on stock market data demonstrate that SAPO can effectively reduce the complexity by 36.94% compared to basic HHL and improve the accuracy by 1.46X compared to hybrid HHL.
Networking
Work-in-Progress Poster


DescriptionMicro base stations, with limited antennas and extensive deployment, require scaled-down hardware. Traditional Software Defined Radio (SDR) solutions (e.g., CPU, manycore systems, GPU) offer flexibility but incur high area and power costs, while DSP lacks efficient acceleration for smaller configurations. The key challenge is achieving minimal area and power overhead while meeting 5G requirements. This paper presents a hardware-software co-designed architecture, Sayram, which minimizes overhead for 5G physical layer processing. Sayram integrates an instruction fusion mechanism, along with the compiler for simplified programming, and a Vector Indirect Addressing Memory (VIAM) to minimize memory access cycles, boosting overall processor efficiency. Operating at 1 GHz, Sayram achieves 158 Gops with a 1.18 mm² area, supporting 2T2R and 4T4R PUSCH processing in single-core and dual-core modes, respectively. Evaluations show that Sayram's area efficiency is 48×, 180×, and 2591× higher than manycore, DSP, and CPU architectures, with power efficiency improvements of 54×, 362×, and 4400×, respectively.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionWe present a quantum-inspired algorithm that utilizes Quantum Hamiltonian Descent (QHD) for efficient community detection. Our approach reformulates the community detection task as a Quadratic Unconstrained Binary Optimization (QUBO) problem, and QHD is deployed to identify optimal community structures. We implement a multi-level algorithm that iteratively refines community assignments by alternating between the QUBO problem setup and QHD-based optimization. Benchmarking shows our method achieves up to 5.49% better modularity scores while requiring less computational time compared to classical optimization approaches. This work demonstrates the potential of hybrid quantum-inspired solutions for advancing community detection in large-scale graph data.
Engineering Poster


DescriptionAs data centers and hyper-scalers continue to expand in size and complexity, ensuring reliability has become a formidable challenge. Diagnosing failures is often hindered by limited debugging capabilities that fail to address the entire system. This presentation underscores the pressing need for a scalable, modular debugging solution that supports the full Product Life Cycle(PLC) of data centers from pre-silicon design and validation to in-field operations.
Networking
Work-in-Progress Poster


DescriptionThis paper presents a scalable framework for enforcing traffic regulations in autonomous driving, integrating CARLA simulations with edge deployments on Raspberry Pi 5 and Jetson Nano p3541. Fine-tuned language models, including GPT-2, GPT-Neo, OPT, and BLOOM, classify driving behaviors. Results highlight platform-specific trade-offs, with Raspberry Pi achieving consistent performance for smaller models in 5–10 seconds, while Jetson Nano, despite CUDA issues, showed faster processing for BLOOM and OPT, reducing detection times by up to 20%. The framework emphasizes adaptability in edge-based compliance monitoring and opportunities for optimization in resource-constrained environments.
Research Special Session


Design
DescriptionThe rapid advancement of AI necessitates increasingly powerful computing platforms, driving the demand for high-bandwidth photonic interconnects. However, designing and optimizing photonic devices remains a complex and time-consuming process, hindering the pace of innovation. We demonstrate how critical a high-throughput simulation engine is for future applications such as heterogeneously integrated photonic devices. We
then present our breakthrough simulation platform, which leverages the machine learning hardware and software stack to enable the optimization and design of the next generation of photonic devices needed to power the data fabrics of tomorrow's datacenters.
then present our breakthrough simulation platform, which leverages the machine learning hardware and software stack to enable the optimization and design of the next generation of photonic devices needed to power the data fabrics of tomorrow's datacenters.
Networking
Work-in-Progress Poster


DescriptionRunahead execution is a technique to mask memory latency caused by irregular memory access. Runahead pre-executes the application code to achieve high prefetch accuracy; however, this technique has been limited to Out-of-Order (OoO) and Superscalar In-Order (Super-InO) cores. For implementation in Scalar In-Order (Scalar-InO) cores the challenges of area-constraint and energy-constraint remain.
Here, we build the first Scalar-InO processor featuring runahead, SR, into an open-source SoC, from the microarchitecture to the ISA. Through this deployment, we establish that implementing SR in Scalar-InO cores is possible, with minimal area and power overheads, while still achieving high performance. We also present an adaptive prefetch method to further improve performance. In the evaluation, we demonstrate the performance benefits and give the power and area overheads of SR.
Here, we build the first Scalar-InO processor featuring runahead, SR, into an open-source SoC, from the microarchitecture to the ISA. Through this deployment, we establish that implementing SR in Scalar-InO cores is possible, with minimal area and power overheads, while still achieving high performance. We also present an adaptive prefetch method to further improve performance. In the evaluation, we demonstrate the performance benefits and give the power and area overheads of SR.
Networking
Work-in-Progress Poster


DescriptionGraph Neural Networks (GNNs) are among the most popular machine learning models driven by the need to process relational information embedded in graph data for numerous learning tasks. These tasks span diverse fields, from social science and physical systems to molecular medicine. However, accelerating these tasks is challenging due to significant variations in workload dimensions and sparsity. For example, molecular medicine requires inductive inference on multiple small molecular graphs, while social science involves transductive inference on a single large social network. Prior works typically optimize for one inference type, failing to efficiently support the other due to two fundamental design limitations: (a) limited scalability, resulting in high latency when processing large workloads or energy inefficiency when processing small workloads, and (b) limited flexibility which leads to the high number of off-chip memory accesses, resulting in energy inefficiency.
To address these limitations, we propose ScaleX, a spatially scalable accelerator architecture with flexible processing elements (PEs) that efficiently speeds up the inference of a wide range
of GNN workloads. To support scalability, we introduce a lightweight and dynamic load-balancing technique that uniformly distributes sparse workloads, achieving high speedup as the design scales. To improve flexibility and reduce off-chip memory accesses, each PE is equipped with an elastic on-chip memory allocator, enabling dynamic memory allocation based on workload size. Furthermore, each PE can be configured to optimize the dataflow used, allowing it to adapt to diverse workloads and improve data reuse. Our results show that for GCN, ScaleX is 5.25× more energy efficient and 2.25× faster than prior works. For GIN and GraphSage, ScaleX achieves 2.2× to 432× speedup
over the A100 GPU. Scalability evaluation shows that ScaleX uniformly balances the workloads among PEs as the design scales, achieving superlinear speedup.
To address these limitations, we propose ScaleX, a spatially scalable accelerator architecture with flexible processing elements (PEs) that efficiently speeds up the inference of a wide range
of GNN workloads. To support scalability, we introduce a lightweight and dynamic load-balancing technique that uniformly distributes sparse workloads, achieving high speedup as the design scales. To improve flexibility and reduce off-chip memory accesses, each PE is equipped with an elastic on-chip memory allocator, enabling dynamic memory allocation based on workload size. Furthermore, each PE can be configured to optimize the dataflow used, allowing it to adapt to diverse workloads and improve data reuse. Our results show that for GCN, ScaleX is 5.25× more energy efficient and 2.25× faster than prior works. For GIN and GraphSage, ScaleX achieves 2.2× to 432× speedup
over the A100 GPU. Scalability evaluation shows that ScaleX uniformly balances the workloads among PEs as the design scales, achieving superlinear speedup.
Engineering Presentation


IP
DescriptionThis paper proposes a Scenario-Based Mixed Signal Layout Generator to address the challenges of high turnaround time (TAT) and limited exploration capabilities in memory analog IP layout design. The proposed method integrates an interactive Abstract UI, an Information Form, and a Scenario Constructor and a placement and routing (P&R) Scenario utilizing Generation APIs. These components enable efficient layout generation by allowing engineers to interactively create desired P&R patterns while handling complex patterns such as optional routing, advanced configurations, and analog constraints.
The work significantly reduces TAT, as demonstrated by experimental results: layout time for Latch-up prevention circuit decreased from 2 hours to 25 seconds, and for Differential Amplifier from 8 hours to 34 seconds. Additionally, pattern modifications for exploration were completed in under 10 minutes, allowing engineers to compare characteristics and select optimal layouts with ease. The proposed methodology enhances productivity, simplifies the design process, and improves the quality of layout design while addressing the growing demand for efficient and scalable analog IP solutions. The approach supports not only manual interaction for tailored pattern generation but also lays the foundation for future automation of layout exploration and characteristic comparison processes.
The work significantly reduces TAT, as demonstrated by experimental results: layout time for Latch-up prevention circuit decreased from 2 hours to 25 seconds, and for Differential Amplifier from 8 hours to 34 seconds. Additionally, pattern modifications for exploration were completed in under 10 minutes, allowing engineers to compare characteristics and select optimal layouts with ease. The proposed methodology enhances productivity, simplifies the design process, and improves the quality of layout design while addressing the growing demand for efficient and scalable analog IP solutions. The approach supports not only manual interaction for tailored pattern generation but also lays the foundation for future automation of layout exploration and characteristic comparison processes.
Networking
Work-in-Progress Poster


DescriptionMachine learning models are advancing circuit design, particularly in analog circuits. They typically generate netlists that lack human interpretability. This is a problem as human designers heavily rely on the interpretability of circuit diagrams or schematics to intuitively understand, troubleshoot, and develop designs. Hence, to integrate domain knowledge effectively, it is crucial to translate ML-generated netlists into interpretable schematics quickly and accurately. We propose Schemato, a large language model (LLM) for netlist-to-schematic conversion. In particular, we consider our approach in the two settings of converting netlists to .asc files for LTSpice and LaTeX files for CircuiTikz schematics. Experiments on our circuit dataset show that Schemato achieves up to 93% compilation success rate for the netlist-to-LaTeX conversion task, surpassing the 26% rate scored by the state-of-the-art LLMs. Furthermore, our experiments show that Schemato generates schematics with a mean structural similarity index measure that is 3x higher than the best performing LLMs, therefore closer to the reference human design.
Networking
Work-in-Progress Poster


DescriptionHigh-performance multipliers are critical components across numerous applications. While FPGAs utilize DSP blocks for efficient multiplication, their limited quantity, fixed placement, and potential for timing closure issues necessitate the exploration of alternative solutions. Logic-based softcore multipliers offer a promising solution, but existing FPGA-oriented implementations predominantly focus on heuristic small-scale multipliers, lacking the scalability and holistic optimization required for larger instances. Moreover, prior designs neglect the tunable balance between latency and resource utilization, employing fixed logic depth and cost, thereby limiting their adaptability to different scenarios. This paper introduces SCMG, a Scalable and Configurable FPGA-based Multiplier Generator that employs Integer Linear Programming (ILP) to overcome these limitations. SCMG offers flexible configuration of multiplier size and logic levels, enabling application-specific performance-prioritized or cost-prioritized optimization. The ILP-based approach efficiently generates optimal designs tailored to specific resource constraints and performance targets. Furthermore, SCMG supports the efficient generation for approximations, providing a valuable option for fault-tolerant applications. Experimental results demonstrate SCMG's significant advantages over AMD Vivado's built-in multiplier IP, achieving up to 23.3% area reduction, 15.6% latency improvement, and 24.5% energy efficiency enhancement. The tool is open-sourced to promote further research and development in this area.
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
Research Manuscript


Design
DES3: Emerging Models of ComputatioN
DescriptionFeature extraction and classification of bio-signals are crucial in human-machine interface (HMI), yet suffer from high delay and limited energy efficiency using conventional hardware. To mitigate this challenge, we propose an SDISC architecture, a neuromorphic HMI with the innovation from signal encoding, computing-in-memory (CIM) hardware, to algorithm-hardware co-optimization. The following strategies are implemented: (1) A spike-driven feature extractor, achieving >10× sparser dataflow than frame-based method; (2) In-situ computing based on resistive random-access memory (RRAM), enabling energy-efficient (4.09 TOPS/W) spiking neural network (SNN) classifier; (3) A Spike-Activity-Distillation algorithm and an Aid-Loser-Only recovery scheme alleviate the non-ideality of RRAM devices, ensuring SDISC maintains high accuracy (∼98.0%) in long time inference (>15 days). We further develop an end-to-end SDISC system for real-time EMG-based robot control, achieving a low latency (34 μs) and low power (39.72 μW/sample) interaction on edge.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionThe post-exposure bake (PEB) process is a critical step in semiconductor lithography, directly impacting resist profile accuracy and circuit pattern fidelity. Precise modeling of PEB is essential for controlling photoacid diffusion and inhibitor reactions. In this paper, we introduce SDM-PEB, an advanced simulation framework designed to enhance the accuracy of PEB simulations by capturing both intra-layer spatial dependencies and inter-layer depthwise interactions. Leveraging a unique hierarchical feature extractor with overlapped patch merging and efficient self-attention, our approach effectively captures both coarse and fine features at multiple scales. The spatial-depthwise Mamba-based attention unit, centered on a customized selective scan and structured state space model, efficiently captures spatial and depthwise dependencies, enabling precise 3D PEB simulation. Additionally, a PEB focal loss and differential depth divergence regularization term improve the sensitivity to both spatial and depthwise variations, addressing inherent data imbalances in 3D PEB simulations. Our framework is validated with commercial rigorous model, and experimental results demonstrate that the SDM-PEB outperforms previous methods in accuracy and efficiency.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionUsing multiple power domains is crucial for power savings in modern designs, but advanced nodes face challenges in powering cross-domain cells due to IR drop constraints and resource competition between secondary power and signal routing. This paper presents a detailed placement method that integrates secondary power routing considerations. Leveraging an integer linear programming (ILP) model and targeted cross-domain cell placement adjustments, our approach optimizes secondary power routing efficiency. Validation with commercial design tools demonstrates improvements in secondary power routing performance by reducing the secondary power distribution network (PDN) wirelength and decreasing PDN constraint violations.
Networking
Work-in-Progress Poster


DES5: Emerging Device and Interconnect Technologies
DescriptionSilicon Photonics-based AI Accelerators (SPAAs) have been considered as promising AI accelerators achieving high energy efficiency and low latency. While many researchers focus on improving SPAAs' energy efficiency and latency, their physical security has not been sufficiently studied. This paper first proposes a threat of thermal fault injection attacks on SPAAs based on Vector-Matrix Multipliers (VMMs) utilizing MachZhender Interferometers. This paper then proposes SecONN, an optical neural network framework that is capable of not only inferences but also concurrent detection of the attacks. In addition, this paper introduces a concept of Wavelength Division Perturbation (WDP) where wavelength dependent VMM results are utilized to increase detection accuracy. Simulation results show that the proposed method achieves 88.7% attack-caused average misprediction recall.
Engineering Presentation


AI
Systems and Software
Chiplet
DescriptionChiplets enable a new design methodology where monolithic System-on-Chips (SoCs) are disaggregated into several chip(let)s integrated together within a System-in-Package (SiP).
The ultimate achievement in this respect is the ability to mix and match heterogeneous chiplets (in different technologies, but most importantly from different vendors). For this vision to happen, it is required that the different chiplets can securely authenticate together and subsequently ensure trustworthy services. This results in three security functions: chiplets hardware bill of material (HBOM) mutual verification, chiplets software bill of material (SBOM) aggregation, verification and reporting (together referred to as "remote attestation"), and key management / payload isolation by cryptography (together referred to as "data protection").
These three functions can leverage known protocols, typically implemented in software. But to run properly, they require the preliminary creation of dedicated key pairs allocated for security services in each chiplet. This occurs at hardware level, ideally within a root of trust (which can leverage injected keys or intrinsic keys, diversified from a PUF).
This presentation will cover the different enrollment steps (hierarchy of certificates, chained in a PKI) throughout lifecycle. We'll detail as well the challenge of key renewal, after compromission or expiration of their crypto-period, transitioning to post-quantum cryptography, etc.
Altogether this approach allows to formalize the security problem and inventory the underlying assets. It stems from a preliminary protection profile (PP) that will be disclosed and made available for comments. The version 1.0 of the PP is due in June 2025.
The ultimate achievement in this respect is the ability to mix and match heterogeneous chiplets (in different technologies, but most importantly from different vendors). For this vision to happen, it is required that the different chiplets can securely authenticate together and subsequently ensure trustworthy services. This results in three security functions: chiplets hardware bill of material (HBOM) mutual verification, chiplets software bill of material (SBOM) aggregation, verification and reporting (together referred to as "remote attestation"), and key management / payload isolation by cryptography (together referred to as "data protection").
These three functions can leverage known protocols, typically implemented in software. But to run properly, they require the preliminary creation of dedicated key pairs allocated for security services in each chiplet. This occurs at hardware level, ideally within a root of trust (which can leverage injected keys or intrinsic keys, diversified from a PUF).
This presentation will cover the different enrollment steps (hierarchy of certificates, chained in a PKI) throughout lifecycle. We'll detail as well the challenge of key renewal, after compromission or expiration of their crypto-period, transitioning to post-quantum cryptography, etc.
Altogether this approach allows to formalize the security problem and inventory the underlying assets. It stems from a preliminary protection profile (PP) that will be disclosed and made available for comments. The version 1.0 of the PP is due in June 2025.
Research Panel


Security
DescriptionWith the rise in interconnected devices and digitization of systems ranging from consumer products to critical infrastructure, there are increased demands for hardware as the foundation for system security. Trust is critical. Recent efforts include the Hardware Common Weakness Enumerations (HW-CWEs), the IEEE P3164 effort (for security annotation of IP), and others. Handling throughout the design lifecycle, such as security design, implementation, verification, and validation, seems a daunting task. So, where are we in the state-of-the-art of hardware security? What are the barriers that prevent research innovations from making it prime time? What problems haven't we solved yet? Why are we severely witnessing increasingly discovered hardware vulnerabilities in terms of number and sophistication?
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionEmerging low-energy computing technologies, in
particular approximate computing, are becoming increasingly
relevant in key applications. A significant use case for these
technologies is reduced energy consumption in Artificial Neural
Networks (ANNs), an increasingly pressing concern with the
rapid growth of AI deployments. It is essential we understand the
security implications of approximate computing in an ANN con-
text before this practice becomes commonplace. In this work, we
examine the test case of approximate ANN processing elements
(PE) in terms of information leakage via the power side channel.
We perform a weight extraction Differential Power Analysis
(DPA) attack under three approximation scenarios: overclocking,
voltage scaling, and circuit level bitwise approximation. We
demonstrate that as the degree of approximation increases the
Signal to Noise Ratio (SNR) of power traces rapidly degrades.
We show that the Measurement to Disclosure (MTD) increases
for all approximate techniques. An MTD of 48 under precise
computing is increased to at minimum 200 (bitwise approximate
circuit at 25% approximation), and under some approximation
scenarios >1024. i.e. an increase in attack difficulty of at least
x4 and potentially over x20. A relative Security-Power-Delay
(SPD) analysis reveals that, in addition to the across the board
improvement vs precise computing, voltage and clock scaling
both significantly outperform approximate circuits with voltage
scaling as the highest performing technique.
particular approximate computing, are becoming increasingly
relevant in key applications. A significant use case for these
technologies is reduced energy consumption in Artificial Neural
Networks (ANNs), an increasingly pressing concern with the
rapid growth of AI deployments. It is essential we understand the
security implications of approximate computing in an ANN con-
text before this practice becomes commonplace. In this work, we
examine the test case of approximate ANN processing elements
(PE) in terms of information leakage via the power side channel.
We perform a weight extraction Differential Power Analysis
(DPA) attack under three approximation scenarios: overclocking,
voltage scaling, and circuit level bitwise approximation. We
demonstrate that as the degree of approximation increases the
Signal to Noise Ratio (SNR) of power traces rapidly degrades.
We show that the Measurement to Disclosure (MTD) increases
for all approximate techniques. An MTD of 48 under precise
computing is increased to at minimum 200 (bitwise approximate
circuit at 25% approximation), and under some approximation
scenarios >1024. i.e. an increase in attack difficulty of at least
x4 and potentially over x20. A relative Security-Power-Delay
(SPD) analysis reveals that, in addition to the across the board
improvement vs precise computing, voltage and clock scaling
both significantly outperform approximate circuits with voltage
scaling as the highest performing technique.
Research Special Session


Design
DescriptionDisaggregated computer architectures are emerging as an interesting paradigm according to which the components of a traditional monolithic server, such as CPU, memory, storage, and networking, are separated into distinct, often independently managed units that communicate over a network. Disaggregation can not only offer benefits such as greater flexibility, scalability, and resource optimization, but it can also enhance security. For example, in the context of enterprise routing, it can offer fine-grained control over the network in that allows one to deploy security policies, access control rules, and threat detection mechanisms more precisely, ensuring that only authorized traffic flows through the enterprise environment - thus enabling the zero-trust paradigm. It makes patch management easier, because its modularity allows different components to be patched independently. The same benefits translate also to cellular networks. Disaggregation is a key feature of the open radio access network (O-RAN) paradigm - whose goal is to make the radio access network intelligent, virtualized and fully interoperable. A disaggregation architecture has been proposed for post-quantum security for optical and packet transport equipment. However, disaggregation also introduces several unique security risks, such as increased attack surfaces, increased sensitive data exposure and data corruption, increased difficulty in tracing data provenance, insecure isolation among different components, insecure APIs. Also, well known security technologies, such as trusted execution environments, may have to be redesigned in the context of disaggregated architectures. In this talk, after an overview of those benefits and concerns, we focus on research approaches proposed to address some of these concerns in the context of O-RAN and in trusted execution environments.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionEnsuring the confidentiality and integrity of DNN accelerators is paramount across various scenarios spanning autonomous driving, healthcare, and finance. However, current security approaches typically require extensive hardware resources, and incur significant off-chip memory access overheads. This paper introduces SeDA, which utilizes 1) a bandwidth-aware encryption mechanism to improve hardware resource efficiency, 2) optimal block granularity through intra-layer and inter-layer tiling patterns, and 3) a multi-level integrity verification mechanism that minimizes, or even eliminates, memory access overheads. Experimental results show that SeDA reduces performance overhead by 12.26% and 12.29% on server and edge NPUs, respectively, while also providing robust scalability.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionVector similarity search (VSS) is a fundamental operation in modern AI applications, including few-shot learning (FSL) and approximate-nearest neighbor search (ANNS). However, VSS incurs substantial energy and computational overhead, primarily due to frequent vector transfers and the complexity of cosine similarity calculations in high-dimensional spaces. Prior research has explored the use of ternary content addressable memories (TCAMs) for parallel in-memory VSS to enhance energy efficiency. However, existing TCAM-based approaches, such as Exact-Match TCAM (EX-TCAM) with range encoding and Best-Match TCAM (Best-TCAM), are limited to distance-based metrics like the L∞ and L1 norms. These spatial-domain metrics suffer from a significant accuracy gap compared to angular-domain cosine similarity. Cosine similarity is widely regarded as the optimal metric for software-based VSS. To address this limitation, we introduce Seg-Cos, a TCAM-based angular VSS framework with its corresponding pre-processing technique and encoding scheme. The Seg-Cos divides vectors into segments and encodes them as circular ranges based on magnitudes to approximate cosine similarity directly within TCAM. It is the first angular VSS framework compatible with both EX-TCAM and Best-TCAM, enabling accurate and energy-efficient VSS in the angular domain. Our approach encompasses angular quantization, weighted range generation, and circular range encoding design. Simulation results show that Seg-Cos improves energy efficiency by 1.41X and achieves up to 2.2% higher accuracy over prior EX-TCAM-based methods in FSL. In ANNS, Seg-Cos enhances recall by 10% to 52% compared to previous Best-TCAM approaches.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionApproximate nearest neighbor search (ANNS) is crucial in many applications to find semantically similar matches for user queries. Especially with the development of large language models (LLMs), ANNS is becoming increasingly important in retrieval-augmented generation (RAG) technique. An in-depth analysis of ANNS reveals that its diverse operations, from extensive memory access to intensive sorting, are key performance bottlenecks, imposing significant strain on both the memory system and computing resources.
Based on these observations, we present SeIM, a hierarchical in-memory architecture to accelerate ANNS. SeIM is designed to accommodate the diverse operational characteristics of ANNS. Specifically, SeIM offloads highly parallel memorybound operations to the memory bank level and introduces a unified execution model to reuse hardware units, requiring only lightweight modifications to standard DRAM architecture. Additionally, SeIM places compute-bound sorting operations, which require cross-unit data access, at the memory controller level and employs an adaptive transmission filtering technique to reduce unnecessary data transfers and processing during sorting. Our evaluation shows that SeIM achieves 268×, 22×, and 5× higher throughput, 306×, 59×, and 4× lower latency, and 3081×, 287×, and 2× higher power efficiency than state-of-the-art CPU-, GPU-, and ASIC-based ANNS solutions.
Based on these observations, we present SeIM, a hierarchical in-memory architecture to accelerate ANNS. SeIM is designed to accommodate the diverse operational characteristics of ANNS. Specifically, SeIM offloads highly parallel memorybound operations to the memory bank level and introduces a unified execution model to reuse hardware units, requiring only lightweight modifications to standard DRAM architecture. Additionally, SeIM places compute-bound sorting operations, which require cross-unit data access, at the memory controller level and employs an adaptive transmission filtering technique to reduce unnecessary data transfers and processing during sorting. Our evaluation shows that SeIM achieves 268×, 22×, and 5× higher throughput, 306×, 59×, and 4× lower latency, and 3081×, 287×, and 2× higher power efficiency than state-of-the-art CPU-, GPU-, and ASIC-based ANNS solutions.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionThermal management in 3D ICs faces challenges due to higher power densities. Traditional PDE-based simulation methods are accurate but too slow for iterative design. Machine learning methods like FNO offer alternatives but struggle with high-frequency information loss and reliance on high-fidelity training data. We propose SAU-FNO, integrating self-attention and U-Net with FNO to capture long-range dependencies and local features, enhancing high-frequency modeling. Transfer learning further reduces high-fidelity data needs and accelerates training. SAU-FNO achieves state-of-the-art thermal prediction and an 842× speedup over conventional FEM methods, offering an efficient solution for advanced 3D IC thermal management.
Networking
Work-in-Progress Poster


DescriptionAnalog circuit optimization remains challenging due to its high-dimensional design space and the prohibitive cost of SPICE simulations. To improve sample efficiency, we propose an RL framework that uses intrinsic rewards, enabling agents to explore novel circuit regions. Furthermore, by leveraging an autoencoder-based novelty estimator, our approach enhances exploration and accelerates convergence, outperforming conventional methods. Experimental results on practical circuits demonstrate significant performance improvements over baselines.
Networking
Work-in-Progress Poster


DescriptionIn the integrated circuit industry, the precise etching process control is critical for realizing the ever scaled new devices. The development of etching process face new challenges and demanding for efficient etching simulation to gain insights into the etching mechanisms. This paper introduces a cross convolutional neural network (CCNN) tailored for predicting the etching profiles, which incorporates profile-oriented lateral convolution and temporal longitudinal convolution. By integrating autoencoder as an auxiliary task, we conduct self-supervised pre-training of lateral convolution layers using simulated data, resulting in comprehensive feature extraction and representation of profile. Fine-tuning is carried out on the temporal longitudinal convolution layers and subsequent fully connected layers with experimental data, to achieve precise prediction of experimental data. Experimental results validate the effectiveness of the novel neural network architecture and the self-supervised learning framework, yielding a reduction in average prediction errors on experimental data from 7.9028 nm to 6.3822 nm.
Engineering Poster
Networking


DescriptionSigmaAV is a unique solution that provides complete power grid noise coverage for 100% of the instances in a design. This novel simulation technique uses comprehensive aggressor knowledge & local noise coverage from SigmaDVD technology (verified in the past) and global noise coverage from Vectorless analysis (widely accepted industry standard for time-based IR analysis).It's an efficient technique to compute the worst case, but statistically relevant, voltage drop for every instance. This analysis bridges the gap in power grid noise coverage observed in traditional methods, such as local noise coverage in Vectorless and global noise impact in SigmaDVD. With SigmaAV's comprehensive local and global noise coverage, we did voltage-annotated timing analysis to identify relevant timing paths. We quantified the hotspot coverage capabilities of SigmaAV through heatmap comparisons and its timing path coverage through scatterplots. In this presentation, we will discuss the theory of SigmaAV and various conducted trials to compare SigmaAV with other available IR-Drop methods.
Engineering Poster
Networking


DescriptionIn the realm of semiconductor design, ensuring power integrity and reliability analysis in interposer design presents a unique set of challenges. The passive nature of interposers, devoid of SOC data at early and design stage, complicates the analysis of power integrity and reliability. Under the premise of lacking SOC data, performing independent simulations for the interposer becomes crucial. Ensure the sign-off safety of the interposer-only design and improve the sign-off efficiency through the robustness check of the power grid, layer drop analysis, electromigration (EM) assessment, and checks for ESD resistance and current density based on the absence of SOC data.
If the interposer design is led by the packaging team, the delay in SOC data from the digital backend team significantly reduces the optimization efficiency of the interposer design. To address this, we propose a new simulation workflow based on creating probes using micro bumps and providing constant/PWL currents for static/dynamic simulation to check layer drop, PG grid robustness, and power EM. The foundry has ESD rules for the interposer and can separately conduct verification on the resistance and current density between bumps in an "interposer only" manner.
If the interposer design is led by the packaging team, the delay in SOC data from the digital backend team significantly reduces the optimization efficiency of the interposer design. To address this, we propose a new simulation workflow based on creating probes using micro bumps and providing constant/PWL currents for static/dynamic simulation to check layer drop, PG grid robustness, and power EM. The foundry has ESD rules for the interposer and can separately conduct verification on the resistance and current density between bumps in an "interposer only" manner.
Engineering Poster
Networking


DescriptionThere has never been an effective way to know exactly what the timing margin of the signal is on the Silicon. Many engineers, leverage the Static Timing Analysis (STA) results to predict the timing margin or use characteristic data that vaguely guesses the silicon quality through the process detector Ring Oscillator (RO). However, Slack Monitor can show the Timing Margin on the functional paths and also how it degrades over time, because of aging & other impacts, throughout the silicon lifecycle. It is amazing to be able to measure signal margin on Silicon, which was previously only speculated. This means we can know if there are problems or what is different from what is designed, and what needs to be improved. Through this presentation, we will demonstrate what is Slack Monitor solution, complete flow that was followed for its integration in Automotive design flow and challenges faced. Also, will be describing, how it enhances reliability & safety through Predictive maintenance in Complex SoC Design and how it is used to monitor the health and performance of silicon along with Analytics at every stage of Silicon Lifecycle.
Ancillary Meeting


DescriptionAn exclusive afternoon with silicon startup leaders, venture funds, incubators, and EDA ecosystem experts to discuss access to funding, tools, and infrastructure.
Panel Discussion Topic: Funding Strategies for Semiconductor Startups
Panel Discussion Abstract: What enables a silicon startup to get seed funding, and/or successfully deliver their prototype design and advance to next stage of funding? Hear from venture funds and incubators on what differentiates a startup from its peers, and from startup founders who have successfully steered their ship across the stormy waters of funding, operations, and technical execution to deliver next generation chips for AI, network, edge, and space applications.
Panel Discussion Topic: Funding Strategies for Semiconductor Startups
Panel Discussion Abstract: What enables a silicon startup to get seed funding, and/or successfully deliver their prototype design and advance to next stage of funding? Hear from venture funds and incubators on what differentiates a startup from its peers, and from startup founders who have successfully steered their ship across the stormy waters of funding, operations, and technical execution to deliver next generation chips for AI, network, edge, and space applications.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionRegister Transfer Level (RTL) simulation is a crucial tool in hardware design, widely used in design space exploration, verification, debugging, and preliminary performance evaluation. Among various RTL simulation approaches, software simulation is the most commonly used due to its flexibility, low cost, and ease of debugging. However, the slow speed of simulation has become the bottleneck in verification, due to the extensive overhead required to simulate complex designs.
In this work, we explore the core of RTL simulation and divide it into four stages. For each stage, we propose several techniques to improve the performance. Finally, we implement these techniques in a novel RTL simulator SIMAX. SIMAX succeeds to simulate XiangShan, the state-of-the-art open-source RISC-V processor. Besides, compared to Verilator, SIMAX can achieve speedup of 7.34x for booting Linux in XiangShan, and 19.94x for running CoreMark in RocketChip.
In this work, we explore the core of RTL simulation and divide it into four stages. For each stage, we propose several techniques to improve the performance. Finally, we implement these techniques in a novel RTL simulator SIMAX. SIMAX succeeds to simulate XiangShan, the state-of-the-art open-source RISC-V processor. Besides, compared to Verilator, SIMAX can achieve speedup of 7.34x for booting Linux in XiangShan, and 19.94x for running CoreMark in RocketChip.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionElectronic-photonic integrated circuits (EPICs) present a transformative solution for next-generation high-performance artificial intelligence (AI). The advancement of EPIC AI systems, however, requires extensive interdisciplinary research across devices, circuits, architecture, and design automation. The complexity of hybrid systems makes it challenging even for domain experts to understand distinct behaviors and interactions across design stacks. The lack of a flexible, accurate, fast, and easy-to-use EPIC AI system simulation framework significantly hinders researchers from exploring their hardware innovations at different sub-areas and evaluating the system impacts with common benchmarks. To address this gap, we propose SimPhony, a device-circuit-architecture cross-layer modeling and simulation framework for heterogeneous electronic-photonic AI systems. SimPhony offers a platform that enables (1) generic, extensible hardware topology representation that supports heterogeneous multi-core architectures with diverse photonic tensor core designs; (2) optics-specific dataflow modeling with unique multi-dimensional parallelism and reuse beyond spatial/temporal dimensions; (3) data-aware energy modeling with realistic device responses, layout-aware area estimation, link budget analysis, and bandwidth-adaptive memory modeling; and (4) seamless integration with model training framework for hardware/software co-simulation. By providing a unified, versatile, and high-fidelity simulation platform, SimPhony enables researchers to innovate and evaluate EPIC AI hardware across multiple domains, facilitating the next leap in emerging AI hardware.
Engineering Poster


DescriptionA novel simulation-aware MOS gate resistance modeling is developed in this work which advances the state-of-the-art RC extraction in layout dependence of gate resistance and has several major contributions. First, gate resistance topology and corresponding solutions in both layout scenarios of contact on field poly and on gate poly are explored. In addition, an efficient and accurate gate resistance diamond-shape network is proposed specifically for layout style of contact on gate poly. Second, the difference of effective gate resistance between extraction and simulation conditions are illustrated and the corresponding performance and layout optimization for gate resistance is shown. Third, from the perspective of circuit performance, the comparison of gate resistance between native network provided by commercial RC extraction tools and diamond-shape network are discussed.
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionCombinational equivalence checking (CEC) is a fundamental task in the realization of digital designs which is unlikely to have universally efficient algorithms due to its co-NP-completeness. Recent researches of CEC have been focusing on SAT sweeping. This paper provides a new perspective other than SAT for tackling CEC, namely exhaustive simulation, and presents a simulation-based CEC engine constructed with fast GPU-parallel algorithms. The proposed engine can solve 4 out of the 9 large cases in the experiments on its own, with up to 88.11x speed-up compared with the checker in ABC. Moreover, a combination of the proposed engine with the ABC checker achieves averaged accelerations of 4.89x and 4.88x over the standalone ABC checker and a commercial checker, respectively.
Engineering Presentation


IP
DescriptionPre-Silicon side-channel analysis (SCA) helps to identify implementation issues in cryptographic algorithms early in the product life cycle and helps in shift-left of security verification for IPs on SoCs. The pre-Silicon simulation based SCA process involves defining proper test benches and generating power simulation traces followed by SCA using a pre-Silicon simulation based SCA security tool. One of the IPs that we evaluated using our pre-Silicon SCA approach was an AES hardware implementation in Galois Counter Mode (GCM) which is quite prevalent in different use-cases such as memory and link encryption and authentication. Different SCA attack models targeting Hamming Weight (HW) and Hamming Distance (HD) of the first or last round AES round operations were considered for a comprehensive evaluation. We used a modified open-source AES-GCM implementation in sequential and parallel mode as a target to prove the effectiveness of our SCA approach. We employed the simulated minimum number of traces to disclose (SMTD) the key and TVLA t-score as the metrics for our SCA. In the case of the AES-GCM test case that we used, HD of the S-Box in the last AES round was found to be most effective attack model. Our results show that an unprotected AES-GCM implementation is vulnerable to pre-Silicon SCA using our methodology with full 16 key bytes disclosure for the last round SBox HD attack model and partial key byte disclosure for the last round Add Round Key HW attack model.
Engineering Poster
Networking


DescriptionFunctional or glitch noise analysis determines if coupling between switching "aggressor" wires and a steady-state "victim" wire could induce incorrect switching on the victim. This analysis is sensitive to the relative voltage of all victim and aggressor wires involved. Historically, most wires within the chip operate at a single common "worst-case" voltage level making library characterization, and simulation setup (i.e. voltage levels for victim and aggressor wires) relatively straight forward for a single corner sign-off functional noise run. However, modern chips with multiple co-located voltage regions along with Dynamic Voltage/Frequency Scaling (DVFS) can complicate the analysis by having wires in different variable voltage domains (i.e VDD1 and VDD2 each with independent min/max voltage ranges) next to each other. This creates multiple "worst-case" voltage conditions depending on victim-aggressor voltage domains and greatly challenges the ability to sign-off with a single corner analysis. We present a method for carefully adjusting aggressor wire voltages (relative to victim voltage level) to enable safe and complete functional noise analysis using a single corner noise library in a single analysis run.
Engineering Special Session


Back-End Design
DescriptionIt's Monday morning, and you get pulled into an ad-hoc meeting to discuss your latest SoC's performance and power targets. The chip is missing performance targets by 15%, and now your team needs to raise the power budget to meet the spec. The entire process devolves into a flurry of finger-pointing without concrete evidence of what went wrong. When it comes to your $200M SoC project, is it better to guess or know what went wrong?
Most modern SoCs mitigate the guesswork by leveraging DFT (Design for Test) techniques, like adding more memory BIST or improving functional coverage. However, these tests were meant for verifying connectivity and basic functionality. What happens when you need the next level of observability and analytics to improve power, performance, yield, and reliability? These next-level analytics are driving the adoption of silicon lifecycle management (SLM) platforms.
For those unfamiliar, SLM platforms combine a variety of specialized on-die sensors with an analytics engine to improve power margins, manufacturing yield, silicon longevity, failure analysis, and enable predictive maintenance. The targeted analytics enable design optimizations at each stage of the design lifecycle, including pre-silicon through in-field operations.
As SoCs grow in size, complexity, and cost, expanding visibility is important. SLM is not yet broadly adopted in the industry, but just like DFT went from a concept to a norm SLM is expected to follow the same path.
Our discussion will briefly explore the current state of silicon testing and its evolution from bench characterization and ATE to in-field testing. It will also delve into different silicon lifecycle solutions and how they fit in each design phase. Together, we will answer some of the following questions from an IP, analytics platform, and testing perspective:
• How is testing done today?
• What are the limitations, and how can they be overcome?
• If the new test capabilities include the ability to test in the field, what benefits does that bring/how can that capability be leveraged?
• What is required to enable this capability, and how does it affect system architecture?
• How does this impact test at the ATE, chiplet, SLT, and in-field stages?
• What is the adoption path for this technology
Most modern SoCs mitigate the guesswork by leveraging DFT (Design for Test) techniques, like adding more memory BIST or improving functional coverage. However, these tests were meant for verifying connectivity and basic functionality. What happens when you need the next level of observability and analytics to improve power, performance, yield, and reliability? These next-level analytics are driving the adoption of silicon lifecycle management (SLM) platforms.
For those unfamiliar, SLM platforms combine a variety of specialized on-die sensors with an analytics engine to improve power margins, manufacturing yield, silicon longevity, failure analysis, and enable predictive maintenance. The targeted analytics enable design optimizations at each stage of the design lifecycle, including pre-silicon through in-field operations.
As SoCs grow in size, complexity, and cost, expanding visibility is important. SLM is not yet broadly adopted in the industry, but just like DFT went from a concept to a norm SLM is expected to follow the same path.
Our discussion will briefly explore the current state of silicon testing and its evolution from bench characterization and ATE to in-field testing. It will also delve into different silicon lifecycle solutions and how they fit in each design phase. Together, we will answer some of the following questions from an IP, analytics platform, and testing perspective:
• How is testing done today?
• What are the limitations, and how can they be overcome?
• If the new test capabilities include the ability to test in the field, what benefits does that bring/how can that capability be leveraged?
• What is required to enable this capability, and how does it affect system architecture?
• How does this impact test at the ATE, chiplet, SLT, and in-field stages?
• What is the adoption path for this technology
Networking
Work-in-Progress Poster


DescriptionRendering is critical in fields like 3D modeling, AR/VR, and autonomous driving, where high-quality, real-time output is essential. Point-based neural rendering (PBNR) offers an efficient, photorealistic alternative to traditional methods, yet it is still challenging to achieve real-time PBNR on mobile platforms. We pinpoint the LOD search stage as a key bottleneck due to workload imbalance and irregular memory access. To address this, we propose SLTarch, an algorithm-architecture co-designed system including SLTree, a novel subtree-based data structure along with LTCore, a co-designed architecture. Compared to a mobile GPU, SLTarch achieves 2.2x speedup and 56% energy savings with negligible architecture overhead. With additional hardware support, SLTarch further boosts the performance to 3.4x with 99% energy savings.
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
DescriptionThis paper proposes smaRTLy: a new optimization technique for multiplexers in Register-Transfer Level (RTL) logic synthesis. Multiplexer trees are very common in RTL designs, and traditional tools like Yosys optimize them by traversing the tree and monitoring control port values. However, this method does not fully exploit the intrinsic logical relationships among signals or the potential for structural optimization. To address these limitations, we develop innovative strategies to remove redundant multiplexer trees and restructure the remaining ones, significantly reducing the overall gate count. We evaluate smaRTLy on the IWLS-2005 and RISC-V benchmarks, achieving an additional 8.95% reduction in the AIG area compared to Yosys. We also evaluate smaRTLy on an industrial benchmark, the result shows that smaRTLy can remove 47.2% more AIG areas than Yosys. These results demonstrate the effectiveness of our logic inferencing and structural rebuilding techniques in enhancing RTL optimization processes, leading to more efficient hardware designs.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionEnergy-efficient image acquisition on the edge is
crucial for enabling remote sensing applications where the sensor
node has weak compute capabilities and must transmit data to
a remote server/cloud for processing. To reduce the edge energy
consumption, this paper proposes a sensor-algorithm co-designed
system called SnapPix, which compresses raw pixels in the
analog domain inside the sensor. We use coded exposure (CE)
as the in-sensor compression strategy as it offers the flexibility
to sample, i.e., selectively expose pixels, both spatially and
temporally. SnapPix has three contributions. First, we propose
a task-agnostic strategy to learn the sampling/exposure pattern
based on the classic theory of efficient coding. Second, we co-
design the downstream vision model with the exposure pattern to
address the pixel-level non-uniformity unique to CE-compressed
images. Finally, we propose lightweight augmentations to the
image sensor hardware to support our in-sensor CE compres-
sion. Evaluating on action recognition and video reconstruction,
SNAPPIX outperforms state-of-the-art video-based methods at
the same speed while reducing the energy by up to 15.4×.
crucial for enabling remote sensing applications where the sensor
node has weak compute capabilities and must transmit data to
a remote server/cloud for processing. To reduce the edge energy
consumption, this paper proposes a sensor-algorithm co-designed
system called SnapPix, which compresses raw pixels in the
analog domain inside the sensor. We use coded exposure (CE)
as the in-sensor compression strategy as it offers the flexibility
to sample, i.e., selectively expose pixels, both spatially and
temporally. SnapPix has three contributions. First, we propose
a task-agnostic strategy to learn the sampling/exposure pattern
based on the classic theory of efficient coding. Second, we co-
design the downstream vision model with the exposure pattern to
address the pixel-level non-uniformity unique to CE-compressed
images. Finally, we propose lightweight augmentations to the
image sensor hardware to support our in-sensor CE compres-
sion. Evaluating on action recognition and video reconstruction,
SNAPPIX outperforms state-of-the-art video-based methods at
the same speed while reducing the energy by up to 15.4×.
Engineering Poster


DescriptionHigh Liability Industries, such as Automotive, Medical, HPC and Mil/Aero face stringent reliability and safety requirements.
As these applications become increasingly reliant on advanced electronics for autonomous decision-making functions, the need for robust soft error analysis and mitigation pre-Tape Out becomes critical.
In this presentation, we will review an EDA tool suite allowing designers to increased tremendously design reliability, at the Cell and at the SoC level.
As these applications become increasingly reliant on advanced electronics for autonomous decision-making functions, the need for robust soft error analysis and mitigation pre-Tape Out becomes critical.
In this presentation, we will review an EDA tool suite allowing designers to increased tremendously design reliability, at the Cell and at the SoC level.
Networking
Work-in-Progress Poster


DescriptionRecent advancements in Silicon Photonics (SiPh)-based AI accelera-
tors present promising solutions to address the energy bottlenecks
of executing large models, like Transformers. However, existing
SiPh solutions focus primarily on accelerating matrix multiplica-
tion (MM) in the photonic domain, while the Softmax Activation
(SMA) function—an operation that accounts for 30-40% of total
computation in Transformers—remains on conventional digital
platforms. This reliance leads to significant energy and latency
overheads due to frequent data conversions between photonic MM
and digital SMA. Several electro-optic and all-optical methods have
been developed to implement nonlinear activation functions (e.g.
ReLU, Sigmoid, Tanh, and Softplus) using Optical Amplifiers along-
side Mach-Zehnder Modulators (MZMs) or Microring Resonators
(MRRs). However, similar approaches are unsuitable for SMA due
to the excessive area and energy overhead introduced by optical
amplifiers. Additionally, devices like MRRs and MZMs alone are in-
sufficient for SMA's nonlinear operations (exponential and division
functions), as they are bound by Maxwell's equations. Consequently,
a photonics-compatible architecture for efficient implementation of
SMA remains unachieved due to its intricate nature. To address this,
we propose SOFTONIC—a first-of-its-kind photonic SMA architec-
ture designed to enable all-photonic acceleration of Transformer for
ultra high energy efficiency and speedup. Our approach leverages
range reduction techniques to adjust input domains and applies
Chebyshev polynomial approximations for efficient computation.
Simulations of SOFTONIC using industry-standard CAD tools and
AI workloads demonstrate a 109x improvement in latency and an
80% reduction in power consumption compared to leading digital
and analog Softmax hardware solutions, with minimal area over-
head.
tors present promising solutions to address the energy bottlenecks
of executing large models, like Transformers. However, existing
SiPh solutions focus primarily on accelerating matrix multiplica-
tion (MM) in the photonic domain, while the Softmax Activation
(SMA) function—an operation that accounts for 30-40% of total
computation in Transformers—remains on conventional digital
platforms. This reliance leads to significant energy and latency
overheads due to frequent data conversions between photonic MM
and digital SMA. Several electro-optic and all-optical methods have
been developed to implement nonlinear activation functions (e.g.
ReLU, Sigmoid, Tanh, and Softplus) using Optical Amplifiers along-
side Mach-Zehnder Modulators (MZMs) or Microring Resonators
(MRRs). However, similar approaches are unsuitable for SMA due
to the excessive area and energy overhead introduced by optical
amplifiers. Additionally, devices like MRRs and MZMs alone are in-
sufficient for SMA's nonlinear operations (exponential and division
functions), as they are bound by Maxwell's equations. Consequently,
a photonics-compatible architecture for efficient implementation of
SMA remains unachieved due to its intricate nature. To address this,
we propose SOFTONIC—a first-of-its-kind photonic SMA architec-
ture designed to enable all-photonic acceleration of Transformer for
ultra high energy efficiency and speedup. Our approach leverages
range reduction techniques to adjust input domains and applies
Chebyshev polynomial approximations for efficient computation.
Simulations of SOFTONIC using industry-standard CAD tools and
AI workloads demonstrate a 109x improvement in latency and an
80% reduction in power consumption compared to leading digital
and analog Softmax hardware solutions, with minimal area over-
head.
Engineering Poster
Networking


DescriptionCadence Memory Models use configuration files which describe the unique characterization attributes of every real memory part. The Memory Model appropriately performs timing checks and protocol responses appropriate to the characterization attributes described by each configuration. As the scale of configuration files for targeted Memory Model increases, a significant challenge arises for the EDA memory provider to accurately maintain and update these files for all varieties of real part configurations. Similarly, it becomes challenging for the users of the memory models to keep their repository snapshot of these configurations up to date while simultaneously selecting from and covering all these configurations with each simulation to ensure compatibility. Especially as part offerings even within one protocol have increased into the tens of thousands while the number of characterization attributes needed to describe a memory and the dependencies between them increases with each protocol generation. While verifying memory sub-systems, configuration process becomes lengthy, and it impacts resource on fixed interval. Existing configuration flow uses one part at a time in the simulation. New SVRAND flow provides flexibility to represent all valid parts in a single System Verilog class which resolves evenly across user's application scope of required configurations while simultaneously providing compatibility closure over that same scoped set of parts.
Networking
Work-in-Progress Poster


DescriptionQuantum computing has the potential to revolutionizing computational problem-solving, including solving complex partial-differential equations (PDEs). However, the currently used quantum computing techniques suffer from low accuracy, limited scalability, and high execution times. In this work, we propose an innovative algorithm to solve PDEs by combining the classical-discretization techniques with classical-to-quantum (C2Q) encoding, and unitary synthesis. Using a multidimensional Poisson equation as a case study, we validated our approach through experiments on noise-free and noisy simulators, hardware emulators, and quantum hardware from IBM. Our experimental work demonstrated favorable results in terms of accuracy, scalability, and execution time compared to quantum variational solvers.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionLarge language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the real-time ASR requirements. Although speculative decoding has been explored for better decoding efficiency, they usually ignore the key characteristics of the ASR task and achieve limited speedup. To further reduce the real-time ASR latency, in this paper, we propose a novel speculative decoding framework specialized for ASR, dubbed SpecASR. SpecASR is developed based on our core observation that ASR decoding is audio-conditioned, which results in high output alignment between small and large ASR models, even given output mismatches in intermediate decoding steps. Therefore, SpecASR features an adaptive draft sequence generation process that dynamically modifies the draft sequence length to maximize the token acceptance length. SpecASR further proposes a draft sequence recycling strategy that reuses the previously generated draft sequence to reduce the draft ASR model latency. Moreover, a two-pass sparse token tree generation algorithm is also proposed to balance the latency of draft and target ASR models. With extensive experimental results, we demonstrate SpecASR achieves 3.04×–3.79× and 1.25×–1.84× speedup over the baseline autoregressive decoding and speculative decoding, respectively, without any loss in recognition accuracy.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionThe rapid advancement of large language models (LLMs) has revolutionized code generation tasks across various programming languages. However, the unique characteristics of programming languages, particularly those like Verilog with specific syntax and lower representation in training datasets, pose significant challenges for conventional tokenization and decoding approaches. In this paper, we introduce a novel application of speculative decoding for Verilog code generation, showing that it can improve both inference speed and output quality, effectively achieving speed and quality all in one.
Unlike standard LLM tokenization schemes, which often fragment meaningful code structures, our approach aligns decoding stops with syntactically significant tokens, making it easier for models to learn the token distribution. This refinement addresses inherent tokenization issues and enhances the model's ability to capture Verilog's logical constructs more effectively. Our experimental results show that our method achieves up to a 5.05× speedup in Verilog code generation and increases pass@10 functional accuracy on RTLLM by up to 17.19% compared to conventional training strategies. These findings highlight speculative decoding as a promising approach to bridge the quality gap in code generation for specialized programming languages.
Unlike standard LLM tokenization schemes, which often fragment meaningful code structures, our approach aligns decoding stops with syntactically significant tokens, making it easier for models to learn the token distribution. This refinement addresses inherent tokenization issues and enhances the model's ability to capture Verilog's logical constructs more effectively. Our experimental results show that our method achieves up to a 5.05× speedup in Verilog code generation and increases pass@10 functional accuracy on RTLLM by up to 17.19% compared to conventional training strategies. These findings highlight speculative decoding as a promising approach to bridge the quality gap in code generation for specialized programming languages.
Networking
Work-in-Progress Poster


DescriptionGlobal placement is essential for high-quality and efficient circuit placement, particularly in complex modern VLSI designs. Recent advancements, such as electrostatics-based analytic placement, have improved scalability and solution quality. This work demonstrates that using the precorrected FFT technique for electric field computation significantly reduces runtime. Experimental results on standard benchmarks show a 2.73x speedup in FFT computation and a 29% total runtime improvement against a conventional-FFT-based approach, and a 1.0% reduction of scaled half-perimeter wirelength after detailed placement, paving the way for efficient global VLSI placement.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionDisaggregated memory (DM) architecture physically separates computing and memory resources into distinct pools interconnected via high-speed networks within data centers, with the aim of improving resource utilization compared to traditional architectures. Most existing range indexes for DM that support variable-length keys are based on adaptive radix trees. However, these indexes exhibit suboptimal performance on DM due to excessive network round trips during tree traversal and inefficient node-based caching mechanisms.
To address these issues, we propose Sphinx, a novel hybrid index for DM. Sphinx introduces an Inner Node Hash Table to minimize the network round trips during index operations by replacing the sequential tree traversal with parallel hash reads. Sphinx incorporates a Succinct Filter Cache to further minimize network overhead while keeping the computing-side cache small and coherent. Experimental results show that Sphinx outperforms state-of-the-art counterparts by up to 7.3× in the YCSB benchmark.
To address these issues, we propose Sphinx, a novel hybrid index for DM. Sphinx introduces an Inner Node Hash Table to minimize the network round trips during index operations by replacing the sequential tree traversal with parallel hash reads. Sphinx incorporates a Succinct Filter Cache to further minimize network overhead while keeping the computing-side cache small and coherent. Experimental results show that Sphinx outperforms state-of-the-art counterparts by up to 7.3× in the YCSB benchmark.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionProcessing in Memory (PIM) architectures enhance memory bandwidth by utilizing bank-level parallelism, typically implemented with a SIMD structure where all banks operate simultaneously under a single command.
However, this synchronous approach requires the activation of all banks before computation, leading to activation times that exceed computation times, limiting performance gain.
Recently, asynchronous execution PIM has been proposed as an alternative, allowing banks to operate asynchronously and overlap activation with processing to hide the row activation overhead. While effective at reducing row activation overhead, the independent operation requires large shared accumulators for each bank group, increasing area overhead.
To address the issues, we propose bank group (BG)-level split synchronization DRAM PIM, where each bank group operates asynchronously to hide row activation overhead while operating synchronously within the bank group to eliminate the need for shared accumulators. Evaluation results show that our proposed design achieves an average throughput improvement of 1.70x and 1.06x compared to conventional PIM and asynchronous execution PIM.
Furthermore, the area overhead per processing unit (PU) increases by only 1.5% compared to conventional PIM and is significantly lower than that of asynchronous execution PIM.
However, this synchronous approach requires the activation of all banks before computation, leading to activation times that exceed computation times, limiting performance gain.
Recently, asynchronous execution PIM has been proposed as an alternative, allowing banks to operate asynchronously and overlap activation with processing to hide the row activation overhead. While effective at reducing row activation overhead, the independent operation requires large shared accumulators for each bank group, increasing area overhead.
To address the issues, we propose bank group (BG)-level split synchronization DRAM PIM, where each bank group operates asynchronously to hide row activation overhead while operating synchronously within the bank group to eliminate the need for shared accumulators. Evaluation results show that our proposed design achieves an average throughput improvement of 1.70x and 1.06x compared to conventional PIM and asynchronous execution PIM.
Furthermore, the area overhead per processing unit (PU) increases by only 1.5% compared to conventional PIM and is significantly lower than that of asynchronous execution PIM.
Engineering Poster
Networking


DescriptionState machines have become increasingly complex, making their manual implementation prone to misinterpretation. We present a novel approach to automate the translation of Architectural specification of state machine from spreadsheets into an intermediate file that integrates seamlessly with our SystemC performance model infrastructure. Our system utilizes a Python script to convert.xlsx files from Microsoft Excel into a JSON file, which is then fed into our C++/SystemC models. This method significantly reduces development time, minimizes the risk of human error, and enables early bug detection. By leveraging this automation, our team has realized notable benefits in performance and maintenance efficiency. Unlike traditional manual coding methods, this approach allows the architecture team to specify state machines in an intuitive and preferred tool like Excel while the performance-modeling team focuses on C++ infrastructure, enabling earlier performance-model execution and timely identification of architectural issues.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionDiffusion models have gained significant popularity in image generation tasks. However, generating high-quality content remains notably slow because it requires running model inference over many time steps. To accelerate these models, we propose to aggressively quantize both weights and activations, while simultaneously promoting significant activation sparsity. We further observe that the stated sparsity pattern varies among different channels and evolves across time steps. To support this quantization and sparsity scheme, we present a novel diffusion model accelerator featuring a heterogeneous mixed-precision dense-sparse architecture, channel-last address mapping, and a time-step-aware sparsity detector for efficient handling of the sparsity pattern. Our 4-bit quantization technique demonstrates superior generation quality compared to existing 4-bit methods. Our custom accelerator achieves 6.91x speed-up and 51.5% energy reduction compared to traditional dense accelerators.
Networking
Work-in-Progress Poster


DescriptionVoltage drop in the design is always one of the serious concerns which may degrade the performance or lead to the unexpected functional failure. To prevent the risk from the voltage drop, the most common and intuitive method is to reduce the current demand of IR-hotspot instances by downsizing the driving strength or swapping to a slower device with higher threshold voltage. However, reducing the current demand of IR-hotspot instances highly relies on the positive timing slacks remaining on the paths. In other words, if the timing slack is exhausted, such IR-hotspot instance cannot be sized down and becomes a hard-to-solved IR violation. In this paper, Extracting Timing Slack (ETS) methodology is proposed to squeeze out hidden timing slacks for the IR-hotspot instance from its fan-in and fan-out cones. Experimental results show that ETS provides 34% more IR fixing rate compared to the traditional methods.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionInverse lithography technology (ILT) is an advanced resolution enhancement technique that achieves mask optimization at the pixel level. However, application of ILT is hindered by time-intensive physical simulation. Herein, we propose an efficient ILT algorithm leveraging a deep learning model. A novel loss function is constructed to guide the model training in a self-supervised manner, eliminating the requirement of labelled data that might be nontrivial to acquire. The trained model outputs final mask patterns without further ILT optimization. Sub-resolution assist features (SRAFs) are generated automatically, the complexity of which can be adjusted during the training process to control mask manufacturability. The model was trained and validated on ICCAD-2013 CAD contest dataset. Better pattern fidelity and up to 12,000 times speedup are observed compared to other SOTA models. The trained model also shows good generalization ability to geometrically-different design patterns from another dataset, via a few-shot learning approach.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionThe growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations—the intermediate tensors produced during forward propagation and reused in backward propagation—dominate the GPU memory use. This leads to high training overhead such as high weight update cost due to small micro-batch size. To address this challenge, we propose SSDTrain, an adaptive activation offloading framework to high-capacity NVMe SSDs. SSDTrain reduces GPU memory usage without impacting performance by fully overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication and forwarding to further enhance efficiency. We extensively experimented with popular LLMs like GPT, BERT, and T5. Results demonstrate that SSDTrain effectively reduces 47% of the activation peak memory usage. At the same time, SSDTrain perfectly overlaps the I/O with the computation and incurs negligible performance overhead. Compared with keeping activations in GPU memory and layerwise full recomputation, SSDTrain achieves the best memory savings with negligible throughput loss. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionA significant number of users depend on Large Language Models (LLMs) for downstream tasks, but training LLMs from scratch remains prohibitively expensive. Sparse fine-tuning (SFT) has emerged as an effective strategy to reduce both the time and memory requirements of fine-tuning LLMs, achieving accuracy on par with fully fine-tuned models. Although SFT has the potential to achieve superior performance by minimizing computational requirements, SFT on GPUs often underperforms compared to dense algorithms like LoRA due to sparse data accesses that modern GPUs cannot efficiently handle. To address these issues, we propose Structured Sparse Fine- Tuning (SSFT). It comprises a novel algorithm, SSFT-Alg, which introduces predictable sparsity patterns to reduce memory access overhead and enhance regularity in the SFT process. To support SSFT-Alg, an accelerator, SSFT-Hw, is proposed to optimize SSFT-Alg through an innovative sparsity-aware design, avoiding the overhead of sparsity operations on GPUs and optimizing latency and energy efficiency. Experiments with relevant models and benchmarks demonstrate that SSFT achieves comparable accuracy to state-of-the-art models on BERT, LLaMA 2 7B, and LLaMA 2 13B. Moreover, SSFT-Hw outperforms both GPUs and the state-of-the-art sparsity-aware transformer accelerators in throughput by 51.0× and 1.32×, respectively, while reducing energy efficiency by 19.0× and 1.48×.
Research Manuscript


Systems
SYS3: Embedded Software
DescriptionSparse Matrix-Vector Multiplication (SpMV) is an essential sparse operation in scientific computing and artificial intelligence.
Efficiently adapting SpMV algorithms to diverse matrices and architectures requires a framework capable of accurately recognizing sparse patterns and selecting the optimal implementation.
In this work, we introduce Sparsity-aware SpMV (SSpMV), a framework that integrates expert-designed features with multimodal representations to adaptively predict the best-performing algorithm and parameters.
For this purpose, we design a multimodal neural network called MM-Adapter, to capture diverse modalities to represent the computational features of SpMV.
Experimental results demonstrate that MM-Adapter achieves the highest accuracy of 81.05%, outperforming existing SpMV prediction models.
Furthermore, SSpMV consistently delivers substantial performance improvements over state-of-the-art sparse libraries across various multi-core platforms.
Efficiently adapting SpMV algorithms to diverse matrices and architectures requires a framework capable of accurately recognizing sparse patterns and selecting the optimal implementation.
In this work, we introduce Sparsity-aware SpMV (SSpMV), a framework that integrates expert-designed features with multimodal representations to adaptively predict the best-performing algorithm and parameters.
For this purpose, we design a multimodal neural network called MM-Adapter, to capture diverse modalities to represent the computational features of SpMV.
Experimental results demonstrate that MM-Adapter achieves the highest accuracy of 81.05%, outperforming existing SpMV prediction models.
Furthermore, SSpMV consistently delivers substantial performance improvements over state-of-the-art sparse libraries across various multi-core platforms.
Engineering Poster
Networking


DescriptionWith the increase in the design complexity and tenology, timing closure of SOCs are becoming extremely critical to achieve the required timing quality for a workable silicon. Significant amount of effort is spent to analyse timing run across multiple scenarios named as corners and come up with a fixes in timing to close setup and hold timing paths. These require a lot of mannual anlysis and effort. THis work presents a dashboard to capture the timing of designs across multiple corners in a dashboard and helps in quick analysis. A significant improvement in timing analysis and closure was achieved. The App is developed using SQL, React, Nodejs and equipped to handle high requests per seconds
Exhibitor Forum


DescriptionEngineering teams are increasingly prioritizing early functional verification, achieving targeted verification and sign-off across more domains during RTL design—well before simulation. This early sign-off approach dramatically reduces downstream engineering changes and iterations.
Successfully deploying sign-off during RTL design requires both tool speed and the scalability to handle IPs and SoCs, along with complete coverage that detects all targeted errors.
Because static sign-off leverages abstract checking methods rather than the Boolean analysis used by simulation and formal verification, it delivers 10–100X faster runtimes, multi-billion-gate capacity, and a more efficient setup process. Additionally, its support for user-defined rules enables in-depth analysis for emerging applications where design requirements continue to evolve.
Multiple experts will share production-proven methodology advances and best practices across key static sign-off applications, including: 1) RTL linting, 2) clock domain crossing, 3) reset domain crossing, 4) design-for-testability, 5) connectivity and glitch detection, and 6) hardware security sign-off.
Attendees will gain a deeper understanding of static sign-off methodologies, along with practical insights tailored to specific applications.
Successfully deploying sign-off during RTL design requires both tool speed and the scalability to handle IPs and SoCs, along with complete coverage that detects all targeted errors.
Because static sign-off leverages abstract checking methods rather than the Boolean analysis used by simulation and formal verification, it delivers 10–100X faster runtimes, multi-billion-gate capacity, and a more efficient setup process. Additionally, its support for user-defined rules enables in-depth analysis for emerging applications where design requirements continue to evolve.
Multiple experts will share production-proven methodology advances and best practices across key static sign-off applications, including: 1) RTL linting, 2) clock domain crossing, 3) reset domain crossing, 4) design-for-testability, 5) connectivity and glitch detection, and 6) hardware security sign-off.
Attendees will gain a deeper understanding of static sign-off methodologies, along with practical insights tailored to specific applications.
Engineering Presentation


AI
IP
DescriptionContinuous Time Delta-Sigma Modulators are integral components in various RF and audio applications. They must achieve high linearity while maintaining efficiency in area usage and power consumption. Multi-bit quantization with dynamic element matching (DEM) techniques is generally employed to achieve linearity while limiting power consumption.
With the ever-increasing demand for linearity and stringent area requirements, it is desirable to use minimally sized DAC elements to reduce area and maximize the benefits of the DEM technique.
Traditional brute-force Monte Carlo methods for high sigma analysis are inefficient for obtaining valuable tail information of the Gaussian distribution, as they involve numerous simulations around the mean.
To address these challenges, a tool is needed that can accurately estimate yield and detect the worst-case tail samples with fewer simulations.
In this paper, we propose a ML enabled statistical analysis to estimate the worst-case tail samples which,
1. Successfully executed the impossible looking task of capturing the exact worst tail sample as would have been captured by standard brute-force monte-carlo (BFMC).
2. Achieved target spec of linearity (SFDR=90dB) with 4X reduction in DAC element area
3. 9X reduction in required number of samples to capture the worst tail sample as compared to BFMC
With the ever-increasing demand for linearity and stringent area requirements, it is desirable to use minimally sized DAC elements to reduce area and maximize the benefits of the DEM technique.
Traditional brute-force Monte Carlo methods for high sigma analysis are inefficient for obtaining valuable tail information of the Gaussian distribution, as they involve numerous simulations around the mean.
To address these challenges, a tool is needed that can accurately estimate yield and detect the worst-case tail samples with fewer simulations.
In this paper, we propose a ML enabled statistical analysis to estimate the worst-case tail samples which,
1. Successfully executed the impossible looking task of capturing the exact worst tail sample as would have been captured by standard brute-force monte-carlo (BFMC).
2. Achieved target spec of linearity (SFDR=90dB) with 4X reduction in DAC element area
3. 9X reduction in required number of samples to capture the worst tail sample as compared to BFMC
Research Manuscript


Systems
SYS3: Embedded Software
DescriptionThe median (MED) is a crucial statistic for measuring the central tendency. However, exact MED computation remains costly, with even state-of-the-art (SOTA) algorithms failing to meet (near) real-time processing demands. While approximate MED algorithm has arisen as a promising candidate, existing approaches ignore the potential opportunity of spatiotemporal similarity within the application and fail to provide application-specific trade-offs between execution time and accuracy. Our goal is to design an enhanced approximate MED algorithm STREAM, which is capable of exploiting the spatiotemporal similarity to achieve bucket reuse and establish a tunable-grained bucket mechanism to meet the accuracy of application-specific requirements. Experimental results show that while maintaining nearly identical accuracy, STREAM outperforms the SOTA approximate methods DDSketch (up to 10x, 4.7x on average) and KLL (up to 71.2x, 10.1x on average).
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionWrite amplification (WA) from migrating valid pages during garbage collection (GC) degrades SSD performance and lifespan. Although stream management based on high-level software semantics reduces WA, existing solutions require host modifications, hindering their adoption. We introduce StreamCSD, an SSD-autonomous stream management approach using in-storage content learning, eliminating host-side changes. Leveraging compression ratios from embedded compressors in computational storage drives (CSDs), StreamCSD employs a streaming K-means algorithm to cost-efficiently cluster data into streams. Evaluations show that StreamCSD reduces WA from 1.7 to 1.06 under multimodal generative AI workloads, matching state-of-the-art methods with minimal impact on bandwidth. StreamCSD operates without host modifications, promoting broader adoption of multi-stream SSDs.
Research Manuscript


AI
AI3: AI/ML Architecture Design
Description3D Gaussian splatting (3DGS) has gained popularity for its efficiency and sparse Gaussian-based representation. However, 3DGS struggles to meet the real-time requirement of 90 frames per second (FPS) on resource-constrained mobile devices, achieving only 2 to 9 FPS. Existing accelerators focus on compute efficiency but overlook memory efficiency, leading to redundant DRAM traffic. We introduce StreamingGS, a fully streaming 3DGS algorithm-architecture co-design that achieves fine-grained pipelining and reduces DRAM traffic by transforming from a tile-centric rendering to a memory-centric rendering. Results show that our design achieves up to 45.7× speedup and 62.9× energy savings over mobile Ampere GPUs.
Engineering Poster
Networking


DescriptionThis paper addresses the challenges of managing clock specifications, RTL implementation, and timing constraints in the development of subchip/block-based SoC designs.
Traditionally, discrepancies between clock specifications, RTL, and timing constraints are identified manually post-synthesis or even late in the development phase, leading to inefficient feedback loops.
In a parallel development environment, subchip physical design (PD) owners manually generate clock architecture diagrams and timing constraints based on the top-level clock spec and RTL.
This paper presents a Python-based framework that automates the generation of subchip-level clock architecture diagrams and clock timing constraints through RTL tracing.
By enabling early detection of discrepancies, this solution ensures better quality control across specifications, RTL, and timing constraints, and accelerates subchip PD execution by providing clock-related constraints even before subchip execution begins.
Traditionally, discrepancies between clock specifications, RTL, and timing constraints are identified manually post-synthesis or even late in the development phase, leading to inefficient feedback loops.
In a parallel development environment, subchip physical design (PD) owners manually generate clock architecture diagrams and timing constraints based on the top-level clock spec and RTL.
This paper presents a Python-based framework that automates the generation of subchip-level clock architecture diagrams and clock timing constraints through RTL tracing.
By enabling early detection of discrepancies, this solution ensures better quality control across specifications, RTL, and timing constraints, and accelerates subchip PD execution by providing clock-related constraints even before subchip execution begins.
Engineering Presentation


Front-End Design
Chiplet
DescriptionAs multi-die SoCs become increasingly complex, ensuring reliable and efficient verification of PMU-driven low-power features is a growing challenge. This work introduces a formalized and automated verification framework that scales seamlessly from single-die to multi-die environments, addressing issues such as missed test cases, manual inefficiencies, and growing design complexity. The framework standardizes low-power specifications into structured metadata, enabling the automated generation of checkers, coverage models, and simulation scripts. By integrating cross-checking mechanisms and automated coverage analysis, it improves test accuracy and identifies previously undetected bugs.
The framework's scalability was demonstrated in a quad-die SoC project, where single-die results of 788 checkers and 270 coverage models expanded to 3,152 checkers and 1,080 coverage models for the multi-die configuration. Development time was reduced by up to 90% for key stages, with multi-die turn-around time cut from 39 weeks to 10 weeks. Additionally, auto-generated checkers detected subtle issues missed by manual methods, while enhanced test coverage identified gaps, enabling further bug detection. This scalable, efficient, and consistent approach addresses the growing demands of multi-die verification, accelerating timelines while improving reliability and product quality.
The framework's scalability was demonstrated in a quad-die SoC project, where single-die results of 788 checkers and 270 coverage models expanded to 3,152 checkers and 1,080 coverage models for the multi-die configuration. Development time was reduced by up to 90% for key stages, with multi-die turn-around time cut from 39 weeks to 10 weeks. Additionally, auto-generated checkers detected subtle issues missed by manual methods, while enhanced test coverage identified gaps, enabling further bug detection. This scalable, efficient, and consistent approach addresses the growing demands of multi-die verification, accelerating timelines while improving reliability and product quality.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionNAND flash-based SSDs have emerged as a critical storage solution due to their exceptional performance and cost-effectiveness. However, the sequential write limitation of NAND flash blocks necessitates garbage collection (GC) to reclaim space occupied by stale data. Nevertheless, the extensive data migration involved in GC significantly impacts performance and Quality of Service (QoS) of SSDs. To mitigate this issue, copyback has been proposed as a means to accelerate GC by eliminating off-chip data movements. Specifically, copyback reads data into on-plane latches and immidiately re-writes it into another page on the same plane.
However, in the case of modern high-performance SSDs, copyback is rarely utilized due to the following challenges: (1) Copyback operates at the page-level and thus fails to effectively reclaim invalid data within the context of subpage-level mapping; (2) Copyback eliminates off-chip data movements, preventing data pages from being read out for Redundant Array of Independent NAND (RAIN) parity computation, thereby compromising SSD reliability. In this study, we introduce SuperCopyback as a solution that efficiently addresses these issues for modern SSDs. Firstly, we propose a Multiple-Read-One-Write (MROW) copyback approach through lightweight latch circuit modifications to enable subpage-level copyback implementation; additionally, we propose an orchestrated GC method to effectively utilize MROW copyback. Furthermore, we present a novel copyback-based RAIN scheme that conceals data pages readout latency in write operations and relocates parities to support efficient copyback. The experimental results on both synthetic and real traces demonstrate that SuperCopyback achieves a performance comparable to an ideal scenario where data movement of 4KB takes only 1ns.
However, in the case of modern high-performance SSDs, copyback is rarely utilized due to the following challenges: (1) Copyback operates at the page-level and thus fails to effectively reclaim invalid data within the context of subpage-level mapping; (2) Copyback eliminates off-chip data movements, preventing data pages from being read out for Redundant Array of Independent NAND (RAIN) parity computation, thereby compromising SSD reliability. In this study, we introduce SuperCopyback as a solution that efficiently addresses these issues for modern SSDs. Firstly, we propose a Multiple-Read-One-Write (MROW) copyback approach through lightweight latch circuit modifications to enable subpage-level copyback implementation; additionally, we propose an orchestrated GC method to effectively utilize MROW copyback. Furthermore, we present a novel copyback-based RAIN scheme that conceals data pages readout latency in write operations and relocates parities to support efficient copyback. The experimental results on both synthetic and real traces demonstrate that SuperCopyback achieves a performance comparable to an ideal scenario where data movement of 4KB takes only 1ns.
Engineering Presentation


AI
Systems and Software
Chiplet
DescriptionTransient temperature response approximately follows an exponential function of power density. As transistors keep shrinking due to Moore's Law turbo workloads in client processors are experiencing increased localized power density – resulting in increased transient temperature ramp rates and large workload dependent on die thermal gradients. Each workload exhibits unique thermal response characteristics, and the variety and volume of workloads are vast and continuously growing. Accurate measurement of actual hot-spot temperatures is crucial for effective processor dynamic power and thermal control, as well as ensuring thermal reliability. To address this challenge, we propose SuperCoverage, a novel approach that significantly enhances workload coverage for power and thermal analysis by a factor of 100x while drastically accelerating the generation of power and thermal maps by a factor of 1000x.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionOnce-for-all NAS trains a supernet once and extracts specialized subnets from it for efficient multi-target deployment. However, the training cost is high, with SOTA ElasticViT and NASViT taking over 72 and 83 GPU days respectively. While approaches accelerated the training using supernet warm-up, we argue that this is suboptimal because knowledge is easier scaled upward than downward. Hence, we propose SuperFast, a simple workflow, that (I.) pretrains a subnet of the supernet, and (II.) distributes its knowledge within the supernet before training. Using SuperFast on ElasticViT and NASViT supernets achieves the baselines' accuracy 1.4x and 1.8x faster on ImageNet.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionProcessing-in-Memory architecture presents a promising solution to alleviate the data movement bottleneck that arises from transferring data between memory and compute units in traditional processor-centric systems, particularly for DNN applications. However, this architecture introduces two inherent overheads: PIM code offloading and data transferring between CPU and memory. To address these issues, we propose two register-based addressing modes, indexed and base-offset addressing, for DMA descriptor-based in-DRAM PIM ISAs. Our full-system performance evaluation demonstrates that the approach significantly reduces the overheads, resulting in up to 1.94x speedup compared to the baseline PIM, additionally only with 4.65% area and 8.61% power consumption.
Research Manuscript
Swift or Exact? Boosting Efficient Microarchitecture DSE via Multi-fidelity Partial Order Prediction
4:30pm - 4:45pm PDT Wednesday, June 25 3002, Level 3

Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionA significant challenge in microarchitecture design space exploration (DSE) lies in the time-intensive synthesis and simulation process, making rapid design exploration infeasible. While the simulation tools offer reports on performance, power, and area (PPA) in the different stages, the PPA reports at early stages may fail to reflect the true relative qualities for various designs, i.e., with low fidelities. To address these limitations, we propose a novel multi-fidelity optimization algorithm tailored for multi-stage optimization problems. The proposed method employs a non-linear Gaussian process model to effectively fuse data from different stages with different fidelities, minimizing the need for expensive high-fidelity data while maximizing accuracy. Furthermore, a logical regression function and a multi-objective partial order relation are introduced to evaluate the reliability of low-fidelity data, mitigating their potential inaccuracies. Experiments demonstrate that our proposed multi-fidelity optimization algorithm can approximate the Pareto front of the direct design space in a shorter time with better performance.
Networking
Work-in-Progress Poster


DescriptionSoftmax is a critical yet memory-intensive operation in the Self-Attention mechanism of Transformers. Previous methods fixed Softmax parameters to minimize computation but required retraining the entire model. We propose SwiftMax, a learnable Softmax alternative that reduce training time by employing layer-wise replacement and fine-tuning pre-trained models. SwiftMax reduces training time by up to 2,250× compared to retraining, while maintaining up to 97% of the original model's accuracy in most NLP tasks. Our approach achieves up to 23× performance improvement during inference on the AMD ACAP platform, alleviating the Softmax bottleneck in Self-Attention and enabling efficient hardware deployment without extensive retraining.
Networking
Work-in-Progress Poster


DescriptionIn current chip design processes, multiple tools are often used to obtain a gate-level netlist, resulting in the loss of source code correlation. This paper proposes SynAlign, a tool that addresses the challenges of manual netlist-to-source code tracing by automating the alignment process. SynAlign simplifies the iterative design process, reduces overhead, and maintains correlation across multiple tools, ultimately enhancing the efficiency and effectiveness of chip design workflows. SynAlign can tolerate up to
61% design net changes without impacting alignment accuracy.
61% design net changes without impacting alignment accuracy.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionIn recent years, AI-assisted IC design methods have demonstrated great potential, but the availability of circuit design data is extremely limited, especially in the public domain. The lack of circuit data has become the primary bottleneck in developing AI-assisted IC design methods. In this work, we make the first attempt, SynCircuit, to generate new synthetic circuits with valid functionalities in the HDL format.
SynCircuit automatically generates synthetic data using a framework with three innovative steps: 1) We propose a customized diffusion-based generative model to resolve the Directed Cyclic Graph (DCG) generation task, which has not been well explored in the AI community. 2) To ensure our circuit is valid, we enforce the circuit constraints by refining the initial graph generation outputs.
3) The Monte Carlo tree search (MCTS) method further optimizes the logic redundancy in the generated graph. Experimental results demonstrate that our proposed SynCircuit can generate more realistic synthetic circuits and enhance ML model performance in downstream circuit design tasks.
SynCircuit automatically generates synthetic data using a framework with three innovative steps: 1) We propose a customized diffusion-based generative model to resolve the Directed Cyclic Graph (DCG) generation task, which has not been well explored in the AI community. 2) To ensure our circuit is valid, we enforce the circuit constraints by refining the initial graph generation outputs.
3) The Monte Carlo tree search (MCTS) method further optimizes the logic redundancy in the generated graph. Experimental results demonstrate that our proposed SynCircuit can generate more realistic synthetic circuits and enhance ML model performance in downstream circuit design tasks.
Research Manuscript


EDA
EDA1: Design Methodologies for System-on-Chip and 3D/2.5D System-in Package
DescriptionAs modern field-programmable gate arrays (FPGAs) continue to grow in complexity, systems featuring multi-die devices connected through time-division multiplexing (TDM) techniques have become increasingly common for implementing large-scale designs. FPGA designs are meticulously partitioned at the die-level for prototyping in modern emulation systems. A die-level router for multi-FPGA systems aims to find routing paths between dies according to the partitioning results. Conventional FPGA-level routers often result in a large critical connection delay that impacts the whole design's frequency. Additionally, the excessive use of super long lines (SLLs) between neighboring dies leads to substantial routing congestion, causing the failure of the routing progress. To tackle these issues, we propose an effective and efficient die-level router for multi-FPGA systems, optimizing routing topology and the TDM ratio. Experimental results on the benchmarks from the die-level routing contest 2023 demonstrate 7.6% better critical connection delay with a 5.761x speed-up compared to the state-of-the-art router.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionVision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks by effectively extracting global features. However, their self-attention mechanism suffers from quadratic time and memory complexity as image resolution or video duration increases, leading to inefficiency on GPUs. To accelerate ViTs, existing works mainly focus on pruning tokens based on value-level sparsity. However, they miss the chance to achieve peak performance as they overlook the bit-level sparsity. Instead, we propose Inter-token Bit-sparsity Awareness (IBA) algorithm to accelerate ViTs by exploring bit-sparsity from similar tokens. Next, we implement IBA on GPUs that synergize CUDA Cores and Tensor Cores by addressing two issues: firstly, the bandwidth congestion of the Register File hinders the parallel ability of CUDA Cores and Tensor Cores. Secondly, due to the varying exponent of floating-point vectors, it is hard to accelerate bit-sparse matrix multiplication and accumulation (MMA) in Tensor Core through fixed-point-based bit-level circuits. Therefore, we present SynGPU, an algorithm-hardware co-design framework, to accelerate ViTs. SynGPU enhances data reuse by a novel data mapping to enable full parallelism of CUDA Cores and Tensor Core. Moreover, it introduces Bit-Serial Tensor Core (BSTC) that supports fixed- and floating-point MMA by combining the fixed-point Bit-Serial Dot Product (BSDP) and exponent alignment techniques. Extensive experiments show that SynGPU achieves an average of 2.15$\times$ $\sim$
3.95$\times$ speedup and 2.49$\times$ $\sim$ 3.81$\times$ compute density over A100 GPU.
3.95$\times$ speedup and 2.49$\times$ $\sim$ 3.81$\times$ compute density over A100 GPU.
Networking
Work-in-Progress Poster


DES5: Emerging Device and Interconnect Technologies
DescriptionMemristor, a passive fundamental circuit component, is a promising candidate for implementing logic circuits. Memristors have extremely low areas and are compatible with MOS devices. Recently, exhaustive enumeration was used to identify extremely low-area memristor-transistor logic cells. In this research, we propose the first constructive method for synthesizing such cells with arbitrary numbers of inputs and devices. We also propose methods for cascading these cells to further improve area efficiency and to carry out device-cell co-optimization to improve performance. We use these methods to create a comprehensive library of memristor-transistor logic cells and use this library to synthesize benchmark circuits using ABC. Logic synthesis results show that our cells dramatically reduce the area for large logic blocks --- around 53% over CMOS cells and 25% over MRL (a logic style that uses memristor-only cell+CMOS inverters) and provide moderate power consumption.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionAs technology nodes continue to shrink, Complementary FET (CFET) structures, which stack PMOS and NMOS together, have emerged as a promising candidate for next-generation technology. Due to the reduction in routing tracks, the insertion of dummy polys to increase routing resources and the use of M2 during the synthesis of CFET standard cells have become more inevitable. These two factors make block-level routing significantly more challenging. To address this, we introduce the methods to utilize the backside (BS) routing resources at the CFET standard cell synthesis stage and efficiently handle CFET transistor folding. To the best of our knowledge, this is the first work to consider BS routing resources at the cell synthesis stage and efficiently address transistor folding in the CFET stacked structure. In the transistor folding and placement stages, we utilize Euler paths to estimate the lower bound of contacted poly pitch (CPP) and apply dynamic programming (DP) to calculate the frontside minimum required tracks (FMRT) of BS-resource-aware placements. The subsequent satisfiability modulo theories (SMT) approach determines which tracks the devices occupy and completes cell routing. Experimental results show that compared to previous work [11], we achieve reductions of 1%, 45%, and 19% in #CPP, #M2 tracks, and runtime, respectively.
Networking
Work-in-Progress Poster


DescriptionAutomating syntax correction in SystemVerilog assertions (SVAs) is essential to streamline hardware verification workflows, reducing the need for laborious, time-consuming manual error correction. However, using proprietary large language models (LLMs) for this task poses challenges due to high costs, privacy concerns, limited access, and slow inference times. Additionally, a substantial performance gap exists between large proprietary models and smaller open-source alternatives in correcting syntax errors in SVAs. To address these challenges, we propose a knowledge distillation (KD)-based approach to transfer the syntax correction capabilities of a larger model to a smaller, open-source model. Our fine-tuned model achieves success rates of 97.77% and 95.69% across two benchmarks, proving it to be a precise, fast, cost-effective, and secure alternative to large proprietary models.
Engineering Presentation


Front-End Design
DescriptionFormal verification of the designs is very challenging as properties verifying the design functionality do not guarantee consistency, correctness, or completeness. This is because sometimes properties are incorrectly written. Moreover, proof convergence for all the properties is difficult to achieve using available state-of-the-art formal techniques. As the designs get more complex, the convergence of proofs becomes more challenging, questioning verification quality. Our novel technique handles both cases. It converges the properties and qualifies them for the given design. Evidence suggests that the proposed technique is very effective in verifying designs with increased levels of complexity.
Exhibitor Forum


DescriptionModern chip design verification generates massive waveform dumps—often terabytes in size—making it nearly impossible for engineers to manually debug and trace root causes efficiently. Traditional waveform viewers and rule-based scripts fall short when faced with the scale, complexity, and subtlety of today's RTL designs. In this Exhibitor Forum session, we introduce ChipAgents, an agentic AI system purpose-built to tackle waveform analysis at scale. By combining structured reasoning, semantic search, and interactive agents, ChipAgents empowers verification engineers to ask natural-language questions, trace failure propagation across time and modules, and receive contextualized explanations grounded in the RTL and waveform data.
Our approach moves beyond static signal inspection. Agents dynamically explore the design hierarchy, generate hypotheses, and interpret causality—turning waveform dumps into actionable insights. Whether you're facing race conditions, signal glitches, or protocol violations, our system helps you find the needle in the haystack—faster and more reliably than ever before.
We'll showcase real-world case studies where ChipAgents successfully identified root causes in minutes—debugs that previously took days. The demo includes multi-agent workflows that collaborate across testbench logs, waveform traces, and RTL to deliver end-to-end failure analysis.
This talk will appeal to EDA tool developers, DV engineers, and design leads seeking scalable, AI-powered solutions for debugging in the era of increasingly complex SoCs. Join us to see how agentic AI is redefining RTL understanding—one waveform at a time.
Our approach moves beyond static signal inspection. Agents dynamically explore the design hierarchy, generate hypotheses, and interpret causality—turning waveform dumps into actionable insights. Whether you're facing race conditions, signal glitches, or protocol violations, our system helps you find the needle in the haystack—faster and more reliably than ever before.
We'll showcase real-world case studies where ChipAgents successfully identified root causes in minutes—debugs that previously took days. The demo includes multi-agent workflows that collaborate across testbench logs, waveform traces, and RTL to deliver end-to-end failure analysis.
This talk will appeal to EDA tool developers, DV engineers, and design leads seeking scalable, AI-powered solutions for debugging in the era of increasingly complex SoCs. Join us to see how agentic AI is redefining RTL understanding—one waveform at a time.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionIsing solvers with hierarchical clustering have shown promise for large-scale Traveling Salesman Problems (TSPs), in terms of latency and energy. However, most of these methods still face unacceptable quality degradation as the problem size increases beyond a certain extent. Additionally, their hardware-agnostic adoptions limit their ability to fully exploit available hardware resources.
In this work, we introduce TAXI -- an in-memory computing-based TSP accelerator with crossbar(Xbar)-based Ising macros. Each macro independently solves a TSP sub-problem, obtained by hierarchical clustering, without the need for any off-macro data movement, leading to massive parallelism. Within the macro, Spin-Orbit-Torque (SOT) devices serve as compact energy-efficient random number generators enabling rapid ``natural annealing". By leveraging hardware-algorithm co-design, TAXI offers improvements in solution quality, speed, and energy-efficiency on TSPs up to 85,900 cities (the largest TSPLIB instance). TAXI produces solutions that are only 22% and 20% longer than the Concorde solver's exact solution on 33,810 and 85,900 city TSPs, respectively. TAXI outperforms a current state-of-the-art clustering-based Ising solver, being 8X faster on average across 20 benchmark problems from TSPLib.
In this work, we introduce TAXI -- an in-memory computing-based TSP accelerator with crossbar(Xbar)-based Ising macros. Each macro independently solves a TSP sub-problem, obtained by hierarchical clustering, without the need for any off-macro data movement, leading to massive parallelism. Within the macro, Spin-Orbit-Torque (SOT) devices serve as compact energy-efficient random number generators enabling rapid ``natural annealing". By leveraging hardware-algorithm co-design, TAXI offers improvements in solution quality, speed, and energy-efficiency on TSPs up to 85,900 cities (the largest TSPLIB instance). TAXI produces solutions that are only 22% and 20% longer than the Concorde solver's exact solution on 33,810 and 85,900 city TSPs, respectively. TAXI outperforms a current state-of-the-art clustering-based Ising solver, being 8X faster on average across 20 benchmark problems from TSPLib.
Networking
Work-in-Progress Poster


DescriptionModern designs demand stringent performance metrics to sustain innovation. Advanced technological nodes offer potential enhancements, but the relationship between node size, speed, and power consumption is intricate, making node selection challenging. To address these challenges, we propose TEA-GNN, a framework leveraging Graph Neural Networks (GNNs) and adapters for rapid assessment of power and slack performance across multiple technology nodes. TEA-GNN uses a two-stage prediction process: first predicting timing metrics, and then using these inputs for power prediction, enhancing accuracy. Experimental results on real-world designs demonstrate significant runtime improvement and lower error compared to commercial tools, reducing design iteration time and enabling faster time-to-market.
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionIn quantum computing, quantum circuits are fundamental representations of quantum algorithms, which are compiled into executable functions for quantum solutions. Quantum compilers transform algorithmic quantum circuits into one compatible with target quantum computer, bridging quantum software and hardware. However, untrusted quantum compilers pose significant risks. They can lead to the theft of quantum circuit designs and compromise sensitive intellectual property (IP).
In this paper, we propose TetrisLock, a split compilation method for quantum circuit obfuscation that uses an interlocking splitting pattern to effectively protect IP with minimal resource overhead. Our approach divides the quantum circuit into two interdependent segments, ensuring that reconstructing the original circuit functionality is possible only by combining both segments and eliminating redundancies. This method makes reverse engineering by an untrusted compiler unrealizable, as the original circuit is never fully shared with any single entity.
Also, our approach eliminates the need for a trusted compiler to process the inserted random circuit, thereby relaxing the requirements. Additionally, it defends against colluding attackers while imposing low overhead by preserving the original depth of the quantum circuit. We demonstrate our method by using established RevLib benchmarks, showing that it achieves a minimal impact on functional accuracy (less than 1%) while significantly reducing the likelihood of IP inference.
In this paper, we propose TetrisLock, a split compilation method for quantum circuit obfuscation that uses an interlocking splitting pattern to effectively protect IP with minimal resource overhead. Our approach divides the quantum circuit into two interdependent segments, ensuring that reconstructing the original circuit functionality is possible only by combining both segments and eliminating redundancies. This method makes reverse engineering by an untrusted compiler unrealizable, as the original circuit is never fully shared with any single entity.
Also, our approach eliminates the need for a trusted compiler to process the inserted random circuit, thereby relaxing the requirements. Additionally, it defends against colluding attackers while imposing low overhead by preserving the original depth of the quantum circuit. We demonstrate our method by using established RevLib benchmarks, showing that it achieves a minimal impact on functional accuracy (less than 1%) while significantly reducing the likelihood of IP inference.
Networking
Work-in-Progress Poster


DescriptionThis work introduces an innovative method for improving combinational digital circuits through random exploration in MIG-based synthesis. High-quality circuits are crucial for performance, power, and cost, making this a critical area of active research. Our approach incorporates next-state prediction and iterative selection, significantly accelerating the synthesis process. This novel method achieves up to 14× synthesis speedup and up to 20.94% better MIG minimization on the EPFL Combinational Benchmark Suite compared to state-of-the-art techniques. We further explore various predictor models and show that increased prediction accuracy does not guarantee an equivalent increase in synthesis quality of results or speedup, observing that randomness remains a desirable factor.
Engineering Special Session


Systems and Software
DescriptionWith the evolution and establishment of open source software, open hardware is aspiring to a similar stature of dependability and reliability in the industry. Over the past years, the chip design industry has witnessed the growth of open source and collaborative efforts across the entire spectrum of hardware development. The community of open hardware comprises several key ingredients necessary to design and build a chip, and those ingredients are served to a chip designer via a design environment.
These include:
-hardware instruction sets such as RISC-V, OpenPower
-Process Design Kits (PDKs) representing the manufacturing ingredients to design a chip
-Electronic Design Automation (EDA) software used for the construction, functional, electrical, and design verification of the design
-Cloud based design enablement platform
-Collaborative design of SoC components using a mix of new and off the shelf open source IP
This session will provide the audience with the latest developments in open source PDKs, EDA, cloud based design environments, and a collaborative chip design project known as Caliptra.
These include:
-hardware instruction sets such as RISC-V, OpenPower
-Process Design Kits (PDKs) representing the manufacturing ingredients to design a chip
-Electronic Design Automation (EDA) software used for the construction, functional, electrical, and design verification of the design
-Cloud based design enablement platform
-Collaborative design of SoC components using a mix of new and off the shelf open source IP
This session will provide the audience with the latest developments in open source PDKs, EDA, cloud based design environments, and a collaborative chip design project known as Caliptra.
Keynote


AI
Design
DescriptionThe Design Automation Conference (DAC) has long been a beacon for technological foresight and innovation in the semiconductor industry. As we look beyond 2025, the landscape of chip design is poised for another transformative leap with the advent of reasoning agents. This evolution builds upon the foundational milestones set by the Electronics Resurgence Initiative (ERI), which revitalized U.S. semiconductor research, and the integration of cloud computing for silicon. The emergence of Generative AI (GenAI) heralds a new era of creativity and efficiency in design processes across multiple domains.
In this keynote, we will explore how reasoning agents are set to revolutionize the semiconductor industry by offering unprecedented capabilities in problem-solving and decision-making. These agents, drawing inspiration from scientific methodologies in other domains, promise to enhance the precision and speed of design, automate manual tasks, while also fostering a collaborative environment between human designers and AI systems. We will delve into the practical applications of these agents, showcasing their potential to streamline complex design challenges, drive innovation, and increase productivity.
The DAC continues to play a crucial role in this journey, serving as a platform for sharing insights, fostering collaboration, and setting the stage for the next wave of technological advancements. By embracing the synergy between AI and human expertise, we are not only shaping the future of microelectronics but also redefining the boundaries of what is possible in chip design. Join us as we navigate this exciting frontier and explore the opportunities that lie ahead.
In this keynote, we will explore how reasoning agents are set to revolutionize the semiconductor industry by offering unprecedented capabilities in problem-solving and decision-making. These agents, drawing inspiration from scientific methodologies in other domains, promise to enhance the precision and speed of design, automate manual tasks, while also fostering a collaborative environment between human designers and AI systems. We will delve into the practical applications of these agents, showcasing their potential to streamline complex design challenges, drive innovation, and increase productivity.
The DAC continues to play a crucial role in this journey, serving as a platform for sharing insights, fostering collaboration, and setting the stage for the next wave of technological advancements. By embracing the synergy between AI and human expertise, we are not only shaping the future of microelectronics but also redefining the boundaries of what is possible in chip design. Join us as we navigate this exciting frontier and explore the opportunities that lie ahead.
Analyst Presentation


DescriptionThe way we transport humans and cargo is evolving at a pace not experienced since the beginning of the industrial revolution.
This session will provide a glimpse into the future of mobility, including grand transportation, marine, aviation and space.
We'll not only discuss vehicle autonomy, electrification and connectivity, but explore how new technologies are impacting other modes of transportation.
This session will provide a glimpse into the future of mobility, including grand transportation, marine, aviation and space.
We'll not only discuss vehicle autonomy, electrification and connectivity, but explore how new technologies are impacting other modes of transportation.
Exhibitor Forum


DescriptionAI is poised to transform our world just like the Internet did. Generative AI will dramatically improve chip design efficiency and productivity, addressing the semiconductor workforce gap as chip demand rises. Discover Microsoft's vision for generative AI-driven solutions in revolutionizing semiconductor development through new efficiencies, next-gen tools, and a transformative designer experience.
Exhibitor Forum


DescriptionFor decades, the EDA industry has been dominated by a few major players, creating high barriers to entry. Yet, a remarkable shift is underway. In the past two years, a wave of EDA startups has emerged, securing significant venture funding and gaining customer traction. This resurgence is often driven by AI/ML advancements, new semiconductor design challenges, and evolving market dynamics.
This panel brings together key stakeholders to discuss:
• Market drivers: What conditions and breakthroughs are fuelling this startup boom?
• Innovation focus: How are startups identifying market opportunities and leveraging AI to solve complex design problems?
• Funding trends: How have investor attitudes toward deep tech and EDA evolved?
• Industry dynamics: How are startups navigating relationships with established players and potential acquirers?
• Success strategies: What’s working in customer acquisition, differentiation, and scaling?
Our panel features EDA entrepreneurs, venture capitalists, and industry leaders shaping this transformation. Join us for an insightful discussion on the future of EDA innovation and the opportunities ahead.
This panel brings together key stakeholders to discuss:
• Market drivers: What conditions and breakthroughs are fuelling this startup boom?
• Innovation focus: How are startups identifying market opportunities and leveraging AI to solve complex design problems?
• Funding trends: How have investor attitudes toward deep tech and EDA evolved?
• Industry dynamics: How are startups navigating relationships with established players and potential acquirers?
• Success strategies: What’s working in customer acquisition, differentiation, and scaling?
Our panel features EDA entrepreneurs, venture capitalists, and industry leaders shaping this transformation. Join us for an insightful discussion on the future of EDA innovation and the opportunities ahead.
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
DescriptionElastic block storage (EBS) with the storage-compute disaggregated architecture stands as a pivotal piece in today's cloud.
EBS furnishes users with storage capabilities through the elastic solid-state drive (ESSD).
Nevertheless, despite the widespread integration into cloud services, the absence of a thorough ESSD performance characterization raises critical doubt: when more and more services are shifted onto the cloud, can ESSD satisfactorily substitute the storage responsibilities of the local SSD and offer comparable performance?
In this paper, we for the first time target this question by characterizing two ESSDs from Amazon AWS and Alibaba Cloud.
We present an unwritten contract of cloud-based ESSDs, encapsulating four observations and five implications for cloud storage users.
Specifically, the observations are counter-intuitive and contrary to the conventional perceptions of what one would expect from the local SSD.
We hope the implications could guide users in revisiting the designs of their deployed cloud software, i.e., harnessing the distinct characteristics of ESSDs for better system performance.
EBS furnishes users with storage capabilities through the elastic solid-state drive (ESSD).
Nevertheless, despite the widespread integration into cloud services, the absence of a thorough ESSD performance characterization raises critical doubt: when more and more services are shifted onto the cloud, can ESSD satisfactorily substitute the storage responsibilities of the local SSD and offer comparable performance?
In this paper, we for the first time target this question by characterizing two ESSDs from Amazon AWS and Alibaba Cloud.
We present an unwritten contract of cloud-based ESSDs, encapsulating four observations and five implications for cloud storage users.
Specifically, the observations are counter-intuitive and contrary to the conventional perceptions of what one would expect from the local SSD.
We hope the implications could guide users in revisiting the designs of their deployed cloud software, i.e., harnessing the distinct characteristics of ESSDs for better system performance.
Engineering Poster
Networking


DescriptionEmerging technologies like AI/ML demands for high compute power, increasing the temperature in high activity regions of the chip. It is even more significant in multichip type of system architectures as heat gets trapped between chip interfaces. High temperature will impact performance and reliability of the product and cause possible thermal runaway if thermal cooling solutions cannot bring overall system temperature down. Methodology and flow proposed in this work will accurately model and analyze multiple chip-package thermal scenarios and capture tile based thermal profiles for each die in reasonable runtime. It allows different design metrics to do thermal aware optimizations and signoff product with high confidence.
Engineering Presentation


Back-End Design
Chiplet
DescriptionThe BSPDN (back-side power delivery network) is a technology introduced to improve performance and power efficiency in semiconductor processes. It is gaining attention as a next-generation design due to its advantages, such as improved power delivery efficiency, reduced signal/power routing interference, and enhanced performance. However, there are thermal management risks because the thickness of the silicon substrate where the transistors are located is thin, making heat transfer less effective compared to the FSPDN (front-side power delivery network). In this study, we analyzed the thermal characteristics of BSPDN based on its structure and evaluated the effect of thermal fillers/vias in reducing localized heating, such as self-heating.
Networking
Work-in-Progress Poster


DescriptionIn modern chips, much of the power is wasted on the power delivery network, causing thermal issues and voltage drops, which reduces the system's reliability. The voltage stacking technique addresses this problem by reusing charge in similar activity circuits. Still, it suffers from stack voltage ripple and power delivery loss due to the current imbalance caused by the activity factor variation in CMOS logic. This work presents a balancing technique that uses clocked differential logic with constant activity factor to eliminate current imbalance on a gate level in an entire SIMD processor datapath. Applied to a voltage
stacked vector processor called TickTockStack, it achieves a 7.6x reduction of stack voltage ripple and a 33.5% smaller power delivery loss compared to the CMOS baseline with an 18.4% area overhead and without a significant impact on the performance.
stacked vector processor called TickTockStack, it achieves a 7.6x reduction of stack voltage ripple and a 33.5% smaller power delivery loss compared to the CMOS baseline with an 18.4% area overhead and without a significant impact on the performance.
Networking
Work-in-Progress Poster


DescriptionTiming attacks exploit variations in a program's execution times to extract sensitive information from the program (e.g. encryption keys, additive manufacturing pathways).
Typical solutions to timing side-channel vulnerabilities attempt to balance the execution time of the sensitive code for different control flow paths to eliminate the timing leakage (without much consideration given to the underlying hardware).
We propose TiLeR, a novel joint hardware-software methodology for mitigating timing side-channel vulnerabilities that utilizes timing values from real embedded devices. We implement/evaluate TiLeR on four embedded devices using six software benchmarks and observe significant post-fixed codes' performance advantage compared to constant-time programming.
Typical solutions to timing side-channel vulnerabilities attempt to balance the execution time of the sensitive code for different control flow paths to eliminate the timing leakage (without much consideration given to the underlying hardware).
We propose TiLeR, a novel joint hardware-software methodology for mitigating timing side-channel vulnerabilities that utilizes timing values from real embedded devices. We implement/evaluate TiLeR on four embedded devices using six software benchmarks and observe significant post-fixed codes' performance advantage compared to constant-time programming.
Engineering Poster


DescriptionIn this engineering track submission, we present a groundbreaking application of machine learning in the realm of timing triage. Our novel approach leverages the power of artificial intelligence to enable designers to perform what-if analysis and efficiently predict the outcome of design and logic changes on timing slack.
Our system learns from a vast corpus of data collected through the ingestion of multiple Terabytes of project data into the timing visualizer tool. This enables the AI assistant to provide accurate and reliable predictions, allowing designers to make informed decisions and optimize their designs for optimal performance.
With Timing Visualizer AI Assistant, designers can quickly and easily explore different design scenarios, identify potential timing issues, and optimize their designs for optimal performance. This results in faster time-to-market, reduced costs, and improved product quality.
We demonstrate the effectiveness of our approach through a series of case studies and experiments, showing significant improvements in timing prediction accuracy and design optimization. Our novel application of machine learning in timing triage has the potential to revolutionize the field of design automation and transform the way designers work.
Our system learns from a vast corpus of data collected through the ingestion of multiple Terabytes of project data into the timing visualizer tool. This enables the AI assistant to provide accurate and reliable predictions, allowing designers to make informed decisions and optimize their designs for optimal performance.
With Timing Visualizer AI Assistant, designers can quickly and easily explore different design scenarios, identify potential timing issues, and optimize their designs for optimal performance. This results in faster time-to-market, reduced costs, and improved product quality.
We demonstrate the effectiveness of our approach through a series of case studies and experiments, showing significant improvements in timing prediction accuracy and design optimization. Our novel application of machine learning in timing triage has the potential to revolutionize the field of design automation and transform the way designers work.
Engineering Poster
Networking


DescriptionThis paper discusses a "timing-aware smart PG fill" technique that allows higher insertion of PG fill without significantly affecting timing. Increased PG fill insertions lead to reduced IR drop, thereby decreasing the iterations of "PnR, STA, and DRC/LVS" needed to meet the IR drop target.
Research Manuscript


EDA
EDA7: Physical Design and Verification
DescriptionIn chip design, skew is a pivotal factor that significantly influences the overall performance for routing. A major challenge is how to achieve an appropriate trade-off between the total wire-length cost and skew. Selecting hub nodes is an effective method to improve this cost-skew trade-off. In this paper, we propose a novel reinforcement learning-based method for hub node selection, where our key idea is leveraging an effective adaptive learning strategy. Moreover, our approach is particularly suitable for solving large-scale routing instances. The empirical results suggest that our method can achieve promising performance on both small-scale and large-scale clock nets, implying its potential practical significance in EDA.
Networking
Work-in-Progress Poster


Descriptionttention monitoring is crucial in various fields,
especially those involving prolonged periods of passive
observation. Electroencephalography (EEG) offers a cost-
effective, portable, and non-invasive solution; however,
the signal's intrinsic complexity necessitate advanced
analysis techniques. This research proposes a novel
approach for accurate attention classification. We selected
a representative channel and applied Recursive Feature
Elimination (RFE) to identify the most discriminating
features to enhance detection accuracy and minimize energy
consumption. On a public dataset, our optimized XGBoost
algorithm achieved 98.29% and 94.25% accuracy for binary
and three-class attention classification, respectively. The
refined feature set reduces execution time by 9%, memory
usage by 38%, and energy consumption by 8% on an Intel
i9-13900 platform. On a Raspberry Pi 5 the improvements
are 15%, 14%, and 14% respectively. By addressing these
challenges our approach facilitates adoption of wearable
devices for attention monitoring in various settings.
Index Terms—EEG signal, attention detection, machine
learning, feature engineering, wearable devices.
especially those involving prolonged periods of passive
observation. Electroencephalography (EEG) offers a cost-
effective, portable, and non-invasive solution; however,
the signal's intrinsic complexity necessitate advanced
analysis techniques. This research proposes a novel
approach for accurate attention classification. We selected
a representative channel and applied Recursive Feature
Elimination (RFE) to identify the most discriminating
features to enhance detection accuracy and minimize energy
consumption. On a public dataset, our optimized XGBoost
algorithm achieved 98.29% and 94.25% accuracy for binary
and three-class attention classification, respectively. The
refined feature set reduces execution time by 9%, memory
usage by 38%, and energy consumption by 8% on an Intel
i9-13900 platform. On a Raspberry Pi 5 the improvements
are 15%, 14%, and 14% respectively. By addressing these
challenges our approach facilitates adoption of wearable
devices for attention monitoring in various settings.
Index Terms—EEG signal, attention detection, machine
learning, feature engineering, wearable devices.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionEvent-based cameras, with their unique event stream representation, effectively mitigate motion blur in high-speed, high-exposure scenarios but suffer from low spatial resolution. To address this, we propose a super-resolution hardware accelerator for event streams based on Spiking Neural Networks (SNNs).
In terms of network architecture, we incorporate hardware-friendly algorithmic designs by simplifying neuron models and optimizing convolution operations. On the hardware side, the design adopts a hierarchical structure featuring a highly parallel computational array. Additionally, by proposing a Kernel-Channel-Timestamp-Row (KCTR) dataflow and dual-pipeline structure, the design achieves in-situ computing, eliminating intermediate storage within layers and significantly reducing inter-layer spike storage.Evaluations on the N-MNIST and ASL-DVS datasets demonstrate root mean square errors (RMSE) of 1.296 and 0.121 for reconstructed super-resolution event streams. In downstream applications, the classification accuracies reach 98.84% and 99.73%, respectively.The proposed accelerator, designed using a 28nm CMOS process, improves reconstruction speed by 95.6% compared to a GPU, operates at 500 MHz, and consumes only 0.546 pJ per synaptic operation.
In terms of network architecture, we incorporate hardware-friendly algorithmic designs by simplifying neuron models and optimizing convolution operations. On the hardware side, the design adopts a hierarchical structure featuring a highly parallel computational array. Additionally, by proposing a Kernel-Channel-Timestamp-Row (KCTR) dataflow and dual-pipeline structure, the design achieves in-situ computing, eliminating intermediate storage within layers and significantly reducing inter-layer spike storage.Evaluations on the N-MNIST and ASL-DVS datasets demonstrate root mean square errors (RMSE) of 1.296 and 0.121 for reconstructed super-resolution event streams. In downstream applications, the classification accuracies reach 98.84% and 99.73%, respectively.The proposed accelerator, designed using a 28nm CMOS process, improves reconstruction speed by 95.6% compared to a GPU, operates at 500 MHz, and consumes only 0.546 pJ per synaptic operation.
Networking
Work-in-Progress Poster


DescriptionRouting is a key stage in current IC design. Due to its hardness in optimization, several machine learning algorithms have been developed recently. But they often suffer from the issues like
high time complexity for training and large demand of training data. Moreover, modern routing usually needs to consider various optimization objectives (e.g., the total wirelength or the maximum time delay). It is prohibitively expensive if we always train a new model from scratch to adapt each encountered
new routing objective. In this paper, we introduce a novel transfer learning framework that can significantly reduce the training time and the amount of newly added training data. The key part of our framework relies on a novel sampling idea called "model-guided coreset", which can yield 5-8X reduction on the training complexity with preserving comparable routing quality to the state-of-the-art learning based approaches.
high time complexity for training and large demand of training data. Moreover, modern routing usually needs to consider various optimization objectives (e.g., the total wirelength or the maximum time delay). It is prohibitively expensive if we always train a new model from scratch to adapt each encountered
new routing objective. In this paper, we introduce a novel transfer learning framework that can significantly reduce the training time and the amount of newly added training data. The key part of our framework relies on a novel sampling idea called "model-guided coreset", which can yield 5-8X reduction on the training complexity with preserving comparable routing quality to the state-of-the-art learning based approaches.
Research Special Session


Design
DescriptionDespite significant progress in cryptographic approaches (e.g., homomorphic encryption, multi-party computation ) and trusted execution environments such as secure enclaves, secure data management has remained elusive. Existing techniques exhibit trade-offs between the type of queries they can support efficiently, and the security the solution offers. Practical solutions today mix multiple techniques into a single system to enable useds to explore such tradeoffs. Based on sensitivity of data and performance requirements users can encrypt different parts of data using different encryption mechanisms. Such an approach of combining multiple cryptographic techniques can, however, result in additional unintended leakages due on inferences based on the semantics of the underlying data. on Our work develops a formal framework to define and reason with such leakages. It draws inspiration from database design theory to store data in a normalized which mitigate such leakages . Furthermore, we explore hardware oriented approaches for data processing over normalized representation that prevents additional leakage. Using the proposed framework, one can built multi-cryptographic solutions to data processing that allow users to explore tradeoffs between security and efficiency of execution without any additional unintended leakages.
Research Manuscript


AI
AI1: AI/ML Algorithms
DescriptionQuantum machine learning, crucial in the noisy intermediate-scale quantum (NISQ) era, confronts challenges in error mitigation. Current noise-aware training (NAT) methods often assume static error rates in quantum neural networks (QNNs), overlooking the dynamic nature of quantum noise. Our work highlights how error rates fluctuate over time and across different qubits, affecting QNN performance even when overall error rates are similar. We introduce a novel NAT strategy that dynamically adjusts to standard and fatal error conditions, incorporating a low-complexity search method to identify fatal errors during optimization. This strategy significantly improves robustness, maintaining competitive performance with leading NAT methods across varying error scenarios.
Research Manuscript


Design
DES4: Digital and Analog Circuits
DescriptionIntegrating deep learning with environmental perception enhances robotic adaptability to complex tasks. However, its ``black-box'' nature, such as the lack of uncertainty quantification, poses challenges for safety-critical applications, particularly in unstructured and noisy environments. Bayesian neural networks (BNNs) offer uncertainty quantification but are limited by high hardware overhead, restricting real-time implementation on resource-constrained robots. This paper presents a mixed-signal hardware accelerator for BNNs, utilizing probabilistic quantum tunneling in fully depleted silicon-on-insulator (FD-SOI) transistors to enable efficient, real-time uncertainty quantification. Device measurements indicate high-quality Gaussian random variable generation, validated through quantile-quantile plot analysis, with a high correlation coefficient (r = 0.997) at 200 fJ/sample. Leveraging such compact randomness, the parallel architecture achieved 10^3--10^4× latency reduction at less than 2 × area cost. Finally, in uncertainty-aware visual localization application of autonomous underwater vehicles, the BNN model effectively distinguishes data noise from model uncertainty, yielding significant information gain and enhancing the resampling efficiency by 4.5× at same accuracy.
Networking
Work-in-Progress Poster


DescriptionStandard cells are the core of digital Very Large Scale Integration designs, but advancements beyond the 7nm node have made layout design more challenging due to complex rules and limited routing resources. Effective placement is crucial, as minor transistor order changes can affect routability. Sequential place-and-route frameworks often expend significant resources on unroutable placements. To enhance efficiency, predicting placement routability before routing is vital. This paper proposes a transistor placement routability predictor using image recognition. Tested on 690 cells, it achieves prediction accuracies of 96.32\% for single-height and 93.51\% for double-height configurations.
Research Manuscript
TransRoute: A Novel Hierarchical Transistor-Level Routing Framework Beyond Standard-Cell Methodology
2:45pm - 3:00pm PDT Tuesday, June 24 3004, Level 3

EDA
EDA7: Physical Design and Verification
DescriptionIn advanced technology nodes, benefits from scaling have become limited in power, performance, and area, necessitating improvements through design-technology co-optimization (DTCO). However, standard-cell methodology, widely used in VLSI design, restricts the potential of DTCO due to its logic-level abstraction, hindering physical-level optimization. To address this, a transistor-level design approach is needed to break the abstraction of standard cells. While large-scale transistor placement has been explored, efficient routing at the transistor level remains a challenge. This paper introduces a novel transistor-level routing framework with a CP-SAT formulation for lower-layer routing, significantly reducing wirelength and design area compared to traditional standard-cell-based designs.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionTo guarantee service quality in transformer based large language model (LLM) serving, it is essential to meet the latency constraints of both the prefill phase (measured by Time-to-First-Token, TTFT) and the decode phase (measured by Time-per-Output-Token, TPOT). Nondisaggregated serving places prefill and decode on the same worker, while disaggregated serving places the prefill and decode on isolated workers. However, no single architecture excels in both TTFT and TPOT. Our analysis of the underlying reasons reveals that in disaggregated LLM serving, prefills have minimal interference with decodes but result in high queuing times. In contrast, nondisaggregated LLM serving effectively reduces queuing times but introduces significant interference between prefills and decodes.
In order to leverage the best aspects of both nondisaggregated and disaggregated LLM serving, we have designed and implemented Tropical. Tropical introduces an service-level objective (SLO)-aware multiplexing strategy that balances the queuing time and the interference, enabling the LLM serving to achieve high TTFT and TPOT SLO simultaneously. Our evaluation of real-world datasets reveals that Tropical outperforms both state-of-the-art nondisaggregated and disaggregated LLM serving systems, achieving up to 2.09x more requests within a 90% SLO attainment. Specially, compared to the disaggregated LLM serving system, Tropical improves P90 TTFT performance by 9x with only an 15% reduction in P90 TPOT. Against the nondisaggregated LLM serving systems, Tropical delivers a 2.8x improvement in P90 TPOT performance while maintaining the same P90 TTFT.
In order to leverage the best aspects of both nondisaggregated and disaggregated LLM serving, we have designed and implemented Tropical. Tropical introduces an service-level objective (SLO)-aware multiplexing strategy that balances the queuing time and the interference, enabling the LLM serving to achieve high TTFT and TPOT SLO simultaneously. Our evaluation of real-world datasets reveals that Tropical outperforms both state-of-the-art nondisaggregated and disaggregated LLM serving systems, achieving up to 2.09x more requests within a 90% SLO attainment. Specially, compared to the disaggregated LLM serving system, Tropical improves P90 TTFT performance by 9x with only an 15% reduction in P90 TPOT. Against the nondisaggregated LLM serving systems, Tropical delivers a 2.8x improvement in P90 TPOT performance while maintaining the same P90 TTFT.
Research Manuscript


EDA
EDA3: Timing Analysis and Optimization
DescriptionFast and accurate pre-routing timing prediction is essential in the chip design flow. However, existing machine learning (ML)-assisted pre-routing timing methods often overlook the impact of power delivery networks (PDN), which contribute to IR drop and routing congestion.
This limitation can make these methods less practical for real-world circuit design flows.
To address this, we propose two specialized encoders—an IR drop-aware encoder and a routing congestion-aware encoder—that effectively capture PDN effects through multimodal fusion of netlist, layout, and PDN data.
To mitigate the challenges of imbalanced multimodal fusion, we further develop a Pareto optimization approach to ensure balanced utilization of all modalities, enhancing timing prediction accuracy.
Comprehensive experiments on large-scale open-source designs using TSMC's 16nm technology node validate the superiority of our model over state-of-the-art pre-routing timing prediction methods.
This limitation can make these methods less practical for real-world circuit design flows.
To address this, we propose two specialized encoders—an IR drop-aware encoder and a routing congestion-aware encoder—that effectively capture PDN effects through multimodal fusion of netlist, layout, and PDN data.
To mitigate the challenges of imbalanced multimodal fusion, we further develop a Pareto optimization approach to ensure balanced utilization of all modalities, enhancing timing prediction accuracy.
Comprehensive experiments on large-scale open-source designs using TSMC's 16nm technology node validate the superiority of our model over state-of-the-art pre-routing timing prediction methods.
Networking
Work-in-Progress Poster


DescriptionThis paper introduces TrustChain AI, a decentralized architecture designed for privacy-preserving aggregation of large language models (LLMs) across multiple institutions. By integrating federated learning with knowledge distillation, TrustChain AI enables collaborative model training without the need to share raw data, addressing critical privacy concerns inherent in multi-institutional collaborations. The architecture leverages the IOTA Tangle for decentralized trust mechanisms, ensuring immutable and verifiable logging of model updates to foster transparency and trust among participants. A consensus voting mechanism ensures that only validated updates contribute to the global model, enhancing the integrity of the aggregation process. Secure communication is facilitated through the Matrix protocol, providing end-to-end encryption for all exchanges. Additionally, computational overhead is decentralized, as institutions focus on fine-tuning specific tasks while benefiting from the shared knowledge of the aggregated global model. The architecture is designed with automation in mind, enabling seamless integration and operation across diverse environments. Experimental results indicate that TrustChain AI achieves high task performance while maintaining scalability and adaptability across heterogeneous environments. This work presents a viable solution for secure, scalable, and privacy-preserving collaborative LLM development, with potential extensions to various domains requiring data integrity, confidentiality, and automated operations
Engineering Presentation


AI
Systems and Software
DescriptionCollaboration in artificial intelligence (AI) faces significant challenges, including privacy concerns, trust deficits, and limitations in handling diverse tasks. Existing centralized and federated frameworks suffer from scalability issues, reliance on single points of failure, and a lack of transparent mechanisms for validating model updates. To address these gaps, we propose TrustChain, a decentralized AI framework that leverages IOTA Tangle for tamper-proof, immutable update logs and Matrix for secure, encrypted communication. TrustChain integrates task-specific models (e.g., BioBERT for NER, LLAMA for QA, and BioGPT) into a unified Global Model using knowledge distillation, enabling multitasking while preserving data privacy. Experimental results demonstrate that TrustChain achieves balanced performance across tasks, providing scalability and transparency through consensus-driven trust. By ensuring privacy, adaptability, and decentralized collaboration, TrustChain offers a robust foundation for secure, multi-institutional AI development.
Networking
Work-in-Progress Poster


DescriptionTraining large-scale models has become increasingly challenging because of the GPU memory wall problem.
Rematerialization in graph mode lacks universality, while eager mode is not efficient.
Moreover, existing methods are difficult for further research because of their trouble in secondary development.
In this paper, we propose TSO, a unified and efficient training framework to boost large-scale model rematerialization training via optimal tensor scheduling optimization which possesses universality and efficiency at the same time.
TSO is non-intrusive modification to current frameworks and achieves one-line code to use.
Experimental results demonstrate that TSO achieves better training efficiency than state-of-the-art methods, even 7.99× speedup compared to intrusive methods.
Rematerialization in graph mode lacks universality, while eager mode is not efficient.
Moreover, existing methods are difficult for further research because of their trouble in secondary development.
In this paper, we propose TSO, a unified and efficient training framework to boost large-scale model rematerialization training via optimal tensor scheduling optimization which possesses universality and efficiency at the same time.
TSO is non-intrusive modification to current frameworks and achieves one-line code to use.
Experimental results demonstrate that TSO achieves better training efficiency than state-of-the-art methods, even 7.99× speedup compared to intrusive methods.
Networking
Work-in-Progress Poster


DescriptionModern computer architectures face challenges in balancing hardware area and performance. Binary computing, known for its compactness, requires hardware area that scales quadratically with precision, while unary computing, despite its simplicity, suffers from exponentially increasing computation time. This paper introduces the Unary Positional System, a paradigm that combines spatial and temporal characteristics to address this trade-off. A key component, CMem, is developed for UPS-based arithmetic hardware and applied to GEMM. Experimental results show that UPS bridges the gap between binary and unary computing, with its flexibility offering potential for further optimization.
Research Manuscript


Systems
SYS2: Design of Cyber-Physical Systems and IoT
DescriptionWearable internet of things (IoT) are transforming various healthcare applications, such as rehabilitation, vital symptom monitoring, and activity recognition. However, small form-factor of wearable devices constrains the battery capacity and the operating lifetime, thus requiring frequent recharging or battery replacements. Frequent recharging and battery replacement lead to lower quality of service and user satisfaction. Harvesting energy from ambient sources to augment the battery has emerged as an effective technique to improve the operating lifetime. However, ambient energy sources are highly stochastic making the energy management challenging. Prior approaches typically use point predictions for estimating future energy, which does not account for the uncertainty. In strong contrast to prior approaches, this paper presents a conformal prediction-based method for future energy harvest. The proposed method provides tight prediction regions while ensuring coverage guarantees. The predictions are then leveraged in an energy management algorithm that employs Monte Carlo sampling to evaluate multiple trajectories of decisions with varying energy harvest. The decisions from each trajectory are combined using a light-weight machine learning model to make an energy management decision that follows an optimal trajectory. Experiments with two diverse datasets with about 10 users show that the proposed approach achieves more than 90% coverage with tight prediction intervals. The energy management algorithm achieves decisions that are within 2 J of an optimal Oracle, thus showing its effectiveness is improving the quality of service.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionTransformer-based large language models (LLMs) have achieved impressive performance in various natural language processing (NLP) applications. However, the high memory and computation cost induced by the KV cache limits the inference efficiency, especially for long input sequences. Compute-in-memory (CIM)-based accelerators have been proposed for LLM acceleration with KV cache pruning. However, as existing accelerators only support static pruning with a fixed pattern or dynamic pruning with primitive implementations, they suffer from either high accuracy degradation or low efficiency. In this paper, we propose a ferroelectric FET (FeFET)-based unified content addressable memory (CAM) and CIM architecture, dubbed as UniCAIM. UniCAIM features simultaneous support for static and dynamic pruning with 3 computation modes: 1) in the CAM mode, UniCAIM enables approximate similarity measurement in O(1) time for dynamic KV cache pruning with high energy efficiency; 2) in the charge-domain CIM mode, static pruning can be supported based on accumulative similarity score, which is much more flexible compared to fixed patterns; 3) in the current-domain mode, exact attention computation can be conducted with a subset of selected KV cache. We further propose a novel CAM/CIM cell design that leverages the multi-level characteristics of FeFETs for signed multi-bit storage of the KV cache and in-place attention computation. With extensive experimental results, we demonstrate UniCAIM can reduce the area-energy-delay product (AEDP) by 8.2×~831× over the state-of-the-art CIM-based LLM accelerators at the circuit level, along with high accuracy comparable with dense attention at the application level, showing its great potential for efficient long-context LLM inference.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionCurrent algorithm-hardware co-search works often suffer from lengthy training times and inadequate exploration of hardware design spaces, leading to suboptimal performance. This work introduces UniCoS, a unified framework for co-optimizing neural networks and accelerators for CNNs and Vision Transformers (ViTs). By introducing a novel training-free proxy that evaluates accuracy within seconds and a clustering-based algorithm for exploring heterogeneous dataflows, UniCoS efficiently navigates the design spaces of both architectures. Experimental results demonstrate that the solutions generated by UniCoS consistently surpass state-of-the-art (SOTA) methods (e.g., 3.54x energy-delay product (EDP) improvement with a 1.76% higher accuracy on ImageNet) while requiring notably reduced search time (up to 48x, ~3 hours). The code will be open-sourced.
Networking
Work-in-Progress Poster


DescriptionSkyrmion Racetrack Memory is a promising non-volatile memory solution, known for high access performance, density, and energy efficiency. However, issues arise with excessive shift and insert operations, especially for large-scale data like hyperdimensional computing. Additionally, the von-Neumann architecture faces memory bandwidth bottlenecks in such contexts. To address this, the proposed architecture enhances Skyrmion Racetrack Memory by enabling in-memory hyperdimensional computing and minimizing unnecessary operations. It employs a bit-interleaved-like data layout to reduce shift operations and a paired nanotrack counter to transform insert/delete tasks into faster shift operations, achieving an 8x shift reduction compared to traditional designs.
Research Manuscript


Systems
SYS3: Embedded Software
DescriptionTo provide flexibility and low-level interaction capabilities, the ''unsafe'' tag in Rust is essential in many projects, but undermines memory safety and introduces Undefined Behaviors (UBs) that reduce safety. Eliminating these UBs requires a deep understanding of Rust's safety rules and strong typing. Traditional methods require depth analysis of code, which is laborious and depends on knowledge design. The powerful semantic understanding capabilities of LLM offer new opportunities to solve this problem. Although existing large model debugging frameworks excel in semantic tasks, limited by fixed processes and lack adaptive and dynamic adjustment capabilities. Inspired by the dual process theory of decision-making (''Fast and Slow Thinking''), we present a LLM-based framework called RustBrain that automatically and flexibly minimizes UBs in Rust projects. Fast thinking extracts features to generate solutions, while slow thinking decomposes, verifies, and generalizes them abstractly. To apply verification and generalization results to solution generation, enabling dynamic adjustments and precise outputs, RustBrain integrates two thinking through a feedback mechanism. Experimental results on Miri dataset show a 94.3% pass rate and 80.4% execution rate, improving flexibility and Rust projects safety.
TechTalk


DescriptionThe semiconductor industry is experiencing unprecedented growth, and this growth comes with significant challenges—more design starts, rising design complexities, shorter time-to-market, and a shrinking talent pool. To address these challenges, semiconductor companies are turning to AI-powered EDA solutions. While mainstream AI & GenAI technologies have seen rapid consumer adoption, adapting these AI technologies for EDA use cases is not straightforward due to stringent quality requirements for semiconductor design.
Ideally, EDA AI solutions that provide productivity boosts to chip designers and engineers should (a) seamlessly analyze design and verification data, (b) optimize complex processes, and (c) generate better designs. Across these functional areas, we will discuss illustrative ML, GenAI, and Agentic approaches. Additionally, we will also discuss the challenges associated with AI adoption, including data availability, model interpretability, and computational demands.
Further, we will discuss the grand vision of having a purpose-built centralized EDA AI platform. Such a platform framework can be very powerful by combining sophisticated foundational models or even IC domain-specific foundational models with a multimodal data lake to bring GenAI capabilities to push the boundaries of semiconductor innovation, paving the way for more efficient, scalable, and intelligent design processes.
Join us to explore the capabilities of EDA AI and see what the future holds!
Ideally, EDA AI solutions that provide productivity boosts to chip designers and engineers should (a) seamlessly analyze design and verification data, (b) optimize complex processes, and (c) generate better designs. Across these functional areas, we will discuss illustrative ML, GenAI, and Agentic approaches. Additionally, we will also discuss the challenges associated with AI adoption, including data availability, model interpretability, and computational demands.
Further, we will discuss the grand vision of having a purpose-built centralized EDA AI platform. Such a platform framework can be very powerful by combining sophisticated foundational models or even IC domain-specific foundational models with a multimodal data lake to bring GenAI capabilities to push the boundaries of semiconductor innovation, paving the way for more efficient, scalable, and intelligent design processes.
Join us to explore the capabilities of EDA AI and see what the future holds!
Engineering Poster
Networking


DescriptionDigital verification presents a critical challenge in achieving maximum device coverage in the shortest possible time while using the minimum amount of resources. However, balancing these objectives often seems impossible. Increasing the number of random runs enhances coverage but extends verification runtime, while increasing parallel runs reduces runtime but consumes more resources.
In this context, we propose a novel digital verification flow that synergizes two AI engines to maximize benefits. Our methodology initially employs an AI engine to create new randomization constraints by correlating random actions to coverage targets, thereby achieving maximum coverage with fewer tests and reducing regression runtime. Subsequently, a second AI engine reshuffles the test execution sequence and parallelism by estimating the duration of each test, further reducing regression runtime and decreasing the number of resources used.
This innovative combination optimizes coverage, reduces verification time, decreases the overall number of required tests, and improves resource efficiency.
In this context, we propose a novel digital verification flow that synergizes two AI engines to maximize benefits. Our methodology initially employs an AI engine to create new randomization constraints by correlating random actions to coverage targets, thereby achieving maximum coverage with fewer tests and reducing regression runtime. Subsequently, a second AI engine reshuffles the test execution sequence and parallelism by estimating the duration of each test, further reducing regression runtime and decreasing the number of resources used.
This innovative combination optimizes coverage, reduces verification time, decreases the overall number of required tests, and improves resource efficiency.
Engineering Presentation


Front-End Design
Chiplet
DescriptionGlitch power is a critical concern in digital design. When the signal timing paths in a combinational circuit are imbalanced, race conditions arise, causing glitches along the paths. Research indicates that glitch power can account for up to 40% of total power consumption, posing a significant challenge for designs involving large-scale combinational logic chips such as CPUs, GPUs, and AI processors.
To optimize glitch power, two key factors must be addressed: the magnitude of glitch power within the design and its distribution across functional blocks. While existing tools can identify glitches and measure their power impact using accurate delay data and gate-level netlists, this analysis typically occurs during the placement and routing (P&R) stage—too late for effective design optimization. Current EDA flows attempt to address this by estimating wire delays during the RTL stage using P&R engines. However, this approach is time-consuming, requires RTL designers to have in-depth P&R knowledge, and may still yield discrepancies when compared to the final tape-out netlist.
In this paper, we propose an alternative approach to address these challenges at the RTL design stage. Our methodology leverages a uniform delay-aware engine to estimate glitch power caused by imbalances in combinational logic. Additionally, a statistical scaling factor is applied to account for delay effects. We validated this approach across eight different design blocks and three technical corners. The results demonstrate less than 10% variance in total power compared to gate-level netlist power, with a well-matched glitch power distribution.
This level of accuracy is sufficient for identifying glitch risks and optimizing critical combinational logic blocks. Furthermore, our solution is faster and more accessible for RTL designers, eliminating the need for extensive P&R expertise.
To optimize glitch power, two key factors must be addressed: the magnitude of glitch power within the design and its distribution across functional blocks. While existing tools can identify glitches and measure their power impact using accurate delay data and gate-level netlists, this analysis typically occurs during the placement and routing (P&R) stage—too late for effective design optimization. Current EDA flows attempt to address this by estimating wire delays during the RTL stage using P&R engines. However, this approach is time-consuming, requires RTL designers to have in-depth P&R knowledge, and may still yield discrepancies when compared to the final tape-out netlist.
In this paper, we propose an alternative approach to address these challenges at the RTL design stage. Our methodology leverages a uniform delay-aware engine to estimate glitch power caused by imbalances in combinational logic. Additionally, a statistical scaling factor is applied to account for delay effects. We validated this approach across eight different design blocks and three technical corners. The results demonstrate less than 10% variance in total power compared to gate-level netlist power, with a well-matched glitch power distribution.
This level of accuracy is sufficient for identifying glitch risks and optimizing critical combinational logic blocks. Furthermore, our solution is faster and more accessible for RTL designers, eliminating the need for extensive P&R expertise.
Engineering Presentation


Front-End Design
Chiplet
DescriptionAt RTL, VC LP makes sure design is correct as per RTL logic and power format UPF. New electrical issues introduced after multi voltage cells (Isolation/Level Shifter) insertion are caught only at post synthesis.
Traditional Low Power performs UPF syntax and sematic checks at RTL stage and structural and integration checks performed at Netlist stage. Catching Low Power issues at later stages can be costly. There is a high demand to catch post-synthesis issues at RTL stage itself and improve accuracy w.r.t netlist stage.
VC LP already has Virtual Instrumentation based Predictive Flow which virtually instruments the Isolation/Level Shifter cells in the design but has certain limitations such as lot of processing in internal crossover database, lacking GUI and TCL support, netlist level checks not supported etc.
Predictive analysis using Design Editing is introduced to catch most issues upfront at RTL rather than after synthesis or P&R stage. Low power elements (Isolation/Level Shifter) get Instrumented in the RTL design to resemble the synthesized netlist design and netlist level checks are performed considering these instrumented cells. All VC LP checks work seamlessly with instrumented design without the need for any special handling and supports GUI, TCL by default.
Customer got excellent results (~99% Accuracy) and Netlist level checks ability with the Design Editing Flow at their RTL design in terms of reducing noise and performing netlist level checks. This produces cleaner design for synthesis and thus less late-stage costly bugs.
Traditional Low Power performs UPF syntax and sematic checks at RTL stage and structural and integration checks performed at Netlist stage. Catching Low Power issues at later stages can be costly. There is a high demand to catch post-synthesis issues at RTL stage itself and improve accuracy w.r.t netlist stage.
VC LP already has Virtual Instrumentation based Predictive Flow which virtually instruments the Isolation/Level Shifter cells in the design but has certain limitations such as lot of processing in internal crossover database, lacking GUI and TCL support, netlist level checks not supported etc.
Predictive analysis using Design Editing is introduced to catch most issues upfront at RTL rather than after synthesis or P&R stage. Low power elements (Isolation/Level Shifter) get Instrumented in the RTL design to resemble the synthesized netlist design and netlist level checks are performed considering these instrumented cells. All VC LP checks work seamlessly with instrumented design without the need for any special handling and supports GUI, TCL by default.
Customer got excellent results (~99% Accuracy) and Netlist level checks ability with the Design Editing Flow at their RTL design in terms of reducing noise and performing netlist level checks. This produces cleaner design for synthesis and thus less late-stage costly bugs.
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
DescriptionVector similarity search plays a pivotal role in modern applications, including recommendation systems, image search, large language models (LLMs), and high-dimensional data retrieval. As data size scales, our research reveals that the search phase imposes substantial demands on DRAM bandwidth, leading to performance limitations in conventional von Neumann architecture with shared memory buses. This data movement bottleneck restricts the efficiency and scalability of vector similarity search due to insufficient memory bandwidth. To mitigate this issue, we leverage UPMEM, an off-the-shelf near-memory processing (NMP) system, to minimize the data movement between memory and compute units. However, UPMEM's computing engine has certain limitations and requires thorough application integration to unleash its high-parallelism capabilities. In this work, we introduce UPMEM-aware Vector Similarity Search (UPVSS), an architecture-aware system that jointly manages vector similarity search and UPMEM's NMP technology. UPVSS prioritizes offloading operations based on their strengths and capabilities, effectively alleviating the data movement bottleneck and improving overall system performance.
Engineering Poster


DescriptionUSB4 Router at MAC verification encounters challenges such as managing complex protocol interactions, signal integrity at high data rates (up to 80 Gbps), error handling, effective link training and creating accurate environments for comprehensive testing. Traditionally USB4 MAC DUT Verification is happening using PHY interface designed for USB4, or PIPE based PHY (Handling multiple protocol). To create reliable and high quality USB4 PHY requires intensive verification across different technologies and because of this PHY readiness is in the critical path of the overall USB4 verification. Also, Due to the serial interface of PHY, simulation time is very high, and it leads to performance issue hence separating PHY verification and MAC verification can help. To address these issues, we have introduced a solution where USB4 Version 1 and Version 2.0 MAC DUT can be verified without using serial interface.
Engineering Poster
Networking


DescriptionValidation of Liberty (.lib) files received from external sources can be challenging and require significant resources. Several factors contribute to discrepancies in the accuracy of .lib files, such as the characterization settings used, the extracted netlist, SPICE models, the versions of simulators or tools employed, and the margins incorporated into the .lib file modeling process. Many of these factors may not be visible to end-users which adds complexity.
The proposed methodology enables users to plug the missing data for a dataset that is starved of it, thereby providing more clarity when comparing with the golden dataset. It leverages AI to allow end-users to overcome the challenge of limited information on how the IP was created. Using this, potential issues with external .lib files where incorrect process models were utilized during the characterization process were identified. By detecting this discrepancy much earlier in the overall development workflow, corrective actions were taken based on the deviations that were identified.
In summary, the solution enabled the early identification of a problem with the process models used for external library characterization. This granted the team ample time to address the issue proactively, avoiding downstream complications and negative impacts to project timeline and resource requirements.
The proposed methodology enables users to plug the missing data for a dataset that is starved of it, thereby providing more clarity when comparing with the golden dataset. It leverages AI to allow end-users to overcome the challenge of limited information on how the IP was created. Using this, potential issues with external .lib files where incorrect process models were utilized during the characterization process were identified. By detecting this discrepancy much earlier in the overall development workflow, corrective actions were taken based on the deviations that were identified.
In summary, the solution enabled the early identification of a problem with the process models used for external library characterization. This granted the team ample time to address the issue proactively, avoiding downstream complications and negative impacts to project timeline and resource requirements.
Engineering Poster
Networking


DescriptionIn the design automation industry, triaging timing violations is typically performed using a method that categorizes violations based on known path attributes such as startpoint, endpoint, slack magnitude, clocks, and data/clock path, and pin names. This approach, while effective, often results in an overwhelming number of violations within the same category especially in early design stages and dirty designs, making efficient triage challenging. I propose a novel method for triaging violations using machine learning techniques, specifically clustering. Instead of sorting paths by predefined attributes, this method groups them based on their similarity across these attributes. By changing the metrics considered by the clustering algorithm and configuring the algorithm itself, we can adjust the size and wideness of the clusters, allowing for controllability in cluster generation and providing better insights for the timing team. This clustering approach allows for the identification of relationships between paths that are not easily discernible through traditional methods or visual inspection.
Research Manuscript


Systems
SYS4: Embedded System Design Tools and Methodologies
DescriptionVerifying hardware designs in embedded systems is crucial but often labor-intensive and time-consuming. While existing solutions have improved automation, they frequently rely on unrealistic assumptions. To address these challenges, we introduce a novel framework, UVLLM, which combines Large Language Models (LLMs) with the Universal Verification Methodology (UVM) to relax these assumptions. UVLLM significantly enhances the automation of testing and repairing error-prone Register Transfer Level (RTL) codes, a critical aspect of verification development. Unlike existing methods, UVLLM ensures that all errors are triggered during verification, achieving a syntax error fix rate of 86.99% and a functional error fix rate of 71.92% on our proposed benchmark. These results demonstrate a substantial improvement in verification efficiency. Additionally, our study highlights the current limitations of LLM applications, particularly their reliance on extensive training data. We emphasize the transformative potential of LLMs in hardware design verification and suggest promising directions for future research in AI-driven hardware design methodologies.
The Repo. of dataset and code: https://anonymous.4open.science/r/UVLLM/.
The Repo. of dataset and code: https://anonymous.4open.science/r/UVLLM/.
Networking
Work-in-Progress Poster


DescriptionEnsuring fault tolerance in Cyber-Physical Systems (CPSs) is challenging due to their complexity and stringent safety requirements. Modern fault-tolerant approaches guarantee fault detection, isolation, and mitigation, but lack systematic approaches to prove their effectiveness and correctness. This paper presents a simulation framework integrating fault injection and contract-based monitoring to validate fault tolerance under diverse conditions. Unlike nominal behavior-based methods, it refines contract specifications through fault-driven scenarios, defining acceptable fault severity and enhancing trust in detection mechanisms. This approach enables early fault detection and precise assessment of critical components by supporting continuous monitoring and allowing prompt corrective actions, improving fault management in dynamic environments. A proof-of-concept implementation demonstrates the framework's effectiveness in assessing fault impacts both in multi-physics components and their controller modules, highlighting its potential to enhance the reliability and resilience of complex CPSs.
Research Manuscript


AI
AI4: AI/ML System and Platform Design
DescriptionLarge Language Models (LLMs) excel in natural language processing tasks but pose significant computational and memory challenges for edge deployment due to their intensive resource demands. This work addresses the efficiency of LLM inference by algorithm-hardware-dataflow tri-optimizations. We propose a novel voting-based KV cache eviction algorithm, balancing hardware efficiency and algorithm accuracy by adaptively identifying unimportant kv vectors. From a dataflow perspective, we introduce a flexible-product dataflow and a runtime reconfigurable PE array for matrix-vector multiplication. The proposed approach effectively handles the diverse dimensional requirements and solves the challenges of incrementally varying sequence lengths. Additionally, an element-serial scheduling scheme is proposed for nonlinear operations, such as softmax and layer normalization. Results demonstrate a substantial reduction in latency, accompanied by a significant decrease in hardware complexity, from $O(N)$ to $O(1)$. The proposed solution is realized in a custom-designed accelerator, VEDA, which outperforms existing hardware platforms. This research represents a significant advancement in LLM inference on resource-constrained edge devices, facilitating real-time processing, enhancing data privacy, and enabling model customization.
Research Manuscript


Systems
SYS2: Design of Cyber-Physical Systems and IoT
DescriptionAs FPGAs gain popularity for on-demand application acceleration in data center computing, dynamic partial reconfiguration (DPR) has become an effective fine-grained sharing technique for FPGA multiplexing. However, current FPGA sharing encounters partial reconfiguration contention and task execution blocking problems introduced by the DPR, which significantly degrade application performance. In this paper, we propose VersaSlot, an efficient spatio-temporal FPGA sharing system with novel Big.Little slot architecture that can effectively resolve the contention and task blocking while improving resource utilization. For the heterogeneous Big.Little architecture, we introduce an efficient slot allocation and scheduling algorithm, along with a seamless cross-board switching and live migration mechanism, to maximize FPGA multiplexing across the cluster. We evaluate the VersaSlot system on an FPGA cluster composed of the latest Xilinx UltraScale+ FPGAs (ZCU216) and compare its performance against four existing scheduling algorithms. The results demonstrate that VersaSlot achieves up to 13.66x lower average response time than the traditional temporal FPGA multiplexing, and up to 2.19x average response time improvement over the state-of-the-art spatio-temporal sharing systems. Furthermore, VersaSlot enhances the LUT and FF resource utilization by 35% and 29% on average, respectively.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionWhile existing quantum hardware resources have limited availability and reliability, there is a growing demand for exploring and verifying quantum algorithms. Efficient classical simulators for high-performance quantum simulation are critical to meeting this demand. However, due to the vastly varied characteristics of classical hardware, implementing hardware-specific optimizations for different hardware platforms is challenging.
To address such needs, we propose CAST (Cross-platform Adaptive Schr\"odiner-style Simulation Toolchain), a novel compilation toolchain with cross-platform (CPU and Nvidia GPU) optimization and high-performance backend supports. CAST exploits a novel sparsity-aware gate fusion algorithm that automatically selects the best fusion strategy and backend configuration for targeted hardware platforms. CAST also aims to offer versatile and high-performance backend for different hardware platforms. To this end, CAST provides an LLVM IR-based vectorization optimization for various CPU architectures and instruction sets, as well as a PTX-based code generator for Nvidia GPU support.
To address such needs, we propose CAST (Cross-platform Adaptive Schr\"odiner-style Simulation Toolchain), a novel compilation toolchain with cross-platform (CPU and Nvidia GPU) optimization and high-performance backend supports. CAST exploits a novel sparsity-aware gate fusion algorithm that automatically selects the best fusion strategy and backend configuration for targeted hardware platforms. CAST also aims to offer versatile and high-performance backend for different hardware platforms. To this end, CAST provides an LLVM IR-based vectorization optimization for various CPU architectures and instruction sets, as well as a PTX-based code generator for Nvidia GPU support.
Analyst Presentation


DescriptionWe will examine the financial performance and key business metrics of the EDA industry through 2024, the further consolidation of EDA (the combination of Synopsys-Ansys), as well as the material technical and market trends and requirements that have influenced EDA business performance and strategies. Among the trends, we will again examine the progression of semiconductor R&D spending and how the market values of the publicly-held EDA companies have evolved. Lastly, we will provide our updated financial projections for the EDA industry for 2025 and 2026.
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
DescriptionGraphics Processing Units (GPUs) play a pivotal
role in high-performance computing by leveraging concurrent
thread execution to enhance processing efficiency, known as
thread-level parallelism (TLP). However, effectively managing the
substantial memory access generated by numerous concurrent
threads presents a significant challenge, impacting performance
and energy consumption. While existing scheduling methods are
recognized solutions for memory challenges in GPUs, they often
overlook the significant potential of data sharing among non-adjacent warps or Cooperative Thread Arrays (CTAs), thereby
limiting their effectiveness in reducing unnecessary memory
accesses through supporting data sharing. Our observation underscores the untapped potential of non-adjacent data sharing,
motivating our work to optimize schedulers more effectively. In
this paper, we present VISTA, a smart locality-aware scheduler
for GPUs that dynamically identifies data locality patterns at both
the CTA and warp levels using intelligent scheduling decisions
to enhance memory efficiency and overall performance. VISTA
significantly reduces unnecessary memory accesses. Simulation
results demonstrate a 48.1% and 51.8% improvement in performance and energy consumption compared to the baseline
for memory-intensive GPGPU applications, respectively, with
negligible hardware overhead.
role in high-performance computing by leveraging concurrent
thread execution to enhance processing efficiency, known as
thread-level parallelism (TLP). However, effectively managing the
substantial memory access generated by numerous concurrent
threads presents a significant challenge, impacting performance
and energy consumption. While existing scheduling methods are
recognized solutions for memory challenges in GPUs, they often
overlook the significant potential of data sharing among non-adjacent warps or Cooperative Thread Arrays (CTAs), thereby
limiting their effectiveness in reducing unnecessary memory
accesses through supporting data sharing. Our observation underscores the untapped potential of non-adjacent data sharing,
motivating our work to optimize schedulers more effectively. In
this paper, we present VISTA, a smart locality-aware scheduler
for GPUs that dynamically identifies data locality patterns at both
the CTA and warp levels using intelligent scheduling decisions
to enhance memory efficiency and overall performance. VISTA
significantly reduces unnecessary memory accesses. Simulation
results demonstrate a 48.1% and 51.8% improvement in performance and energy consumption compared to the baseline
for memory-intensive GPGPU applications, respectively, with
negligible hardware overhead.
Research Manuscript


Design
DES5: Emerging Device and Interconnect Technologies
DescriptionTransformer models have achieved state-of-the-art performance in various natural language processing (NLP) and computer vision (CV) tasks.
To meet their substantial computational demands, the computing-in-memory (CiM) architectures, which alleviate the memory wall problem and enable efficient vector-matrix multiplication (VMM), have been adopted for transformer accelerators.
However, the dynamic VMM involved in the attention mechanism, which necessitates runtime write operations, presents significant challenges for non-volatile memory (NVM)-based CiM designs.
High write overhead, complex compute-write-compute (CWC) dependencies, and limited endurance reduce their effectiveness.
In this paper, we propose VQT-CiM, a ferroelectric FET (FeFET)-based CiM design that accelerates vector quantization (VQ) enhanced transformers by eliminating the runtime write operations.
VQT-CiM quantizes keys and values in self-attention to convert dynamic VMMs in inner-product and weighted-sum into static VMMs with the codebooks, enabling efficient calculations with CiM crossbars.
However, directly applying VQ hinders the accuracy of transformer model due to its limited representation capability.
To address this, we introduce a vector quantization scheme that integrates residual VQ (RVQ) and product VQ (PVQ) for enhanced representation space.
We present an efficient hardware implementation for the proposed VQT-CiM with optimized dataflow in RVQ, which incorporates the FeFET-based CiM crossbars and peripheral digital circuits.
Evaluation results suggest that VQT-CiM achieves the 3.54$\times$ and 4.53$\times$ improvements in energy efficiency and throughput, respectively, compared to state-of-the-art NVM-based CiM transformer designs.
To meet their substantial computational demands, the computing-in-memory (CiM) architectures, which alleviate the memory wall problem and enable efficient vector-matrix multiplication (VMM), have been adopted for transformer accelerators.
However, the dynamic VMM involved in the attention mechanism, which necessitates runtime write operations, presents significant challenges for non-volatile memory (NVM)-based CiM designs.
High write overhead, complex compute-write-compute (CWC) dependencies, and limited endurance reduce their effectiveness.
In this paper, we propose VQT-CiM, a ferroelectric FET (FeFET)-based CiM design that accelerates vector quantization (VQ) enhanced transformers by eliminating the runtime write operations.
VQT-CiM quantizes keys and values in self-attention to convert dynamic VMMs in inner-product and weighted-sum into static VMMs with the codebooks, enabling efficient calculations with CiM crossbars.
However, directly applying VQ hinders the accuracy of transformer model due to its limited representation capability.
To address this, we introduce a vector quantization scheme that integrates residual VQ (RVQ) and product VQ (PVQ) for enhanced representation space.
We present an efficient hardware implementation for the proposed VQT-CiM with optimized dataflow in RVQ, which incorporates the FeFET-based CiM crossbars and peripheral digital circuits.
Evaluation results suggest that VQT-CiM achieves the 3.54$\times$ and 4.53$\times$ improvements in energy efficiency and throughput, respectively, compared to state-of-the-art NVM-based CiM transformer designs.
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
DescriptionSparse general matrix-matrix multiplication (SpGEMM) serves as a fundamental operation in real-world applications such as deep learning. Different from general matrix multiplication, matrices in SpGEMM are highly sparse and therefore require a compact representation. This places an additional burden on data preprocessing and exchanging and also causes irregular memory access patterns, which can in turn lead to communication and computation bottlenecks. To break these bottlenecks, we present VSpGEMM, a hardware accelerator for SpGEMM that is tailored and optimized on Versal ACAP. Firstly, a new storage format called BCSX is proposed in VSpGEMM, which offers a unified and block-wise compression strategy to deal with both row-major and column-major representation of non-zero data, enabling fixed-pattern memory accesses and effective data preloading. Secondly, a multi-level tiling mechanism is introduced to decompose the holistic SpGEMM into multiple computation granularities that fit into the AI Engines (AIEs) on Versal in a hierarchical manner, enhancing data reuse. Thirdly, a hybrid partitioning scheme is presented to orchestrate both the AIEs and programmable logic (PL) for intermediate product merging, which together resolve the issues of high memory utilization and communication demand. Experimental results demonstrate a 2.65× speedup over state-of-the-art (SOTA) GEMM design on Versal and an average 33.62× improvement in energy efficiency compared to cuSPARSE on RTX 4090 GPU, showing the efficacy of VSpGEMM.
Research Manuscript


Design
DES6: Quantum Computing
DescriptionIsing model-based Quantum Error Correction decoders reduce topological complexity compared to classical decoders. However, the SOTA Ising decoder has a higher time complexity than union-find (UF) and a lower threshold than minimum-weight perfect-matching (MWPM). We propose the Weighted Range-Constrained Ising Model-Based (WRIM) decoder. WRIM uses a polygonal region to enclose flipped syndromes, ensuring the coverage of all potential error chains while optimizing coupling and external field coefficients. WRIM reduces the variable count by 97.8x, achieves microsecond-level decoding, and has a worst-case time complexity of O(n), outperforming UF. WRIM exhibits threshold behavior up to 10.7-11.0%, surpassing the MWPM's highest reported threshold.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionProcessing-in-Memory (PIM) aims to address the costly data movement between processing elements and memory subsystem, by computing simple operations inside DRAM in parallel.
The large capacity, wide activation size during cell access, and the maturity of DRAM technology, make this technology a great choice for PIM techniques. Nonetheless, vulnerability to process variation and noises, internal leakage of the cells, and high latency in cell access, limit the utilization of processing in DRAMs for real-world applications.
This work proposes a fast PIM technique, called WISEDRAM, which leverages one row of special cells, called X-cells, to enable in-DRAM bulk-bitwise operations. Unlike previous approaches, WISEDRAM retains the conventional DRAM cell access procedure, thereby ensuring the reliability of cell access for reads and writes at a level equivalent to that of conventional DRAMs. Compared with the state-of-the-art, WISEDRAM exhibits 22% reduction in average bitwise computation latency and a 71% improvement in XOR operation execution speed, while imposing an area overhead of 1.6%.
The large capacity, wide activation size during cell access, and the maturity of DRAM technology, make this technology a great choice for PIM techniques. Nonetheless, vulnerability to process variation and noises, internal leakage of the cells, and high latency in cell access, limit the utilization of processing in DRAMs for real-world applications.
This work proposes a fast PIM technique, called WISEDRAM, which leverages one row of special cells, called X-cells, to enable in-DRAM bulk-bitwise operations. Unlike previous approaches, WISEDRAM retains the conventional DRAM cell access procedure, thereby ensuring the reliability of cell access for reads and writes at a level equivalent to that of conventional DRAMs. Compared with the state-of-the-art, WISEDRAM exhibits 22% reduction in average bitwise computation latency and a 71% improvement in XOR operation execution speed, while imposing an area overhead of 1.6%.
Workshop


Design
Sunday Program
DescriptionContemporary microelectronic design is facing tremendous challenges in memory bandwidth, processing speed and power consumption. Although recent advances in monolithic design (e.g. near-memory and in-memory computing) help relieve some issues, the scaling trend is still lagging behind the ever-increasing demand of AI, HPC and other applications. In this context, technological innovations beyond a monolithic chip, such as 2.5D and 3D packaging at the macro and micro levels, are critical to enabling heterogeneous integration with various types of chiplets and bringing significant performance and cost benefits for future systems. Such a paradigm shift further drives new innovations on chiplet IPs, heterogeneous architectures and system mapping.
This workshop is designed to be a forum that is highly interactive, timely and informative,
on the related topics:
● Roadmap and technology perspectives of heterogeneous
integration
● IP definition for chiplets
● Signaling interface cross chiplets
● Network
topology for data movement
● Design solutions for power delivery
● Thermal management
● Testing in a heterogeneous system
● High-level synthesis for the chiplet system
● Architectural innovations
● Ecosystems of IPs and EDA tools.
The format of the workshop will consist of multiple invited presentations from industry, academia, and government funding agencies. We will also organize a panel for discussions. Intended audience includes industry and academic researchers, funding agencies, IP providers, EDA tool vendors and foundry engineers.
Learn more at https://nanocad.ee.ucla.edu/?page_id=1307
This workshop is designed to be a forum that is highly interactive, timely and informative,
on the related topics:
● Roadmap and technology perspectives of heterogeneous
integration
● IP definition for chiplets
● Signaling interface cross chiplets
● Network
topology for data movement
● Design solutions for power delivery
● Thermal management
● Testing in a heterogeneous system
● High-level synthesis for the chiplet system
● Architectural innovations
● Ecosystems of IPs and EDA tools.
The format of the workshop will consist of multiple invited presentations from industry, academia, and government funding agencies. We will also organize a panel for discussions. Intended audience includes industry and academic researchers, funding agencies, IP providers, EDA tool vendors and foundry engineers.
Learn more at https://nanocad.ee.ucla.edu/?page_id=1307
Research Manuscript


EDA
EDA2: Design Verification and Validation
DescriptionIn modern digital circuit design, verifying the equivalence of arithmetic circuits is a significant and challenging task. This paper introduces a new circuit solver based on the Conflict-Driven Clause Learning (CDCL) algorithm, which integrates structural elimination techniques to reduce the number of variables and clauses while maintaining the circuit structure. Additionally, branching heuristics have been enhanced specifically for the structure of arithmetic circuits. Experimental results demonstrate that X-SAT significantly outperforms best previous circuit solver could be found on all benchmarks. Further, X-SAT performs better than the state-of-the-art CNF-based SAT solvers on complex arithmetic circuits, underscoring its significant potential in the field of circuit design verification.
Research Manuscript


AI
AI3: AI/ML Architecture Design
DescriptionBinarization is a promising approach to significantly reduce computational complexity by replacing multiplications with hardware-efficient XNOR operations. However, the binarization of LLM activations often leads to severe accuracy degradation, while weight-only binarization fails to eliminate multipliers due to the Self-Attention mechanism. Furthermore, LLMs exhibit distinctive channel-level data distribution characteristics and differing computational and memory requirements between the Pre-fill and Decoding stages, necessitating a specialized inference framework.
In response, we introduce \textit{XShift}, an algorithm-hardware co-design framework optimized for efficient binarized LLM inference on FPGAs. \textit{XShift} incorporates three key contributions: (1) a hardware-friendly XNOR-Shift Encoding (XSE) format that transforms traditional multiplications into XNOR and shift operations, ensuring scalability and precision; (2) Hardware Adaptive Outlier and Sparsity (HAOS) techniques, which exploit channel-level data distribution and systolic array architectures for optimized quantization and sparsification; and (3) a dedicated hardware accelerator featuring an XNOR-Shift Systolic Array (XSSA) and an enhanced Base-2 SoftMax Converter (BSMC), designed to address the specific computational demands of binarized LLMs.
Experimental evaluations on the Alveo U280 and U50 FPGA demonstrate that \textit{\method} achieves a 10-15× reduction in DSP resource usage while surpassing existing accelerators and GPUs in inference performance. Specifically, \textit{XShift} delivers an average speedup of 4.17-4.76× and a 14.29-6.95× improvement in energy efficiency, alongside lower perplexity compared to other low-precision LLM techniques. These results underscore the potential of \textit{\method} for edge deployment of LLMs.
In response, we introduce \textit{XShift}, an algorithm-hardware co-design framework optimized for efficient binarized LLM inference on FPGAs. \textit{XShift} incorporates three key contributions: (1) a hardware-friendly XNOR-Shift Encoding (XSE) format that transforms traditional multiplications into XNOR and shift operations, ensuring scalability and precision; (2) Hardware Adaptive Outlier and Sparsity (HAOS) techniques, which exploit channel-level data distribution and systolic array architectures for optimized quantization and sparsification; and (3) a dedicated hardware accelerator featuring an XNOR-Shift Systolic Array (XSSA) and an enhanced Base-2 SoftMax Converter (BSMC), designed to address the specific computational demands of binarized LLMs.
Experimental evaluations on the Alveo U280 and U50 FPGA demonstrate that \textit{\method} achieves a 10-15× reduction in DSP resource usage while surpassing existing accelerators and GPUs in inference performance. Specifically, \textit{XShift} delivers an average speedup of 4.17-4.76× and a 14.29-6.95× improvement in energy efficiency, alongside lower perplexity compared to other low-precision LLM techniques. These results underscore the potential of \textit{\method} for edge deployment of LLMs.
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
DescriptionThree-dimensional integration technologies present a promising path forward for extending Moore's law, facilitating high-density interconnects between chips and supporting multi-tier architectural designs. Cu-Cu hybrid bonding has emerged as a favored technique for the integration of chiplets at high interconnect density. This paper introduces YAP, a yield model for wafer-to-wafer (W2W) and die-to-wafer (D2W) hybrid bonding process. The model accounts for key failure mechanisms that contribute to yield loss, including overlay errors, particle defects, Cu recess variations, excessive wafer surface roughness, and Cu density. We also develop an open-source yield simulator and compare the accuracy of the near-analytical yield model with the simulation results. The results demonstrate that YAP achieves virtually identical accuracy while offering over 10,000x faster runtime. YAP enables the co-optimization of packaging technologies, assembly design rules, and overall design methodologies. We used YAP to examine the impact of bonding pitch, compare W2W and D2W hybrid bonding for varying chiplet sizes, and explore the benefits of tighter process controls, such as improved particle defect density.
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
DescriptionIn this paper, we further explore the potential of analog in-memory computing (AiMC) and introduce an innovative artificial intelligence (AI) accelerator architecture named YOCO, featuring three key proposals: (1) YOCO proposes an innovative 8-bit in-situ multiply arithmetic (IMA) achieving 123.8 TOPS/W energy-efficiency and 34.9 TOPS throughput through efficient charge-domain computation and time-domain accumulation mechanism. (2) YOCO employs a hybrid ReRAM-SRAM memory structure to balance computational efficiency and storage density. (3) YOCO tailors an IMC-friendly attention computing flow with an efficient pipe-line to accelerate the inference of transformer-based AI models. Compared to three SOTA baselines, YOCO on average improves energy efficiency by up to 3.9×~19.9× and through-put by up to 6.8×~33.6× across 10 CNN/transformer models.
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionWhile Last-Level Cache (LLC) side-channel attacks often target inclusive caches, directory-based attacks on non-inclusive caches have been demonstrated on Intel and ARM processors. However, the vulnerability of AMD's non-inclusive caches to such attacks has remained uncertain, primarily due to challenges in reverse-engineering cache addressing, constructing eviction sets, and evicting private cache lines.
This paper addresses these challenges and demonstrates the feasibility of conducting LLC side-channel attacks on AMD's non-inclusive caches. We first reverse-engineer the cache addressing functions for the L2 set index, L3 slice, and L3 set index. Leveraging this insight, we construct the first eviction sets on AMD processors. We then introduce the first LLC side-channel attack on AMD's Zen series CPUs. The effectiveness of our approach is validated by attacking OpenSSL's AES T-table.
This paper addresses these challenges and demonstrates the feasibility of conducting LLC side-channel attacks on AMD's non-inclusive caches. We first reverse-engineer the cache addressing functions for the L2 set index, L3 slice, and L3 set index. Leveraging this insight, we construct the first eviction sets on AMD processors. We then introduce the first LLC side-channel attack on AMD's Zen series CPUs. The effectiveness of our approach is validated by attacking OpenSSL's AES T-table.
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
DescriptionTrusted Execution Environments (TEEs) provide robust hardware-based isolation to mitigate data breaches and privacy risks. Confidential Virtual Machines (VM) extend these capabilities by using VM as their execution abstraction, offering superior compatibility over process-based TEEs like Intel SGX. The rising demand for Confidential VMs has spurred innovations from major chip manufacturers, such as AMD SEV, Intel TDX, and Arm CCA, and their integration into leading cloud platforms, including AWS, Azure, and Google Cloud. On the RISC-V platform, however, existing TEE architectures rely on process-level abstractions or custom hardware, leading to limited compatibility and scalability.
This paper presents Zion, a confidential VM architecture for commodity RISC-V hardware that operates without custom extensions. Zion ensures security, flexibility, and efficiency through a short-path confidential VM mode and a secure vCPU mechanism for protecting and efficiently updating vCPU states, enhancing context-switching performance. It combines Physical Memory Protection (PMP) with paging for scalable memory isolation, employs a hierarchical memory structure for efficient management, and introduces a split-page-table-based mechanism for secure memory sharing with virtio devices. Evaluations show Zion achieves under 5% overhead in real-world applications, demonstrating its practicality.
This paper presents Zion, a confidential VM architecture for commodity RISC-V hardware that operates without custom extensions. Zion ensures security, flexibility, and efficiency through a short-path confidential VM mode and a secure vCPU mechanism for protecting and efficiently updating vCPU states, enhancing context-switching performance. It combines Physical Memory Protection (PMP) with paging for scalable memory isolation, employs a hierarchical memory structure for efficient management, and introduces a split-page-table-based mechanism for secure memory sharing with virtio devices. Evaluations show Zion achieves under 5% overhead in real-world applications, demonstrating its practicality.
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
DescriptionZero-knowledge succinct non-interactive arguments of knowledge (zk-SNARK) schemes have been a promising technique in verified computation. Zk-SNARK schemes were designed to be mathematically secure against cryptographic attacks and it remains unclear whether they are vulnerable to fault injection attacks. In this work, we provide a positive answer by presenting ZK-Hammer, which leaks secrets from zk-SNARK schemes via Rowhammer. We incur faults in the exponentiate variables in the Quadratic Arithmetic Program (QAP) problem. Then we analyze the faulty proof using the bilinear pairing technique and manage to recover the secret. We employ a Rowhammer fault evaluation in _libsnark_ and identify 3 CVEs.
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
DescriptionIn the context of cloud computing, services are held on cloud servers, where the clients send their data to the server and obtain the results returned by server. However, the computation, data and results are prone to tampering due to the vulnerabilities on the server side. Thus, verifying the integrity of computation is important in the client-server setting. The cryptographic method known as Zero-Knowledge Proof (ZKP) is renowned for facilitating private and verifiable computing. ZKP allows the client to validate that the results from the server are computed correctly without violating the privacy of the server's intellectual property. Zero-Knowledge Succinct Non-Interactive Argument of Knowledge (zkSNARKs), in particular, has been widely applied in various applications like blockchain and verifiable machine learning. Despite their popularity, existing zkSNARKs approaches remain highly computationally intensive. For instance, even basic operations like matrix multiplication require an extensive number of constraints, resulting in significant overhead. In addressing this challenge, we introduce \textit{zkVC}, which optimizes the ZKP computation for matrix multiplication, enabling rapid proof generation on the server side and efficient verification on the client side. zkVC integrates optimized ZKP modules, such as Constraint-reduced Polynomial Circuit (CRPC) and Prefix-Sum Query (PSQ), collectively yielding a more than 12-fold increase in proof speed over prior methods. The code is available at https://anonymous.4open.science/r/zkformer-5E69/
Research Manuscript


Design
DES6: Quantum Computing
DescriptionQuantum circuit execution often requires transpilation into hardware-compatible instructions, which can significantly alter the original design, making equivalence checking essential. However, existing approaches struggle with scalability and computational overhead. In this paper, we present ZXNet, a transformative framework for quantum circuit equivalence checking using ZX calculus-based graph abstractions. Leveraging graph neural networks, ZXNet captures complex equivalence patterns by integrating critical local and global circuit features. ZXNet achieves 99.4% validation accuracy, and up to 62× speedup over state-of-the-art methods, furnishing improvements of 45.83% in scalability, 42.22% in per-qubit verification time, and 5.94% in accuracy, significantly outperforming state-of-the-art approaches.
Sessions
Research Special Session


AI
Research Special Session


Design
Research Manuscript


EDA
EDA7: Physical Design and Verification
Engineering Presentation


AI
Front-End Design
Chiplet
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
Research Manuscript


EDA
EDA7: Physical Design and Verification
Exhibitor Forum


Exhibitor Forum


Engineering Presentation


AI
Back-End Design
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
Engineering Presentation


AI
Systems and Software
Chiplet
Research Manuscript


AI
AI4: AI/ML System and Platform Design
Exhibitor Forum


Research Manuscript


Security
SEC4: Embedded and Cross-Layer Security
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
Research Manuscript


Design
DES2A: In-memory and Near-memory Computing Circuits
Research Manuscript


EDA
EDA3: Timing Analysis and Optimization
Research Manuscript


Design
DES4: Digital and Analog Circuits
Research Manuscript


Design
DES6: Quantum Computing
Exhibitor Forum


Exhibitor Forum


Exhibitor Forum


Exhibitor Forum


Research Special Session


Design
Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
Engineering Presentation


Back-End Design
Chiplet
Engineering Presentation


IP
Exhibitor Forum


Ancillary Meeting
Early Career Workshop
9:00am - 6:00pm PDT Sunday, June 22 3006, Level 3

Research Manuscript


AI
AI3: AI/ML Architecture Design
Research Manuscript


EDA
EDA2: Design Verification and Validation
Research Manuscript


EDA
EDA1: Design Methodologies for System-on-Chip and 3D/2.5D System-in Package
Research Manuscript


Systems
SYS4: Embedded System Design Tools and Methodologies
Research Manuscript


AI
AI3: AI/ML Architecture Design
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
Engineering Presentation


Front-End Design
Engineering Presentation


AI
Back-End Design
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
Research Manuscript


Systems
SYS2: Design of Cyber-Physical Systems and IoT
Research Manuscript


EDA
EDA9: Design for Test and Silicon Lifecycle Management
Exhibitor Forum


Research Manuscript


AI
AI1: AI/ML Algorithms
Engineering Presentation


AI
Back-End Design
Chiplet
Ancillary Meeting
HACK @ DAC
9:00am - 6:00pm PDT Monday, June 23 Level 2 Lobby

Ancillary Meeting
HACK @ DAC
9:00am - 6:00pm PDT Sunday, June 22 Level 2 Lobby

Ancillary Meeting
HACK @ DAC Awards
3:30pm - 5:30pm PDT Tuesday, June 24 3012, Level 3

Exhibitor Forum


Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
Exhibitor Forum


Engineering Presentation


IP
Exhibitor Forum


Research Special Session


AI
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
Research Manuscript


AI
AI4: AI/ML System and Platform Design
Research Manuscript


EDA
EDA6: Analog CAD, Simulation, Verification and Test
Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
Research Special Session


Design
Exhibitor Forum


Research Manuscript


EDA
EDA5: RTL/Logic Level and High-level Synthesis
Research Manuscript


Design
DES3: Emerging Models of ComputatioN
Engineering Poster
Networking


Engineering Poster
Monday Poster Gladiator Battle
5:00pm - 6:00pm PDT Monday, June 23 DAC Pavilion, Level 2 Exhibit Hall

Back-End Design
Front-End Design
IP
Systems and Software
Networking
Work-in-Progress Poster


Research Manuscript


EDA
EDA7: Physical Design and Verification
Research Manuscript


AI
AI4: AI/ML System and Platform Design
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
Networking
Networking Reception
6:00pm - 7:00pm PDT Tuesday, June 24 Level 2 Lobby

Research Manuscript


Security
SEC3: Hardware Security: Attack & Defense
Engineering Presentation


AI
IP
Chiplet
Research Manuscript


Security
SEC2: Hardware Security: Primitives & Architecture, Design & Test
Research Special Session


EDA
Research Manuscript


AI
AI1: AI/ML Algorithms
Exhibitor Forum


Exhibitor Forum


Ancillary Meeting
PhD Forum & University Demo
7:00pm - 9:00pm PDT Tuesday, June 24 Level 2 Lobby

Research Manuscript


Systems
SYS3: Embedded Software
Research Manuscript


Design
DES6: Quantum Computing
Research Manuscript


Design
DES6: Quantum Computing
Research Special Session


Design
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
Exhibitor Forum
Real Intent: Static Sign-Off Methodologies: Liberating Functional Verification from Boolean Shackles
1:45pm - 2:15pm PDT Monday, June 23 Exhibitor Forum, Level 1 Exhibit Hall

Research Manuscript


Systems
SYS6: Time-Critical and Fault-Tolerant System Design
Research Manuscript


EDA
EDA2: Design Verification and Validation
Research Manuscript


Design
DES5: Emerging Device and Interconnect Technologies
Exhibitor Forum


Exhibitor Forum


Exhibitor Forum


Research Manuscript


AI
AI3: AI/ML Architecture Design
Research Manuscript


AI
AI2: AI/ML Application and Infrastructure
Research Manuscript


AI
AI4: AI/ML System and Platform Design
Research Manuscript


AI
AI1: AI/ML Algorithms
Engineering Presentation


AI
Systems and Software
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
Research Manuscript


EDA
EDA7: Physical Design and Verification
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
Networking
Work-in-Progress Poster


Engineering Presentation


AI
IP
Ancillary Meeting


Exhibitor Forum


TechTalk
TechTalk Session - To Be Announced
11:15am - 12:00pm PDT Tuesday, June 24 DAC Pavilion, Level 2 Exhibit Hall

Research Manuscript


Systems
SYS1: Autonomous Systems (Automotive, Robotics, Drones)
Research Manuscript


Design
DES2B: In-memory and Near-memory Computing Architectures, Applications and Systems
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
Research Manuscript


Systems
SYS5: Embedded Memory and Storage Systems
Research Manuscript


AI
AI3: AI/ML Architecture Design
Research Manuscript


Security
SEC1: AI/ML Security/Privacy
Engineering Poster


Engineering Poster
Tuesday Poster Gladiator Battle
5:00pm - 6:00pm PDT Tuesday, June 24 DAC Pavilion, Level 2 Exhibit Hall

Back-End Design
Front-End Design
IP
Systems and Software
Research Manuscript


AI
AI1: AI/ML Algorithms
Research Manuscript


Design
DES1: SoC, Heterogeneous, and Reconfigurable Architectures
Engineering Presentation


Front-End Design
Engineering Presentation


Front-End Design
Chiplet
Research Manuscript


EDA
EDA4: Power Analysis and Optimization
Research Manuscript


AI
AI3: AI/ML Architecture Design
Engineering Poster
Networking


Engineering Poster
Wednesday Poster Gladiator Battle
3:00pm - 4:00pm PDT Wednesday, June 25 DAC Pavilion, Level 2 Exhibit Hall

Back-End Design
Front-End Design
IP
Systems and Software
Research Manuscript


EDA
EDA8: Design for Manufacturing and Reliability
Ancillary Meeting
Young Fellows Closing Ceremony
3:30pm - 5:30pm PDT Wednesday, June 25 3012, Level 3

Ancillary Meeting
Young Fellows Kick-Off and All-Day Activities
9:00am - 6:00pm PDT Sunday, June 22 3018, Level 3

Ancillary Meeting
Young Fellows Posters
4:00pm - 6:00pm PDT Tuesday, June 24 Level 2 Lobby

Try a different query.