publications
2026
- TVLSIHardware Acceleration of Kolmogorov–Arnold Network (KAN) in Large-Scale SystemsWei-Hsing Huang*, Jianwei Jia*, Yuyao Kong, Faaiq Waqar, Tai-Hao Wen, Meng-Fan Chang, and Shimeng YuIEEE Transactions on Very Large Scale Integration (VLSI) Systems, Apr 2026
Recent developments have introduced Kolmogorov– Arnold networks (KANs), an innovative architectural paradigm capable of replicating conventional deep neural network (DNN) capabilities while utilizing significantly reduced parameter counts through the employment of parameterized B-spline functions incorporating trainable coefficients. Nevertheless, the B-spline functional components inherent to KAN architectures introduce distinct hardware acceleration complexities. While B-spline function evaluation can be accomplished through lookup table (LUT) implementations that directly encode functional mappings, thus minimizing computational overhead, such approaches continue to demand considerable circuit infrastructure, including LUTs, multiplexers, decoders, and associated components. This work presents an algorithm-hardware co-design approach for KAN acceleration. At the algorithmic level, techniques include alignment–symmetry and PowerGap KAN hardware-aware quantization, KAN sparsity-aware mapping strategy, and circuit-level techniques include N:1 time modulation dynamic voltage input generator with analog-compute-in-memory (ACIM) circuits. Furthermore, this work conducts comprehensive evaluations on large-scale KAN networks to validate the proposed methodologies. Nonideality factors, including partial sum deviations arising from process variations, have been evaluated with the statistics measured from the TSMC 22-nm RRAM-ACIM prototype chips. Utilizing optimally determined KAN hyperparameters in conjunction with circuit optimizations implemented and evaluated at the 22-nm technology node, despite the model sizes for large-scale tasks in this work increasing by 435 K \times to 756 K \times compared to tiny-scale tasks in previous work, the area overhead increases by only 26 K \times to 40 K \times , with power consumption rising by merely 48\times to 93\times , while accuracy degradation remains minimal at 0.11%–0.22%, thereby demonstrating the scaling potential of our proposed architecture.
- TEDCryogenic Characterization of Ferroelectric Nonvolatile CapacitorsMadhav Vadlamani, Dyutimoy Chakraborty, Jianwei Jia, Halid Mulaosmanovic, Stefan Duenkel, Sven Beyer, Suman Datta, and Shimeng YuIEEE Transactions on Electron Devices, Apr 2026
Ferroelectric-based capacitive crossbar arrays have been proposed for energy-efficient in-memory computing in the charge domain. They combat the challenges like sneak paths and high static power faced by resistive crossbar arrays but are susceptible to thermal noise limiting the effective number of bits (ENOBs) for the weighted sum. A direct way to reduce this thermal noise is by lowering the temperature, as thermal noise is proportional to temperature. In this work, we first characterize the nonvolatile capacitors (nvCAPs) on a foundry 28-nm platform at cryogenic temperatures to evaluate the memory window (MW), on state retention as a function of temperature down to 77 K, and then use the calibrated device models to simulate the capacitive crossbar arrays in SPICE at lower temperatures to demonstrate higher ENOB ( 5 bits) for 128\times 128 multiply-and-accumulate (MAC) operations.
- OJSSCSNon-Volatile Digital Compute-in-Memory Macro with Ferroelectric FET-based Voltage Divider Weight Cells Featuring Power-GatingMatthew Chen, Vaidehi Garg, Jianwei Jia, Jay Sonawane, Omkar Phadke, and Shimeng YuIEEE Open Journal of the Solid-State Circuits Society, Mar 2026
Digital compute-in-memory (DCIM) architectures have shown great potential for AI/ML inference by minimizing energy-intensive data movement and increasing processing parallelism while providing lossless computations in hardware. Prior works have utilized conventional memories such as SRAM for weight storage, which are volatile and may consume excess standby power due to high cell leakage. In this work, we present a non-volatile DCIM (nvDCIM) macro with embedded ferroelectric field-effect transistors (FeFET) used as weight storage for low-power inference and applications at the edge. Our proposed dual n-type FeFET bitcell provides a voltage-domain output through the voltage divider effect, enabling it to directly drive the DCIM architecture’s CMOS adder trees with a binary voltage output. The FeFET-based bitcell’s non-volatility further enables power-gating to eliminate idle power in the design with instant on/off operations, removing the need for time-consuming weight reloading. When powered, the bitcell demonstrates low standby power ( 6.25 pW @ 0.8 V supply) allowing for competitive active energy efficiency with conventional DCIM designs. For 4b×4b multiply-accumulate (MAC) operations, our 4 Kb macro taped out in the GlobalFoundries 28 nm process achieves 106.6TOPS/W operation at 0.8 V supply with minimal accuracy loss (-0.11%) and enables -77.7% total power reduction when at low activity (e.g. 1% activity factor). Additionally, a dual p-type FeFET cell is proposed to further boost area-efficiency. Further analysis is provided to evaluate sources of inaccuracy in the taped-out bitcell, tradeoffs between various bitcell structures and implementations for FeFET-based DCIM, as well as comparison to existing analog and digital CIM designs and future architectures.
2025
- SSCLA 28-nm FeFET Compute-in-Memory Macro With 64×64 Array Size and On-Chip 4-Bit Flash ADCVaidehi Garg, Jianwei Jia, Omkar Phadke, and Shimeng YuIEEE Solid-State Circuits Letters, Dec 2025
Compute-in-memory (CIM) using emerging nonvolatile memory devices is a promising candidate for energy-efficient deep neural network (DNN) inference at the edge. Ferroelectric field-effect transistors (FeFETs) have recently gained attention as nonvolatile, CMOS-compatible devices with a higher on/off ratio and lower read and write energy compared to resistive random-access memory (RRAM). This work demonstrates a 4-kb FeFET-CIM macro fabricated in the GlobalFoundries 28-nm high-k metal gate (HKMG) process. The macro consists of a 64\times 64 FeFET array with peripheral circuits for program, erase, and current-mode CIM operations and eight 4-bit Flash ADCs to quantize the analog partial sums. The proposed design achieves an energy efficiency of 346.6 TOPS/W for 1\times 1 b MAC, an inference accuracy of 85.2% for 16 row parallel compute with 4-bit ADC resolution, and 89.1% with 8 row parallel compute with 3-bit resolution, compared to a software baseline of 89.7% on the VGG-8 model for CIFAR-10.
- JXCDCReconfigurable Ferroelectric Bandpass Filter With Low-Frequency Noise Analysis for Intracardiac Electrogram MonitoringJianwei Jia, Zhenge Jia, Omkar Phadke, Yiyu Shi, and Shimeng YuIEEE Journal on Exploratory Solid-State Computational Devices and Circuits, Jun 2025
Implantable cardioverter defibrillators (ICDs) provide real-time monitoring and immediate defibrillation for life-threatening arrhythmias. However, the intracardiac electrogram (IEGM) acquisition of ICDs faces stringent constraints, including power consumption, low-frequency noise, and patient-specific physiological variability. This article introduces an ultralow-power, high-resolution, reconfigurable three-stage bandpass filter designed specifically for IEGM, utilizing ferroelectric field-effect transistor (FeFET) technology provided by a foundry platform. By employing adjustable threshold voltage V _\text th and gate capacitance of FeFET as programmable pseudo-high-value resistors (PHVRs) and capacitor structures, the filter enables personalized cardiac signal isolation tailored to individual patient needs. In addition, this work incorporates, for the first time, a comprehensive low-frequency noise model covering the entire operational region of FeFET into circuit-level analysis. Based on GlobalFoundries (GF) 28-nm SLPe FeFET-enabled process, the proposed filter achieves a wide gain tuning range (17–77 dB) and a flexible bandwidth tuning range (0.5–19 Hz for low cutoff frequency and 23–138 Hz for high cutoff frequency), with an average power consumption of 257 nW and minimum 11- μV resolution.
- ISCASDigital Compute-in-Memory Ising Annealer with Ferroelectric Capacitor-Based nvSRAM for Combinatorial Optimization ProblemsYuyao Kong, Jianwei Jia, Anni Lu, Faaiq Waqar, Yuan-Chun Luo, Hai Li, Ian Young, and Shimeng YuIn 2025 IEEE International Symposium on Circuits and Systems (ISCAS), May 2025
Combinatorial optimization problems (COPs) have a wide range of applications. The Ising model-based annealer is gaining attention for its efficiency and speed in finding approximate solutions. However, building an Ising machine that is area- and energy-efficient, scalable, and with low compute latency in CMOS is challenging. In this paper, we present a digital compute-in-memory (DCIM) Ising annealer that uses ferroelectric capacitor (FeCap)-based nvSRAM to solve COPs like the Traveling Salesman Problem (TSP). By using weak recall operations, our design eliminates the need to reload weights, significantly reducing energy consumption and speeding up processing compared to other approaches. Simulations using a 16nm PDK demonstrate that our nvSRAM-based DCIM array maintains accuracy while reducing latency by up to 55.0% and energy by 49.6% compared to prior work implemented with conventional SRAM DCIM array. Algorithm validation further shows that the random noise introduced by weak recall can be effectively utilized in the annealing process.
- ASPDACHardware Acceleration of Kolmogorov-Arnold Network (KAN) for Lightweight Edge InferenceWei-Hsing Huang*, Jianwei Jia*, Yuyao Kong, Faaiq Waqar, Tai-Hao Wen, Meng-Fan Chang, and Shimeng YuIn Proceedings of the 30th Asia and South Pacific Design Automation Conference, Tokyo, Japan, Jan 2025
Recently, a novel model named Kolmogorov-Arnold Networks (KAN) has been proposed with the potential to achieve the functionality of traditional deep neural networks (DNNs) using orders of magnitude fewer parameters by parameterized B-spline functions with trainable coefficients. However, the B-spline functions in KAN present new challenges for hardware acceleration. Evaluating the B-spline functions can be performed by using lookup tables (LUTs) to directly map the B-spline functions, thereby reducing computational resource requirements. However, this method still requires substantial circuit resources (LUTs, MUXs, decoders, etc.). For the first time, this paper employs an algorithm-hardware co-design methodology to accelerate KAN. The proposed algorithm-level techniques include Alignment-Symmetry and PowerGap KAN hardware aware quantization, KAN sparsity aware mapping strategy, and circuit-level techniques include N:1 Time Modulation Dynamic Voltage input generator with analog-CIM (ACIM) circuits. The impact of non-ideal effects, such as partial sum errors caused by the process variations, has been evaluated with the statistics measured from the TSMC 22nm RRAM-ACIM prototype chips. With the best searched hyperparameters of KAN and the optimized circuits implemented in 22 nm node, we can reduce hardware area by 41.78x, energy by 77.97x with 3.03% accuracy boost compared to the traditional DNN hardware.
- EDLCapacitive Crossbar Array for Solving Matrix Equations in One-ShotMadhav Vadlamani, Jianwei Jia, Tian Xie, Yuan-Chun Luo, Junmo Lee, Shaolan Li, and Shimeng YuIEEE Electron Device Letters, Jan 2025
The resistive crossbar with a feedback loop has been proposed for solving matrix equations in a linear system with the current-domain computation. But the resistive approach suffers from high static power especially when the resistance is low. To overcome the challenges, we leverage C-V asymmetry in the ferroelectric capacitors of a crossbar array for the energy-efficient charge-domain computation. In this work, we demonstrate that such a capacitive crossbar when operated in negative feedback could solve the matrix problem Ax = b where x is the unknown vector. A comparative study shows a much lower power consumption ( 1000 \times ) for such a matrix solver when compared to the resistive crossbar counterpart.
2024
- MWSCASA Reconfigurable Bandpass Filter with Ferroelectric Devices for Intracardiac Electrograms MonitoringJianwei Jia, Zhenge Jia, Omkar Phadke, Gihun Choe, Yiyu Shi, and Shimeng YuIn 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS), Aug 2024
This paper introduces a novel three-stage bandpass filter for intracardiac electrograms monitoring (IEGM), employing ferroelectric field-effect-transistor (FeFET) technology to allow bandwidth adaptation for personalized medicine. By utilizing FeFET’s channel and gate stack as programmable resistor and capacitor respectively, the filter achieves precise cardiac signal isolation tailored to individual’s physiological needs. Based on Globalfoundries (GF) 28 nm SLPe process that features FeFET, the design offers a broad continuous gain tuning range (22 dB to 82 dB) and bandwidth tuning range (0.1 to 25 Hz for low cut-off frequency and 10 to 120 Hz for high cut-off frequency), with an average power consumption of 393 nW, showcasing a significant stride in low-power cardiac monitoring. Moreover, input sensitivity to FeFET threshold voltage mismatch and noise characteristics are also evaluated.