new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Dec 31

A reconfigurable neural network ASIC for detector front-end data compression at the HL-LHC

Despite advances in the programmable logic capabilities of modern trigger systems, a significant bottleneck remains in the amount of data to be transported from the detector to off-detector logic where trigger decisions are made. We demonstrate that a neural network autoencoder model can be implemented in a radiation tolerant ASIC to perform lossy data compression alleviating the data transmission problem while preserving critical information of the detector energy profile. For our application, we consider the high-granularity calorimeter from the CMS experiment at the CERN Large Hadron Collider. The advantage of the machine learning approach is in the flexibility and configurability of the algorithm. By changing the neural network weights, a unique data compression algorithm can be deployed for each sensor in different detector regions, and changing detector or collider conditions. To meet area, performance, and power constraints, we perform a quantization-aware training to create an optimized neural network hardware implementation. The design is achieved through the use of high-level synthesis tools and the hls4ml framework, and was processed through synthesis and physical layout flows based on a LP CMOS 65 nm technology node. The flow anticipates 200 Mrad of ionizing radiation to select gates, and reports a total area of 3.6 mm^2 and consumes 95 mW of power. The simulated energy consumption per inference is 2.4 nJ. This is the first radiation tolerant on-detector ASIC implementation of a neural network that has been designed for particle physics applications.

  • 18 authors
·
May 4, 2021

Transferable Parasitic Estimation via Graph Contrastive Learning and Label Rebalancing in AMS Circuits

Graph representation learning on Analog-Mixed Signal (AMS) circuits is crucial for various downstream tasks, e.g., parasitic estimation. However, the scarcity of design data, the unbalanced distribution of labels, and the inherent diversity of circuit implementations pose significant challenges to learning robust and transferable circuit representations. To address these limitations, we propose CircuitGCL, a novel graph contrastive learning framework that integrates representation scattering and label rebalancing to enhance transferability across heterogeneous circuit graphs. CircuitGCL employs a self-supervised strategy to learn topology-invariant node embeddings through hyperspherical representation scattering, eliminating dependency on large-scale data. Simultaneously, balanced mean squared error (BMSE) and balanced softmax cross-entropy (BSCE) losses are introduced to mitigate label distribution disparities between circuits, enabling robust and transferable parasitic estimation. Evaluated on parasitic capacitance estimation (edge-level task) and ground capacitance classification (node-level task) across TSMC 28nm AMS designs, CircuitGCL outperforms all state-of-the-art (SOTA) methods, with the R^2 improvement of 33.64% sim 44.20% for edge regression and F1-score gain of 0.9times sim 2.1times for node classification. Our code is available at https://github.com/ShenShan123/CircuitGCL.

  • 7 authors
·
Jul 9

Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates

The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. An ideal estimation on hardwiring gpt-oss 120 B requires fabricating at least 6 billion dollars of photomask sets, rendering the straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 layers of photomasks are made homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x of GPU/WSE), 36 tokens/J (1,047x/283x of GPU/WSE), 13,232 mm2 total die area (29% inscribed rectangular area in a 300 mm wafer), \$184M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 8.57x cost-effectiveness and 230x carbon footprint reduction compared to H100 clusters, under an annual weight updating assumption.

  • 27 authors
·
Aug 22

Potential and Limitation of High-Frequency Cores and Caches

This paper explores the potential of cryogenic semiconductor computing and superconductor electronics as promising alternatives to traditional semiconductor devices. As semiconductor devices face challenges such as increased leakage currents and reduced performance at higher temperatures, these novel technologies offer high performance and low power computation. Conventional semiconductor electronics operating at cryogenic temperatures (below -150{\deg}C or 123.15 K) can benefit from reduced leakage currents and improved electron mobility. On the other hand, superconductor electronics, operating below 10 K, allow electrons to flow without resistance, offering the potential for ultra-low-power, high-speed computation. This study presents a comprehensive performance modeling and analysis of these technologies and provides insights into their potential benefits and limitations. We implement models of in-order and out-of-order cores operating at high clock frequencies associated with superconductor electronics and cryogenic semiconductor computing in gem5. We evaluate the performance of these components using workloads representative of real-world applications like NPB, SPEC CPU2006, and GAPBS. Our results show the potential speedups achievable by these components and the limitations posed by cache bandwidth. This work provides valuable insights into the performance implications and design trade-offs associated with cryogenic and superconductor technologies, laying the foundation for future research in this field using gem5.

  • 3 authors
·
Aug 6, 2024

ViTAD: Timing Violation-Aware Debugging of RTL Code using Large Language Models

In modern Very Large Scale Integrated (VLSI) circuit design flow, the Register-Transfer Level (RTL) stage presents a critical opportunity for timing optimization. Addressing timing violations at this early stage is essential, as modern systems demand higher speeds, where even minor timing violations can lead to functional failures or system crashes. However, traditional timing optimization heavily relies on manual expertise, requiring engineers to iteratively analyze timing reports and debug. To automate this process, this paper proposes ViTAD, a method that efficiently analyzes the root causes of timing violations and dynamically generates targeted repair strategies. Specifically, we first parse Verilog code and timing reports to construct a Signal Timing Dependency Graph (STDG). Based on the STDG, we perform violation path analysis and use large language models (LLMs) to infer the root causes of violations. Finally, by analyzing the causes of violations, we selectively retrieve relevant debugging knowledge from a domain-specific knowledge base to generate customized repair solutions. To evaluate the effectiveness of our method, we construct a timing violation dataset based on real-world open-source projects. This dataset contains 54 cases of violations. Experimental results show that our method achieves a 73.68% success rate in repairing timing violations, while the baseline using only LLM is 54.38%. Our method improves the success rate by 19.30%.

  • 4 authors
·
Aug 18

New Solutions on LLM Acceleration, Optimization, and Application

Large Language Models (LLMs) have become extremely potent instruments with exceptional capacities for comprehending and producing human-like text in a wide range of applications. However, the increasing size and complexity of LLMs present significant challenges in both training and deployment, leading to substantial computational and storage costs as well as heightened energy consumption. In this paper, we provide a review of recent advancements and research directions aimed at addressing these challenges and enhancing the efficiency of LLM-based systems. We begin by discussing algorithm-level acceleration techniques focused on optimizing LLM inference speed and resource utilization. We also explore LLM-hardware co-design strategies with a vision to improve system efficiency by tailoring hardware architectures to LLM requirements. Further, we delve into LLM-to-accelerator compilation approaches, which involve customizing hardware accelerators for efficient LLM deployment. Finally, as a case study to leverage LLMs for assisting circuit design, we examine LLM-aided design methodologies for an important task: High-Level Synthesis (HLS) functional verification, by creating a new dataset that contains a large number of buggy and bug-free codes, which can be essential for training LLMs to specialize on HLS verification and debugging. For each aspect mentioned above, we begin with a detailed background study, followed by the presentation of several novel solutions proposed to overcome specific challenges. We then outline future research directions to drive further advancements. Through these efforts, we aim to pave the way for more efficient and scalable deployment of LLMs across a diverse range of applications.

  • 8 authors
·
Jun 16, 2024

PulseDL-II: A System-on-Chip Neural Network Accelerator for Timing and Energy Extraction of Nuclear Detector Signals

Front-end electronics equipped with high-speed digitizers are being used and proposed for future nuclear detectors. Recent literature reveals that deep learning models, especially one-dimensional convolutional neural networks, are promising when dealing with digital signals from nuclear detectors. Simulations and experiments demonstrate the satisfactory accuracy and additional benefits of neural networks in this area. However, specific hardware accelerating such models for online operations still needs to be studied. In this work, we introduce PulseDL-II, a system-on-chip (SoC) specially designed for applications of event feature (time, energy, etc.) extraction from pulses with deep learning. Based on the previous version, PulseDL-II incorporates a RISC CPU into the system structure for better functional flexibility and integrity. The neural network accelerator in the SoC adopts a three-level (arithmetic unit, processing element, neural network) hierarchical architecture and facilitates parameter optimization of the digital design. Furthermore, we devise a quantization scheme compatible with deep learning frameworks (e.g., TensorFlow) within a selected subset of layer types. We validate the correct operations of PulseDL-II on field programmable gate arrays (FPGA) alone and with an experimental setup comprising a direct digital synthesis (DDS) and analog-to-digital converters (ADC). The proposed system achieved 60 ps time resolution and 0.40% energy resolution at signal to noise ratio (SNR) of 47.4 dB.

  • 6 authors
·
Sep 2, 2022

LLMPirate: LLMs for Black-box Hardware IP Piracy

The rapid advancement of large language models (LLMs) has enabled the ability to effectively analyze and generate code nearly instantaneously, resulting in their widespread adoption in software development. Following this advancement, researchers and companies have begun integrating LLMs across the hardware design and verification process. However, these highly potent LLMs can also induce new attack scenarios upon security vulnerabilities across the hardware development process. One such attack vector that has not been explored is intellectual property (IP) piracy. Given that this attack can manifest as rewriting hardware designs to evade piracy detection, it is essential to thoroughly evaluate LLM capabilities in performing this task and assess the mitigation abilities of current IP piracy detection tools. Therefore, in this work, we propose LLMPirate, the first LLM-based technique able to generate pirated variations of circuit designs that successfully evade detection across multiple state-of-the-art piracy detection tools. We devise three solutions to overcome challenges related to integration of LLMs for hardware circuit designs, scalability to large circuits, and effectiveness, resulting in an end-to-end automated, efficient, and practical formulation. We perform an extensive experimental evaluation of LLMPirate using eight LLMs of varying sizes and capabilities and assess their performance in pirating various circuit designs against four state-of-the-art, widely-used piracy detection tools. Our experiments demonstrate that LLMPirate is able to consistently evade detection on 100% of tested circuits across every detection tool. Additionally, we showcase the ramifications of LLMPirate using case studies on IBEX and MOR1KX processors and a GPS module, that we successfully pirate. We envision that our work motivates and fosters the development of better IP piracy detection tools.

  • 5 authors
·
Nov 25, 2024

DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors

Machine Learning (ML) functions are becoming ubiquitous in latency- and privacy-sensitive IoT applications, prompting a shift toward near-sensor processing at the extreme edge and the consequent increasing adoption of Parallel Ultra-Low Power (PULP) IoT processors. These compute- and memory-constrained parallel architectures need to run efficiently a wide range of algorithms, including key Non-Neural ML kernels that compete favorably with Deep Neural Networks (DNNs) in terms of accuracy under severe resource constraints. In this paper, we focus on enabling efficient parallel execution of Non-Neural ML algorithms on two RISCV-based PULP platforms, namely GAP8, a commercial chip, and PULP-OPEN, a research platform running on an FPGA emulator. We optimized the parallel algorithms through a fine-grained analysis and intensive optimization to maximize the speedup, considering two alternative Floating-Point (FP) emulation libraries on GAP8 and the native FPU support on PULP-OPEN. Experimental results show that a target-optimized emulation library can lead to an average 1.61x runtime improvement and 37% energy reduction compared to a standard emulation library, while the native FPU support reaches up to 32.09x and 99%, respectively. In terms of parallel speedup, our design improves the sequential execution by 7.04x on average on the targeted octa-core platforms leading to energy and latency decrease up to 87%. Lastly, we present a comparison with the ARM Cortex-M4 microcontroller (MCU), a widely adopted commercial solution for edge deployments, which is 12.87x slower and 98% less energy-efficient than PULP-OPEN.

  • 3 authors
·
Jul 16, 2021

Learning to Design Circuits

Analog IC design relies on human experts to search for parameters that satisfy circuit specifications with their experience and intuitions, which is highly labor intensive, time consuming and suboptimal. Machine learning is a promising tool to automate this process. However, supervised learning is difficult for this task due to the low availability of training data: 1) Circuit simulation is slow, thus generating large-scale dataset is time-consuming; 2) Most circuit designs are propitiatory IPs within individual IC companies, making it expensive to collect large-scale datasets. We propose Learning to Design Circuits (L2DC) to leverage reinforcement learning that learns to efficiently generate new circuits data and to optimize circuits. We fix the schematic, and optimize the parameters of the transistors automatically by training an RL agent with no prior knowledge about optimizing circuits. After iteratively getting observations, generating a new set of transistor parameters, getting a reward, and adjusting the model, L2DC is able to optimize circuits. We evaluate L2DC on two transimpedance amplifiers. Trained for a day, our RL agent can achieve comparable or better performance than human experts trained for a quarter. It first learns to meet hard-constraints (eg. gain, bandwidth), and then learns to optimize good-to-have targets (eg. area, power). Compared with grid search-aided human design, L2DC can achieve 250times higher sample efficiency with comparable performance. Under the same runtime constraint, the performance of L2DC is also better than Bayesian Optimization.

  • 4 authors
·
Dec 5, 2018

A Tale of Two Sides of Wafer: Physical Implementation and Block-Level PPA on Flip FET with Dual-sided Signals

As the conventional scaling of logic devices comes to an end, functional wafer backside and 3D transistor stacking are consensus for next-generation logic technology, offering considerable design space extension for powers, signals or even devices on the wafer backside. The Flip FET (FFET), a novel transistor architecture combining 3D transistor stacking and fully functional wafer backside, was recently proposed. With symmetric dual-sided standard cell design, the FFET can deliver around 12.5% cell area scaling and faster but more energy-efficient libraries beyond other stacked transistor technologies such as CFET. Besides, thanks to the novel cell design with dual-sided pins, the FFET supports dual-sided signal routing, delivering better routability and larger backside design space. In this work, we demonstrated a comprehensive FFET evaluation framework considering physical implementation and block-level power-performance-area (PPA) assessment for the first time, in which key functions are dual-sided routing and dual-sided RC extraction. A 32-bit RISC-V core was used for the evaluation here. Compared to the CFET with single-sided signals, the FFET with single-sided signals achieved 23.3% post-P&R core area reduction, 25.0% higher frequency and 11.9% lower power at the same utilization, and 16.0 % higher frequency at the same core area. Meanwhile, the FFET supports dual-sided signals, which can further benefit more from flexible allocation of cell input pins on both sides. By optimizing the input pin density and BEOL routing layer number on each side, 10.6% frequency gain was realized without power degradation compared to the one with single-sided signal routing. Moreover, the routability and power efficiency of FFET barely degrades even with the routing layer number reduced from 12 to 5 on each side, validating the great space for cost-friendly design enabled by FFET.

  • 10 authors
·
Jan 25

Retrieval-Guided Reinforcement Learning for Boolean Circuit Minimization

Logic synthesis, a pivotal stage in chip design, entails optimizing chip specifications encoded in hardware description languages like Verilog into highly efficient implementations using Boolean logic gates. The process involves a sequential application of logic minimization heuristics (``synthesis recipe"), with their arrangement significantly impacting crucial metrics such as area and delay. Addressing the challenge posed by the broad spectrum of design complexities - from variations of past designs (e.g., adders and multipliers) to entirely novel configurations (e.g., innovative processor instructions) - requires a nuanced `synthesis recipe` guided by human expertise and intuition. This study conducts a thorough examination of learning and search techniques for logic synthesis, unearthing a surprising revelation: pre-trained agents, when confronted with entirely novel designs, may veer off course, detrimentally affecting the search trajectory. We present ABC-RL, a meticulously tuned alpha parameter that adeptly adjusts recommendations from pre-trained agents during the search process. Computed based on similarity scores through nearest neighbor retrieval from the training dataset, ABC-RL yields superior synthesis recipes tailored for a wide array of hardware designs. Our findings showcase substantial enhancements in the Quality-of-result (QoR) of synthesized circuits, boasting improvements of up to 24.8% compared to state-of-the-art techniques. Furthermore, ABC-RL achieves an impressive up to 9x reduction in runtime (iso-QoR) when compared to current state-of-the-art methodologies.

  • 5 authors
·
Jan 22, 2024

Customizing a Large Language Model for VHDL Design of High-Performance Microprocessors

The use of Large Language Models (LLMs) in hardware design has taken off in recent years, principally through its incorporation in tools that increase chip designer productivity. There has been considerable discussion about the use of LLMs in RTL specifications of chip designs, for which the two most popular languages are Verilog and VHDL. LLMs and their use in Verilog design has received significant attention due to the higher popularity of the language, but little attention so far has been given to VHDL despite its continued popularity in the industry. There has also been little discussion about the unique needs of organizations that engage in high-performance processor design, and techniques to deploy AI solutions in these settings. In this paper, we describe our journey in developing a Large Language Model (LLM) specifically for the purpose of explaining VHDL code, a task that has particular importance in an organization with decades of experience and assets in high-performance processor design. We show how we developed test sets specific to our needs and used them for evaluating models as we performed extended pretraining (EPT) of a base LLM. Expert evaluation of the code explanations produced by the EPT model increased to 69% compared to a base model rating of 43%. We further show how we developed an LLM-as-a-judge to gauge models similar to expert evaluators. This led us to deriving and evaluating a host of new models, including an instruction-tuned version of the EPT model with an expected expert evaluator rating of 71%. Our experiments also indicate that with the potential use of newer base models, this rating can be pushed to 85% and beyond. We conclude with a discussion on further improving the quality of hardware design LLMs using exciting new developments in the Generative AI world.

  • 10 authors
·
May 14

XR-NPE: High-Throughput Mixed-precision SIMD Neural Processing Engine for Extended Reality Perception Workloads

This work proposes XR-NPE, a high-throughput Mixed-precision SIMD Neural Processing Engine, designed for extended reality (XR) perception workloads like visual inertial odometry (VIO), object classification, and eye gaze extraction. XR-NPE is first to support FP4, Posit (4,1), Posit (8,0), and Posit (16,1) formats, with layer adaptive hybrid-algorithmic implementation supporting ultra-low bit precision to significantly reduce memory bandwidth requirements, and accompanied by quantization-aware training for minimal accuracy loss. The proposed Reconfigurable Mantissa Multiplication and Exponent processing Circuitry (RMMEC) reduces dark silicon in the SIMD MAC compute engine, assisted by selective power gating to reduce energy consumption, providing 2.85x improved arithmetic intensity. XR-NPE achieves a maximum operating frequency of 1.72 GHz, area 0.016 mm2 , and arithmetic intensity 14 pJ at CMOS 28nm, reducing 42% area, 38% power compared to the best of state-of-the-art MAC approaches. The proposed XR-NPE based AXI-enabled Matrix-multiplication co-processor consumes 1.4x fewer LUTs, 1.77x fewer FFs, and provides 1.2x better energy efficiency compared to SoTA accelerators on VCU129. The proposed co-processor provides 23% better energy efficiency and 4% better compute density for VIO workloads. XR-NPE establishes itself as a scalable, precision-adaptive compute engine for future resource-constrained XR devices. The complete set for codes for results reproducibility are released publicly, enabling designers and researchers to readily adopt and build upon them. https://github.com/mukullokhande99/XR-NPE.

  • 5 authors
·
Aug 18 1

Physics-based Noise Modeling for Extreme Low-light Photography

Enhancing the visibility in extreme low-light environments is a challenging task. Under nearly lightless condition, existing image denoising methods could easily break down due to significantly low SNR. In this paper, we systematically study the noise statistics in the imaging pipeline of CMOS photosensors, and formulate a comprehensive noise model that can accurately characterize the real noise structures. Our novel model considers the noise sources caused by digital camera electronics which are largely overlooked by existing methods yet have significant influence on raw measurement in the dark. It provides a way to decouple the intricate noise structure into different statistical distributions with physical interpretations. Moreover, our noise model can be used to synthesize realistic training data for learning-based low-light denoising algorithms. In this regard, although promising results have been shown recently with deep convolutional neural networks, the success heavily depends on abundant noisy clean image pairs for training, which are tremendously difficult to obtain in practice. Generalizing their trained models to images from new devices is also problematic. Extensive experiments on multiple low-light denoising datasets -- including a newly collected one in this work covering various devices -- show that a deep neural network trained with our proposed noise formation model can reach surprisingly-high accuracy. The results are on par with or sometimes even outperform training with paired real data, opening a new door to real-world extreme low-light photography.

  • 4 authors
·
Aug 4, 2021

PreRoutGNN for Timing Prediction with Order Preserving Partition: Global Circuit Pre-training, Local Delay Learning and Attentional Cell Modeling

Pre-routing timing prediction has been recently studied for evaluating the quality of a candidate cell placement in chip design. It involves directly estimating the timing metrics for both pin-level (slack, slew) and edge-level (net delay, cell delay), without time-consuming routing. However, it often suffers from signal decay and error accumulation due to the long timing paths in large-scale industrial circuits. To address these challenges, we propose a two-stage approach. First, we propose global circuit training to pre-train a graph auto-encoder that learns the global graph embedding from circuit netlist. Second, we use a novel node updating scheme for message passing on GCN, following the topological sorting sequence of the learned graph embedding and circuit graph. This scheme residually models the local time delay between two adjacent pins in the updating sequence, and extracts the lookup table information inside each cell via a new attention mechanism. To handle large-scale circuits efficiently, we introduce an order preserving partition scheme that reduces memory consumption while maintaining the topological dependencies. Experiments on 21 real world circuits achieve a new SOTA R2 of 0.93 for slack prediction, which is significantly surpasses 0.59 by previous SOTA method. Code will be available at: https://github.com/Thinklab-SJTU/EDA-AI.

  • 7 authors
·
Feb 26, 2024

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

Large language models (LLMs) have emerged due to their capability to generate high-quality content across diverse contexts. To reduce their explosively increasing demands for computing resources, a mixture of experts (MoE) has emerged. The MoE layer enables exploiting a huge number of parameters with less computation. Applying state-of-the-art continuous batching increases throughput; however, it leads to frequent DRAM access in the MoE and attention layers. We observe that conventional computing devices have limitations when processing the MoE and attention layers, which dominate the total execution time and exhibit low arithmetic intensity (Op/B). Processing MoE layers only with devices targeting low-Op/B such as processing-in-memory (PIM) architectures is challenging due to the fluctuating Op/B in the MoE layer caused by continuous batching. To address these challenges, we propose Duplex, which comprises xPU tailored for high-Op/B and Logic-PIM to effectively perform low-Op/B operation within a single device. Duplex selects the most suitable processor based on the Op/B of each layer within LLMs. As the Op/B of the MoE layer is at least 1 and that of the attention layer has a value of 4-8 for grouped query attention, prior PIM architectures are not efficient, which place processing units inside DRAM dies and only target extremely low-Op/B (under one) operations. Based on recent trends, Logic-PIM adds more through-silicon vias (TSVs) to enable high-bandwidth communication between the DRAM die and the logic die and place powerful processing units on the logic die, which is best suited for handling low-Op/B operations ranging from few to a few dozens. To maximally utilize the xPU and Logic-PIM, we propose expert and attention co-processing.

  • 9 authors
·
Sep 2, 2024

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

Deploying advanced large language models on edge devices, such as smartphones and robotics, is a growing trend that enhances user data privacy and network connectivity resilience while preserving intelligent capabilities. However, such a task exhibits single-batch computing with incredibly low arithmetic intensity, which poses the significant challenges of huge memory footprint and bandwidth demands on limited edge resources. To address these issues, we introduce Cambricon-LLM, a chiplet-based hybrid architecture with NPU and a dedicated NAND flash chip to enable efficient on-device inference of 70B LLMs. Such a hybrid architecture utilizes both the high computing capability of NPU and the data capacity of the NAND flash chip, with the proposed hardware-tiling strategy that minimizes the data movement overhead between NPU and NAND flash chip. Specifically, the NAND flash chip, enhanced by our innovative in-flash computing and on-die ECC techniques, excels at performing precise lightweight on-die processing. Simultaneously, the NPU collaborates with the flash chip for matrix operations and handles special function computations beyond the flash's on-die processing capabilities. Overall, Cambricon-LLM enables the on-device inference of 70B LLMs at a speed of 3.44 token/s, and 7B LLMs at a speed of 36.34 token/s, which is over 22X to 45X faster than existing flash-offloading technologies, showing the potentiality of deploying powerful LLMs in edge devices.

  • 15 authors
·
Sep 23, 2024

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

As Neural Processing Units (NPU) or accelerators are increasingly deployed in a variety of applications including safety critical applications such as autonomous vehicle, and medical imaging, it is critical to understand the fault-tolerance nature of the NPUs. We present a reliability study of Arm's Ethos-U55, an important industrial-scale NPU being utilised in embedded and IoT applications. We perform large scale RTL-level fault injections to characterize Ethos-U55 against the Automotive Safety Integrity Level D (ASIL-D) resiliency standard commonly used for safety-critical applications such as autonomous vehicles. We show that, under soft errors, all four configurations of the NPU fall short of the required level of resiliency for a variety of neural networks running on the NPU. We show that it is possible to meet the ASIL-D level resiliency without resorting to conventional strategies like Dual Core Lock Step (DCLS) that has an area overhead of 100%. We achieve so through selective protection, where hardware structures are selectively protected (e.g., duplicated, hardened) based on their sensitivity to soft errors and their silicon areas. To identify the optimal configuration that minimizes the area overhead while meeting the ASIL-D standard, the main challenge is the large search space associated with the time-consuming RTL simulation. To address this challenge, we present a statistical analysis tool that is validated against Arm silicon and that allows us to quickly navigate hundreds of billions of fault sites without exhaustive RTL fault injections. We show that by carefully duplicating a small fraction of the functional blocks and hardening the Flops in other blocks meets the ASIL-D safety standard while introducing an area overhead of only 38%.

  • 5 authors
·
Apr 14, 2024

Resistive memory-based zero-shot liquid state machine for multimodal event data learning

The human brain is a complex spiking neural network (SNN) that learns multimodal signals in a zero-shot manner by generalizing existing knowledge. Remarkably, the brain achieves this with minimal power consumption, using event-based signals that propagate within its structure. However, mimicking the human brain in neuromorphic hardware presents both hardware and software challenges. Hardware limitations, such as the slowdown of Moore's law and the von Neumann bottleneck, hinder the efficiency of digital computers. On the software side, SNNs are known for their difficult training, especially when learning multimodal signals. To overcome these challenges, we propose a hardware-software co-design that combines a fixed and random liquid state machine (LSM) SNN encoder with trainable artificial neural network (ANN) projections. The LSM is physically implemented using analogue resistive memory, leveraging the inherent stochasticity of resistive switching to generate random weights. This highly efficient and nanoscale in-memory computing approach effectively addresses the von Neumann bottleneck and the slowdown of Moore's law. The ANN projections are implemented digitally, allowing for easy optimization using contrastive loss, which helps to overcome the difficulties associated with SNN training. We experimentally implement this co-design on a 40nm 256Kb in-memory computing macro. We first demonstrate LSM-based event encoding through supervised classification and linear probing on the N-MNIST and N-TIDIGITS datasets.

  • 19 authors
·
Jul 3, 2023

D2S-FLOW: Automated Parameter Extraction from Datasheets for SPICE Model Generation Using Large Language Models

In electronic design, engineers often manually search through extensive documents to retrieve component parameters required for constructing SPICE models, a process that is both labor-intensive and time-consuming. To address this challenge, we present an automated framework called D2S-FLOW that leverages large language models (LLMs) to extract electrical parameters from datasheets and generate SPICE models with high precision and efficiency, significantly reducing the need for manual intervention. Unlike traditional RAG systems, D2S-FLOW employs a workflow to enhance precision in handling unstructured documents and inconsistent naming conventions through three innovative mechanisms: Attention-Guided Document Focusing (AGDF), Hierarchical Document-Enhanced Retrieval (HDER), and Heterogeneous Named Entity Normalization (HNEN). AGDF narrows retrieval to user-selected documents, HDER utilizes document structure for precise parameter localization, and HNEN standardizes terminology via semantic inference. Experimental results demonstrate that the framework achieves an Exact Match (EM) of 0.86, an F1 score of 0.92, and an Exact Correctness (EC) of 0.96, outperforming the strongest baseline by 19.4%, 5.7%, and 13.1%, respectively. Additionally, it reduces API token consumption by 38% and minimizes the irrelevant information ratio to 4%, showcasing substantial improvements in resource efficiency. This research provides an effective automated solution for circuit design.

  • 3 authors
·
Feb 23

AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning

Machine learning has shown great promise in addressing several critical hardware security problems. In particular, researchers have developed novel graph neural network (GNN)-based techniques for detecting intellectual property (IP) piracy, detecting hardware Trojans (HTs), and reverse engineering circuits, to name a few. These techniques have demonstrated outstanding accuracy and have received much attention in the community. However, since these techniques are used for security applications, it is imperative to evaluate them thoroughly and ensure they are robust and do not compromise the security of integrated circuits. In this work, we propose AttackGNN, the first red-team attack on GNN-based techniques in hardware security. To this end, we devise a novel reinforcement learning (RL) agent that generates adversarial examples, i.e., circuits, against the GNN-based techniques. We overcome three challenges related to effectiveness, scalability, and generality to devise a potent RL agent. We target five GNN-based techniques for four crucial classes of problems in hardware security: IP piracy, detecting/localizing HTs, reverse engineering, and hardware obfuscation. Through our approach, we craft circuits that fool all GNNs considered in this work. For instance, to evade IP piracy detection, we generate adversarial pirated circuits that fool the GNN-based defense into classifying our crafted circuits as not pirated. For attacking HT localization GNN, our attack generates HT-infested circuits that fool the defense on all tested circuits. We obtain a similar 100% success rate against GNNs for all classes of problems.

  • 4 authors
·
Feb 21, 2024

Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers

Enabling On-Device Learning (ODL) for Ultra-Low-Power Micro-Controller Units (MCUs) is a key step for post-deployment adaptation and fine-tuning of Deep Neural Network (DNN) models in future TinyML applications. This paper tackles this challenge by introducing a novel reduced precision optimization technique for ODL primitives on MCU-class devices, leveraging the State-of-Art advancements in RISC-V RV32 architectures with support for vectorized 16-bit floating-point (FP16) Single-Instruction Multiple-Data (SIMD) operations. Our approach for the Forward and Backward steps of the Back-Propagation training algorithm is composed of specialized shape transform operators and Matrix Multiplication (MM) kernels, accelerated with parallelization and loop unrolling. When evaluated on a single training step of a 2D Convolution layer, the SIMD-optimized FP16 primitives result up to 1.72times faster than the FP32 baseline on a RISC-V-based 8+1-core MCU. An average computing efficiency of 3.11 Multiply and Accumulate operations per clock cycle (MAC/clk) and 0.81 MAC/clk is measured for the end-to-end training tasks of a ResNet8 and a DS-CNN for Image Classification and Keyword Spotting, respectively -- requiring 17.1 ms and 6.4 ms on the target platform to compute a training step on a single sample. Overall, our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs and outperforms by 1.6 times previous FP32 parallel implementations on a Continual Learning setup.

  • 4 authors
·
May 30, 2023

Modeling Performance of Data Collection Systems for High-Energy Physics

Exponential increases in scientific experimental data are outstripping the rate of progress in silicon technology. As a result, heterogeneous combinations of architectures and process or device technologies are increasingly important to meet the computing demands of future scientific experiments. However, the complexity of heterogeneous computing systems requires systematic modeling to understand performance. We present a model which addresses this need by framing key aspects of data collection pipelines and constraints, and combines them with the important vectors of technology that shape alternatives, computing metrics that allow complex alternatives to be compared. For instance, a data collection pipeline may be characterized by parameters such as sensor sampling rates, amount of data collected, and the overall relevancy of retrieved samples. Alternatives to this pipeline are enabled by hardware development vectors including advancing CMOS, GPUs, neuromorphic computing, and edge computing. By calculating metrics for each alternative such as overall F1 score, power, hardware cost, and energy expended per relevant sample, this model allows alternate data collection systems to be rigorously compared. To demonstrate this model's capability, we apply it to the CMS experiment (and planned HL-LHC upgrade) to evaluate and compare the application of novel technologies in the data acquisition system (DAQ). We demonstrate that improvements to early stages in the DAQ are highly beneficial, greatly reducing the resources required at later stages of processing (such as a 60% power reduction) and increasing the amount of relevant data retrieved from the experiment per unit power (improving from 0.065 to 0.31 samples/kJ) However, we predict further advances will be required in order to meet overall power and cost constraints for the DAQ.

  • 3 authors
·
Jun 27, 2024

wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation

As machine learning (ML) is increasingly implemented in hardware to address real-time challenges in scientific applications, the development of advanced toolchains has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as hardware synthesis, are becoming limiting factors in the rapid iteration of designs. To mitigate these emerging constraints, multiple efforts have been undertaken to develop an ML-based surrogate model that estimates resource usage of ML accelerator architectures. We introduce wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, and its corresponding initial dataset of over 680,000 fully connected and convolutional neural networks, all synthesized using hls4ml and targeting Xilinx FPGAs. The benchmark evaluates the performance of resource and latency predictors against several common ML model architectures, primarily originating from scientific domains, as exemplar models, and the average performance across a subset of the dataset. Additionally, we introduce GNN- and transformer-based surrogate models that predict latency and resources for ML accelerators. We present the architecture and performance of the models and find that the models generally predict latency and resources for the 75% percentile within several percent of the synthesized resources on the synthetic test dataset.

  • 16 authors
·
Nov 6