top of page
1. Processing-in-Memory for Data-Intensive Applications, From Device to Algorithm
Over the past decades, the amount of data that is required to be processed and analyzed by computing systems has been increasing dramatically to exascale. However, the inability of modern computing platforms to deliver both energy-efficient and high-performance computing solutions leads to a gap between meets and needs. 
Unfortunately, such a gap will keep widening mainly due to limitations in both
devices and architectures. First, at the device level, the computing efficiency and performance of CMOS Boolean systems are beginning to stall due to approaching the end of Moore's law and also reaching its power wall (i.e., huge leakage power consumption limits the performance growth when technology scales down). Second, at the architecture level, today's computers are based on Von-Neumann architecture with separate computing and memory units connecting via buses, which leads to memory wall (including long memory access latency, limited memory bandwidth, energy-hungry data transfer) and huge leakage power for holding data in volatile memory.

Motivated by the aforementioned concerns, our group research is focused on Hardware and Software Co-design of Energy-efficient and High-Performance Processing-in-Memory (PIM) Platforms based on Volatile and Non-Volatile memories, leveraging innovations from device, circuit, architecture, and Algorithm to integrate memory and logic to break the existing memory and power walls and dramatically increase computing efficiency of today’s non-Von-Neumann computing systems.

Our 1.23-GHz 16-kb Processing-in-Memory Prototype [ESSCIRC'22]

2. Integrated Sensing and Normally-off Computing for Edge Imaging Systems
Internet of Things (IoT) devices are projected to exceed $1000B by 2025, with a web of interconnection projected to comprise approximately 75+ billion IoT devices. The large number of IoTs consists of sensory imaging systems that enable massive data collection from the environment and people. However, considerable portions of the captured sensory data are redundant and unstructured. Data conversion of such large raw data, storing in volatile memories, transmission, and computation in on-/off-chip processors, impose high energy consumption, latency, and a memory bottleneck at the edge. Moreover, because renewing batteries for IoT devices is very costly and sometimes impracticable, energy harvesting devices with ambient energy sources and low maintenance have impacted a wide range of IoT applications such as wearable devices, smart cities, and the intelligent industry. This project explores and designs new high-speed, low-power, and normally-off computing architectures for resource-limited sensory nodes by exploiting cross-layer CMOS/post-CMOS approaches to overcome these issues.
Our group mainly focuses on two main research directions: 
(1) design and analysis of a Processing-In-Sensor Unit (PISU) co-integrating always-o
n sensing and processing capabilities in conjunction with a Processing-Near-Sensor Unit (PNSU). The hybrid platform will feature real-time programmable granularity-configurable arithmetic operations to balance the accuracy, speed, and power-efficiency trade-offs under both continuous and energy-harvesting-powered imaging scenarios. This platform will enable resource-limited edge devices to locally perform data and compute-intensive applications such as machine learning tasks while consuming much less power than present state-of-the-art technology; and 
(2) realizing the concept of normally-off computing and rollback by integrating the non-volatility and instant wake-up features of non-volatile memories. This allows PISU to first keep data available and safe, even after a power-off, and then return to a previously valid state in the case of an execution error or a power failure.
3. Memory Security Solutions for Large-scale Deep Learning Applications
With deep learning deployed in many security-sensitive applications, the vulnerability of large-scale models stored in main memory (e.g., DRAM) is a major concern for safe model deployment. Recent studies demonstrate attackers can exploit system-level techniques to expose the vulnerability of DRAM against memory fault injection, including bit-flip using rowhammer. Unfortunately, by scaling down the size of DRAM chips in modern manufacturing, DRAM becomes increasingly more vulnerable to a wide range of fault injection techniques. Recent advancements in large language models with billions of weight parameters in DRAM will further exacerbate the threat. While existing defense mechanisms have improved DRAM security against some specific rowhammer attacks, they still remain vulnerable to more advanced techniques. We believe a systematic algorithm-, hardware-, and system-level analysis to establish low-cost and efficient design strategies for safeguarding Deep Neural Networks (DNNs) against memory bit-flip attacks on DRAM as a newcomer is missing. In our lab, we investigate novel attack surfaces and designs cost-efficient defensive measures to ensure the security and robustness of large-scale DNN applications on modern DRAM. 
DRAM2 (1).jpg
4. Toward Large Language Model-Driven AI Accelerator Generation
Artificial Intelligence (AI) has shown a remarkable potential to address complex design problems in a wide variety of fields such as software development. A key advantage of AI in this usage is significantly reducing the manual effort and the expertise requirements. In hardware design for Deep Neural Networks (DNNs) accelerators, the complexity and need for expert knowledge have been major design limitations making the design process time-consuming and complicated. Amongst several hardware accelerator options, systolic array architecture is essential for efficient DNN processing for its ability to handle the matrix operations central to AI computations. Their design promotes high throughput and energy savings, essential for the demanding tasks of AI applications. Furthermore, these accelerators are adaptable and able to be customized for diverse AI needs, making them integral in optimizing AI hardware for various tasks. At the frontier of AI models, Large Language Models (LLMs) form an appealing choice for alleviating the challenges in hardware accelerator design. While there have been a few LLMs for automating the hardware design process, from conceptual design to synthesis and fabrication, the absence of prompt optimization, tailored datasets, and model fine-tuning poses a barrier to fully harnessing the potential of LLMs. 
bottom of page