poster - digital & system

Poster�-�Digital�&�System�

PD01 An area-efficient implementation of softmax activation function for deep neural networks Muhammad Awais Hussain, Tsung-Han Tsai National Central University

Deep neural networks are widely used in computer vision applications due to their high performance. However, DNNs involve a large number of computations in the training and inference phase. Among the different layers of a DNN, the softmax layer has one of the most complex computations as it involves exponent and division operations. So, a hardware-efficient implementation is required to reduce the on-chip resources. In this paper, we propose a new hardware-efficient and fast implementation of the softmax activation function. The proposed hardware implementation consumes fewer hardware resources and works at high speed as compared to the state-of-the-art techniques.

PD02 FPGA-based Robot System Design for Search and Rescue Applications Chun-Hsian Huang, Cheng-Yi Hsu, Yu-Chen Chen, Shao-Yu Yang, Wei-Ting Huang and Pei-Rong Wu Department of Computer Science and Information Engineering, National Taitung University

In this work, we focus on the main system design of a search and rescue (SAR) robot but not the mechanical mechanism. Compared to the most existing robot designs, the FPGA device is adopted as the main computing architecture, instead of using a microprocessor. For SAR tasks, besides multiple different sensors are used, a resource-efficient quantized neural network (QNN) is implemented as a hardware module and integrated into our robot system for real-time survivor detection. To reduce the effects on the cover of obstacles and the lack of light, an image fusion method for survivor detection based on both the RGB image and the thermal infrared image is also presented. According to our experiments, the proposed robot system can operate up to 101.41 MHz, while it can achieve 2.404 fps for survivor detection.

PD03 AN-RNSNN: AN-Coded Redundant Residue Number System for Neural Network Acceleration and Reliability Hsiao-Wen Fu, Meng-Wei Shen and Tsung-Chu Huang Department of Electronics Engineering, National Changhua University of Education

Residue Number Systems (RNS) can simultaneously improve computing acceleration, area reduction and power saving. For high-reliability applications like automotive electronics, partition-ability empowers the redundant RNS fault tolerance. However the parallel multiple modular redundancy will take a huge number of converters. This is the first paper to incorporate AN codes to the RRNS applied in high-reliable neural net-works. From experimental results, the k-modulus redundancy can be reduced from 2 paths to only one, and the residue-to-binary converters can be saved from (k+2)(k+1) to only k+1 for the external-coding structure, and even to only one for the internal-coding structure in the single residue arithmetic-weight error correcting. From BLER simulations, about 130 times of reliability can be achieved for a 16-bit 47N-coded MAC.

PD04 Error Correctable Range-Addressable Lookup for Any Activation Function of Neural Net-works Ting-Yu Chen, Cheng-Di Tsai, Hsiao-Wen Fu, Yung-Chun Yang and Tsung-Chu Huang Department of Electronics Engineering, National Changhua University of Education

In this paper, we generalize the lightweight-slope piecewise-line range-addressable lookup table for efficient approximate computing in any activation/quantization function, and propose an error-correcting algorithm and circuitry using the AN codes for enhancing reliability. A BLER/SNR simulation proves the SEC/DEC capability of the firstly-presented AN-coded neuron. Comparisons with similar state-of-the-art works show that the proposed technique is the most efficient and error-correctable lookup table for any function in a medium resolution within 8-12 bits.

PD05 Image recognition with the dynamical fixed-point representation of weight tables for reducing CNN computational complexity Jih-Ching Chiu, Wei-Yi Lin, Zhe-You Yan Department of Electrical Engineering, National Sun Yat-sen University

Currently, the demand is increasing for deep learning, real-time recognition, and edge computing. These technologies require not only significant computing resources but also real-time computing performance. Therefore, the computing efficiency, energy consumption, and storage space of embedded systems have increased in importance.

This paper builds on the convolutional neural network (CNN) mechanism to construct a dynamical fixed-point representation algorithm, which uses quantized weight tables for the fixed-point distinguishable representation values. The algorithm includes a fixed-point 16 quantitative architecture and dynamic adjustment algorithm for integer and decimal ratio. This architecture helps users avoid needing to retrain the weight tables using a dedicated CNN model and reduces the data bits to reduce the required data bandwidth and storage space. In addition, the computational complexity is much smaller than with the traditional floating-point CNN architecture. Through the construction and analysis of a simulation environment, this paper has obtained nearly non-destructive recognition results, and the consumed computational resources were reduced by more than 50%.

PD06 Scalable NPairLoss-based Deep-ECG for ECG Verification Yu-Shan Tai1, Yi-Ta Chen2, and An-Yeu (Andy) Wu2

1 Department of Electrical Engineering, National Taiwan University 2 Graduate Institute of Electronics Engineering, National Taiwan University

To protect the sensitive ECG data from data breach, ECG biometrics system are proposed. Compared to the traditional biometric systems, ECG biometric is known to be ubiquitous and difficult to counterfeit. ECG biometric system mainly contains identification task and verification task, and Deep-ECG is the state-of-the-art work in both tasks. However, Deep-ECG only trained on one specific dataset, which ignored the intra-variability of different ECG signals across different situations. Moreover, Deep-ECG used cross-entropy loss to train the deep convolutional neural networks (CNN) model, which is not the most appropriate loss function for such embedding-based problem. In this paper, we proposed a scalable NPairLoss-based Deep-ECG (SNL-Deep-ECG) system for ECG verification on a hybrid dataset, mixed with four public ECG datasets. We modify the preprocessing method and trained the deep CNN model with NPairLoss. Compared with Deep-ECG, SNL-Deep-ECG can reduce 90% of the signal collection time during inference with only 0.9% AUC dropped. Moreover, SNL-Deep-ECG outperforms Deep-ECG for approximately 3.5% Area Under ROC Curve (AUC) score in the hybrid dataset. Moreover, SNL-Deep-ECG can maintain its verification performance over the increasing number of the subjects, and thus to be scalable in terms of subject number. The final performance of the proposed SNL-Deep-ECG is 0.975/0.970 AUC score on the seen/unseen-subject task.

PD07 Instruction Analyzer with Nested Loop Unrolling Jih-Ching Chiu, Wei-Liang Yan, Yu-Quan Chen Department of Electrical Engineering, National Sun Yat-Sen University

To accelerate the computing of multicore processor, the key point is to make instructions of loop issued smoother. There are two reasons for loop slowing down the processor. First, loop makes the basic block shorter in the program. Second, lots of data dependency will be added after the iteration of loop. This paper will show how to issue the program with loops smoother in the hyper-scalar architecture, a multicore architecture, by improving the Instruction Analyzer (IA), the one of module in hyper-scalar architecture. Before the hardware, using the software to analyze the loop information, label and standardize the loop tag, and then mix into others instruction. When executing the program, Pre-Fetch stage, first stage of IA, will use the block-driven predictor to unroll the loop and pass multiple instruction to next stage. Tag-Generate stage, second stage of IA, will rename and change the immediate value of the instruction with different iteration number of loops to change some data dependency between iteration, then issue these instructions to multiprocessor. After improving the IA, we will have 23% gain of performance, 99.98% of instruction saturation in each processor by using the simplified convolution neural network (CNN) as example program. With different combination of loop, this architecture still has 10% gain of performance, 99% of instruction saturation in each processor. Finally, we also implement the design in FPGA and complete function verification.

PD08 Continual Learning for Environmental Sound Classification without Catastrophic Forgetting Said Karam, Shanq-Jang Ruan National Taiwan University of Science and Technology

The major obstacle for artificial intelligence is the ability of a model to learn new knowledge without forgetting the previously learned information. In this article, we focus on the problem of continual learning in sound classification. We implement ResNet-50 and gradient episodic memory (GEM) for classifying and observing a sequence of tasks. The results show that GEM can transfer knowledge successfully and mitigate catastrophic forgetting. The proposed GEM show strong performance on ESC-50 and UrbanSound8K datasets compared to state-of-the-art methods. The proposed GEM obtained 93% and 90% accuracy, respectively.

PD09 Integrated Dynamic Memory Manager for an Embedded RISC-V Processor Chun-Wei Chao, Jin-You Wu, Sheng-Di Hong, and Chun-Jen Tsai Department of Computer Science, National Yang Ming Chiao Tung University

In this paper, we present an open-source RISC-V processor with an embedded dynamic memory manager. Traditionally, dynamic memory management is handled by a software library. However, the process involves searching and manipulation of link lists of memory blocks, which can be expensive when the memory becomes fragmented. In particular, for embedded systems that have to be online for a long duration, a static memory organization is often used to reduce the overhead for dynamic memory management at the cost of less flexibility and worse coding style. Modern VLSI technology allows efficient implementation of complex control behaviors such that it is time to investigate the option of integrating hardwired resource managers directly into the processor architecture for better performance. As the experiments in this paper show, a hardware memory manager embedded within the processor core can be more efficient than a software one. The proposed architecture is implemented and verified on a Xilinx FPGA development board

PD10 Quantized Full Semantic Segmentation Neural Networks for image segmentation on PYNQ-FPGA Afaroj Ahamad, Chi-Chia Sun, Hoang Minh Nguyen Digital System Design Laboratory, National Formosa University,

In the history of computer vision, semantic segmentation is one of the most important challenges. Semantic segmentation is the ability to segment an image into different parts. The implementation of semantic segmentation in an embedded platform is a fruitful idea but due to low power and limited memory, its become an important task. In this article, we proposed a novel and practical quantized full semantic segmentation architecture for an FPGA device, which allows reducing the parameter size of the original architecture. Hence the required power also reduces. The proposed architecture is symmetrical with other semantic segmentation networks apart from quantized weight and activation. Thus, this paper proposed a high performance deep learning processor unit (DPU) based accelerator for Semantic segmentation neural network. this methodology is quite suitable for robot vision in an embedded platform and the segmentation accuracy is up to $89.60\%$ on average. Notably, the proposed faster architecture is ideal for low power embedded devices that need to solve the shortest path problem, path searching, and motion planning, in the ADAS and Robot.

PD11 A High-Accuracy Approximate Multiplier with Dynamic Input Truncation Jia-Wei Lin, Ing-Chao Lin Department of computer Science and Information Engineering, National Cheng Kung University

Multipliers are among the most critical arithmetic functional units in many applications, and those applications commonly require many multiplications which result in significant power consumption. For applications that have error-tolerance, approximate multiplier is an emerging method to reduce critical path delay and power consumption. Approximate multiplier can trade off accuracy for lower energy and higher performance. In this paper, we propose an approximate 4-2 compressor with high accuracy. In addition to the proposed approximate 4-2 compressor, we propose a reconfigurable approximate multiplier that can dynamically truncate partial products to achieve variable accuracy requirement. We also propose a simple error compensation circuit to reduce error distance. The proposed approximate multiplier can adjust the accuracy at run-time based on the users’ requirement. Compared to the traditional Wallace Tree multiplier, the critical path delay is reduced by 27% and the power consumption is between 18% and 72% under different accuracy requirement.

PD12 Neural Network Architecture Based on Wavelet Transform for Vehicle Detection Ching-Lung Su1,3, Wen-Cheng Lai1,2, and Zhong-Jun Yao1

1Dept. of Electronic Engineering, National Yunlin University of Science and Technology 2Bachelor Program in Industrial Projects, National Yunlin University of Science and Technology 3Research and Implementation Center of Intelligent Electronic Product, National Yunlin University of Science and Technology

This article discusses the development of a neural network model for long-distance objects and a small-scale of computing. The proposed architecture of wavelet object detection is based on merge other frequency domains of the image into reference. The Experiments is applied for long-distance scenes to provide higher accuracy assume as lower computational complexity. Feasibility of the proposed wavelet neural networks has been evaluated and verified on vehicle detection. The architecture of wavelet object detection promoted low latency and high frame per second (fps) to porting on the NVIDIA JETSON AGX XAVIER evaluation board.

PD13 Accuracy Improvement of the Abnormal ECG Detection Chip Hong-Wen Jian1 and Hsin-Tung Hua1, Yuan-Ho Chen1,2 1Dept. of Electronics Engineering, Chang Gung University 2Dept. of Radiation Oncology, Chang Gung Memorial Hospital

This paper proposes a method to detect six different kinds of electrocardiography (ECG) by using the convolutional neural network (CNN), which can improve the efficiency of the original architecture. A smaller convolutional neural network architecture had been used to achieve high-efficiency expectations, which includes 5 layers are composed of 3 convolution layers and 2 dense layers respectively, that could provide high performance based on lightweight network architecture. To improve the accuracy, we propose a method of data shifting to double the input signal and produce more information to use as training. Although it takes a doubling time on computing, we have high accuracy, similar to software simulated, of the VLSI chip to remedy. This study used a TSMC 0.18−µm CMOS process with an area of 790µm×783µm and power consumption of 0.75mW at 20MHz. It can effectively increase the accuracy by about 0.28%, which is helpful for the improvement of the accuracy of physiological signals such as ECG. And the small area of the chip can be combined with portable devices and apply in daily life.

PD14 High Resolution Pulse-Shrinking-Based FPGA DPWM Taisen Lin, Poki Chen, Satria Bhaskara Adinugraha Department of Electronic and Computer Engineering, National Taiwan University of Science and Technology

A high-resolution digital pulse width modulator (DPWM) implemented on field-programmable gate arrays (FPGA) is proposed in this paper. A novel pulse shrinking technique is implemented on this paper to achieve high resolution. PLL and inverter are used to reduce the utilization, only need 6.7% of required delay elements to realize such the high-resolution. Altera Arria 10 FPGA uses the dedicated 50 MHz clocks as the switching frequency in this paper. However, it is unable to simulate pulse shrinking on the EDA tool. After calibration, the only required control bits are 12 bits, and the programmable resolution is 6.098 ps, of which INL -10.46~10.94 ps and DNL -16.29~16.34 ps.

PD15 A Low-Complexity Image Stitching Algorithm Based on SURF Shih-Ting Hung1, Ya-Yun Huang1, Fu-Jung Wen1, Yih-Shyh Chiou1, Wu-Chun Chung2, Ting-Lan Lin3, Chiung-An Chen4 and Shih-Lun Chen1

1Department of Electronic Engineering, Chung Yuan Christian University 2Department of Information and Computer Engineering, Chung Yuan Christian University 3Department of Electronic Engineering, National Taipei University of Technology 4Department of Electrical Engineering, Ming Chi University of Technology

In this paper, an image stitching algorithm with low complexity and computation is proposed. The proposed algorithm is developed by using feature point matching method based on Speeded-Up Robust Features (SURF) algorithm. In order to simplify the calculation, the candidate feature point of this algorithm applies a Gaussian filter of standard deviation sigma to the image, and the filtered image is convoluted to the Hessian template. In the image stitching part, using optimal seam algorithm to achieve fast image stitching and effectively eliminate ghosting. Moreover, the obvious lines at the stitching seam due to the color difference are reduced by Alpha blending. Experimental results show that the proposed feature point matching algorithm without affecting matching accuracy and time. It has benefits of low computational complexity and high extraction accuracy.

PD16 A Development and Implementation of a Smart IoT Sensing System in Long Term Care Wei-Da Chen1, En-Wei Syn2, Yu-Ming Xu2, Tao-Zhi Wang2, Huai-An Wang2 1Department of Electronic Engineering, Oriental Institute of technology 2Department of Communication Engineering, Oriental Institute of technology

People are living longer and longer on average and sub-replacement fertility has become widespread, especially in Taiwan and Japan. To maintain the quality of life for caretakers and the people took care of, a contactless thermal sensing system is presented, which can detect the fallen down people or unsafe thermal sources in the space. All data sensed by Wi-Fi smart sensing modules can be uploaded to an MQTT cloud broker. It is convenient for caretakers to display these data on a customized graphical user interface. Additionally, an ARM-based hard processor system (HPS) can access the hardware accelerator of a bicubic interpolator implemented on SoC FPGA running Linux OS. The proposed system is equipped with an FPGA to support the data rate of more than 30 sensors and perform operations of interpolating thermal images stored in cloud storage. Via integration and verification, the proposed system can achieve to reduce the cost of manpower and detect unsafe thermal sources in real-time.

PD17 Using Phase Portrait of Photoplethysmography Signals to analyze Arrhythmia Po-Lin Yao1, Shu-Yen Lin1, and Yu-Wei Chiu2,3

1Department of Electrical Engineering and 2Department of Computer Science and Engineering, Yuan Ze University, 3Cardiology Department, Far Eastern Memorial Hospital

To analyze the arrhythmia by using the photoplethysmography (PPG) signal, the phase portrait reconstructed by the time delay of the PPG signal is used in this study. With the change of two parameters in the phase portrait (signal delay and window size), the difference of the waveform between normal and abnormal beats can be extracted. This method turns PPG signal from one-dimension to two-dimension and provides us another angle to analyze arrhythmia. The phase portrait also provides a new method to detect the arrhythmia from the wearable devices, and the feature of the low computation complexity is helpful to reduce the power consumption and extend the battery of the wearable devices.

PD18 Object Detection Inference Based on NVDLA Accelerator Yeong-Kang Lai, Chuan-Wei Huang Department of Electrical Engineering, National Chung Hsing University

Nowadays, when AI (Artificial Intelligence) is more and more advanced, many applications are also following in its footsteps. For example, in the fields of object recognition and speech processing, there are many different DNNs (Deep Neural Networks) structures. By combining these DNNs and collecting a large number of image galleries or audio. It uses GPU for training to achieve high accuracy and meets its application requirements. This paper focuses on object identification in store applications, and uses DNN hardware accelerators to achieve real-time inferences. It uses YOLOv2-tiny to map to NVDLA hardware. The NVDLA can perform YOLOv2-tiny using 256 MAC@150 MHz frequency. In addition, it can reach up to 5 FPS.

PD19 IP Generator for Edge AI Inference Systems Yeong-Kang Lai and Chuan-Wei Huang Department of Electrical Engineering, National Chung Hsing University

With the rapid development of deep learning, many applications have also developed. People design hardware accelerators for their applications. For example, in the field of object detection, people pursue the accuracy of identifying objects. However, a well-designed hardware accelerator usually considers several performance evaluation

factors. The latency, power, throughput, energy, and area are often used as performance indicators. A hardware accelerator usually needs to achieve a high utilization rate, then a well-designed dataflow is needed. Therefore, this paper focuses on some DNN networks (AlexNet, VGG and MobileNet) and analyzes three different hardware accelerator architectures (Output Stationary, Weight Stationary, Tree Architecture) that proposed by this paper. Moreover, it also uses different numbers of PEs to analyze these models. It is found that the PE usage rate in the more regular model will be higher. For most DNNs, the proposed hardware accelerators have a PE usage rate of at least 92%.

PD20 A 0.5V 16-Transistor Flip-Flop in 180-nm CMOS for Low Power Applications Shun-Fa Yang, Chun-Ting Wu, Cherng-Haw Tsai, Jun-Yang Hsieh and Jin-Fa Lin Department of Information and Communication Engineering, Chaoyang University of Technology

A low voltage and low power true-single-phase flip-flop (FF) design using 16-transistor only is proposed. It is adapted from conventional master-slave based design and reduces layout area by using hybrid logic scheme. Optimization measures have resulted in a new FF with better power and area performances. Based on simulation results using the TSMC CMOS 180nm, our design achieves the conventional TGFF design by 67.3% in energy and 30.8% in layout area.

PD21 Design of Low-power Wide-area Wireless Network System with Intelligent Terminal for Applied Agriculture Qi-Ting Lin, Trong-Yen Lee, Wei-Chen Hsueh, and Xin-Jie Lin Department of Electronic Engineering, National Taipei University of Technology

Due to the large area of agricultural land, sensors used in agriculture need to be resolved through long-distance communication protocols. Commonly used communication protocols cannot solve long-distance communication. LoRa technology combined with the Internet of Things can solve this problem. The terminal part of the neural network cannot use a complete computer system to operate on this occasion, so the establishment of a terminal neural network system chip is also a major trend in agricultural applications. Therefore, this paper proposes that the terminal’s sensor cooperates with the LoRa chip to send the sensing data to the gateway at the remote end, and designs the gateway that receives the LoRa signal and converts it into WiFi to connect to the IoT platform, and the edge designed by the FPGA The system chip of the Edge BNN can be used with the lens and other peripherals to identify mango symptoms.

PD22 An FPGA-based High-frequency Trading System on Taiwan Futures Market Yi-Chieh Kao, Hung-An Chen, and Hsi-Pin Ma Department of Electrical Engineering, National Tsing Hua University

High-frequency trading (HFT) systems require extremely low latency in response to make profits. Therefore, reduce the overall system latency and increase the throughput can increase the daily net profit of high-frequency traders. In the paper, an FPGA based high-frequency trading system is designed and implemented. The proposed system implements 10G Ethernet physical interface, customized network stack parsing and packaging, partial finan-cial protocol decoding and encoding, futures market order book handling and customized trading strategy. A 156.25 MHz clock is used to clock the 64-bit datapath of Ethernet physical transceiver and receiver. For network stack decoding, the system can identify and analyze the Ethernet packets of the address resolution proto-col (ARP), user datagram protocol (UDP), transmission control protocol (TCP), and provides the functionality of TCP connection. The proposed system has connected to the real futures trading environment to verify the correctness of the functionality of market data parsing and order management processing. Based on the verification and evaluation on field programmable gate array (FPGA), the proposed system can accurately analyze the market information and obtain the fifth-order of specific product regard-less of the trading time, and issue an order when trading strategy is triggered. An aggregated latency of 500 ns is measured. This result in a hundred-time more efficient than a typical software-based HFT system.

PD23 A Study on Complexity-Constrained Model Scaling for Convolutional Neural Networks Jia-Han Liu, Kai-Ping Lin, and Chao-Tsung Huang Department of Electrical Engineering, National Tsing Hua University

Convolutional Neural Networks (CNNs) are deployed on hardware platforms with fixed computation resources. While model scaling is commonly used to trade off computation complexity and model performance of a CNN model, in this paper, we study model scaling under a fixed complexity and indicate that trading off model width, depth, and resolution can lead to better performance. We present a case study on ResNet-18 on the ImageNet dataset, achieving accuracy gain up to 1.6% compared to the baseline model without scaling. We demonstrate model depth is the dominant dimension that affects model performance the most, and show that additional accuracy gain can be obtained by choosing input images of size 192 × 192.

PD24 Analysis and Improvement of the Optimal Path Planning Algorithm Applied to Multiple Vehicles Bo-Xun Peng, Jing-Chen Tu, and Kuang-Hao Lin Department of Electrical Engineering, National Formosa University

This paper improves the navigation algorithm of A star and Dijkstra to make it more suitable for multiple vehicles, so that it can reach its destination more quickly and accurately. This paper details how to adjust the parameters of the car body to develop it by yourself, in addition introduce the ROS processing, theory, each part of the code, and analyzing the pros and cons of robot operation system. If the number of vehicles in the local map is more than 4, the runtime will become longer due to the robot crossing. Therefore, I improved the Dijkstra algorithm and added a condition to limit the direction of the vehicle. Let the vehicle escape the traffic jam. Finally, utilizing the A star and the Modify Dijkstra algorithm simulate in large scale map. In conclusion, if the vehicle is moreover the field limit. It will trigger the traffic jam and plugin the efficiency. However, if there are not many vehicles in the field, the A Star algorithm is the best solution, so at the end of this chapter, the two algorithms are integrated together, and judge the vehicle quality and choose the algorithm in large scale map.

PD25 Hardware Design of Contour Image Generation for Label Cutting Machines Chu Yu, You-Sheng Huang, Ting-Wei Hsu, and Jia-Hong Tang Department of Electronic Engineering, National Ilan University

In this paper, we present the hardware design of contour image generation for label cutting machines. Generally, the label image has a simple background. This feature induces a low-complexity image processing for the contour generation. Therefore, we select Sobel edge detector to generate the image edge in the first step of feature information. However, the edge detector is prone to generate discontinuous fracture edge information, which will affect the integrity of the contour image. Thus, the dilation scheme in morphology is employed to supplement the broken edge. Finally, based on the aforementioned proposed scheme, we designed its dedicated hardware to effectively accelerate the generation of label contour image. For a 512 512 label image, the contour generation takes 500 ms with running software, whereas our hardware only takes 30 ms at 100MHz. Therefore, our proposed hardware is more suited to product the large amount of contour image generation.

PD26 A Noise-Tolerant RSSI Positioning Technique by Combining Weighted Circular and Genetic Algorithms Kuo-Wei Huang, Rui-Ming Yang and Hongchin Lin Dept. of Electrical Engineering, National Chung Hsing University

In recent years, the Internet of Things (IoT) are getting popular in various electronic products. One of the important features is to determine the location of the IoT device. Among various signals used to sense the position, the Received Signal Strength Indication (RSSI) technique is easy to be implemented. However, the accuracy of location is difficult to be achieved owing to noise. To enhance the accuracy, the proposed Progressive Group-Based Symbiotic Genetic Algorithm (PGBS-GA) and the Weight Circular Algorithm (WCA) are combined to reduce the positioning errors significantly. The simulation and experimental results using the LoRa (Long Range) wireless system show the errors are less than 30 m in average on the area about 80000 m^2 outdoors using 8 anchors for noise with the mean = -8dB ~ 4dB and the standard deviation < 3dB.

PD27 PSCNN: Programmable SRAM-based Computation-In-Memory CNN Processor for Binary Keyword Spotting Model Shu-Hung Kuo, Tian-Sheuan Chang Institute of Electronics, National Yang Ming Chiao Tung University

The enormous amount of computation in CNN limits the compatibility of applications on low-power devices. To solve this problem, the CIM, which has the advantages of low power and high throughput, has been proposed. However, being restricted to physical limit, the CIM has low flexibility on dataflow. Consequently, it causes a low utility on hardware. To address this issue, we propose a programmable SRAM-based CIM CNN processor, PSCNN, that can achieve a high throughput with low power consumption while maintaining high flexibility on controlling. PSCNN implemented with 28nm technology achieves 1.52 TOPs at 100MHz with 16K gate counts (7.78mm2 in area), one 512Kb CIM macro and twelve 64Kb SRAM macros while implementing our binary keyword spotting model.

PD28 AI Crowd Control Detection System Implemented on FPGA Hardware Development Platform Yu-Hu Wu, Chung-Bin Wu National Chung Hsing University

This paper proposes a detection application for crowd control. Detecting objects is mainly to identify pedestrians at close range. The deep learning network uses Yolo-like. And change Yolo-like[1] into a network architecture that only recognizes pedestrians at close distances. The hardware uses Convolution and Detection Layer IP hardware accelerators. And realize the function of Maxpooling and Shortcut on FPGA platform. The FPGA platform is Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit. Finally, the FPGA platform uses HDMI to display the recognition results that only detect pedestrians at close distances. Achieve the detection effect of crowd control.

PD29 Design of Distributed Sorting Accelerator on Multiple FPGAs Yi-Ta Hsin and Bo-Cheng Lai Institute of Electronics, National Yang Ming Chiao Tung University

Database analysis is the key to reveal the imperative information hidden in the sheer amount of data. Sorting is one of the pivotal operations extensively used in various analysis and applications of database. The growing data volume has posed great challenges in achieving in-time and scalable sorting for modern database. FPGA (Field Programmable Gate Array) has demonstrated promising performance in high throughput data sorting. However, the limited in-system memory of FPGA causes extensive off-chip data communication which become the main bottleneck of the sorting operations. The standalone design on a single FPGA further inhibits the scalability to handle the growing data volume of emerging applications. This paper proposes the design of a distributed sorting accelerator. The implementation can be scaled to multiple FPGAs to handle the large size of datasets. The experiments have shown that the proposed design attains up to 10.1x throughput enhancement when compared to the previous design on FPGA.

PD32 Exploration of Efficient Bluestein’s FFT Hardware Architecure for Fully Homomoprhic Encryption Shi-Yong Wu, Kuan-Yu Chen and Ming-Der Shieh Department of Electrical Engineering, National Cheng Kung University

Fully homomorphic encryption (FHE) is a powerful cryptographic system that allows computations to be performed on encrypted data. To reduce the computational complexity of FHE operations, double-CRT representation has been adopted in BGV-FHE cryptosystems, in which the 2nd-CRT, namely the polynomial-CRT, can be viewed as doing Discrete Fourier Transform (DFT). Since the point size of the DFT is usually non power of two, traditional FFT algorithms cannot be directly applied to reduce the complexity. This paper explores efficient VLSI architecture of Bluestein’s FFT for FHE applications. A mixed-radix single-port merged-bank memory addressing algorithm is presented to increase the effective memory bandwidth and reduce the required memory area at the same time. Moreover, a reconfigurable design which can process different data points, ranging from 4095 to 32767 points, is adopted to cope with different security levels in FHE applications.

poster - digital & system

Documents