- Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Slide 1 Slide 2 Motivation Mobile embedded systems are present in: Cell phones PDAs MP3 players GPS units Slide 3 Mobile Computing Design Considerations Low power Real-time data processing Small size Low cost Quick time to market Slide 4 Metric Introduction Processor specialization Instruction set Interconnect Memory specialization Functional & Data path units Power Specialization Slide 5 Metric: Processor Specialization Central controlling point of embedded system Examples: VLIW to perform multiple instructions in parallel. RISC architecture Slide 6 Metric: Instruction Set Specialization Introduction of new instructions to extract optimal performance from the processor Examples: Multiply-accumulate Vector operations Slide 7 Metric: Interconnect Provides means for different modules to communicate Optimizations can lead to reduced complexity, cost, and power consumption Slide 8 Metric: Memory Specialization Specialization is achieved through optimization of number and size of memory banks, number and size of access ports Optimizations can improve performance, power consumption, and chip area Slide 9 Metric: Functional & Data Path Units Functional units are often specialized hardware units implementing a frequently used software algorithm Examples: DSP co-processors, interrupt priority co- processors, memory access modules, and timer modules Slide 10 Metric: Power Specialization Major concern in mobile systems Kept under control by: Using low voltage Slow clock speed Custom circuit solutions Slide 11 Architectures to be discussed M*CORE D30V/MPEG SuperENC 1.3-GOPS Parallel DSP IA-32 w/ Enhanced Data Streaming Slide 12 M*CORE Low power embedded applications Wireless mobile devices Cellular phones Slide 13 M*CORE Processor Specialization Simple RISC architecture 4 stage pipeline 16-bit instruction word length Compiler designed in parallel with architecture Barrel shifter built into ALU Slide 14 M*CORE Instruction Set Specialization Multimedia instructions Multiple data transfers from memory to register and register to memory. Fast register saves FF1 Find First 1 Finding highest priority interrupt in hardware Slide 15 M*CORE Interconnect Specialization 16 bit data bus to match 16 bit word length Reduces memory bandwidth, complexity, chip area layout, and power consumption MDI MCUto-DSP Interface Dual access memory messaging unit General I/O bus for a peripherals Slide 16 M*CORE Memory Specialization Alternate register bank Fast register saves for context switches Slide 17 M*CORE Functional & Data Path Units 32 channel programmable interrupt controller Protocol timer DSP core Slide 18 M*CORE Power Specialization 1.8 Volts Uses 0.5 Watts Power aware pipeline Programmable power states Stop Wait Dose Normal Slide 19 M*CORE Summary Low power and programmable power states make it ideal for mobile devices Interface to built in DSP core makes it ideal for cell phone applications Slide 20 650 MHZ IA-32 Microprocessor designed to accelerate data- streaming applications Three-dimensional graphics Video encode/decode Slide 21 650 MHZ IA-32 Processor Specialization IA-32 architecture 70 new instructions SIMD floating point data type Improvements in regard to circuit implementation Slide 22 650 MHZ IA-32 Instruction Set Specialization 70 new instructions SIMD FP operations Control for new 8-entry register file Multimedia extension 12 new integer instructions Slide 23 650 MHZ IA-32 Interconnect Specialization Front Side Bus of 66, 100, 133 MHz Back Side Bus Half the clock frequency for mobile and desktop applications Full clock frequency for server/workstation applications Slide 24 650 MHZ IA-32 Memory Specialization 3 new non-temporal store instructions with write combining buffers Burst write protocol Write data throughput of 1.066 Gbytes/sec on a 133 MHz bus 4 new data pre-fetch instructions Overlap, reduces cache miss penalties Slide 25 650 MHZ IA-32 Functional Specialization 8 entry register file Reduces register starvation for SIMD unit 128 bits wide four independent single precision elements packed in parallel Dedicated table based lookup unit for reciprocal operations Completes reciprocal operations in one clock cycle Error of 1.5 * 2^-12 Slide 26 650 MHZ IA-32 Low Power Usage 1.4 V ~ 2.2 V at 650 MHz close to room temperature Slide 27 650 MHZ IA-32 Performance 1.5X to 2.0X performance boost for 3-D transform and lighting kernels Real-time MPEG-2 video/audio encoding at 30 frames per second Achieved through improvement to SIMD unit, at a cost of only 2% increase of unit area size Slide 28 D30V/MPEG Multimedia applications Decoding MPEG-2 Slide 29 D30V/MPEG Processor Specialization 2 way VLIW Dual issue RISC pipeline 2 way assigned SIMD module Pipeline has ability to re-route data through execution path Slide 30 D30V/MPEG Instruction Set Specialization Saturate and Add DSP instructions built in Modular addressing Block repeat Multiply accumulate Half word instructions Effectively double number of useable registers Slide 31 D30V/MPEG Interconnect Specialization Chip layout specialized for decoding streaming mpeg data Slide 32 D30V/MPEG Memory Specialization 32 Kbyte data RAM 64 Kbyte instruction RAM 4 Kbyte RAM for Variable Length Encoder/Decoder (VLC/VLD) tables Special Registers MOD_S & MOD_E for modulo addressing RPT_S, RPT_E, and RPT_C for looping Slide 33 D30V/MPEG Functional Specialization VLC/VLD Variable Length Encoding/Decoding units Slide 34 D30V/MPEG Low Power Usage 2.5 Volts at 243 MHz Uses 2.0 Watts Slide 35 D30V/MPEG Performance 12 % speedup from inter-pipe bypasses Special VLC/VLD functional blocks speedup MPEG decoding Slide 36 1.3 GOPS Parallel DSP Achieve real-time image processing capability Employ data parallelism to achieve goal High level algorithms, non-parallelizable Arithmetic encoding Medium level algorithms, medium parallelizable Contour tracking of binary images Low level algorithms, high parallelizable Filters and transforms Data independent control and data flow 80 % of MPEG-2, 60% of MPEG-4 Slide 37 1.3 GOPS Parallel DSP Processor Specialization Central control unit RISC based Controls multiple SIMD units Slide 38 1.3 GOPS Parallel DSP Instruction Set Specialization VLIW instructions 3 instructions per issue 1 load/store 16 bit data 2 arithmetic operations on 16/32 bit data Slide 39 1.3 GOPS Parallel DSP Interconnect Specialization DMA/MCU (Direct Memory Access/Memory Control Unit) Handles cache misses Performs prefetch operations from matrix memory Interfaces with external 64 bit data bus and 32 bit address bus for SRAM and DRAM modules Slide 40 1.3 GOPS Parallel DSP Memory Specialization Memory tailored to image processing needs Provides parallel high bandwidth access to shared data with matrix shaped access patterns Individual Cache Memory Services irregular memory requests Slide 41 1.3 GOPS Parallel DSP Functional Specialization Multiple SIMD units Currently 4 units for prototype 16 units planned for future versions SIMD approach has been extended with ASIMD, autonomous instruction selection capability Improves handling of conditional branches Slide 42 1.3 GOPS Parallel DSP Low Power Usage 3.3 Volts Using 650 milliwatts Slide 43 1.3 GOPS Summary Sustained performance 380 MIPS Around 90% utilization Slide 44 SuperENC MPEG-2 video encoder Slide 45 SuperENC Processor Specialization Software implemented RISC architecture 5 stage pipeline 81 MHz, 32 bit wide data/instruction path Software implemented SIMD/SDIF (SDRAM Interface) modules Slide 46 SuperENC Instruction Set Specialization There is no instruction set specialization mentioned in the paper. Slide 47 SuperENC Interconnect Specialization SDIF All memory access goes through SDIF Relay data without going to external memory Reduces memory bandwidth and power consumption Slide 48 SuperENC Memory Specialization Uses external RAM Can access two 16 Mbit SDRAMS or one 64 Mbit SDRAM Slide 49 SuperENC Functional Specialization MPEG algorithm is broken up into hardware functional blocks Example DCT, Discrete Cosine Transfer IDCT, Inverse Discrete Cosine Transfer ME. Motion Estimation MC, Motion Compensation Slide 50 SuperENC Low Power Usage 2.5 Volts internal 3.3 Volts I/O 1.5 Watts Slide 51 SuperENC Summary SuperENC makes use of many hardware functional blocks to implement the MPEG decoding algorithm Slide 52 Metric Results D30V/MPEG highest rated