- "Designing and Selecting Instruction Sets for Vision," a Presentation From Cadence
"Designing and Selecting Instruction Sets for Vision," a Presentation From Cadence
1. Copyright 2015 Cadence Design Systems 1 Chris Rowen 12 May 2015 Designing and Selecting Instruction Sets for Vision 2. Copyright 2015 Cadence Design Systems 2 A top design automation supplier: analog, digital and system verification tools interface, processor and protocol verification IP Leading supplier of DSP and other data-rich embedded processing cores and softwareXtensa Innovation Platform IVP family: advanced imaging and vision DSP cores with almost 1000 library functions and applications Massively parallel SIMD/VLIW processors with automated configurability and extensibility of ISA, memory, and interface One of Fortune Magazines Top 100 Places to Work Cadence in Nutshell 3. Copyright 2015 Cadence Design Systems 3 The Vision Performance Challenge The Vision Instruction Set Puzzle Application Diversity Drives ISA Flexibility The Hardwired Accelerator Problem Examples: Pedestrian Detection Lane Departure Warning Convolutional Neural Network Wrap-up Outline 4. Copyright 2015 Cadence Design Systems 4 ADAS Processing Requirements are high VGA: approaching 100 GOPs The Vision Performance Challenge Source: SoC for car navigation systems with a 53.3 GOPS image recognition engine, Hot Chips 21 (2009) 5. Copyright 2015 Cadence Design Systems 5 Complexity grows an order of magnitude for full HD processing Accelerating algorithmic sophistication Scaling best addressed by more parallelism application specific optimizations architectural enhancements move to advanced process nodes A good architecture accelerates core functions supports a wide range of application specific optimizations The Vision Performance Challenge 1080p60 ADAS is a teraOp problem 0 5 10 15 20 25 30 QVGA VGA HD Full HD Computation increase with resolution (brute force approach) 6. Copyright 2015 Cadence Design Systems 6 Key dimensions: Local memory bandwidth Memory hierarchy for data streaming Data types SIMD/vector organization Scalar operation bandwidth Instruction issue parallelism (VLIW) Vision-specific operations Multi-processor support The Vision Instruction Set Puzzle What to look for: 1.High local memory bandwidth 2.Effective latency hiding for DDR access 3.Data-types: 8b,16b, 32b fixed-point, floating point 4.Sustained ops/cycle from combination of VLIW and SIMD 5.Vision -specific operations: 2D data access, histogram, convolution, search, non-linear functions 6.Automatic compiler inference of vectors, complex operations 7.Scale-up with custom operations 8.Scale-up with parallel cores 7. Copyright 2015 Cadence Design Systems 7 Real design is full of trade- offs: Memory reference vs. ALU ops Multiplies vs. other ALU ops Mix of scalar vs. vector ops Vector computation vs. data reorganization Measured a set of 45 major kernels and applications in vision and imaging Look at key ratios to assess trends Application Diversity Drives ISA Flexibility Functions include: Face detection Fast9 SURF Oriented FAST and Rotated BRIEF feature detector (ORB) Harris Corners H.265 Motion Compensation Haar Cascade and Classifiers Optical Flow Affine transform Perspective Warp Various Filtersbilateral, denoising High Dynamic Range Color Space and format conversions Histogram equalization 8. Copyright 2015 Cadence Design Systems 8 Typically several ALU ops per load operation Wide range of ALU : Load/store ratio (1:2 to 5:1) Many important functions dont do multiplies A fraction have very heavy multiply usagee.g. convolutions ISA should handle wide range of ratios efficiently Application Diversity Drives ISA Flexibility 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.5 1 1.5 2 2.5 VectorLoad/StoreOps perInstruction Vector ALU Ops per Instruction Memory ops vs ALU ops 0 0.5 1 1.5 2 2.5 0 0.25 0.5 0.75 1 OtherALUVectorOps perInstruction Multiply Vector Ops per Instruction Multiply ops vs Other ALU ops 9. Copyright 2015 Cadence Design Systems 9 A successful architecture maximizes the fraction of kernels that can be vectorized A small number of functions may still use scalar ops heavily On-the-fly data reorganization may be important in a few kernels ALU : Reorg ratio varies from 10:1 to 1:1 Efficient data reorganization boosts benefit of vectorization Application Diversity Drives ISA Flexibility 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 VectorOpsper Instruction Scalar Ops per Instruction Scalar ops vs Vector ops 0 0.25 0.5 0.75 1 0 0.5 1 1.5 2 2.5 DataReorgVectorOps perInstruction ALU Vector Ops per Instruction ALU ops vs Data Reorganization ops 10. Copyright 2015 Cadence Design Systems 10 Certain tasks beg for immense performance hardwired functions are tempting Issue 1: Changes in algorithms hard to anticipate. Hardwired functions often under-used on deployed systems Issue 2: Hardwired functions difficult to control from softwareoperation start/stop, memory management, context switching Techniques to improve hardwired functions: Flexible chaining of interface to hardwired blocks more reusable primitives Direct incorporation into processor ISA Instruction-mapped instead of memory-mapped The Hardwired Accelerator Problem Processor Accelerator 11. Copyright 2015 Cadence Design Systems 11 Pedestrian Detection Application Example Key Functions % of Processing Pyramid generation 10% Gradient magnitude and orientation calculation 25% Histogram of Gradients calculation 25% Histogram normalization 5% SVM Classifier 35% Fractional co-ordinate calculations (16b co- ordinates) Pixel Interpolations (8b values) Finite differences or Sobel (8b pixels) Sum of squares (8/16b gradients) Squareroot (16/32b values) Divide (8/16b values) Arctan (8/16b values) Magnitude projection on bins (16b values) Weighted histograms (16b values) L1 (sum) or L2 (sum of squares) (16b values) Squareroot (32b values) Divide (16b values) Multiply accumulate (16b values) A good architecture supports a wide variety of operations and precisions 12. Copyright 2015 Cadence Design Systems 12 Camera system parameters (resolution, field of view, focal length) determine person height vs. distance Dynamically tradeoff detection latency for far-away pedestrians based on vehicle speedhigher resolution levels may not need high frame rate ! Pedestrian Detection Application Example Ref: Pedestrian Detection: An Evaluation of the State of the Art, IEEE Transactions on Analysis and Machine Intelligence, Volume: 34 , Issue: 4 h f D H fov Detection resolution Using Pinhole camera model: = = tan( 2 ) 13. Copyright 2015 Cadence Design Systems 13 Lane Departure Warning Processing Functions Camera Input Pre processing Feature extractio n Post processing Tracking Road and vehicle model Color conversion Noise removal Contrast enhancement Steerable/ Gabor filters Image segmentation (Intensity, color) Pyramid generation Perspective warp Edge detection (Sobel, Canny) Edge magnitude and orientation Edge directional response Thresholding Morphology Corner detection (Harris, Fast, ..) Hough transform Neural network Template matching and updating Road model fitting Outlier removal Connected components Vehicle data (speed, steering) Constant curvature, Parabolic Kalman filter High computations A wide variety of functions are used and must be well supported 14. Copyright 2015 Cadence Design Systems 14 Cascade Classifier Application Example Parallelism drops quickly in traditional SIMD implementation after early stages of cascade Need architectural approach to exploit available parallelism: Distributed detection windows Distributed features within a detection window Switch type of parallelism as you progress through the cascade: parallelize over pixels in window parallelize over windows parallelize over features A good architecture supports many types of parallelism 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10111213141516171819202122 ConventionalSIMDparallelism Cascade Stages Parallelism in conventional SIMD processor in different cascade stages 15. Copyright 2015 Cadence Design Systems 15 Key computational kernels in CNN are Convolution (Highest cost) Subsampling (box filter, max pooling) Non-linear function (Tanh, Sigmoid) For practical implementations a range of tradeoffs are possible for convolutions Precision tradeoffs Separable kernels Symmetric kernels Convolutional Neural Network (CNN) Example A good architecture supports a range of options for fast convolutions Input Convolution Non- linearity Sub- sampling Repeat Classifier Result face identified Convolutions model locally receptive visual cortex cells by sampling a small region and generating features Non-linearity like tanh function models on-off behavior of neurons Subsampling models cells with larger receptive fields (provide local invariance) Repeat previous steps for neural network layers Final classifier stage 16. Copyright 2015 Cadence Design Systems 16 How to Choose a Vision Processor ISA: Measure on your real applicationdont just look at paper feeds-and- speeds Expect massive parallelism Look for balance and versatility in available operations Consider not just raw ops rate, but also ability to handle complex data organization and on-the-fly reorganization The compiler is part of the ISAlook at efficiency, robustness and analysis tools Judge hardwired accelerators by reusability on possible future applications Look for multi-processor support in hardware and software Wrap-Up 17. Copyright 2015 Cadence Design Systems 17 More readily-available imaging/vision source code, including OpenVX graphs Open reference video streams for testing vision apps More substance and less hype around CNN and ADAS Standard input data sets Standard description of neural networks Reference trained parameters Wish List 18. Copyright 2015 Cadence Design Systems 18 Cadence Imaging/Vision Products: http://ip.cadence.com/ipportfolio/tensilica-ip/image-video- processing Some Cadence Vision Partners Morpho: http://www.morphoinc.com/en/ Almalence: http://www.almalence.com/ Irida Labs: http://www.iridalabs.gr/ Ittiam: http://www.ittiam.com/ Dream Chip: http://www.dreamchip.de/ OpenVX: https://www.khronos.org/openvx/ Resources Cadence, Xtensa and Tensilica are registered trademarks of Cadence Design Systems, Inc. All other trademarks and logos are the property of their respective holders.