Saad Ashfaq

AI Compiler Engineer at Deeplite
  • Claim this Profile
Contact Information
us****@****om
(386) 825-5501
Location
Greater Toronto Area, Canada, CA
Languages
  • English Native or bilingual proficiency
  • Urdu Native or bilingual proficiency

Topline Score

Topline score feature will be out soon.

Bio

Generated by
Topline AI

You need to have a working account to view this content.
You need to have a working account to view this content.

Experience

    • Canada
    • Software Development
    • 1 - 100 Employee
    • AI Compiler Engineer
      • Mar 2021 - Present

      • Developed Deeplite Runtime (DLRT) as an inference engine based on the TVM compiler stack enabling deployment of ultra-low precision quantized deep learning models on Arm Cortex-A based platforms• Implemented and optimized bitserial convolution kernels for Armv7 and Armv8 architectures to accelerate the execution of quantized convolution layers with extremely low bitwidth (1-2 bits) weights and activations• Defined transformation passes to convert trained fake quantized models to ultra low-bit representations that can be executed efficiently on Arm target devices• Designed a mixed precision approach to selectively offload convolution layers to 32-bit full-precision, 8-bit integer or ultra-low precision bitserial kernels to minimize the accuracy drop of quantized models• Profiled and compared the end-to-end model execution time of ultra-low bit quantized models on DLRT against open-source inference engine and runtime frameworks achieving up to 7x improvement in latency over the full-precision baseline

    • China
    • Telecommunications
    • 700 & Above Employee
    • Senior Software Engineer - DaVinci AI Compilers
      • Mar 2020 - Mar 2021

      • Implemented compiler optimizations in the LLVM framework with a focus on loop optimizations• Developed middle end and back end passes to improve low-level performance of the vector execution unit in Huawei's AI processor• Defined passes and runtime libraries to enable support of tensor data structure in the C+T programming language• Created regression and end-to-end tests to validate functionality and performance• Collaborated with teams across geographic locations on a regular basis

    • United States
    • Semiconductor Manufacturing
    • 700 & Above Employee
    • Senior Software Development Engineer - Platform Security
      • Jun 2018 - Mar 2020

      • Developed the Secure OS and Bootloader components in a Trusted Execution Environment (TEE) for the ARM Cortex-A5 security coprocessor on AMD dGPUs• Enabled virtualization security features based on the SR-IOV specification for key customers including Google Stadia, Microsoft Project xCloud and Amazon Web Services• Implemented low-level features such as single-GPU and multi-GPU ASIC reset, secure firmware loading, power saving and ECC error handling on server dGPU products• Resolved issues in the security software stack during pre-silicon, ASIC bring-up and post-release stages

    • Software Development Engineer - Multimedia
      • Aug 2016 - Jun 2018

      • Developed user and kernel mode layers of the multimedia driver stack for dGPU and APU projects in accordance with the Windows Display Driver Model• Optimized power management and job scheduling for the firmware executing on the embedded multimedia microprocessor• Involved in the bring-up phase of future multimedia IPs on FPGA and in simulation• Designed internal tools to validate software functionality of the driver stack• Debugged and resolved reported issues from customers and internal validation teams

    • United States
    • Semiconductor Manufacturing
    • 700 & Above Employee
    • System Validation Intern
      • May 2014 - Jul 2015

      • Executed test procedures to validate features of dGPU products• Created scripts to automate manual validation test cases using Ruby, bash and batch• Performed clock profiling experiments reducing the typical runtime duration from 3 days to less than 5 hours• Debugged post-silicon issues in collaboration with cross-functional teams • Executed test procedures to validate features of dGPU products• Created scripts to automate manual validation test cases using Ruby, bash and batch• Performed clock profiling experiments reducing the typical runtime duration from 3 days to less than 5 hours• Debugged post-silicon issues in collaboration with cross-functional teams

Education

  • University of Toronto
    Master of Engineering - MEng, Computer Engineering
    2017 - 2020
  • University of Toronto
    Bachelor of Applied Science - BASc, Computer Engineering
    2011 - 2016

Community

You need to have a working account to view this content. Click here to join now