prototyping next-gen tegra soc - dvcon india · prototyping next-gen tegra soc ... haps. rtl...
TRANSCRIPT
Prototyping Next-Gen Tegra SoC
Sivarama Prasad Valluri &Ramanan Sanjeevi Krishnan
© Accellera Systems Initiative 1
Agenda
Introduction
Prototyping flow overview
RTL Conversion Challenges
Partitioning Challenges
PIN multiplexing Challenges
Other Challenges
Results & Conclusion
INTRODUCTIONProject Overview
IntroductionProject overview
Introduction
• Requirements for Prototyping Tegra SoC– Prototype RTL close to ASIC RTL– Kernel boot on Multi-Processor Setup– Faster time to Prototype –Early SW Development– Faster turnaround of bit-streams -Incr. RTL Drops– Achieve FPGA Prototyping an “order of magnitude”
faster than emulation– Support all HSIOs (HDMI, SATA & PCIe)
PROTOTYPING FLOW OVERVIEW
Prototyping Flow
Certify SynplifyPremier + Xilinx Vivado
Partitioning
Pin Multiplexing
Trace Assignment
Project creation per FPGA & time
budgeting
Synthesis P&R FPGA1
Synthesis P&R FPGA2
Synthesis P&R FPGAn
…
Converted RTL+ FDC + Board Files
Prototyping Platform
Synopsys HAPS
RTL CONVERSION CHALLENGES
Handling Clocks• Large number of clocks in ASIC – Compared to limited FPGA
global clock resources• Clock generation, gating & mux-ing logic in clock paths • Clock Skew introduced due to FPGA Partitioning
Approach• Merged related clocks and reduced to 6 global clocks • Used global clocks to generate necessary clocks in each FPGA• Replaced logic in clock-path with equivalent FPGA blocks• Used Synthesis tool to convert remaining clock gates
Reset Synchronization• Global SoC reset drives seq. elements across the entire design• Critical to ensure that reset is released at same time across all
FPGAs
Approach• Modify RTL to add pipeline stages to reset signal• Use the pipeline tree to achieve the reset synchronization
Reset Synchronization Tree
1
3 3
3 4 4
To Next System
Asynchronous Reset
…
PARTITIONING CHALLENGES
Initial Partition Approach
First time partition• Ran Area-Estimation• Partitioned design based on
– Design-hierarchies & IP Area/Size– External Interface Proximity &– Layout of the multi-HAPS system
Interconnect ProblemHuge number of interconnects(IC’s) between FPGA’s---------------------------@W: CU603 |Actual I/O count(1558) after CPM exceeds the total I/O count(1200) for device <>
@W: CU603 |Actual I/O count(3794) after CPM exceeds the total I/O count(1200) for device <>
@W: CU603 |Actual I/O count(14677) after CPM exceeds the total I/O count(1200) for device <>
…
@W: CU603 |Actual I/O count(13876) after CPM exceeds the total I/O count(1200) for device <>
@W: CU603 |Actual I/O count(20724) after CPM exceeds the total I/O count(1200) for device <>
@W: CU603 |Actual I/O count(26143) after CPM exceeds the total I/O count(1200) for device <>
---------------------------Approach• Change partition to reduce interconnects• Additionally used pin-multiplexing techniques to address this
• Used HSTDM – Synopsys Certify Pin-multiplexing scheme• Only pin-multiplexed the flop-to-flop signals
Partition attempts to reduce the ICs
• Moving blocks with large number of inputs and less outputs into the Source FPGA’s
• Moving blocks which are going from one FPGA to another and coming back.
FPGA A
FPGA B
M2M1 256
256
256
300
300
256
300
300
300
FPGA A
FPGA B
M2M1 256
300
FPGA A
FPGA B
M2M1 256
300
300
300
To FPGA C
FPGA A
FPGA B
M2M1To FPGA C
Partition attempts to reduce the ICs(2)
• Huge number of non F2F IC’s going across multiple FPGAs which can not be pin-mux’ed
Approach• Design insight from IP Team• Identified combinational buses running across multiple IP’s
across FPGAs• Moved all the logic into a single FPGA
Addressing FPGA clock CrossingsInter-FPGA Clock Crossings• Introduces clock skew in Destination FPGA• More clock capable IO pins needs to be used.
ApproachUsed automation to address the following• Using HDL Analyst(find/expand commands) to analyze the Partitioned netlist• Populated the list of Clock-crossings and their loads into logs
Fix them by replicating the clock generation logic
Full Design
IP 1
R1
R11
R2
Rn
R12
R1n
Rm1
Rm2
Rmn
.
.
.
.
.
.
.
.
.
...
...
...
clkgen
CLK_ip1
CLK
Rm1
Rm2
Rmn
.
.
.
...
...
...
FPGA 2
IP 1-Part2
Clock Crossing
FPGA 1
IP 1
R1
R11
R2
Rn
R12
R1n
.
.
.
.
.
.
clkgenCLK_ip1
CLK
clkgen CLK
So many partition trials – Any simpler way?
Time taken to do the Partitioning change and to check the impact on the I/C’s and clock crossings
• Iterative process – More than one run needed• Manual runs– Not efficient & Prone to human errors • UI – Not the most efficient way as human intervention needed.• Batch mode – How to check the impact?
ApproachUsed automation to do the following• Apply partition file on the design• Generate an excel sheet with interconnection matrix & calculated
connector count info• Check clock crossings & Partitioned netlist file analysis• E-mail report
PIN MULTIPLEXING CHALLENGES
Slack Based HSTDM
Selection of Appropriate HSTDM ratios• No single button flow for HSTDM selection based on slack• Not possible to hand-pick the HSTDM ratios based on slack for
50k+ signals
Approach• Developed a TCL script to do the slack-based HSTDM
placements• Script applies the HSTDM ratio based on slack• Applies higher HSTDM ratios for slow signals and lower HSTDM
ratios for fast signals• Optimizes the number of IC’s with clean timing
TIME BUDGETING CHALLENGES
Time-Budgeting issuesMultiple issues seen in SLP time-budgeting steps• Zero/Negative values seen in Tool Generated FDC• Slack not accurately evaluated• Missing constraintsApproach• Slack based HSTDM• Used Automation to address these issues
– Certify HDL Analyst to analyze the partitioned netlist + TCL commands(find, expand, etc) helps to analyze paths in batch mode and write out the evaluated constraints
– Created Incremental FDC’s to write the missing/zero/-ve constraints.Added them to flow to fix the constraints with issues in the original FDC’s
Bit-stream Generation/Turnaround timeTurnaround time• Complete RTL going into all the Individual FPGA Projects• All Modified Modules going into all the Individual FPGA Projects
Approach• Developed scripts to generate the RTL list per FPGA project• Developed scripts to identify and split the modified modules
per partition• Reduces the compilation time per partition from 3-4 hours to
30-40 minutes
RESULTS
ResultsProject results• Kernel booted much ahead of the Tape-out
– enabled early SW development
• Kernel booted on Multi-Processer setup. – SW able to execute inter-cluster tests like cache-coherency
tests
• “Order of magnitude” faster than Emulation• Able to run the interfaces at speed for driver
development
Thank You
Questions
© Accellera Systems Initiative 28