

### New SSD Interfaces and Their Impact on SSD Controller Architecture

Tim Canepa Director of Architecture LSI Flash Components Division, an Avago Company



## Legacy SSD Controller Architecture Anatomy

#### **Traditional SOC Design**

- Cached CPU architecture with TCM
- External DDR with multi-port memory controller

#### Interface is the main constraint

- Host limited vs. Flash limited
- 550MBps vs. 400MT per channel
- One transaction at a time on the wire, 32 max outstanding

#### DDR used for buffer and FTL

 Circa 2GBps of buffer BW required for 100% sequential and random corners

#### Not a lot of HW assists

• CPU(s) manages FTL directly in DDR via write-thru cache.

S

Flash Memory Summit 2014 Santa Clara, CA





- Scalable Interface
  - 500MBs x1 Gen2 16GBs x16 Gen3
- Full Duplex
- New, lightweight queuing interfaces (NVMe)
- Transaction interleaving
  - Enables ability to access all the flash bandwidth
- New command semantics
  - Fused commands
  - Hinting
  - Name spaces



Cut out the middle man and get to all the flash bandwidth!

Flash Memory Summit 2014 Santa Clara, CA



## Legacy SSD Controller Architectures (bottle necks)

8 channels of ONFI 3.0 can source 3.2GBps

Bottleneck 1: DDR buffer BW

- Need well in excess of 6.4GBps to service flash alone.
- 32Bit DDR3 1600 won't even cut it
- Wider or faster DDR I/F is not attractive (power & cost)

Bottleneck 2: Flash Data path

DMA needs to be multi-channel to keep all channels active.

. SO ₽

Bus BW needs to increase to >3.2GBps

QUISINGUINN LESIOS QUISINGUINN LESIOS BIRS REQUINDED BY QUISIONUNDIS BY QUISIONUNDIS BY COUSIONUNDIS DRAM DRAM CTRLR Channel 1 Multi-Port ECC DMA РΗΥ Memory Ctrlr NAND and Bus Fabric Ctrlr Channel 2 ECC DMA РΗΥ NAND DMA SATA Controller DMA Ctrlr SATA Phy 0 0 RAID 0 0 Cache 0 0 I-TCM Channel N ECC DMA CPU РНΥ NAND D-TCM Ctrlr



Santa Clara, CA

## Legacy SSD Controller Architectures (bottle necks - continued)





#### Legacy Architecture 1800 **Dealing with Full Duplex** 1600

80% Reads / 20% Writes Mix Design Max: ~1,440MB/s

SF3700 architected for bi-directional PCIe traffic

Legacy Architecture bottlenecks capping performance with mixed workload.

Samsung XP941: ~290MB/s

Plextor M6e: ~160MB/s

PCle Gen 2 x4

Bathtub curve - not optimized for bi-directional traffic

Flash Memory Summit 2014 Santa Clara, CA



Source: Plextor and Samsung from TweakTown SF3700 from LSI @ 50% entropy, 480GB, 7% OP

## Memory How is Throughput Determined?

- Throughput is influenced by many factors
  - Flash die limits a 1TB drive can produce16GBps of read BW
  - Flash channel limits
  - Internal B/W limits (e.g., buffers, compression engines, etc)
  - Tenures how long are resources held before they can be reused
  - Host I/F limits
  - CPU limits
- What's key is understand what the limiting factor(s) is(are)
  - And at what range of parameters, as the limiting factor(s) may vary
  - And how to strike a balance to get the best performance in all corners



Memory Scaling up and Striking a balance

- Host Interface
- Data Paths
- Flash bandwidth
- What does the take to build a scalable another take to build Processing Power



- Scalability and Interleaving are critical
- Start with Configurable PCIe
  Controller
  - multi-ported & cut-thru for CpID
- Add separate, multi-layer bus fabric with arbitration at TLP frame level.
- Individual DMAs and Control path logic connected to separate ports
  - Must have local buffering
- Clock domain crossings can convolute the design





- **Bus Bandwidth** 
  - Generally good to have 1.5x or more internal BW than you want to deliver Data

Host

- **Internal Buffers** 
  - Write staging buffers
  - Flash transfer buffers
  - Size is based on tenure
  - Mixed workload complicates sizing
  - Multi-banking required to achieve BW requirements (multiple GBps)
- DMAs
  - Multiple Queued Requests
- Inline modules complicate things
  - Pull through compression engines requires pipelining

Flash Memory Summit 2014 Santa Clara, CA





- Multi-Channel DMA
  - One read, one write is sufficient with balanced BW on buses
- Local Buffering in each channel
- Data bus frequency decoupled from flash clock





- Hardware assists needed to relieve processors from expensive tasks
  - Map Lookup assists
  - Recycling Assists
  - Buffer allocation assists
  - Even add assists for background operations
- Resize cache to hold working set
  - Elevated miss rates will crater MIPs
- Scale with asymmetric MP
  - Group activities into multiple asymmetric processor groups
  - Easiest way to scale "run to completion" architectures



Asymmetric processor Groups



# Adding up all the Architectural Changes

- When you sum up the required architectural changes...
- You realize that you practically have to throw your old SATA architecture out and start from scratch



- Bounding recovery Time requires new FTL techniques
  - Journaling FTL is essential to manage flush rates and to bound the recovery interval.
- High IOPs need FTL HW assists
  - Map lookups
  - Recycling
  - Extensible structures for Storing and Managing Hints



- More commands in flight
  - More resources required to track/manage commands
  - Algorithmic changes sub-scalar algorithms don't show up at low IOPs (data coherency, searches, etc)
  - Scheduling becomes more challenging to keep the pipe full
  - Minimizing Tenure is critical
- Flash scheduling considerations
  - Same amount of flash as SATA SSD, but substantially more IOPs
  - On chip buffering requires more robust resource management
  - Requires entire scheduling layer redesign
- Background task management
  - Shorten background task segments. On "run to completion" architectures



## **Questions?**