



### ASIC/Merchant Chip-Based Flash Controllers

Jeff Yang Siliconmotion

Flash Memory Summit 2016 Santa Clara, CA



- Basic controller architecture.
- The challenge on the Merchant Chip-based flash controller.
- The Flash selection combination and the performance requirement.
- Flash write channel and the read channel throughput analysis.
- Hard-decoding only BCH based controller for SATA application throughput requirement.
- From the hard-decoding only to soft-decoding controller.
- Correction capability.
- 3D vs. 2D's NAND architecture.
- Vth tracking and the Data-retention issue.
- RAID





Flash Memory Summit 2016 Santa Clara, CA



Buffer

for

others



Flash Memory Summit 2016 Santa Clara, CA

### ב חכ

4



ry Challenge: Support all combinations and cost efficiency

2D/3D

One-pass

Two-pass

Multi-pass

SLC usage

SLC caching

TLC direct

SLC/TLC dynamic buffer Full DRAM Non-DRAM

External

Partial DRAM

SiliconMotion

Flash Memory Summit 2016 Santa Clara, CA

© Copyright Silicon Motion, Inc., 2012. All Rights Reserved.

### Traditional Write/Read channel with BCH h Memory Fla B С Α Encoder randomizer Chien-G Key-Ε $\mathbf{F}$ D Flash detector equation search channel Corrector RMW System DE-Data randomizer A' buffer Host DMA

- A: 1024B
- B: 1024B randomized.
- C: 1024B + 126B-parity
- D: C +error from flash

- A': 1024B with error bit.
- E: syndrome (126B)
- F: error-polynomial (128B)
- G: error location and err-mag



- Share the decoder's hardware with multi-channel.
- Each channel will not encode and decoder at the same time. Share the encoder with Detector.
- The decoder's output should satisfy the host maximum read throughput.

### 4-stage pipe-line BCH

single key-quation with 4 stage pipeline

T=72bit mode and error bit=72



- BCH 72bit mode, 72bit error, chunks size is 1024B + 126B = 1150B
- DMA is 100MHz parallel 16 → 576cycles (200MB/sec per channel)
- Chien-search is operated at 330MHz with parallel16 circuit. → 576cycles.
- Chien-search throughput is 1024/(576x3ns) = 592MB/sec.
- Key-equation cycle is proportional to error bit. (throughput, power consumption bottleneck)
- Key-equation's execution cycle should under 576 cycles
  - It will need a very high parallelism Key-equation on its hardware.

## Key-equation operation efficiency.

1KB + 126B with 72bit protection. Cover range to UBER~1e-15 RBER = 3.1e-3 Average error bit = 28bits per chunk

SUMME

2KB + 252B with 134bit protection Cover range to UBER ~1e-15 RBER = 3.9e-3 Average error bit = 71bits per chunk



- An very efficiency BM, simplified and inversion free algorithm has been used as an original.
- The further reduction provide much better efficiency.
  - 1KB 10bits error, 288 → 42 cycles. ~85% improvement. (BOL)
  - 2KB 20bits error, 654 → 87 cycles. ~87% improvement. (BOL)
  - 1KB 28bits error, 200 → 127cycles. ~55% improvement. (EOL)
  - 2KB 71bits error, 912 → 414 cycles. ~55% improvement. (EOL)



- In order to provide better decoder's correction capability, using the soft-info to get more reliability bits.
- NAND interface support.
  - Traditional read/retry interface.
  - Direct soft-info interface.

# DSP engine's buffer size



- The buffer size is the capability to contain the number of chunks soft-bit.
- Access addition soft-info from NAND may need additional read busy time.
- Read the soft-bit under the same busy time will have higher efficiency, but buffer size requirement is huge.

# Soft-decoding throughput limitation



- One Transfer time = 2.5ns/1B x 18432B = 46us (400MTs)
- Assume DSP-buffer size 16KB.
  - 9 tR time + 12 transfer-time = 9x(100us) + 12 x (46us) = 1452us
  - Throughput = 64KB/1452us = 44MB/sec
- Assume DSP-buffer size 64KB.
  - 3tR time + 12 transfer-time = 3x(100us) + 8 x(46us) = 668us.
  - Throughput = 64KB/668us = 95MB/sec

In Client SSD applications,

Soft-decoding will regard as the ERROR-Recovery flow. We will not ask the throughput under recovery mode. But we will take care the recovery mode trigger rate.



Flash Memory Summit 2016 Santa Clara, CA





### **ECC Chunk**

- Fixed code rate: around 0.9, ECC chunk size: 1KB/ 2KB/ 4KB
- Hard-decoding is based on BCH, and soft-decoding is based on LDPC with less than 3-bit channel reliability values.
  - Correction Performance: 4KB better than 1KB
  - Decoding Latency: 1KB better than 4KB



### Failure range from 2D to 3D. **Flash** Memory SUMMIT **3D BLOCK** 2D BLOCK Program order L Pair-Page Pair-Page WL М М 0 0 0 U L Damage range Pair-Page Pair-Page М WL of Program U 0 WL fail / Word-Damage 0 WL Pair-Page line Open range of М 2 0 Program fail L L Pair-Page М WL Pair-Page Ν 3 U 0 Damage Π L L Pair-Page range of Pair-Page WL М Damage range of Word-line open Μ 0 0 U Word-line U L short L Pair-Page М WL Pair-Page Block М Block 5 0 U WL U Damage range of two Word-line 1 L short Pair-Page WL Μ 6 0 L Pair-Page М Ν U Pair-Page U WL Μ WL Pair-Page М Μ Program order







- Both the 2D and 3D will have the data retention problem.
  - 1Znm MLC need 6~10 read-retry tables, But TLC need 40~45 tables with less endurance and retention.
  - 3D will have more severe Data retention issue.

[ref]: E.S. Choi, S.K. Park, "Device Considerations for High Density and Highly reliable 3D NAND Flash Cell in Near Future". IEDM 2012





## Flash Memory

SUMMIT

The 3D flash is good!! What are we waiting for? COST, COST, COST!!!

|                           |                                                                                                                   | 2D                                                          | 3D                                                                                    |
|---------------------------|-------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------------------------------|
|                           | EDURANCE                                                                                                          | After cycling: Keep the same Error distribution in LOW RBER | After cycling: Keep the same Error distribution in LOW RBER                           |
|                           | Data RETENTION                                                                                                    | The RBER become worse, the Vth also shifting                | Only the Vth-shifting, but RBER is still good.                                        |
| HDD Tre<br>SSD Tre        | end: RS $\rightarrow$ LDPC $\rightarrow$ NB-LDPC<br>end: HM $\rightarrow$ RS $\rightarrow$ BCH $\rightarrow$ LDPC | We Always Need A<br>Stronger ECC                            | WHY target RBER= 3e-3?<br>BCH 72bit/1KB will provide UBER< 1e-<br>15 with RBER = 3e-3 |
|                           | Target                                                                                                            | RBER requiremen                                             |                                                                                       |
| Norm<br>operat<br>(Base-I | hal Extreme low power.<br>tion Keep the host throughput<br>line)                                                  | RBER = <b>3e-3</b>                                          | Need a ECC stronger th                                                                |
| Reliab<br>extens          | ility Vth-tracking to lowering the<br>sion RBER.<br>Soft-info to have stronger<br>correction                      | RBER = 1e-2 ~ 1.2e-2                                        | MI Provide the most-cost efficiency<br>satisfy the reliability                        |

## ECC design loop related to NAND characteristics.

- We already have 6<sup>th</sup> generation LDPC decoder.
- Keep improving the LDPC performance.
- For higher throughput ~8GB/sec, we may go back to step1.
- After 28nm process, the design iteration depth will from code-construction to trial APR.
- EX: Find the Routing congestion issue in step 4, it may need to solve from step1.





### Before the RAID protect flow.....

Flash Memory

DRAM/ DRAM-less/ Small DRAM SLC-first/ TLC-direct write/ Dynamic SLC One-pass / Multi-pass/ Pair-page mapping WL to WL short Failure range

All the issues combine together

Capacity (RAID overhead) Binary/arbitrary

on

Program failure DRAM-backup/ Flash-cache/ RAID recover/ WL open Failure range Recovery latency

BUT THE SAME CONCEPT IS......



- I you doin t want to use RAID, what alternative y
  - Read-back check after program.

onMotion