

## Increasing NAND Flash Endurance Using Refresh Techniques

Yu Cai<sup>1</sup>, Gulay Yalcin<sup>2</sup>, Onur Mutlu<sup>1</sup>, Erich F. Haratsch<sup>3</sup>, Adrian Cristal<sup>2</sup>, Osman S. Unsal<sup>2</sup> and Ken Mai<sup>1</sup>

> DSSC, Carnegie Mellon University<sup>1</sup> Barcelona Supercomputing Center<sup>2</sup> LSI Corporation<sup>3</sup>





- Error rate increases with P/E cycles
- Retention errors are the most dominant errors
- Retention error rates increase as retention time increase

Santa Clara, CA August 2012

2



## BER and ECC Analysis for NAND Flash Memory

### • ECC (n, k, t) selection guidelines

- Efficiency: >0.89 coding rate (k/n)
- Reliability: <10<sup>-15</sup> uncorrectable error rate after ECC
- Code length: segment of one flash page (e.g. 4k-bytes)

### • Characteristics of various error correction codes (BCH)

| Code length<br>(n) | Correctable<br>Errors (t) | Acceptable<br>Raw BER        | Norm.<br>Power | Norm. Area |
|--------------------|---------------------------|------------------------------|----------------|------------|
| 512                | 7                         | 1.3x10 <sup>-4</sup> (1x)    | 1              | 1          |
| 1024               | 12                        | 5.5x10 <sup>-4</sup> (4x)    | 2              | 2.1        |
| 2048               | 22                        | 1.0x10 <sup>-3</sup> (7.7x)  | 4.1            | 3.9        |
| 4096               | 40                        | 1.1x10 <sup>-3</sup> (8.5x)  | 8.6            | 10.3       |
| 8192               | 74                        | 1.2x10 <sup>-3</sup> (9.2x)  | 17.8           | 21.3       |
| 32768              | 259                       | 2.0x10 <sup>-3</sup> (15.4x) | 71             | 85         |



## BER & ECC Analysis for NAND Flash Memory (cont)

• Lifetime improvement comparison of various BCH codes



- Summary
  - Raw BER of NAND flash increases exponentially as P/E cycles
  - Stronger ECC improves flash lifetime with diminishing returns
  - Raw BER MUST be decreased to achieve lifetime improvement

Santa Clara, CA August 2012



## Using Refresh Techniques to Improve the Flash Lifetime



- Flash Correct-and-Refresh (FCR)
  - Read, correct, and refresh the stored data before flash accumulates more retention errors than can be corrected by simple ECC



## Raw flash errors under refresh and no-refresh





#### • Work flow





### • Summary

- Overall error rate can be decreased by increasing the refresh rate
- Periodic remapping of block introduces additional erase operations
  - More frequent the remap, more erase and more wear out

Santa Clara, CA August 2012 Memory In-place refresh without erase



Retention errors are caused by threshold voltage shifting to left

ISPP shift threshold voltage to the right and fix retention errors

 Basic in-place reprogramming based FCR mechanism can be implemented without remapping data to other new blocks

Santa Clara, CA August 2012

# Flash Problems of in-place refreshing

• Program interference causes threshold voltage shift to right



### • Example of in-place reprogramming

Santa Clara, August 2012

|            | Original data to<br>be programmed           | <br>00                  | 11 | 01            | 00 | 10            | 11 | 00                  |  |
|------------|---------------------------------------------|-------------------------|----|---------------|----|---------------|----|---------------------|--|
|            | Program errors after<br>initial programming | <br>00                  | 10 | 01            | 00 | 10            | 11 | 00                  |  |
|            | Retention errors<br>after some time         | <br>( <mark>01</mark> ) | 10 | ( <u>10</u> ) | 00 | ( <u>11</u> ) | 11 | ( <mark>01</mark> ) |  |
| i, CA<br>2 | Errors after in-place reprogramming         | <br>00                  | 10 | 01            | 00 | 10            | 10 | 00                  |  |



### • Hybrid FCR work flow

#### Choose a block to **Basic in-place refresh** be refreshed $E_{total}^{in-place} = E_{retention}(T) + E_{program} + E_{read}$ Read LSB and MSB page pair $+E_{erase} + N \times E_{reprogram}$ Error Correction N uncontrolled LSB/MSB Hybrid FCR Cell threshold page pair num++ voltage comparison No $E_{total}^{hybrid} = E_{retention}(T) + E_{program} + E_{read}$ Yes Yes Last LSB/MSB Right shift errors Reprogram $+E_{erase} + N \times E_{reprogram}$ page pair? in-place < Threshold No Re-map to the N < Controlled threshold new block

Error model

### • Hybrid FCR

• If right shift program error count is less than a threshold, in-place reprogram the block; otherwise, remap to a new block

• Greatly reduce additional erase operations due to remapping Santa Clara, CA August 2012





 Trigger refresh operations only when necessary by tracking the number of P/E cycles of each block



- Implementation cost
  - Do not require hardware changes, only changes FTL software
  - Per-block P/E cycle information maintained in existing flash systems
- Power supply continuity
  - Proposed for enterprise storage, which are continuously powered on
- Response time impact
  - Trigger refresh operations whenever idle
  - Refresh operations can be interrupted by normal operations
  - Refresh period is at least a day, and can be finished within the period
- Additional erase operations
  - Hybrid FCR and adaptive rate FCR can greatly reduce additional erase operations



- Simulation framework
  - Disksim with SSD extensions
- Workload with various write ratios
  - File system applications: iozone (>99%), postmark(17%), cello99 (62%), MSR-Cambridge (20%)
  - Database application: oltp (48%)
  - Web search application: websearch (<1%)
- Lifetime evaluation

**Experimental Testing Data** 

Maximum Full Disk P/E Cycles

Total full disk P/E Cycle Given Application

Santa Clara, CA August 2012

Simulated Data

✗ # of Days of Given Application

Announced time of each benchmark



## Flash Lifetime with Remapping-Based FCR



- Given the same workload and the same refresh interval (or no-refresh), stronger ECC always provides a longer lifetime than weaker ECC
- Remapping based FCR provides significant lifetime improvements for write-intensive applications
- For heavily read-intensive applications (i.e., web search, MSR-Cambridge), remapping based FCR reduces lifetime (The additional Same rase cycles lead to decreased lifetime)



## Flash Lifetime with Adaptive-Rate FCR



- Adaptive-rate FCR improves lifetime over both periodic FCR mechanisms for all workloads as it avoids unnecessary refreshes
- Adaptive-rate FCR can improve lifetime for read-intensive workloads
- Average lifetime improvement over no-refresh
  - Adaptive 46x, Hybrid FCR 30x and remapping FCR 9x



- Stronger ECC has diminishing returns on improving endurance and lifetime of NAND flash based data storage
- Retention-aware error management need to be applied to reduce raw BER of NAND flash memory
  - Re-map based FCR
  - Hybrid in-place and out-place FCR
  - Adaptive-rate FCR
- Adaptive FCR techniques can improve the lifetime of NAND flash memory by ~46x on average with minor overhead



## Thank You

## Questions?