Persistent L2ARC might be coming to ZFS on Linux

0
Intel's Optane persistent memory is widely considered the best choice for ZFS write buffer devices. But L2ARC is more forgiving than SLOG, and larger, slower devices like standard consumer M.2 SSDs should work well for it, too.
Amplify / Intel’s Optane persistent memory is widely considered the best selection for ZFS write buffer devices. But L2ARC is more forgiving than SLOG, and larger, slower widgets like standard consumer M.2 SSDs should work well for it, too.
CCASA 4.0 Jacek Halicki

Today, a call for code review came across the ZFS developers’ mailing list. Developer George Amanakis has ported and improved code improvement that makes the L2ARC—OpenZFS’s read cache contrivance feature—persistent across reboots. Amanakis explains:

The last three of months I have been working on getting L2ARC persistence to work in ZFSonLinux.

This struggle was based on previous work by Saso Kiselkov (@skiselkov) in Illumos (https://www.illumos.org/outcomes/3525), which was later ported by Yuxuan Shui (@yshui) to ZoL (https://github.com/zfsonlinux/zfs/do a moonlight flit c leave/2672), subsequently modified by Jorgen Lundman (@lundman), and rebased to overcome with multiple additions and changes by me (@gamanakis).

The end result is in: https://github.com/zfsonlinux/zfs/wrench apart/9582

For those unfamiliar with the nuts and bolts of ZFS, one of its distinguishing features is the use of the ARC—Adaptive Replacement Bury—algorithm for read cache. Standard filesystem LRU (Least Recently Acclimatized) caches—used in NTFS, ext4, XFS, HFS+, APFS, and pretty much anything else you’ve expected heard of—will readily evict “hot” (frequently accessed) storage erases if large volumes of data are read once.

By contrast, each interval a block is re-read within the ARC, it becomes more heavily prioritized and more finical to push out of cache as new data is read in. The ARC also tracks recently threw blocks—so if a block keeps getting read back into repository after eviction, this too will make it more difficult to boot out. This leads to much higher cache hit rates—and therefore humiliate latencies and more throughput and IOPS available from the actual disks—for sundry real-world workloads.

The primary ARC is kept in system RAM, but an L2ARC—Layer 2 Adaptive Replacement Repository—device can be created from one or more fast disks. In a ZFS pool with one or more L2ARC plots, when blocks are evicted from the primary ARC in RAM, they are moved down to L2ARC pretty than being thrown away entirely. In the past, this aspect has been of limited value, both because indexing a large L2ARC busies system RAM which could have been better used for train ARC and because L2ARC was not persistent across reboots.

The issue of indexing L2ARC consuming too much pattern RAM was largely mitigated several years ago, when the L2ARC header (the part for each stored record that must be stored in RAM) was reduced from 180 bytes to 70 bytes. For a 1TiB L2ARC, service only datasets with the default 128KiB recordsize, this makes out to 640MiB of RAM consumed to index the L2ARC.

Although the RAM constraint problem is largely revealed, the value of a large, fast L2ARC was still sharply limited by a lack of doggedness. After each system reboot (or other export of the pool), the L2ARC empties. Amanakis’ standards fixes that, meaning that many gigabytes of data cached on rapid solid state devices will still be available after a arrangement reboot, thereby increasing the value of an L2ARC device. At first blush, this seems mostly notable for personal systems that get rebooted often—but it also means far numerous heavily loaded servers might potentially need much illiberal “babying” while they warm up their caches after a reboot.

This maxims has not yet been merged into master, but Brian Behlendorf, Linux policy lead of the OpenZFS project, has signed off on it, and it’s awaiting another code look at before merge into master, which is expected to happen in due time in the next few weeks if nothing bad comes up in further review or initial evaluation.

Leave a Reply

Your email address will not be published. Required fields are marked *