Disadvantages of using ZFS recordsize 16k instead of 128k

Disadvantages of using ZFS recordsize 16k instead of 128k

I'm using Proxmox on a dedicated server. For production I'm still using ext4, but I decided to start messing around with ZFS.

So I have created two separate ZFS storage pools with diffrent recordsizes:

  • 128k for everything except MySQL/InnoDB
  • 16k for MySQL/InnoDB (because 16k is default InnoDB page size which I'm using)

I have added that 16k pool to check if it really makes difference for MySQL/InnoDB database performance. So it really does. I have about 40% more transactions per second and 25% lower latency (I have tested this thoroughly with sysbench and tpcc).

For practical reasons at this moment I would prefer to use one big pool with 16k recordsize instead of two separate parts (16k and 128k). I know, that I can create subvolumes on single ZFS pool and give them diffrent recordsizes, but this is also something I want to avoid. I prefer to keep this manageable from Proxmox GUI.


My questions:

  1. What disadvantages can I encounter if I start using a small (16k) recordsize for everything instead of 128k (it was default on Proxmox)?

  2. Does the QEMU disk image have an equivalent of innodb_page_size? If it does - what size is it?

    I have tried to check it with qemu-img info:

     $ qemu-img info vm-100-disk-0.raw
     image: vm-100-disk-0.raw
     file format: raw
     virtual size: 4 GiB (4294967296 bytes)
     disk size: 672 MiB
    

Server usage is:

  • containers for www/php (tons of small files, but inside a container disk file)
  • containers for java/spring applications (they produce a lot of logs)
  • containers for mysql/innodb databases (no explanation required)
  • local backup/restore operations including compressing backups
  • messing around with large gzip files (not every day, low priority)

답변1

Short answer: It really depends on your expected use case. As a general rule, the default 128K recordsize is a good choice on mechanical disks (where access latency is dominated by seek time + rotational delay). For an all-SSD pool, I would probably use 16K or at most 32K (only if the latter provides a significant compression efficiency increase for your data).

Long answer: With an HDD pool, I recommend sticking with the default 128K recordsize for datasets and using 128K volblocksize for zvol also. The rationale is that access latency for a 7.2K RPM HDD is dominated by seek time, which does not scale with recordsize/volblocksize. Lets do some math: a 7.2K HDD has an average seek time of 8.3ms, while reading a 128K block only takes ~1ms. So commanding an head seek (with 8ms+ delay) to read a small 16K blocks seems wasteful, especially considering that for smaller reads/writes you are still impaired by r/m/w latency. Moreover, a small recordsize means a bigger metadata overhead and worse compression. So while InnoDB issues 16K IOs, and for a dedicated dataset one can use 16K recordsize to avoid r/m/w and write amplification, for a mixed-use datasets (ie: ones you use not only for the DB itself but for more general workloads also) I would suggest staying at 128K, especially considering the compression impact from small recordsize.

However, for an SSD pool I would use a much smaller volblocksize/recordsize, possibly in the range of 16-32K. The rationale is that SSD have much lower access time but limited endurance, so writing a full 128K block for smaller writes seems excessive. Moreover, the IO bandwidth amplification commanded by large recordsize is much more concerning on an high-IOPs device as modern SSDs (ie: you risk to saturate your bandwidth before reaching IOPs limit).

답변2

I recommend tuning if and when you encounter a problem.

ZFS defaults to 128K record sizes, and that is acceptable and valid for most configurations and applications.

Exceptions to this include:

  • certain database applications; a smaller value may be appropriate.
    The tradeoff is compression will be way less effective, which may have a greater impact on performance than the higher transaction count!!
  • large media workloads (e.g. video editing); a larger value is useful
  • specific workloads that fall outside the normal ZFS use cases

If you feel that the performance of the database benchmark is better with a certain record size, use it!
But have you tested with a realistic non-benchmarking workload to make sure you're adjusting for the right thing?

답변3

For what it's worth, it's a recommendation to set "recordsize=16K "according to zfs'documentation itself.

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#innodb

EDIT: I just reverted this setting after changing it for less than 12 hours on a proxmox server for a virtual server with quite a big database (>60GB of data). The server seriously dropped behind in analyzing data. In fact the 'z_rd_int_' processes jumped from a low CPU usage to about 5% each while the 'z_wr_int_' processed dropped in cpu usage - probably because less data was treated.

However, changing the hash algorithm to edonr (zfs set checksum=edonr vmpool) did have a positive impact: perf top no longer shows SHA256TransformBlocks as the top kernal function.

So the recommendation does not appear to be good in all cases - it can be reverted to the original set.

관련 정보