Disadvantages of using ZFS recordsize 16k instead of 128k

Question 1

Short answer: It really depends on your expected use case. As a general rule, the default 128K recordsize is a good choice on mechanical disks (where access latency is dominated by seek time + rotational delay). For an all-SSD pool, I would probably use 16K or at most 32K (only if the latter provides a significant compression efficiency increase for your data).

Long answer: With an HDD pool, I recommend sticking with the default 128K recordsize for datasets and using 128K volblocksize for zvol also. The rationale is that access latency for a 7.2K RPM HDD is dominated by seek time, which does not scale with recordsize/volblocksize. Lets do some math: a 7.2K HDD has an average seek time of 8.3ms, while reading a 128K block only takes ~1ms. So commanding an head seek (with 8ms+ delay) to read a small 16K blocks seems wasteful, especially considering that for smaller reads/writes you are still impaired by r/m/w latency. Moreover, a small recordsize means a bigger metadata overhead and worse compression. So while InnoDB issues 16K IOs, and for a dedicated dataset one can use 16K recordsize to avoid r/m/w and write amplification, for a mixed-use datasets (ie: ones you use not only for the DB itself but for more general workloads also) I would suggest staying at 128K, especially considering the compression impact from small recordsize.

However, for an SSD pool I would use a much smaller volblocksize/recordsize, possibly in the range of 16-32K. The rationale is that SSD have much lower access time but limited endurance, so writing a full 128K block for smaller writes seems excessive. Moreover, the IO bandwidth amplification commanded by large recordsize is much more concerning on an high-IOPs device as modern SSDs (ie: you risk to saturate your bandwidth before reaching IOPs limit).

Answer

Short answer: It really depends on your expected use case. As a general rule, the default 128K recordsize is a good choice on mechanical disks (where access latency is dominated by seek time + rotational delay). For an all-SSD pool, I would probably use 16K or at most 32K (only if the latter provides a significant compression efficiency increase for your data).

Long answer: With an HDD pool, I recommend sticking with the default 128K recordsize for datasets and using 128K volblocksize for zvol also. The rationale is that access latency for a 7.2K RPM HDD is dominated by seek time, which does not scale with recordsize/volblocksize. Lets do some math: a 7.2K HDD has an average seek time of 8.3ms, while reading a 128K block only takes ~1ms. So commanding an head seek (with 8ms+ delay) to read a small 16K blocks seems wasteful, especially considering that for smaller reads/writes you are still impaired by r/m/w latency. Moreover, a small recordsize means a bigger metadata overhead and worse compression. So while InnoDB issues 16K IOs, and for a dedicated dataset one can use 16K recordsize to avoid r/m/w and write amplification, for a mixed-use datasets (ie: ones you use not only for the DB itself but for more general workloads also) I would suggest staying at 128K, especially considering the compression impact from small recordsize.

However, for an SSD pool I would use a much smaller volblocksize/recordsize, possibly in the range of 16-32K. The rationale is that SSD have much lower access time but limited endurance, so writing a full 128K block for smaller writes seems excessive. Moreover, the IO bandwidth amplification commanded by large recordsize is much more concerning on an high-IOPs device as modern SSDs (ie: you risk to saturate your bandwidth before reaching IOPs limit).

Question 2

I recommend tuning if and when you encounter a problem.

ZFS defaults to 128K record sizes, and that is acceptable and valid for most configurations and applications.

Exceptions to this include:

certain database applications; a smaller value may be appropriate.
The tradeoff is compression will be way less effective, which may have a greater impact on performance than the higher transaction count!!
large media workloads (e.g. video editing); a larger value is useful
specific workloads that fall outside the normal ZFS use cases

If you feel that the performance of the database benchmark is better with a certain record size, use it!
But have you tested with a realistic non-benchmarking workload to make sure you're adjusting for the right thing?

Answer

I recommend tuning if and when you encounter a problem.

ZFS defaults to 128K record sizes, and that is acceptable and valid for most configurations and applications.

Exceptions to this include:

certain database applications; a smaller value may be appropriate.
The tradeoff is compression will be way less effective, which may have a greater impact on performance than the higher transaction count!!
large media workloads (e.g. video editing); a larger value is useful
specific workloads that fall outside the normal ZFS use cases

If you feel that the performance of the database benchmark is better with a certain record size, use it!
But have you tested with a realistic non-benchmarking workload to make sure you're adjusting for the right thing?

Question 3

For what it's worth, it's a recommendation to set "recordsize=16K "according to zfs'documentation itself.

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#innodb

EDIT: I just reverted this setting after changing it for less than 12 hours on a proxmox server for a virtual server with quite a big database (>60GB of data). The server seriously dropped behind in analyzing data. In fact the 'z_rd_int_' processes jumped from a low CPU usage to about 5% each while the 'z_wr_int_' processed dropped in cpu usage - probably because less data was treated.

However, changing the hash algorithm to edonr (zfs set checksum=edonr vmpool) did have a positive impact: perf top no longer shows SHA256TransformBlocks as the top kernal function.

So the recommendation does not appear to be good in all cases - it can be reverted to the original set.

Answer