I'm using Proxmox on a dedicated server. For production I'm still using ext4, but I decided to start messing around with ZFS.
So I have created two separate ZFS storage pools with diffrent recordsizes:
- 128k for everything except MySQL/InnoDB
- 16k for MySQL/InnoDB (because 16k is default InnoDB page size which I'm using)
I have added that 16k pool to check if it really makes difference for MySQL/InnoDB database performance. So it really does. I have about 40% more transactions per second and 25% lower latency (I have tested this thoroughly with sysbench and tpcc).
For practical reasons at this moment I would prefer to use one big pool with 16k recordsize instead of two separate parts (16k and 128k). I know, that I can create subvolumes on single ZFS pool and give them diffrent recordsizes, but this is also something I want to avoid. I prefer to keep this manageable from Proxmox GUI.
My questions:
What disadvantages can I encounter if I start using a small (16k) recordsize for everything instead of 128k (it was default on Proxmox)?
Does the QEMU disk image have an equivalent of innodb_page_size? If it does - what size is it?
I have tried to check it with
qemu-img info
:$ qemu-img info vm-100-disk-0.raw image: vm-100-disk-0.raw file format: raw virtual size: 4 GiB (4294967296 bytes) disk size: 672 MiB
Server usage is:
- containers for www/php (tons of small files, but inside a container disk file)
- containers for java/spring applications (they produce a lot of logs)
- containers for mysql/innodb databases (no explanation required)
- local backup/restore operations including compressing backups
- messing around with large gzip files (not every day, low priority)
답변1
Short answer: It really depends on your expected use case. As a general rule, the default 128K recordsize is a good choice on mechanical disks (where access latency is dominated by seek time + rotational delay). For an all-SSD pool, I would probably use 16K or at most 32K (only if the latter provides a significant compression efficiency increase for your data).
Long answer: With an HDD pool, I recommend sticking with the default 128K recordsize for datasets and using 128K volblocksize for zvol also. The rationale is that access latency for a 7.2K RPM HDD is dominated by seek time, which does not scale with recordsize/volblocksize. Lets do some math: a 7.2K HDD has an average seek time of 8.3ms, while reading a 128K block only takes ~1ms. So commanding an head seek (with 8ms+ delay) to read a small 16K blocks seems wasteful, especially considering that for smaller reads/writes you are still impaired by r/m/w latency. Moreover, a small recordsize means a bigger metadata overhead and worse compression. So while InnoDB issues 16K IOs, and for a dedicated dataset one can use 16K recordsize to avoid r/m/w and write amplification, for a mixed-use datasets (ie: ones you use not only for the DB itself but for more general workloads also) I would suggest staying at 128K, especially considering the compression impact from small recordsize.
However, for an SSD pool I would use a much smaller volblocksize/recordsize, possibly in the range of 16-32K. The rationale is that SSD have much lower access time but limited endurance, so writing a full 128K block for smaller writes seems excessive. Moreover, the IO bandwidth amplification commanded by large recordsize is much more concerning on an high-IOPs device as modern SSDs (ie: you risk to saturate your bandwidth before reaching IOPs limit).
답변2
I recommend tuning if and when you encounter a problem.
ZFS defaults to 128K record sizes, and that is acceptable and valid for most configurations and applications.
Exceptions to this include:
- certain database applications; a smaller value may be appropriate.
The tradeoff is compression will be way less effective, which may have a greater impact on performance than the higher transaction count!! - large media workloads (e.g. video editing); a larger value is useful
- specific workloads that fall outside the normal ZFS use cases
If you feel that the performance of the database benchmark is better with a certain record size, use it!
But have you tested with a realistic non-benchmarking workload to make sure you're adjusting for the right thing?
답변3
For what it's worth, it's a recommendation to set "recordsize=16K "according to zfs'documentation itself.
https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#innodb
EDIT: I just reverted this setting after changing it for less than 12 hours on a proxmox server for a virtual server with quite a big database (>60GB of data). The server seriously dropped behind in analyzing data. In fact the 'z_rd_int_' processes jumped from a low CPU usage to about 5% each while the 'z_wr_int_' processed dropped in cpu usage - probably because less data was treated.
However, changing the hash algorithm to edonr (zfs set checksum=edonr vmpool
) did have a positive impact: perf top
no longer shows SHA256TransformBlocks
as the top kernal function.
So the recommendation does not appear to be good in all cases - it can be reverted to the original set.