In this Document
APPLIES TO:Linux OS - Version Oracle Linux 7.4 and later Linux x86-64
SYMPTOMSRHCK kernel with default Automatic NUMA Balancing induces high IO wait times due to hints page fault during page migration. For NUMA hardware, the access speed to main memory is determined by the location of the memory relative to the CPU. The performance of a workload depends on the application threads accessing data that is local to the CPU the thread is executing on. Automatic NUMA Balancing migrates data on demand to memory nodes that are local to the CPU accessing that data. Command "cat /proc/sys/kernel/numa_balancing" will show to be 1. The NUMA balancing is achieved from the following steps 1. A task scanner periodically scans a portion of a task's address space and marks the memory to force a page fault when the data is next accessed. 2. The next access to the data will result in a NUMA Hinting Fault. Based on this fault, the data can be migrated to a memory node associated with the task accessing the memory. 3. To keep a task, the CPU it is using and the memory it is accessing together, the scheduler groups tasks that share data. It is due to this induced page fault for page migration that causes high %iowait. Whether this overhead can be compensated by tasks accessing local node memory depends. Performance data shows blocking of tasks with high %iowait but no high disk I/Os access issues due to system reads and writes. vmstat shows blocking on wa zzz ***Mon Dec 28 11:25:22 CST 2020 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 24 426 0 502696000 506280 296023456 0 0 208 48 0 0 2 1 97 0 0 36 543 0 502876320 506280 296249664 0 0 10923 591 161479 128670 9 7 19 64 0 14 1023 0 502625120 506280 296354304 0 0 6363 490 167416 111990 7 7 8 78 0 zzz ***Mon Dec 28 11:26:33 CST 2020 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 25 133 0 563774592 506280 294375328 0 0 208 48 0 0 2 1 97 0 0 50 115 0 563888192 506280 294376992 0 0 38491 3054 190037 169273 12 8 72 8 0 40 5 0 561866688 506280 294369888 0 0 46107 1024 196046 181620 13 7 72 8 0 ps shows D-state tasks poracle 406661 1 19 1.4 1.0 87820580 11342140 wait_o D 05:00:03 00:05:34 oraclecums2 (LOCAL=NO) poracle 379722 1 19 4.9 1.0 87814416 10938380 wait_o D 08:56:48 00:07:24 oraclecums2 (LOCAL=NO) poracle 375667 1 19 2.4 1.0 87817524 11148184 wait_o D 08:55:43 00:03:44 oraclecums2 (LOCAL=NO) poracle 374255 1 19 1.3 1.0 87830900 11263564 wait_o D 08:55:00 00:02:03 oraclecums2 (LOCAL=NO) mpstat shows high %iowait Linux 3.10.0-693.11.6.el7.x86_64 (hostname) 12/28/2020 _x86_64_ (176 CPU) 11:35:20 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 11:35:21 AM all 9.23 0.00 7.35 77.18 0.00 0.77 0.00 0.00 0.00 5.47 11:35:21 AM 0 14.14 0.00 41.41 43.43 0.00 1.01 0.00 0.00 0.00 0.00 11:35:21 AM 1 4.55 0.00 2.27 93.18 0.00 0.00 0.00 0.00 0.00 0.00 11:35:21 AM 2 5.32 0.00 3.19 91.49 0.00 0.00 0.00 0.00 0.00 0.00 11:35:21 AM 3 5.15 0.00 5.15 89.69 0.00 0.00 0.00 0.00 0.00 0.00 11:35:21 AM 4 33.00 0.00 27.00 22.00 0.00 18.00 0.00 0.00 0.00 0.00 11:35:21 AM 5 4.08 0.00 5.10 88.78 0.00 2.04 0.00 0.00 0.00 0.00 11:35:21 AM 6 6.19 0.00 7.22 85.57 0.00 1.03 0.00 0.00 0.00 0.00 11:35:21 AM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 11:35:21 AM 8 39.60 0.00 12.87 32.67 0.00 14.85 0.00 0.00 0.00 0.00 11:35:21 AM 9 4.08 0.00 7.14 83.67 0.00 0.00 0.00 0.00 0.00 5.10 11:35:21 AM 10 6.19 0.00 4.12 89.69 0.00 0.00 0.00 0.00 0.00 0.00 11:35:21 AM 11 4.08 0.00 2.04 93.88 0.00 0.00 0.00 0.00 0.00 0.00 11:35:21 AM 12 36.00 0.00 11.00 50.00 0.00 3.00 0.00 0.00 0.00 0.00 11:35:21 AM 13 4.08 0.00 16.33 79.59 0.00 0.00 0.00 0.00 0.00 0.00 11:35:21 AM 14 10.42 0.00 4.17 85.42 0.00 0.00 0.00 0.00 0.00 0.00 11:35:21 AM 15 5.21 0.00 4.17 89.58 0.00 1.04 0.00 0.00 0.00 0.00 11:35:21 AM 16 32.35 0.00 11.76 53.92 0.00 1.96 0.00 0.00 0.00 0.00 11:35:21 AM 17 5.05 0.00 14.14 80.81 0.00 0.00 0.00 0.00 0.00 0.00 11:35:21 AM 18 7.14 0.00 8.16 84.69 0.00 0.00 0.00 0.00 0.00 0.00 11:35:21 AM 19 5.15 0.00 6.19 88.66 0.00 0.00 0.00 0.00 0.00 0.00 11:35:21 AM 20 29.29 0.00 17.17 52.53 0.00 1.01 0.00 0.00 0.00 0.00 NUMA statistics can be found from /proc/vmstat numa_pte_updates The amount of base pages that were marked for NUMA hinting faults. numa_huge_pte_updates The amount of transparent huge pages that were marked for NUMA hinting faults. In combination with numa_pte_updates the total address space that was marked can be calculated. numa_hint_faults Records how many NUMA hinting faults were trapped. numa_hint_faults_local Shows how many of the hinting faults were to local nodes. In combination with numa_hint_faults, the percentage of local versus remote faults can be calculated. A high percentage of local hinting faults indicates that the workload is closer to being converged. numa_pages_migrated Records how many pages were migrated because they were misplaced. As migration is a copying operation, it contributes the largest part of the overhead created by NUMA balancing. By the way, Oracle UEK4 kernel has numa_balancing turning off starting from kernel version 4.1.12-124.20.5 Bug 28814880 - Enabling numa balancing causes high I/O wait on numa systems CAUSEnuma_balancing is default on for Oracle Linxu 7 RHCK kernel which indices high %iowait due to hints page fault for page migration. SOLUTIONTurn off numa_balancing echo 0 > /proc/sys/kernel/numa_balancing |