* Re: scalability regressions related to hugetlb_fault() changes [not found] <D3204B1E-50A1-4261-8C75-3DF77A302502@intersystems.com> @ 2022-03-24 21:55 ` Randy Dunlap 2022-03-24 22:41 ` Mike Kravetz 0 siblings, 1 reply; 6+ messages in thread From: Randy Dunlap @ 2022-03-24 21:55 UTC (permalink / raw) To: Ray Fucillo, linux-kernel, linux-mm [add linux-mm mailing list] On 3/24/22 13:12, Ray Fucillo wrote: > In moving to newer versions of the kernel, our customers have experienced dramatic new scalability problems in our database application, InterSystems IRIS. Our research has narrowed this down to new processes that attach to the database's shared memory segment taking very long delays (in some cases ~100ms!) acquiring the i_mmap_lock_read() in hugetlb_fault() as they fault in the huge page for the first time. The addition of this lock in hugetlb_fault() matches the versions where we see this problem. It's not just slowing the new process that incurs the delay, but backing up other processes if the page fault occurs inside a critical section within the database application. > > Is there something that can be improved here? > > The read locks in hugetlb_fault() contend with write locks that seem to be taken in very common application code paths: shmat(), process exit, fork() (not vfork()), shmdt(), presumably others. So hugetlb_fault() contending to read turns out to be common. When the system is loaded, there will be many new processes faulting in pages that may blocks the write lock, which in turn blocks more readers in fault behind it, and so on... I don't think there's any support for shared page tables in hugetlb to avoid the faults altogether. > > Switching to 1GB huge pages instead of 2MB is a good mitigation in reducing the frequency of fault, but not a complete solution. > > Thanks for considering. > > Ray -- ~Randy ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: scalability regressions related to hugetlb_fault() changes 2022-03-24 21:55 ` scalability regressions related to hugetlb_fault() changes Randy Dunlap @ 2022-03-24 22:41 ` Mike Kravetz 2022-03-25 0:02 ` Ray Fucillo 0 siblings, 1 reply; 6+ messages in thread From: Mike Kravetz @ 2022-03-24 22:41 UTC (permalink / raw) To: Ray Fucillo, linux-kernel, linux-mm On 3/24/22 14:55, Randy Dunlap wrote: > [add linux-mm mailing list] > > On 3/24/22 13:12, Ray Fucillo wrote: >> In moving to newer versions of the kernel, our customers have experienced dramatic new scalability problems in our database application, InterSystems IRIS. Our research has narrowed this down to new processes that attach to the database's shared memory segment taking very long delays (in some cases ~100ms!) acquiring the i_mmap_lock_read() in hugetlb_fault() as they fault in the huge page for the first time. The addition of this lock in hugetlb_fault() matches the versions where we see this problem. It's not just slowing the new process that incurs the delay, but backing up other processes if the page fault occurs inside a critical section within the database application. >> >> Is there something that can be improved here? >> >> The read locks in hugetlb_fault() contend with write locks that seem to be taken in very common application code paths: shmat(), process exit, fork() (not vfork()), shmdt(), presumably others. So hugetlb_fault() contending to read turns out to be common. When the system is loaded, there will be many new processes faulting in pages that may blocks the write lock, which in turn blocks more readers in fault behind it, and so on... I don't think there's any support for shared page tables in hugetlb to avoid the faults altogether. >> >> Switching to 1GB huge pages instead of 2MB is a good mitigation in reducing the frequency of fault, but not a complete solution. >> >> Thanks for considering. >> >> Ray Hi Ray, Acquiring i_mmap_rwsem in hugetlb_fault was added in the v5.7 kernel with commit c0d0381ade79 "hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization". Ironically, this was added due to correctness (possible data corruption) issues with huge pmd sharing (shared page tables for hugetlb at the pmd level). It is used to synchronize the fault path which sets up the sharing with the unmap (or other) path which tears down the sharing. As mentioned in the commit message, it is 'possible' to approach this issue in different ways such as catch races, cleanup, backout and retry. Adding the synchronization seemed to be the most direct and less error prone approach. I also seem to remember thinking about the possibility of avoiding the synchronization if pmd sharing was not possible. That may be a relatively easy way to speed things up. Not sure if pmd sharing comes into play in your customer environments, my guess would be yes (shared mappings ranges more than 1GB in size and aligned to 1GB). It has been a couple years since c0d0381ade79, I will take some time to look into alternatives and/or improvements. Also, do you have any specifics about the regressions your customers are seeing? Specifically what paths are holding i_mmap_rwsem in write mode for long periods of time. I would expect something related to unmap. Truncation can have long hold times especially if there are may shared mapping. Always worth checking specifics, but more likely this is a general issue. -- Mike Kravetz ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: scalability regressions related to hugetlb_fault() changes 2022-03-24 22:41 ` Mike Kravetz @ 2022-03-25 0:02 ` Ray Fucillo 2022-03-25 4:40 ` Mike Kravetz 0 siblings, 1 reply; 6+ messages in thread From: Ray Fucillo @ 2022-03-25 0:02 UTC (permalink / raw) To: Mike Kravetz; +Cc: Ray Fucillo, linux-kernel, linux-mm > On Mar 24, 2022, at 6:41 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote: > > I also seem to remember thinking about the possibility of > avoiding the synchronization if pmd sharing was not possible. That may be > a relatively easy way to speed things up. Not sure if pmd sharing comes > into play in your customer environments, my guess would be yes (shared > mappings ranges more than 1GB in size and aligned to 1GB). Hi Mike, This is one very large shared memory segment allocated at database startup. It's common for it to be hundreds of GB. We allocate it with shmget() passing SHM_HUGETLB (when huge pages have been reserved for us). Not sure if that answers... > Also, do you have any specifics about the regressions your customers are > seeing? Specifically what paths are holding i_mmap_rwsem in write mode > for long periods of time. I would expect something related to unmap. > Truncation can have long hold times especially if there are may shared > mapping. Always worth checking specifics, but more likely this is a general > issue. We've seen the write lock originate from calling shmat(), shmdt() and process exit. We've also seen it from a fork() off of one of the processes that are attached to the shared memory segment. Some evidence suggests that fork is a more costly case. However, while there are some important places where we'd use fork(), it's more unusual because most process creation will vfork() and execv() a new database process (which then attaches with shmat()). ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: scalability regressions related to hugetlb_fault() changes 2022-03-25 0:02 ` Ray Fucillo @ 2022-03-25 4:40 ` Mike Kravetz 2022-03-25 13:33 ` Ray Fucillo 0 siblings, 1 reply; 6+ messages in thread From: Mike Kravetz @ 2022-03-25 4:40 UTC (permalink / raw) To: Ray Fucillo; +Cc: linux-kernel, linux-mm On 3/24/22 17:02, Ray Fucillo wrote: > >> On Mar 24, 2022, at 6:41 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote: >> >> I also seem to remember thinking about the possibility of >> avoiding the synchronization if pmd sharing was not possible. That may be >> a relatively easy way to speed things up. Not sure if pmd sharing comes >> into play in your customer environments, my guess would be yes (shared >> mappings ranges more than 1GB in size and aligned to 1GB). > > Hi Mike, > > This is one very large shared memory segment allocated at database startup. It's common for it to be hundreds of GB. We allocate it with shmget() passing SHM_HUGETLB (when huge pages have been reserved for us). Not sure if that answers... Yes, so there would be shared pmds for that large shared mapping. I assume this is x86 or arm64 which are the only architectures which support shared pmds. So, the easy change of "don't take semaphore if pmd sharing is not possible" would not apply. >> Also, do you have any specifics about the regressions your customers are >> seeing? Specifically what paths are holding i_mmap_rwsem in write mode >> for long periods of time. I would expect something related to unmap. >> Truncation can have long hold times especially if there are may shared >> mapping. Always worth checking specifics, but more likely this is a general >> issue. > > We've seen the write lock originate from calling shmat(), shmdt() and process exit. We've also seen it from a fork() off of one of the processes that are attached to the shared memory segment. Some evidence suggests that fork is a more costly case. However, while there are some important places where we'd use fork(), it's more unusual because most process creation will vfork() and execv() a new database process (which then attaches with shmat()). Thanks. I will continue to look at this. A quick check of the fork code shows the semaphore held in read mode for the duration of the page table copy. -- Mike Kravetz ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: scalability regressions related to hugetlb_fault() changes 2022-03-25 4:40 ` Mike Kravetz @ 2022-03-25 13:33 ` Ray Fucillo 2022-03-28 18:30 ` Mike Kravetz 0 siblings, 1 reply; 6+ messages in thread From: Ray Fucillo @ 2022-03-25 13:33 UTC (permalink / raw) To: Mike Kravetz; +Cc: Ray Fucillo, linux-kernel, linux-mm > On Mar 25, 2022, at 12:40 AM, Mike Kravetz <mike.kravetz@oracle.com> wrote: > > I will continue to look at this. A quick check of the fork code shows the > semaphore held in read mode for the duration of the page table copy. Thank you for looking into it. As a side note about fork() for context, and not to distract from the regression at hand... There's some history here where we ran into problems circa 2005 where fork time went linear with the size of shared memory, and that was resolved by letting the pages fault in the child. This was when hugetlb was pretty new (and not used by us) and I see now that the fix explicitly excluded hugetlb. Anyway, we now mostly use vfork(), only fork() in some special cases, and improving just fork wouldn't fix the scalability regression for us. But, it does sound like fork() time might be getting large again now that everyone is using very large shared segments with hugetlb, but generally haven't switched to 1GB pages. That old thread is: https://lkml.org/lkml/2005/8/24/190 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: scalability regressions related to hugetlb_fault() changes 2022-03-25 13:33 ` Ray Fucillo @ 2022-03-28 18:30 ` Mike Kravetz 0 siblings, 0 replies; 6+ messages in thread From: Mike Kravetz @ 2022-03-28 18:30 UTC (permalink / raw) To: Ray Fucillo Cc: linux-kernel, linux-mm, Michal Hocko, Naoya Horiguchi, 'Aneesh Kumar', Kirill A. Shutemov On 3/25/22 06:33, Ray Fucillo wrote: > >> On Mar 25, 2022, at 12:40 AM, Mike Kravetz <mike.kravetz@oracle.com> wrote: >> >> I will continue to look at this. A quick check of the fork code shows the >> semaphore held in read mode for the duration of the page table copy. > > Thank you for looking into it. > Adding some mm people on cc: Just a quick update on some thoughts and possible approach. Note that regressions were noted when code was originally added to take i_mmap_rwsem at fault time. A limited way of addressing the issue was proposed here: https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/ I do not think such a change would help in this case as hugetlb pages are used via a shared memory segment. Hence, sharing and pmd sharing is happening. After some thought, I believe the synchronization needed for pmd sharing as outlined in commit c0d0381ade79 is limited to a single address space/mm_struct. We only need to worry about one thread of a process causing an unshare while another thread in the same process is faulting. That is because the unshare only tears down the page tables in the calling process. Also, the page table modifications associated pmd sharing are constrained by the virtual address range of a vma describing the sharable area. Therefore, pmd sharing synchronization can be done at the vma level. My 'plan' is to hang a rw_sema off the vm_private_data of hugetlb vmas that can possibly have shared pmds. We will use this new semaphore instead of i_mmap_rwsem at fault and pmd_unshare time. The only time we should see contention on this semaphore is if one thread of a process is doing something to cause unsharing for an address range while another thread is faulting in the same range. This seems unlikely, and much much less common than one process unmapping pages while another process wants to fault them in on a large shared area. There will also be a little code shuffling as the fault code is also synchronized with truncation and hole punch via i_mmap_rwsem. But, this is much easier to address. Comments or other suggestions welcome. -- Mike Kravetz ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-03-28 18:30 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <D3204B1E-50A1-4261-8C75-3DF77A302502@intersystems.com>
2022-03-24 21:55 ` scalability regressions related to hugetlb_fault() changes Randy Dunlap
2022-03-24 22:41 ` Mike Kravetz
2022-03-25 0:02 ` Ray Fucillo
2022-03-25 4:40 ` Mike Kravetz
2022-03-25 13:33 ` Ray Fucillo
2022-03-28 18:30 ` Mike Kravetz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox