Re: scalability regressions related to hugetlb

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: scalability regressions related to hugetlb_fault() changes
       [not found] <D3204B1E-50A1-4261-8C75-3DF77A302502@intersystems.com>
@ 2022-03-24 21:55 ` Randy Dunlap
  2022-03-24 22:41   ` Mike Kravetz
  0 siblings, 1 reply; 6+ messages in thread
From: Randy Dunlap @ 2022-03-24 21:55 UTC (permalink / raw)
  To: Ray Fucillo, linux-kernel, linux-mm

[add linux-mm mailing list]

On 3/24/22 13:12, Ray Fucillo wrote:
> In moving to newer versions of the kernel, our customers have experienced dramatic new scalability problems in our database application, InterSystems IRIS.  Our research has narrowed this down to new processes that attach to the database's shared memory segment taking very long delays (in some cases ~100ms!) acquiring the i_mmap_lock_read() in hugetlb_fault() as they fault in the huge page for the first time.  The addition of this lock in hugetlb_fault() matches the versions where we see this problem.  It's not just slowing the new process that incurs the delay, but backing up other processes if the page fault occurs inside a critical section within the database application.
> 
> Is there something that can be improved here?  
> 
> The read locks in hugetlb_fault() contend with write locks that seem to be taken in very common application code paths: shmat(), process exit, fork() (not vfork()), shmdt(), presumably others.  So hugetlb_fault() contending to read turns out to be common.  When the system is loaded, there will be many new processes faulting in pages that may blocks the write lock, which in turn blocks more readers in fault behind it, and so on...  I don't think there's any support for shared page tables in hugetlb to avoid the faults altogether.
> 
> Switching to 1GB huge pages instead of 2MB is a good mitigation in reducing the frequency of fault, but not a complete solution.
> 
> Thanks for considering.
> 
> Ray

-- 
~Randy


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-24 21:55 ` scalability regressions related to hugetlb_fault() changes Randy Dunlap
@ 2022-03-24 22:41   ` Mike Kravetz
  2022-03-25  0:02     ` Ray Fucillo
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Kravetz @ 2022-03-24 22:41 UTC (permalink / raw)
  To: Ray Fucillo, linux-kernel, linux-mm

On 3/24/22 14:55, Randy Dunlap wrote:
> [add linux-mm mailing list]
> 
> On 3/24/22 13:12, Ray Fucillo wrote:
>> In moving to newer versions of the kernel, our customers have experienced dramatic new scalability problems in our database application, InterSystems IRIS.  Our research has narrowed this down to new processes that attach to the database's shared memory segment taking very long delays (in some cases ~100ms!) acquiring the i_mmap_lock_read() in hugetlb_fault() as they fault in the huge page for the first time.  The addition of this lock in hugetlb_fault() matches the versions where we see this problem.  It's not just slowing the new process that incurs the delay, but backing up other processes if the page fault occurs inside a critical section within the database application.
>>
>> Is there something that can be improved here?  
>>
>> The read locks in hugetlb_fault() contend with write locks that seem to be taken in very common application code paths: shmat(), process exit, fork() (not vfork()), shmdt(), presumably others.  So hugetlb_fault() contending to read turns out to be common.  When the system is loaded, there will be many new processes faulting in pages that may blocks the write lock, which in turn blocks more readers in fault behind it, and so on...  I don't think there's any support for shared page tables in hugetlb to avoid the faults altogether.
>>
>> Switching to 1GB huge pages instead of 2MB is a good mitigation in reducing the frequency of fault, but not a complete solution.
>>
>> Thanks for considering.
>>
>> Ray

Hi Ray,

Acquiring i_mmap_rwsem in hugetlb_fault was added in the v5.7 kernel with
commit c0d0381ade79 "hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization".  Ironically, this was added due to correctness (possible
data corruption) issues with huge pmd sharing (shared page tables for hugetlb
at the pmd level).  It is used to synchronize the fault path which sets up
the sharing with the unmap (or other) path which tears down the sharing.

As mentioned in the commit message, it is 'possible' to approach this issue
in different ways such as catch races, cleanup, backout and retry.  Adding
the synchronization seemed to be the most direct and less error prone
approach.  I also seem to remember thinking about the possibility of
avoiding the synchronization if pmd sharing was not possible.  That may be
a relatively easy way to speed things up.  Not sure if pmd sharing comes
into play in your customer environments, my guess would be yes (shared
mappings ranges more than 1GB in size and aligned to 1GB).

It has been a couple years since c0d0381ade79, I will take some time to
look into alternatives and/or improvements.

Also, do you have any specifics about the regressions your customers are
seeing?  Specifically what paths are holding i_mmap_rwsem in write mode
for long periods of time.  I would expect something related to unmap.
Truncation can have long hold times especially if there are may shared
mapping.  Always worth checking specifics, but more likely this is a general
issue.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-24 22:41   ` Mike Kravetz
@ 2022-03-25  0:02     ` Ray Fucillo
  2022-03-25  4:40       ` Mike Kravetz
  0 siblings, 1 reply; 6+ messages in thread
From: Ray Fucillo @ 2022-03-25  0:02 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Ray Fucillo, linux-kernel, linux-mm

> On Mar 24, 2022, at 6:41 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> I also seem to remember thinking about the possibility of
> avoiding the synchronization if pmd sharing was not possible.  That may be
> a relatively easy way to speed things up.  Not sure if pmd sharing comes
> into play in your customer environments, my guess would be yes (shared
> mappings ranges more than 1GB in size and aligned to 1GB).

Hi Mike, 

This is one very large shared memory segment allocated at database startup.  It's common for it to be hundreds of GB.  We allocate it with shmget() passing SHM_HUGETLB (when huge pages have been reserved for us).  Not sure if that answers...

> Also, do you have any specifics about the regressions your customers are
> seeing?  Specifically what paths are holding i_mmap_rwsem in write mode
> for long periods of time.  I would expect something related to unmap.
> Truncation can have long hold times especially if there are may shared
> mapping.  Always worth checking specifics, but more likely this is a general
> issue.

We've seen the write lock originate from calling shmat(), shmdt() and process exit.  We've also seen it from a fork() off of one of the processes that are attached to the shared memory segment.  Some evidence suggests that fork is a more costly case.  However, while there are some important places where we'd use fork(), it's more unusual because most process creation will vfork() and execv() a new database process (which then attaches with shmat()).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-25  0:02     ` Ray Fucillo
@ 2022-03-25  4:40       ` Mike Kravetz
  2022-03-25 13:33         ` Ray Fucillo
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Kravetz @ 2022-03-25  4:40 UTC (permalink / raw)
  To: Ray Fucillo; +Cc: linux-kernel, linux-mm

On 3/24/22 17:02, Ray Fucillo wrote:
> 
>> On Mar 24, 2022, at 6:41 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> I also seem to remember thinking about the possibility of
>> avoiding the synchronization if pmd sharing was not possible.  That may be
>> a relatively easy way to speed things up.  Not sure if pmd sharing comes
>> into play in your customer environments, my guess would be yes (shared
>> mappings ranges more than 1GB in size and aligned to 1GB).
> 
> Hi Mike, 
> 
> This is one very large shared memory segment allocated at database startup.  It's common for it to be hundreds of GB.  We allocate it with shmget() passing SHM_HUGETLB (when huge pages have been reserved for us).  Not sure if that answers...

Yes, so there would be shared pmds for that large shared mapping.  I assume
this is x86 or arm64 which are the only architectures which support shared
pmds.

So, the easy change of "don't take semaphore if pmd sharing is not possible"
would not apply.

>> Also, do you have any specifics about the regressions your customers are
>> seeing?  Specifically what paths are holding i_mmap_rwsem in write mode
>> for long periods of time.  I would expect something related to unmap.
>> Truncation can have long hold times especially if there are may shared
>> mapping.  Always worth checking specifics, but more likely this is a general
>> issue.
> 
> We've seen the write lock originate from calling shmat(), shmdt() and process exit.  We've also seen it from a fork() off of one of the processes that are attached to the shared memory segment.  Some evidence suggests that fork is a more costly case.  However, while there are some important places where we'd use fork(), it's more unusual because most process creation will vfork() and execv() a new database process (which then attaches with shmat()).

Thanks.

I will continue to look at this.  A quick check of the fork code shows the
semaphore held in read mode for the duration of the page table copy.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-25  4:40       ` Mike Kravetz
@ 2022-03-25 13:33         ` Ray Fucillo
  2022-03-28 18:30           ` Mike Kravetz
  0 siblings, 1 reply; 6+ messages in thread
From: Ray Fucillo @ 2022-03-25 13:33 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Ray Fucillo, linux-kernel, linux-mm

> On Mar 25, 2022, at 12:40 AM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> I will continue to look at this.  A quick check of the fork code shows the
> semaphore held in read mode for the duration of the page table copy.

Thank you for looking into it.  

As a side note about fork() for context, and not to distract from the 
regression at hand...  There's some history here where we ran into problems 
circa 2005 where fork time went linear with the size of shared memory, and 
that was resolved by letting the pages fault in the child.  This was when
hugetlb was pretty new (and not used by us) and I see now that the fix
explicitly excluded hugetlb.  Anyway, we now mostly use vfork(), only fork()
in some special cases, and improving just fork wouldn't fix the scalability
regression for us.  But, it does sound like fork() time might be getting 
large again now that everyone is using very large shared segments with 
hugetlb, but generally haven't switched to 1GB pages.  That old thread is: 
https://lkml.org/lkml/2005/8/24/190

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-25 13:33         ` Ray Fucillo
@ 2022-03-28 18:30           ` Mike Kravetz
  0 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2022-03-28 18:30 UTC (permalink / raw)
  To: Ray Fucillo
  Cc: linux-kernel, linux-mm, Michal Hocko, Naoya Horiguchi,
	'Aneesh Kumar',
	Kirill A. Shutemov

On 3/25/22 06:33, Ray Fucillo wrote:
> 
>> On Mar 25, 2022, at 12:40 AM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> I will continue to look at this.  A quick check of the fork code shows the
>> semaphore held in read mode for the duration of the page table copy.
> 
> Thank you for looking into it.  
> 

Adding some mm people on cc:
Just a quick update on some thoughts and possible approach.

Note that regressions were noted when code was originally added to take
i_mmap_rwsem at fault time.  A limited way of addressing the issue was
proposed here:
https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/

I do not think such a change would help in this case as hugetlb pages are used
via a shared memory segment.  Hence, sharing and pmd sharing is happening.

After some thought, I believe the synchronization needed for pmd sharing
as outlined in commit c0d0381ade79 is limited to a single address space/mm_struct.  We only need to worry about one thread of a process causing
an unshare while another thread in the same process is faulting.  That is
because the unshare only tears down the page tables in the calling process.
Also, the page table modifications associated pmd sharing are constrained
by the virtual address range of a vma describing the sharable area.
Therefore, pmd sharing synchronization can be done at the vma level.

My 'plan' is to hang a rw_sema off the vm_private_data of hugetlb vmas that
can possibly have shared pmds.  We will use this new semaphore instead of
i_mmap_rwsem at fault and pmd_unshare time.  The only time we should see
contention on this semaphore is if one thread of a process is doing something
to cause unsharing for an address range while another thread is faulting in
the same range.  This seems unlikely, and much much less common than one
process unmapping pages while another process wants to fault them in on a
large shared area.

There will also be a little code shuffling as the fault code is also
synchronized with truncation and hole punch via i_mmap_rwsem.  But, this is
much easier to address.

Comments or other suggestions welcome.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-03-28 18:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <D3204B1E-50A1-4261-8C75-3DF77A302502@intersystems.com>
2022-03-24 21:55 ` scalability regressions related to hugetlb_fault() changes Randy Dunlap
2022-03-24 22:41   ` Mike Kravetz
2022-03-25  0:02     ` Ray Fucillo
2022-03-25  4:40       ` Mike Kravetz
2022-03-25 13:33         ` Ray Fucillo
2022-03-28 18:30           ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox