linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Tariq Toukan <tariqt@mellanox.com>
To: Aaron Lu <aaron.lu@intel.com>
Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	David Miller <davem@davemloft.net>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Alexei Starovoitov <ast@fb.com>,
	Saeed Mahameed <saeedm@mellanox.com>,
	Eran Ben Elisha <eranbe@mellanox.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>
Subject: Re: Page allocator bottleneck
Date: Mon, 23 Apr 2018 11:54:57 +0300	[thread overview]
Message-ID: <0dea4da6-8756-22d4-c586-267217a5fa63@mellanox.com> (raw)
In-Reply-To: <127df719-b978-60b7-5d77-3c8efbf2ecff@mellanox.com>



On 22/04/2018 7:43 PM, Tariq Toukan wrote:
> 
> 
> On 21/04/2018 11:15 AM, Aaron Lu wrote:
>> Sorry to bring up an old thread...
>>
> 
> I want to thank you very much for bringing this up!
> 
>> On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
>>>
>>>
>>> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
>>>>
>>>>
>>>> On 15/09/2017 1:23 PM, Mel Gorman wrote:
>>>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>>>>>> Insights: Major degradation between #1 and #2, not getting any
>>>>>> close to linerate! Degradation is fixed between #2 and #3. This is
>>>>>> because page allocator cannot stand the higher allocation rate. In
>>>>>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>>>>>> as result of increasing congestion over shared resources.
>>>>>>
>>>>>
>>>>> Unfortunately, no surprises there.
>>>>>
>>>>>> Congestion in this case is very clear. When monitored in perf
>>>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath
>>>>>>
>>>>>
>>>>> While it's not proven, the most likely candidate is the zone lock
>>>>> and that should be confirmed using a call-graph profile. If so, then
>>>>> the suggestion to tune to the size of the per-cpu allocator would
>>>>> mitigate the problem.
>>>>>
>>>> Indeed, I tuned the per-cpu allocator and bottleneck is released.
>>>>
>>>
>>> Hi all,
>>>
>>> After leaving this task for a while doing other tasks, I got back to 
>>> it now
>>> and see that the good behavior I observed earlier was not stable.
>>
>> I posted a patchset to improve zone->lock contention for order-0 pages
>> recently, it can almost eliminate 80% zone->lock contention for
>> will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel
>> Skylake server and it doesn't require PCP size tune, so should have
>> some effects on your workload where one CPU does allocation while
>> another does free.
>>
> 
> That is great news. In our driver's memory scheme (and many others as 
> well) we allocate only order-0 pages (the only flow that does not do 
> that yet in upstream will do so very soon, we already have the patches 
> in our internal branch).
> Allocation of order-0 pages is not only the common case, but is the only 
> type of allocation in our data-path. Let's optimize it!
> 
> 
>> It did this by some disruptive changes:
>> 1 on free path, it skipped doing merge(so could be bad for mixed
>> A A  workloads where both 4K and high order pages are needed);
> 
> I think there are so many advantages to not using high order 
> allocations, especially in production servers that are not rebooted for 
> long periods and become fragmented.
> AFAIK, the community direction (at least in networking) is using order-0 
> pages in datapath, so optimizing their allocaiton is a very good idea. 
> Need of course to perf evaluate possible degradations, and see how 
> important these use cases are.
> 
>> 2 on allocation path, it avoided touching multiple cachelines.
>>
> 
> Great!
> 
>> RFC v2 patchset:
>> https://lkml.org/lkml/2018/3/20/171
>>
>> repo:
>> https://github.com/aaronlu/linux zone_lock_rfc_v2
>>
> 
> I will check them out first thing tomorrow!
> 
> p.s., I will be on vacation for a week starting Tuesday.
> I hope I can make some progress before that :)
> 
> Thanks,
> Tariq
> 

Hi,

I ran my tests with your patches.
Initial BW numbers are significantly higher than I documented back then 
in this mail-thread.
For example, in driver #2 (see original mail thread), with 6 rings, I 
now get 92Gbps (slightly less than linerate) in comparison to 64Gbps 
back then.

However, there were many kernel changes since then, I need to isolate 
your changes. I am not sure I can finish this today, but I will surely 
get to it next week after I'm back from vacation.

Still, when I increase the scale (more rings, i.e. more cpus), I see 
that queued_spin_lock_slowpath gets to 60%+ cpu. Still high, but lower 
than it used to be.

This should be root solved by the (orthogonal) changes planned in 
network subsystem, which will change the SKB allocation/free scheme so 
that SKBs are released on the originating cpu.

Thanks,
Tariq

>>> Recall: I work with a modified driver that allocates a page (4K) per 
>>> packet
>>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
>>> NICs.
>>>
>>> Performance is good as long as pages are available in the allocating 
>>> cores's
>>> PCP.
>>> Issue is that pages are allocated in one core, then free'd in another,
>>> making it's hard for the PCP to work efficiently, and both the allocator
>>> core and the freeing core need to access the buddy allocator very often.
>>>
>>> I'd like to share with you some testing numbers:
>>>
>>> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000
>>>
>>> 100% cpu on all cores, top func in perf:
>>> A A A  84.98%A  [kernel]A A A A A A A A A A A A  [k] queued_spin_lock_slowpath
>>>
>>> system wide (all cores)
>>> A A A A A A A A A A A  1135941A A A A A  kmem:mm_page_alloc
>>>
>>> A A A A A A A A A A A  2606629A A A A A  kmem:mm_page_free
>>>
>>> A A A A A A A A A A A A A A A A A  0A A A A A  kmem:mm_page_alloc_extfrag
>>> A A A A A A A A A A A  4784616A A A A A  kmem:mm_page_alloc_zone_locked
>>>
>>> A A A A A A A A A A A A A A  1337A A A A A  kmem:mm_page_free_batched
>>>
>>> A A A A A A A A A A A  6488213A A A A A  kmem:mm_page_pcpu_drain
>>>
>>> A A A A A A A A A A A  8925503A A A A A  net:napi_gro_receive_entry
>>>
>>>
>>> Two types of cores:
>>> A core mostly running napi (8 such cores):
>>> A A A A A A A A A A A A  221875A A A A A  kmem:mm_page_alloc
>>>
>>> A A A A A A A A A A A A A  17100A A A A A  kmem:mm_page_free
>>>
>>> A A A A A A A A A A A A A A A A A  0A A A A A  kmem:mm_page_alloc_extfrag
>>> A A A A A A A A A A A A  766584A A A A A  kmem:mm_page_alloc_zone_locked
>>>
>>> A A A A A A A A A A A A A A A A  16A A A A A  kmem:mm_page_free_batched
>>>
>>> A A A A A A A A A A A A A A A A  35A A A A A  kmem:mm_page_pcpu_drain
>>>
>>> A A A A A A A A A A A  1340139A A A A A  net:napi_gro_receive_entry
>>>
>>>
>>> Other core, mostly running user application (40 such):
>>> A A A A A A A A A A A A A A A A A  2A A A A A  kmem:mm_page_alloc
>>>
>>> A A A A A A A A A A A A A  38922A A A A A  kmem:mm_page_free
>>>
>>> A A A A A A A A A A A A A A A A A  0A A A A A  kmem:mm_page_alloc_extfrag
>>> A A A A A A A A A A A A A A A A A  1A A A A A  kmem:mm_page_alloc_zone_locked
>>>
>>> A A A A A A A A A A A A A A A A A  8A A A A A  kmem:mm_page_free_batched
>>>
>>> A A A A A A A A A A A A  107289A A A A A  kmem:mm_page_pcpu_drain
>>>
>>> A A A A A A A A A A A A A A A A  34A A A A A  net:napi_gro_receive_entry
>>>
>>>
>>> As you can see, sync overhead is enormous.
>>>
>>> PCP-wise, a key improvement in such scenarios would be reached if we 
>>> could
>>> (1) keep and handle the allocated page on same cpu, or (2) somehow 
>>> get the
>>> page back to the allocating core's PCP in a fast-path, without going 
>>> through
>>> the regular buddy allocator paths.

  reply	other threads:[~2018-04-23  8:55 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-14 16:49 Tariq Toukan
2017-09-14 20:19 ` Andi Kleen
2017-09-17 15:43   ` Tariq Toukan
2017-09-15  7:28 ` Jesper Dangaard Brouer
2017-09-17 16:16   ` Tariq Toukan
2017-09-18  7:34     ` Aaron Lu
2017-09-18  7:44       ` Aaron Lu
2017-09-18 15:33         ` Tariq Toukan
2017-09-19  7:23           ` Aaron Lu
2017-09-15 10:23 ` Mel Gorman
2017-09-18  9:16   ` Tariq Toukan
2017-11-02 17:21     ` Tariq Toukan
2017-11-03 13:40       ` Mel Gorman
2017-11-08  5:42         ` Tariq Toukan
2017-11-08  9:35           ` Mel Gorman
2017-11-09  3:51             ` Figo.zhang
2017-11-09  5:06             ` Tariq Toukan
2017-11-09  5:21             ` Jesper Dangaard Brouer
2018-04-21  8:15       ` Aaron Lu
2018-04-22 16:43         ` Tariq Toukan
2018-04-23  8:54           ` Tariq Toukan [this message]
2018-04-23 13:10             ` Aaron Lu
2018-04-27  8:45               ` Aaron Lu
2018-05-02 13:38                 ` Tariq Toukan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0dea4da6-8756-22d4-c586-267217a5fa63@mellanox.com \
    --to=tariqt@mellanox.com \
    --cc=aaron.lu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@fb.com \
    --cc=brouer@redhat.com \
    --cc=davem@davemloft.net \
    --cc=eranbe@mellanox.com \
    --cc=eric.dumazet@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox