linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jarno Rajahalme <jrajahalme@nicira.com>
To: Jesse Gross <jesse@nicira.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
	"dev@openvswitch.org" <dev@openvswitch.org>,
	Pravin Shelar <pshelar@nicira.com>,
	"David S. Miller" <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org
Subject: Re: [ovs-dev] [PATCH] ovs: do not allocate memory from offline numa node
Date: Fri, 9 Oct 2015 08:54:35 -0700	[thread overview]
Message-ID: <ECF39603-F56D-483A-A398-480C28C93F97@nicira.com> (raw)
In-Reply-To: <CAEP_g=9bqj_CKMTvd4dHTS+J82u7idtqa_PFA9=-CmO2ZcUMow@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4647 bytes --]


> On Oct 8, 2015, at 4:03 PM, Jesse Gross <jesse@nicira.com> wrote:
> 
> On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme <jrajahalme@nicira.com <mailto:jrajahalme@nicira.com>> wrote:
>> 
>>> On Oct 6, 2015, at 6:01 PM, Jesse Gross <jesse@nicira.com> wrote:
>>> 
>>> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
>>> <alexander.duyck@gmail.com> wrote:
>>>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>>>>> 
>>>>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>>>>>> 
>>>>>> When openvswitch tries allocate memory from offline numa node 0:
>>>>>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
>>>>>> 0)
>>>>>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
>>>>>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
>>>>>> This patch disables numa affinity in this case.
>>>>>> 
>>>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>>> 
>>>>> 
>>>>> ...
>>>>> 
>>>>>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
>>>>>> index f2ea83ba4763..c7f74aab34b9 100644
>>>>>> --- a/net/openvswitch/flow_table.c
>>>>>> +++ b/net/openvswitch/flow_table.c
>>>>>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>>>>>> 
>>>>>>     /* Initialize the default stat node. */
>>>>>>     stats = kmem_cache_alloc_node(flow_stats_cache,
>>>>>> -                      GFP_KERNEL | __GFP_ZERO, 0);
>>>>>> +                      GFP_KERNEL | __GFP_ZERO,
>>>>>> +                      node_online(0) ? 0 : NUMA_NO_NODE);
>>>>> 
>>>>> 
>>>>> Stupid question: can node 0 become offline between this check, and the
>>>>> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>>>> 
>>>> 
>>>> Another question to ask would be is it possible for node 0 to be online, but
>>>> be a memoryless node?
>>>> 
>>>> I would say you are better off just making this call kmem_cache_alloc.  I
>>>> don't see anything that indicates the memory has to come from node 0, so
>>>> adding the extra overhead doesn't provide any value.
>>> 
>>> I agree that this at least makes me wonder, though I actually have
>>> concerns in the opposite direction - I see assumptions about this
>>> being on node 0 in net/openvswitch/flow.c.
>>> 
>>> Jarno, since you original wrote this code, can you take a look to see
>>> if everything still makes sense?
>> 
>> We keep the pre-allocated stats node at array index 0, which is initially used by all CPUs, but if CPUs from multiple numa nodes start updating the stats, we allocate additional stats nodes (up to one per numa node), and the CPUs on node 0 keep using the preallocated entry. If stats cannot be allocated from CPUs local node, then those CPUs keep using the entry at index 0. Currently the code in net/openvswitch/flow.c will try to allocate the local memory repeatedly, which may not be optimal when there is no memory at the local node.
>> 
>> Allocating the memory for the index 0 from other than node 0, as discussed here, just means that the CPUs on node 0 will keep on using non-local memory for stats. In a scenario where there are CPUs on two nodes (0, 1), but only the node 1 has memory, a shared flow entry will still end up having separate memory allocated for both nodes, but both of the nodes would be at node 1. However, there is still a high likelihood that the memory allocations would not share a cache line, which should prevent the nodes from invalidating each other’s caches. Based on this I do not see a problem relaxing the memory allocation for the default stats node. If node 0 has memory, however, it would be better to allocate the memory from node 0.
> 
> Thanks for going through all of that.
> 
> It seems like the question that is being raised is whether it actually
> makes sense to try to get the initial memory on node 0, especially
> since it seems to introduce some corner cases? Is there any reason why
> the flow is more likely to hit node 0 than a randomly chosen one?
> (Assuming that this is a multinode system, otherwise it's kind of a
> moot point.) We could have a separate pointer to the default allocated
> memory, so it wouldn't conflict with memory that was intentionally
> allocated for node 0.

It would still be preferable to know from which node the default stats node was allocated, and store it in the appropriate pointer in the array. We could then add a new “default stats node index” that would be used to locate the node in the array of pointers we already have. That way we would avoid extra allocation and processing of the default stats node.

  Jarno


[-- Attachment #2: Type: text/html, Size: 14705 bytes --]

  reply	other threads:[~2015-10-09 15:54 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-02 10:18 Konstantin Khlebnikov
2015-10-02 22:38 ` Pravin Shelar
2015-10-05 13:44 ` David Miller
2015-10-05 13:59 ` Vlastimil Babka
2015-10-05 20:25   ` Alexander Duyck
2015-10-07  1:01     ` [ovs-dev] " Jesse Gross
2015-10-07 17:47       ` Jarno Rajahalme
2015-10-08 23:03         ` Jesse Gross
2015-10-09 15:54           ` Jarno Rajahalme [this message]
2015-10-09 22:11             ` Jesse Gross
2015-10-10  0:02               ` Jarno Rajahalme
2015-10-20 17:58                 ` Jarno Rajahalme
2015-10-21  8:55                   ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ECF39603-F56D-483A-A398-480C28C93F97@nicira.com \
    --to=jrajahalme@nicira.com \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=dev@openvswitch.org \
    --cc=jesse@nicira.com \
    --cc=khlebnikov@yandex-team.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=pshelar@nicira.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox