From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 48B786B005A for ; Thu, 25 Jun 2009 15:21:42 -0400 (EDT) Received: from wpaz21.hot.corp.google.com (wpaz21.hot.corp.google.com [172.24.198.85]) by smtp-out.google.com with ESMTP id n5PJMGVU002445 for ; Thu, 25 Jun 2009 20:22:17 +0100 Received: from pxi9 (pxi9.prod.google.com [10.243.27.9]) by wpaz21.hot.corp.google.com with ESMTP id n5PJMDsw030866 for ; Thu, 25 Jun 2009 12:22:14 -0700 Received: by pxi9 with SMTP id 9so311631pxi.1 for ; Thu, 25 Jun 2009 12:22:13 -0700 (PDT) Date: Thu, 25 Jun 2009 12:22:12 -0700 (PDT) From: David Rientjes Subject: Re: [PATCH 0/5] Huge Pages Nodes Allowed In-Reply-To: <1245896060.6439.159.camel@lts-notebook> Message-ID: References: <20090616135228.25248.22018.sendpatchset@lts-notebook> <20090617130216.GF28529@csn.ul.ie> <1245258954.6235.58.camel@lts-notebook> <1245842724.6439.19.camel@lts-notebook> <1245896060.6439.159.camel@lts-notebook> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: Mel Gorman , linux-mm@kvack.org, Andrew Morton , Nishanth Aravamudan , Adam Litke , Andy Whitcroft , eric.whitney@hp.com, Ranjit Manomohan List-ID: On Wed, 24 Jun 2009, Lee Schermerhorn wrote: > Would having cpusets constrain huge page pool allocation meet your > needs? > It would, but it seems like an unnecessary inconvenience. It would require the admin task to join a cpuset to allocate hugepages for an application while allocating them. It also is more difficult to expand a cpuset to include a new node that has a specific threshold of hugepages available if writing to /sys/devices/system/node/node*/hugepages-kB/nr_hugepages is used to preallocate hugepages to determine which node has the least fragmentation to support such an allocation in the first place (and then freeing them if they cannot be allocated). It also doesn't support users who don't have CONFIG_CPUSETS, but do mbind their memory to a subset of nodes that need hugepages while others do not. > > This could become pretty cryptic: > > > > hugepagesz=2M hugepages=(0:10,1:20) hugepagesz=1G \ > > hugepages=(2:10,3:10) > > > > and I assume we'd use `count' of 99999 for nodes of unknown sizes where we > > simply want to allocate as many hugepages as possible. > > If one needed that capability--"allocate as many as possible"--then, > yes, I guess any ridiculously large count would do the trick. > I remember on older kernels that large hugepages= values would cause the system not to boot because subsequent kernel allocations would fail because it's oom. We need to do hugepages= allocation as early as possible to avoid additional fragmentation later, but not enough to completely oom the kernel. The only way to prevent that is with a maximum hugepage watermark (which would no longer be global but per zone with the hugepages=(node:count,...) support to allow at least a certain threshold of memory to be free to the kernel for boot. > > We'd still need to support hugepages=N for large NUMA machines so we don't > > have to specify the same number of hugepages per node for a true > > interleave, which would require an extremely large command line. And then > > the behavior of > > > > hugepagesz=1G hugepages=(0:10,1:20) hugepages=30 > > > > needs to be defined. In that case, does hugepages=30 override the > > previous settings if this system only has dual nodes? If so, for SGI's 1K > > node systems it's going to be difficult to specify many nodes with 10 > > hugepages and a few with 20. So perhaps hugepages=(node:count,...) should > > increment or decrement the hugepages= value, if specified? > > Mel mentioned that we probably don't need boot command line hugepage > allocation all that much with lumpy reclaim, etc. I can see his point. Understood, especially with the complexity of specifying them on the command line in the first place :) The only concern I have is for users who want hugepages preallocated on a specific set of nodes at boot and are required to use much higher hugepages= values to allocate on all system nodes and then just free the pages on the nodes they aren't interested in. > If we can't allocate all the hugepages we need from an early init script > or similar, we probably don't have enough memory anyway. For > compatibility, I supposed we need to retain the hugepages= parameter. > And, we've added the hugepagesz parameter, so we need to retain that. > But, maybe we should initially limit per node allocations to sysfs node > attributes post boot? > Agreed, it solves the early boot oom failures as well. > ------------- > Related question: do you think we need per node overcommit limits? I do, because applications constrained to an exclusive cpuset will only be able to allocate from its set of allowable nodes anyway, so the global overcommit limits aren't in effect. There needs to be a mechanism to allow such allocations to take place for such constrained tasks. The only reason I've proposed these hugepage tunables to be attributes of the system's nodes and not attributes of the individual cpusets is because non-exclusive cpusets may share nodes among siblings and parents will always share nodes with children, so the tunables could become inconsistent with one another. For all other purposes, they really are a characteristic of the cpuset's job, however. > I'm > having difficulty understanding what the semantics of the global limit > would be with per node limits--i.e., how would one distribute the global > limit across nodes [for backwards compatibility]. With nr_hugepages, > today we just do a best effort to distribute the requested number of > pages over the on-line nodes. If we fail to allocate that many, we > don't remember the initial request, just how many we actually allocated > where ever they landed. But, I don't see how that works with limits. I > suppose we could arrange that if you don't specify a per node limit, the > global limit applies when attempting to allocate a surplus page on a > given node. If you do [so specify], then the respective node limit > applies, whether or not the sum of per node surplus pages exceeds the > global limit. > Hmm, yes, that's a concern. I think the easiest way to do it would be to respect both the global and node surplus limits when the node limit is 0, and respect the node surplus limit when it is positive. The setting of either the global limit or the node limit does not change the other. This deals with the cpuset case where applications constrained to a cpuset may only allocate from their own nodes. This would allow the global limit to exceed the sum of the node limits and for hugepage allocation to fail, but that would have required a node limit to be set on each online node. In such a situation, the global limit has, in effect, been obsoleted and its value no longer matters. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org