linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: kswapd @ 60-80% CPU during heavy HD i/o.
@ 2000-05-02 17:26 frankeh
  2000-05-02 17:45 ` Rik van Riel
  2000-05-02 17:53 ` Andrea Arcangeli
  0 siblings, 2 replies; 31+ messages in thread
From: frankeh @ 2000-05-02 17:26 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: riel, Roger Larsson, linux-kernel, linux-mm

It makes sense to me to make the number of pools configurable and not tie
them directly to the number of nodes in a NUMA system.
In particular allow memory pools (i.e. instance of pg_dat_t) to be smaller
than a node size.

The smart things that I see has to happen is to allow a set of processes to
be attached to a set of memory pools and the OS basically enforcing
allocation in those constraints. I brought this up before and I think
Andrea proposed something similar. Allocation should take place in those
pools along the allocation levels based on GFP_MASK, so first allocate on
HIGH along all specified pools and if unsuccessful, then fallback on a
previous level.
With each pool we should associate a kswapd.

Making the size of the pools configurable allows to control the velocity at
which we can swap out. Standard Queuing theory: if we can't get the desired
througput, then increase the number of servers, here kswapd.

Comments...

-- Hubertus




Andrea Arcangeli <andrea@suse.de>@kvack.org on 05/02/2000 12:20:41 PM

Sent by:  owner-linux-mm@kvack.org


To:   riel@nl.linux.org
cc:   Roger Larsson <roger.larsson@norran.net>,
      linux-kernel@vger.rutgers.edu, linux-mm@kvack.org
Subject:  Re: kswapd @ 60-80% CPU during heavy HD i/o.



On Tue, 2 May 2000, Rik van Riel wrote:

>That's a very bad idea.

However the lru_cache have definitely to be per-node and not global as now
in 2.3.99-pre6 and pre7-1 or you won't be able to do the smart things I
was mentining some day ago in linux-mm with NUMA.

My current tree looks like this:

#define LRU_SWAP_CACHE        0
#define LRU_NORMAL_CACHE 1
#define NR_LRU_CACHE          2
typedef struct lru_cache_s {
     struct list_head heads[NR_LRU_CACHE];
     unsigned long nr_cache_pages; /* pages in the lrus */
     unsigned long nr_map_pages; /* pages temporarly out of the lru */
     /* keep lock in a separate cacheline to avoid ping pong in SMP */
     spinlock_t lock ____cacheline_aligned_in_smp;
} lru_cache_t;

struct bootmem_data;
typedef struct pglist_data {
     int nr_zones;
     zone_t node_zones[MAX_NR_ZONES];
     gfpmask_zone_t node_gfpmask_zone[NR_GFPINDEX];
     lru_cache_t lru_cache;
     struct page *node_mem_map;
     unsigned long *valid_addr_bitmap;
     struct bootmem_data *bdata;
     unsigned long node_start_paddr;
     unsigned long node_start_mapnr;
     unsigned long node_size;
     int node_id;
     struct pglist_data *node_next;
     spinlock_t freelist_lock ____cacheline_aligned_in_smp;
} pg_data_t;

Stay tuned...

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: kswapd @ 60-80% CPU during heavy HD i/o.
@ 2000-05-02 18:46 frankeh
  0 siblings, 0 replies; 31+ messages in thread
From: frankeh @ 2000-05-02 18:46 UTC (permalink / raw)
  To: riel; +Cc: linux-mm

Hi, Rik...


Rik van Riel <riel@conectiva.com.br> on 05/02/2000 02:15:18 PM

   Please respond to riel@nl.linux.org

   To:  Hubertus Franke/Watson/IBM@IBMUS
   cc:  Andrea Arcangeli <andrea@suse.de>, Roger Larsson
        <roger.larsson@norran.net>, linux-kernel@vger.rutgers.edu,
        linux-mm@kvack.org
   Subject:     Re: kswapd @ 60-80% CPU during heavy HD i/o.



   On Tue, 2 May 2000 frankeh@us.ibm.com wrote:

   > It makes sense to me to make the number of pools configurable
   > and not tie them directly to the number of nodes in a NUMA
   > system. In particular allow memory pools (i.e. instance of
   > pg_dat_t) to be smaller than a node size.

   *nod*



   We should have different memory zones per node on
   Intel-handi^Wequipped NUMA machines.

Wouldn't that be orthogonal....
Anyway, I believe x86 NUMA machines will exist in the future, so I am not
ready to trash them right now, whether I like their architecture or not.



   > The smart things that I see has to happen is to allow a set of
   processes to
   > be attached to a set of memory pools and the OS basically enforcing
   > allocation in those constraints. I brought this up before and I think
   > Andrea proposed something similar. Allocation should take place in
   those
   > pools along the allocation levels based on GFP_MASK, so first allocate
   on
   > HIGH along all specified pools and if unsuccessful, then fallback on a
   > previous level.

   That idea is broken if you don't do balancing of VM load between
   zones.



   > With each pool we should associate a kswapd.

   How will local page replacement help you if the node next door
   has practically unloaded virtual memory? You need to do global
   page replacement of some sort...

You wouldn't balance a zone until you have checked on the same level (e.g.
HIGHMEM) on all the specified nodes. Then and only then you fall back. So
we aren't doing any local page replacement unless I can not satisfy a page
request within the given resource set.
That means something along the following pseudo code

   forall zonelevels
        forall nodes in resource set
             zone = pgdat[node].zones[zonelevel];
             if (zone->free_pages > threshold)
                  alloc_page;
                  return;
             set kswapd_required flag   (kick)

   balance zones;   // couldn't allocate a page in the desired resource set
so start balancing.


Now balancing zones kicks the kswaps or helps out... global balancing can
take place by servicing the pgdat_t with the highest number of kicks...
I think it is ok to have pools with unused memory lying around if a
particular resource set does not include those pools. How else are you
planning to control locality and affinity within memory other than using
resource sets.
We take the same approach in the kernel, for instance we have a minimum
file cache size, because we know that we can increase throughput by doing
so.


   > Making the size of the pools configurable allows to control the
   > velocity at which we can swap out. Standard Queuing theory: if
   > we can't get the desired througput, then increase the number of
   > servers, here kswapd.

   What we _could_ do is have one (or maybe even a few) kswapds
   doing global replacement with io-less and more fine-grained
   swap_out() and shrink_mmap() functions, and per-node kswapds
   taking care of the IO and maybe even a per-node inactive list
   (though that would probably be *bad* for page replacement).

That is workable .......


   Then again, if your machine can't get the desired throughput,
   how would adding kswapds help??? Have you taken a look at
   mm/page_alloc.c::alloc_pages()? If kswapd can't keep up, the
   biggest memory consumers will help a hand and prevent the
   rest of the system from thrashing too much.

Correct...

However, having finer grain pools also allows you to deal with potential
lock contention, which is one of the biggest impedements to scale up.
characteristics of NUMA machines are large memory and large number of CPUs.
This implies that there will be increased lock contention, for instance on
the lock that protects the
memory pool. Also increased lock contention can arise by increased lock
hold time, which I assume is somewhat related to the size of the memory. So
decreasing lock contention time by limiting the number of pages that are
managed per pool could remove an arising bottleneck.


   regards,

   Rik
   --
   The Internet is not a network of computers. It is a network
   of people. That is its real strength.

   Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
   http://www.conectiva.com/        http://www.surriel.com/


regards...

Hubertus


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2000-05-04 22:38 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <390E1534.B33FF871@norran.net>
2000-05-01 23:23 ` kswapd @ 60-80% CPU during heavy HD i/o Rik van Riel
2000-05-01 23:33   ` David S. Miller
2000-05-02  0:07     ` Rik van Riel
2000-05-02  0:23       ` David S. Miller
2000-05-02  1:03         ` Rik van Riel
2000-05-02  1:13           ` David S. Miller
2000-05-02  1:31             ` Rik van Riel
2000-05-02  1:51             ` Andrea Arcangeli
2000-05-03 17:11         ` [PATCHlet] " Rik van Riel
2000-05-02  7:56       ` michael
2000-05-02 16:17   ` Roger Larsson
2000-05-02 15:43     ` Rik van Riel
2000-05-02 16:20       ` Andrea Arcangeli
2000-05-02 17:06         ` Rik van Riel
2000-05-02 21:14           ` Stephen C. Tweedie
2000-05-02 21:42             ` Rik van Riel
2000-05-02 22:34               ` Stephen C. Tweedie
2000-05-04 12:37               ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
2000-05-04 14:34                 ` Rik van Riel
2000-05-04 22:38                   ` [PATCH][RFC] Another shrink_mmap Roger Larsson
2000-05-04 15:25                 ` [PATCH][RFC] Alternate shrink_mmap Roger Larsson
2000-05-04 18:30                   ` Rik van Riel
2000-05-04 20:44                     ` Roger Larsson
2000-05-04 18:59                       ` Rik van Riel
2000-05-04 22:29                         ` Roger Larsson
2000-05-02 18:03       ` kswapd @ 60-80% CPU during heavy HD i/o Roger Larsson
2000-05-02 17:37         ` Rik van Riel
2000-05-02 17:26 frankeh
2000-05-02 17:45 ` Rik van Riel
2000-05-02 17:53 ` Andrea Arcangeli
2000-05-02 18:46 frankeh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox