From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.5 required=3.0 tests=INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C174C04AB3 for ; Mon, 27 May 2019 14:12:29 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DDCF721851 for ; Mon, 27 May 2019 14:12:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DDCF721851 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3A1386B027F; Mon, 27 May 2019 10:12:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 352276B0280; Mon, 27 May 2019 10:12:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 241FA6B0281; Mon, 27 May 2019 10:12:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by kanga.kvack.org (Postfix) with ESMTP id C1B556B027F for ; Mon, 27 May 2019 10:12:27 -0400 (EDT) Received: by mail-ed1-f70.google.com with SMTP id r48so28209276eda.11 for ; Mon, 27 May 2019 07:12:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:date:from:to :cc:subject:message-id:references:mime-version:content-disposition :in-reply-to:user-agent; bh=ZXsBSb1B2f8Xg+mSZ+FAtjEX2rbfXb9EUHSvyXl2Vgg=; b=JWtjidOWhQY69YQBawrZbEbCH3DXaxcCgWm4V2UmEt1jzOhO2sJkGSpaUAHKdZ63D1 TLueMbvgdb6WVWaefmtvQLKS9wKpW9fwWh8+xOcmwkPpxGWrub4382z98m9DqkIFwRYB +uCW0cqYAwsujO7V+dfl9/aQ+cxRDh3qQjINpKgz5qNSkzpsQO0+TVS6BOv4G4ukpBhm 6haTqI1ASMVkKuirYdVG/rRN/1Da+YEpM8teoi57jO+tZUPVGbJbxt7CCpHF0u8VbjSy E3GH/4QinAkKzrYKfyh1fER79nrw989U5m/4w+I3vcyAzCtKrk4FicwiexxokRrS4UgZ NAxQ== X-Original-Authentication-Results: mx.google.com; spf=softfail (google.com: domain of transitioning mhocko@kernel.org does not designate 195.135.220.15 as permitted sender) smtp.mailfrom=mhocko@kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Gm-Message-State: APjAAAVBRuhwbo3HrPO9slzhLeiQH0oB7vnKpPfmplh9ozBiXLMgkcOA ms8lsRYableLBglRjdpMMKc11H2WbZNSMVGXKBRwiO2EbOp4bx7vASu9wbB60nEjiATYiaER5hO 7BNm+nHCSxvV3YBW6Ac3a+UtVqcqrmecu5+f6qsBPm/2KMjfg3fvgDWcDjkavvDU= X-Received: by 2002:a50:92c9:: with SMTP id l9mr122160312eda.75.1558966347325; Mon, 27 May 2019 07:12:27 -0700 (PDT) X-Google-Smtp-Source: APXvYqwTw/TMa1glRsh9Zt84iOWyLWOA6dMPl6m3MDsCjuHb+dAAzezlyQwIdQB8tbiIc//4drK3 X-Received: by 2002:a50:92c9:: with SMTP id l9mr122160150eda.75.1558966345993; Mon, 27 May 2019 07:12:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558966345; cv=none; d=google.com; s=arc-20160816; b=Sx9M1juCu/opP1kMraXHpYBnr556mEqyy8FwZAkcYm+iy4v/BLRWs/kWp7f/hVXN50 RfMgsKspY1Mkr8se1XW7yu8LxzRbxJ8J35ZyVUcqmKQ+K3WTM/j/hsZl7jLV8SoBr8k0 DBBc0lLIvZPTC3vph6DkW5d+oUmzCKc5srExthMDlIpuJhAfCpXDS8medkLJnVw7Rwou bvo5Ludt8o/gQTkOqJbVv1yS2MVAyoukxgcZg83oRyl/+trYJ+e5N4OgD7On7U8+umTP uZuEHwxziTEN2+ojsaDSBC/RQ2HXuoOf1UUA107FkmP7eS4sB7HvKyVSYYMXw0szL3Zy 7LXg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=user-agent:in-reply-to:content-disposition:mime-version:references :message-id:subject:cc:to:from:date; bh=ZXsBSb1B2f8Xg+mSZ+FAtjEX2rbfXb9EUHSvyXl2Vgg=; b=IFutdD4kvH0y6WSNxMqcFztuKJf48EX/Aktveh5Pvkc03EkN36O7+LpX/jRZCBENjK cizsOMInknsnOpmH7hXhJqJ3naMjG0cQrIAONJg3FjLuf57E0XGhmAXXp3T8MClI5u3h EnOWXqkctLu2pnWs/qoIO9HidJ+zccfTBVeAqP1MqOA2Wt8ljXb7xa6LTgnhEoKhgz/s 5WWpciM3ZrUW3NMmhKK/8MN1Av7aEqDWy3HdEyX/lJ85tYiOlDiEa1gNGEIE+iKMBa2/ gJ6WhFtWqzAbA64UoI7MXaWmeqiydPgGEzfLdnV5yQHNYSWHs7ELlh3Jr9iq3+uvBxTD 9htw== ARC-Authentication-Results: i=1; mx.google.com; spf=softfail (google.com: domain of transitioning mhocko@kernel.org does not designate 195.135.220.15 as permitted sender) smtp.mailfrom=mhocko@kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id e2si8982730ede.347.2019.05.27.07.12.25 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 May 2019 07:12:25 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning mhocko@kernel.org does not designate 195.135.220.15 as permitted sender) client-ip=195.135.220.15; Authentication-Results: mx.google.com; spf=softfail (google.com: domain of transitioning mhocko@kernel.org does not designate 195.135.220.15 as permitted sender) smtp.mailfrom=mhocko@kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 1A564AC3F; Mon, 27 May 2019 14:12:25 +0000 (UTC) Date: Mon, 27 May 2019 16:12:23 +0200 From: Michal Hocko To: Konstantin Khlebnikov Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Vladimir Davydov , Johannes Weiner , Tejun Heo , Andrew Morton , Mel Gorman , Roman Gushchin , linux-api@vger.kernel.org Subject: Re: [PATCH RFC] mm/madvise: implement MADV_STOCKPILE (kswapd from user space) Message-ID: <20190527141223.GD1658@dhcp22.suse.cz> References: <155895155861.2824.318013775811596173.stgit@buzz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <155895155861.2824.318013775811596173.stgit@buzz> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [Cc linux-api. Please always cc this list when proposing a new user visible api. Keeping the rest of the email intact for reference] On Mon 27-05-19 13:05:58, Konstantin Khlebnikov wrote: > Memory cgroup has no background memory reclaimer. Reclaiming after passing > high-limit blocks task because works synchronously in task-work. > > This implements manual kswapd-style memory reclaim initiated by userspace. > It reclaims both physical memory and cgroup pages. It works in context of > task who calls syscall madvise thus cpu time is accounted correctly. > > Interface: > > ret = madvise(ptr, size, MADV_STOCKPILE) > > Returns: > 0 - ok, free memory >= size > -EINVAL - not supported > -ENOMEM - not enough memory/cgroup limit > -EINTR - interrupted by pending signal > -EAGAIN - cannot reclaim enough memory > > Argument 'size' is interpreted size of required free memory. > Implementation triggers direct reclaim until amount of free memory is > lower than that size. Argument 'ptr' could points to vma for specifying > numa allocation policy, right now should be NULL. > > Usage scenario: independent thread or standalone daemon estimates rate of > allocations and calls MADV_STOCKPILE in loop to prepare free pages. > Thus fast path avoids allocation latency induced by direct reclaim. > > We are using this embedded into memory allocator based on MADV_FREE. > > > Demonstration in memory cgroup with limit 1G: > > touch zero > truncate -s 5G zero > > Without stockpile: > > perf stat -e vmscan:* md5sum zero > > Performance counter stats for 'md5sum zero': > > 0 vmscan:mm_vmscan_kswapd_sleep > 0 vmscan:mm_vmscan_kswapd_wake > 0 vmscan:mm_vmscan_wakeup_kswapd > 0 vmscan:mm_vmscan_direct_reclaim_begin > 10147 vmscan:mm_vmscan_memcg_reclaim_begin > 0 vmscan:mm_vmscan_memcg_softlimit_reclaim_begin > 0 vmscan:mm_vmscan_direct_reclaim_end > 10147 vmscan:mm_vmscan_memcg_reclaim_end > 0 vmscan:mm_vmscan_memcg_softlimit_reclaim_end > 99910 vmscan:mm_shrink_slab_start > 99910 vmscan:mm_shrink_slab_end > 39654 vmscan:mm_vmscan_lru_isolate > 0 vmscan:mm_vmscan_writepage > 39652 vmscan:mm_vmscan_lru_shrink_inactive > 2 vmscan:mm_vmscan_lru_shrink_active > 19982 vmscan:mm_vmscan_inactive_list_is_low > > 10.886832585 seconds time elapsed > > 8.928366000 seconds user > 1.935212000 seconds sys > > With stockpile: > > stockpile 100 10 & # up to 100M every 10ms > perf stat -e vmscan:* md5sum zero > > Performance counter stats for 'md5sum zero': > > 0 vmscan:mm_vmscan_kswapd_sleep > 0 vmscan:mm_vmscan_kswapd_wake > 0 vmscan:mm_vmscan_wakeup_kswapd > 0 vmscan:mm_vmscan_direct_reclaim_begin > 0 vmscan:mm_vmscan_memcg_reclaim_begin > 0 vmscan:mm_vmscan_memcg_softlimit_reclaim_begin > 0 vmscan:mm_vmscan_direct_reclaim_end > 0 vmscan:mm_vmscan_memcg_reclaim_end > 0 vmscan:mm_vmscan_memcg_softlimit_reclaim_end > 0 vmscan:mm_shrink_slab_start > 0 vmscan:mm_shrink_slab_end > 0 vmscan:mm_vmscan_lru_isolate > 0 vmscan:mm_vmscan_writepage > 0 vmscan:mm_vmscan_lru_shrink_inactive > 0 vmscan:mm_vmscan_lru_shrink_active > 0 vmscan:mm_vmscan_inactive_list_is_low > > 10.469776675 seconds time elapsed > > 8.976261000 seconds user > 1.491378000 seconds sys > > Signed-off-by: Konstantin Khlebnikov > --- > include/linux/memcontrol.h | 6 +++++ > include/uapi/asm-generic/mman-common.h | 2 ++ > mm/madvise.c | 39 ++++++++++++++++++++++++++++++ > mm/memcontrol.c | 41 ++++++++++++++++++++++++++++++++ > tools/vm/Makefile | 2 +- > tools/vm/stockpile.c | 30 +++++++++++++++++++++++ > 6 files changed, 119 insertions(+), 1 deletion(-) > create mode 100644 tools/vm/stockpile.c > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index bc74d6a4407c..25325f18ad55 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -517,6 +517,7 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, > } > > void mem_cgroup_handle_over_high(void); > +int mem_cgroup_stockpile(unsigned long goal_pages); > > unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); > > @@ -968,6 +969,11 @@ static inline void mem_cgroup_handle_over_high(void) > { > } > > +static inline int mem_cgroup_stockpile(unsigned long goal_page) > +{ > + return 0; > +} > + > static inline void mem_cgroup_enter_user_fault(void) > { > } > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index abd238d0f7a4..675145864fee 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -64,6 +64,8 @@ > #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ > #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ > > +#define MADV_STOCKPILE 20 /* stockpile free pages */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/mm/madvise.c b/mm/madvise.c > index 628022e674a7..f908b08ecc9f 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -686,6 +686,41 @@ static int madvise_inject_error(int behavior, > } > #endif > > +static long madvise_stockpile(unsigned long start, size_t len) > +{ > + unsigned long goal_pages, progress; > + struct zonelist *zonelist; > + int ret; > + > + if (start) > + return -EINVAL; > + > + goal_pages = len >> PAGE_SHIFT; > + > + if (goal_pages > totalram_pages() - totalreserve_pages) > + return -ENOMEM; > + > + ret = mem_cgroup_stockpile(goal_pages); > + if (ret) > + return ret; > + > + /* TODO: use vma mempolicy */ > + zonelist = node_zonelist(numa_node_id(), GFP_HIGHUSER); > + > + while (global_zone_page_state(NR_FREE_PAGES) < > + goal_pages + totalreserve_pages) { > + > + if (signal_pending(current)) > + return -EINTR; > + > + progress = try_to_free_pages(zonelist, 0, GFP_HIGHUSER, NULL); > + if (!progress) > + return -EAGAIN; > + } > + > + return 0; > +} > + > static long > madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > unsigned long start, unsigned long end, int behavior) > @@ -728,6 +763,7 @@ madvise_behavior_valid(int behavior) > case MADV_DODUMP: > case MADV_WIPEONFORK: > case MADV_KEEPONFORK: > + case MADV_STOCKPILE: > #ifdef CONFIG_MEMORY_FAILURE > case MADV_SOFT_OFFLINE: > case MADV_HWPOISON: > @@ -834,6 +870,9 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > return madvise_inject_error(behavior, start, start + len_in); > #endif > > + if (behavior == MADV_STOCKPILE) > + return madvise_stockpile(start, len); > + > write = madvise_need_mmap_write(behavior); > if (write) { > if (down_write_killable(¤t->mm->mmap_sem)) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e50a2db5b4ff..dc23dc6bbeb3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2276,6 +2276,47 @@ void mem_cgroup_handle_over_high(void) > current->memcg_nr_pages_over_high = 0; > } > > +int mem_cgroup_stockpile(unsigned long goal_pages) > +{ > + int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > + unsigned long limit, nr_free, progress; > + struct mem_cgroup *memcg, *pos; > + int ret = 0; > + > + pos = memcg = get_mem_cgroup_from_mm(current->mm); > + > +retry: > + if (signal_pending(current)) { > + ret = -EINTR; > + goto out; > + } > + > + limit = min(pos->memory.max, pos->high); > + if (goal_pages > limit) { > + ret = -ENOMEM; > + goto out; > + } > + > + nr_free = limit - page_counter_read(&pos->memory); > + if ((long)nr_free < (long)goal_pages) { > + progress = try_to_free_mem_cgroup_pages(pos, > + goal_pages - nr_free, GFP_HIGHUSER, true); > + if (progress || nr_retries--) > + goto retry; > + ret = -EAGAIN; > + goto out; > + } > + > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > + pos = parent_mem_cgroup(pos); > + if (pos) > + goto retry; > + > +out: > + css_put(&memcg->css); > + return ret; > +} > + > static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > unsigned int nr_pages) > { > diff --git a/tools/vm/Makefile b/tools/vm/Makefile > index 20f6cf04377f..e5b5bc0d9421 100644 > --- a/tools/vm/Makefile > +++ b/tools/vm/Makefile > @@ -1,7 +1,7 @@ > # SPDX-License-Identifier: GPL-2.0 > # Makefile for vm tools > # > -TARGETS=page-types slabinfo page_owner_sort > +TARGETS=page-types slabinfo page_owner_sort stockpile > > LIB_DIR = ../lib/api > LIBS = $(LIB_DIR)/libapi.a > diff --git a/tools/vm/stockpile.c b/tools/vm/stockpile.c > new file mode 100644 > index 000000000000..245e24f293ec > --- /dev/null > +++ b/tools/vm/stockpile.c > @@ -0,0 +1,30 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include > +#include > +#include > +#include > +#include > + > +#ifndef MADV_STOCKPILE > +# define MADV_STOCKPILE 20 > +#endif > + > +int main(int argc, char **argv) > +{ > + int interval; > + size_t size; > + int ret; > + > + if (argc != 3) > + errx(1, "usage: %s ", argv[0]); > + > + size = atol(argv[1]) << 20; > + interval = atoi(argv[2]) * 1000; > + > + while (1) { > + ret = madvise(NULL, size, MADV_STOCKPILE); > + if (ret && errno != EAGAIN) > + err(2, "madvise(NULL, %zu, MADV_STOCKPILE)", size); > + usleep(interval); > + } > +} > -- Michal Hocko SUSE Labs