From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D36DEC433F5 for ; Fri, 13 May 2022 07:21:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2AC0E6B0073; Fri, 13 May 2022 03:21:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 234A66B0075; Fri, 13 May 2022 03:21:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0D54D6B0078; Fri, 13 May 2022 03:21:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id EC0B56B0073 for ; Fri, 13 May 2022 03:21:28 -0400 (EDT) Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id C15771208F1 for ; Fri, 13 May 2022 07:21:28 +0000 (UTC) X-FDA: 79459874256.31.2140134 Received: from mail-ua1-f47.google.com (mail-ua1-f47.google.com [209.85.222.47]) by imf11.hostedemail.com (Postfix) with ESMTP id 0D4F4400B3 for ; Fri, 13 May 2022 07:21:22 +0000 (UTC) Received: by mail-ua1-f47.google.com with SMTP id x21so2869451uan.13 for ; Fri, 13 May 2022 00:21:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=levpsNNmZJAQ194UQtXw36sKWWPGKf4XBoSDPW1A4rU=; b=X+OzpWOlucNuJtW2aSC+sNkp1W0s/P/VScZDZkLnsl+WE2GHjzJSjggDyBi8Fo0Qsf 1LBlrItXwa0AOyeedGWigL8fyMjbEmIQ9wC1L9Yx7OFbpH1+8Ak23zJx1ZmfUTMp9Rtn Mil56LuNQyzMuas70KC4xMcZnkSh1wA/0h2UStMkiRK2F+zwIAISx1jdbO+qYLMH+ysi Tad8tbq3z7y+xhzK9VqyYU7TUHu+HcFTqMFd6xLOoAZ3bAqPxCW5eaZoianznF+Rq7kS SfhRZiZvtNfKrEVkOMSQXUVcGud58XLmS6AzJ+mRbF4OnMz30qUasZHVGlv20ptbztNg MTRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=levpsNNmZJAQ194UQtXw36sKWWPGKf4XBoSDPW1A4rU=; b=GIEeXJQDzA9b89nRWC1IQMrb4mwB/5tibs+d7oJySvG21Vw6+rjjo8jQ+EUT1bbpHG HPtjH7azZBXgSgnRWo/c46SZsBROpNmeuDMgm0c2oe1sMGPqp1s0IIJB3U6PkYfHo+HV 8dtrcEqjwLpi6DNZ4J7eGctNvdCvrDawTwLtOTZ1Ap0207CpbxFlZ76BjgaR3KEsWVKR ImAG8gYN9gI0rz0pcVg2F0ij3DU3xpeUccXzRCN2vzZ/Mxp9q0twiwNg5JxDEI89dcIw q6WnUxOrT2FZhpt8p6FFwqLA+efqp2dhhJarB1QX4Zlw02RN1TtFhB8dFr/WJgkUMvGP s4iQ== X-Gm-Message-State: AOAM530023+gCz2LYBKG8CUbdfdnCOg8n3KAS4k16fuw1v0HZzKoQy3U ORhd4x9aO+1wkON7QlL7Vt5lhAGn36JURPgnoZgrTg== X-Google-Smtp-Source: ABdhPJx4lmaa+gXwCK5sj/EmTxVRnJW282wOjQX/NgYoja9RlgMjz6OGWlAllhaqg5zZgeJ0rcPVkgvzu2J84yvJuk4= X-Received: by 2002:ab0:7643:0:b0:362:833d:5bfb with SMTP id s3-20020ab07643000000b00362833d5bfbmr1817401uaq.4.1652426487413; Fri, 13 May 2022 00:21:27 -0700 (PDT) MIME-Version: 1.0 References: <69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com> In-Reply-To: From: Wei Xu Date: Fri, 13 May 2022 00:21:16 -0700 Message-ID: Subject: Re: RFC: Memory Tiering Kernel Interfaces (v2) To: "ying.huang@intel.com" Cc: Andrew Morton , Greg Thelen , "Aneesh Kumar K.V" , Yang Shi , Linux Kernel Mailing List , Jagdish Gediya , Michal Hocko , Tim C Chen , Dave Hansen , Alistair Popple , Baolin Wang , Feng Tang , Jonathan Cameron , Davidlohr Bueso , Dan Williams , David Rientjes , Linux MM , Brice Goglin , Hesham Almatary Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=X+OzpWOl; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf11.hostedemail.com: domain of weixugc@google.com designates 209.85.222.47 as permitted sender) smtp.mailfrom=weixugc@google.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 0D4F4400B3 X-Rspam-User: X-Stat-Signature: rmh84u1dhzoujy69ootf8ziek4dewdka X-HE-Tag: 1652426482-460697 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, May 13, 2022 at 12:04 AM ying.huang@intel.com wrote: > > On Thu, 2022-05-12 at 23:36 -0700, Wei Xu wrote: > > On Thu, May 12, 2022 at 8:25 PM ying.huang@intel.com > > wrote: > > > > > > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote: > > > > > > > > Memory Allocation for Demotion > > > > ============================== > > > > > > > > To allocate a new page as the demotion target for a page, the kernel > > > > calls the allocation function (__alloc_pages_nodemask) with the > > > > source page node as the preferred node and the union of all lower > > > > tier nodes as the allowed nodemask. The actual target node selection > > > > then follows the allocation fallback order that the kernel has > > > > already defined. > > > > > > > > The pseudo code looks like: > > > > > > > > targets = NODE_MASK_NONE; > > > > src_nid = page_to_nid(page); > > > > src_tier = node_tier_map[src_nid]; > > > > for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++) > > > > nodes_or(targets, targets, memory_tiers[i]); > > > > new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets); > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can > > > > be set to refine the demotion target nodemask, e.g. to prevent > > > > demotion or select a particular allowed node as the demotion target. > > > > > > Consider a system with 3 tiers, if we want to demote some pages from > > > tier 0, the desired behavior is, > > > > > > - Allocate pages from tier 1 > > > - If there's no enough free pages in tier 1, wakeup kswapd of tier 1 so > > > demote some pages from tier 1 to tier 2 > > > - If there's still no enough free pages in tier 1, allocate pages from > > > tier 2. > > > > > > In this way, tier 0 will have the hottest pages, while tier 1 will have > > > the coldest pages. > > > > When we are already in the allocation path for the demotion of a page > > from tier 0, I think we'd better not block this allocation to wait for > > kswapd to demote pages from tier 1 to tier 2. Instead, we should > > directly allocate from tier 2. Meanwhile, this demotion can wakeup > > kswapd to demote from tier 1 to tier 2 in the background. > > Yes. That's what I want too. My original words may be misleading. > > > > With your proposed method, the demoting from tier 0 behavior is, > > > > > > - Allocate pages from tier 1 > > > - If there's no enough free pages in tier 1, allocate pages in tier 2 > > > > > > The kswapd of tier 1 will not be waken up until there's no enough free > > > pages in tier 2. In quite long time, there's no much hot/cold > > > differentiation between tier 1 and tier 2. > > > > This is true with the current allocation code. But I think we can make > > some changes for demotion allocations. For example, we can add a > > GFP_DEMOTE flag and update the allocation function to wake up kswapd > > when this flag is set and we need to fall back to another node. > > > > > This isn't hard to be fixed, just call __alloc_pages_nodemask() for each > > > tier one by one considering page allocation fallback order. > > > > That would have worked, except that there is an example earlier, in > > which it is actually preferred for some nodes to demote to their tier > > + 2, not tier +1. > > > > More specifically, the example is: > > > > 20 > > Node 0 (DRAM) -- Node 1 (DRAM) > > | | | | > > | | 30 120 | | > > | v v | 100 > > 100 | Node 2 (PMEM) | > > | | | > > | | 100 | > > \ v v > > -> Node 3 (Large Mem) > > > > Node distances: > > node 0 1 2 3 > > 0 10 20 30 100 > > 1 20 10 120 100 > > 2 30 120 10 100 > > 3 100 100 100 10 > > > > 3 memory tiers are defined: > > tier 0: 0-1 > > tier 1: 2 > > tier 2: 3 > > > > The demotion fallback order is: > > node 0: 2, 3 > > node 1: 3, 2 > > node 2: 3 > > node 3: empty > > > > Note that even though node 3 is in tier 2 and node 2 is in tier 1, > > node 1 (tier 0) still prefers node 3 as its first demotion target, not > > node 2. > > Yes. I understand that we need to support this use case. We can use > the tier order in allocation fallback list instead of from small to > large. That is, for node 1, the tier order for demotion is tier 2, tier > 1. That could work, too, though I feel it might be simpler and more efficient (no repeated calls to __alloc_pages for the same allocation) to modify __alloc_pages() itself. Anyway, we can discuss more on this when it comes to the implementation of this demotion allocation function. I believe this should not affect the general memory tiering interfaces proposed here. > Best Regards, > Huang, Ying > > >