linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Fengguang Wu <fengguang.wu@intel.com>
To: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Huang Ying <ying.huang@intel.com>,
	Zhang Yi <yi.z.zhang@linux.intel.com>,
	kvm@vger.kernel.org, Dave Hansen <dave.hansen@intel.com>,
	Liu Jingqi <jingqi.liu@intel.com>, Fan Du <fan.du@intel.com>,
	Dong Eddie <eddie.dong@intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-accelerators@lists.ozlabs.org,
	Linux Memory Management List <linux-mm@kvack.org>,
	Peng Dong <dongx.peng@intel.com>, Yao Yuan <yuan.yao@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
Date: Tue, 29 Jan 2019 10:00:44 +0800	[thread overview]
Message-ID: <20190129020044.a5h3wjjqsf4tnwbs@wfg-t540p.sh.intel.com> (raw)
In-Reply-To: <20190128174239.0000636b@huawei.com>

Hi Jonathan,

Thanks for showing the gap on tracking hot accesses from devices.

On Mon, Jan 28, 2019 at 05:42:39PM +0000, Jonathan Cameron wrote:
>On Wed, 2 Jan 2019 12:21:10 +0000
>Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
>
>> On Fri, 28 Dec 2018 20:52:24 +0100
>> Michal Hocko <mhocko@kernel.org> wrote:
>>
>> > [Ccing Mel and Andrea]
>> >
>
>Hi,
>
>I just wanted to highlight this section as I didn't feel we really addressed this
>in the earlier conversation.
>
>> * Hot pages may not be hot just because the host is using them a lot.  It would be
>>   very useful to have a means of adding information available from accelerators
>>   beyond simple accessed bits (dreaming ;)  One problem here is translation
>>   caches (ATCs) as they won't normally result in any updates to the page accessed
>>   bits.  The arm SMMU v3 spec for example makes it clear (though it's kind of
>>   obvious) that the ATS request is the only opportunity to update the accessed
>>   bit.  The nasty option here would be to periodically flush the ATC to force
>>   the access bit updates via repeats of the ATS request (ouch).
>>   That option only works if the iommu supports updating the accessed flag
>>   (optional on SMMU v3 for example).

If ATS based updates are supported, we may trigger it when closing the
/proc/pid/idle_pages file. We already do TLB flushes at that time. For
example,

[PATCH 15/21] ept-idle: EPT walk for virtual machine

        ept_idle_release():
          kvm_flush_remote_tlbs(kvm);

[PATCH 17/21] proc: introduce /proc/PID/idle_pages

        mm_idle_release():
          flush_tlb_mm(mm);

The flush cost is kind of "minimal necessary" in our current use
model, where user space scan+migration daemon will do such loop:

loop:
        walk page table N times:
                open,read,close /proc/PID/idle_pages
                (flushes TLB on file close)
                sleep for a short interval
        sort and migrate hot pages
        sleep for a while

>If we ignore the IOMMU hardware update issue which will simply need to be addressed
>by future hardware if these techniques become common, how do we address the
>Address Translation Cache issue without potentially causing big performance
>problems by flushing the cache just to force an accessed bit update?
>
>These devices are frequently used with PRI and Shared Virtual Addressing
>and can be accessing most of your memory without you having any visibility
>of it in the page tables (as they aren't walked if your ATC is well matched
>in size to your usecase.
>
>Classic example would be accelerated DB walkers like the the CCIX demo
>Xilinx has shown at a few conferences.   The whole point of those is that
>most of the time only your large set of database walkers is using your
>memory and they have translations cached for for a good part of what
>they are accessing.  Flushing that cache could hurt a lot.
>Pinning pages hurts for all the normal flexibility reasons.
>
>Last thing we want is to be migrating these pages that can be very hot but
>in an invisible fashion.

If there are some other way to get hotness for special device memory,
the user space daemon may be extended to cover that. Perhaps by
querying another new kernel interface.

By driving hotness accounting and migration in user space, we harvest
this kind of flexibility. In the daemon POV, /proc/PID/idle_pages
provides one common way to get "accessed" bits hence hotness, though
the daemon does not need to depend solely on it.

Thanks,
Fengguang


  parent reply	other threads:[~2019-01-29  2:00 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-26 13:14 Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27  3:41   ` Matthew Wilcox
2018-12-27  4:11     ` Fengguang Wu
2018-12-27  5:13       ` Dan Williams
2018-12-27  5:13         ` Dan Williams
2018-12-27 19:32         ` Yang Shi
2018-12-27 19:32           ` Yang Shi
2018-12-28  3:27           ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27 20:07   ` Christopher Lameter
2018-12-27 20:07     ` Christopher Lameter
2018-12-28  2:31     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:14   ` Aneesh Kumar K.V
2019-01-01  9:14     ` Aneesh Kumar K.V
2019-01-07  9:57     ` Fengguang Wu
2019-01-07 14:09       ` Aneesh Kumar K.V
2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:23   ` Aneesh Kumar K.V
2019-01-01  9:23     ` Aneesh Kumar K.V
2019-01-02  0:59     ` Yuan Yao
2019-01-02 16:47   ` Dave Hansen
2019-01-07 10:21     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2019-02-02  6:57   ` Peter Xu
2019-02-02 10:50     ` Fengguang Wu
2019-02-04 10:46     ` Paolo Bonzini
2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
2018-12-28  5:08   ` Fengguang Wu
2018-12-28  8:41     ` Michal Hocko
2018-12-28  9:42       ` Fengguang Wu
2018-12-28 12:15         ` Michal Hocko
2018-12-28 13:15           ` Fengguang Wu
2018-12-28 13:15             ` Fengguang Wu
2018-12-28 19:46             ` Michal Hocko
2018-12-28 13:31           ` Fengguang Wu
2018-12-28 18:28             ` Yang Shi
2018-12-28 18:28               ` Yang Shi
2018-12-28 19:52             ` Michal Hocko
2019-01-02 12:21               ` Jonathan Cameron
2019-01-02 12:21                 ` Jonathan Cameron
2019-01-08 14:52                 ` Michal Hocko
2019-01-10 15:53                   ` Jerome Glisse
2019-01-10 15:53                     ` Jerome Glisse
2019-01-10 16:42                     ` Michal Hocko
2019-01-10 17:42                       ` Jerome Glisse
2019-01-10 17:42                         ` Jerome Glisse
2019-01-10 18:26                   ` Jonathan Cameron
2019-01-10 18:26                     ` Jonathan Cameron
2019-01-28 17:42                 ` Jonathan Cameron
2019-01-28 17:42                   ` Jonathan Cameron
2019-01-29  2:00                   ` Fengguang Wu [this message]
2019-01-03 10:57               ` Mel Gorman
2019-01-10 16:25               ` Jerome Glisse
2019-01-10 16:25                 ` Jerome Glisse
2019-01-10 16:50                 ` Michal Hocko
2019-01-10 18:02                   ` Jerome Glisse
2019-01-10 18:02                     ` Jerome Glisse
2019-01-02 18:12       ` Dave Hansen
2019-01-08 14:53         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190129020044.a5h3wjjqsf4tnwbs@wfg-t540p.sh.intel.com \
    --to=fengguang.wu@intel.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dongx.peng@intel.com \
    --cc=eddie.dong@intel.com \
    --cc=fan.du@intel.com \
    --cc=jingqi.liu@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-accelerators@lists.ozlabs.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=yi.z.zhang@linux.intel.com \
    --cc=ying.huang@intel.com \
    --cc=yuan.yao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox