From: Andres Lagar-Cavilla <andreslc@google.com>
To: Vladimir Davydov <vdavydov@parallels.com>,
Paolo Bonzini <pbonzini@redhat.com>,
kvm@vger.kernel.org, Eric Northup <digitaleric@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Minchan Kim <minchan@kernel.org>,
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
Michel Lespinasse <walken@google.com>,
David Rientjes <rientjes@google.com>,
Pavel Emelyanov <xemul@parallels.com>,
Cyrill Gorcunov <gorcunov@openvz.org>,
Jonathan Corbet <corbet@lwn.net>,
linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH -mm v8 5/7] mmu-notifier: add clear_young callback
Date: Wed, 15 Jul 2015 12:16:33 -0700 [thread overview]
Message-ID: <CAJu=L58yzBr8+XaV90x+S60YnJzd7Yr2fDEgaQ0bcCKpwzSAhw@mail.gmail.com> (raw)
In-Reply-To: <82693bd5b5dbf4e65657fa22288942650aa04a0a.1436967694.git.vdavydov@parallels.com>
On Wed, Jul 15, 2015 at 6:54 AM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> In the scope of the idle memory tracking feature, which is introduced by
> the following patch, we need to clear the referenced/accessed bit not
> only in primary, but also in secondary ptes. The latter is required in
> order to estimate wss of KVM VMs. At the same time we want to avoid
> flushing tlb, because it is quite expensive and it won't really affect
> the final result.
>
> Currently, there is no function for clearing pte young bit that would
> meet our requirements, so this patch introduces one. To achieve that we
> have to add a new mmu-notifier callback, clear_young, since there is no
> method for testing-and-clearing a secondary pte w/o flushing tlb. The
> new method is not mandatory and currently only implemented by KVM.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Added Paolo Bonzini, kvm list, Eric Northup.
> ---
> include/linux/mmu_notifier.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
> mm/mmu_notifier.c | 17 +++++++++++++++++
> virt/kvm/kvm_main.c | 18 ++++++++++++++++++
> 3 files changed, 79 insertions(+)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 61cd67f4d788..a5b17137c683 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -66,6 +66,16 @@ struct mmu_notifier_ops {
> unsigned long end);
>
> /*
> + * clear_young is a lightweight version of clear_flush_young. Like the
> + * latter, it is supposed to test-and-clear the young/accessed bitflag
> + * in the secondary pte, but it may omit flushing the secondary tlb.
> + */
> + int (*clear_young)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end);
> +
> + /*
> * test_young is called to check the young/accessed bitflag in
> * the secondary pte. This is used to know if the page is
> * frequently used without actually clearing the flag or tearing
> @@ -203,6 +213,9 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
> extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
> unsigned long start,
> unsigned long end);
> +extern int __mmu_notifier_clear_young(struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end);
> extern int __mmu_notifier_test_young(struct mm_struct *mm,
> unsigned long address);
> extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> @@ -231,6 +244,15 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
> return 0;
> }
>
> +static inline int mmu_notifier_clear_young(struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end)
> +{
> + if (mm_has_notifiers(mm))
> + return __mmu_notifier_clear_young(mm, start, end);
> + return 0;
> +}
> +
> static inline int mmu_notifier_test_young(struct mm_struct *mm,
> unsigned long address)
> {
> @@ -311,6 +333,28 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
> __young; \
> })
>
> +#define ptep_clear_young_notify(__vma, __address, __ptep) \
> +({ \
> + int __young; \
> + struct vm_area_struct *___vma = __vma; \
> + unsigned long ___address = __address; \
> + __young = ptep_test_and_clear_young(___vma, ___address, __ptep);\
> + __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \
> + ___address + PAGE_SIZE); \
> + __young; \
> +})
> +
> +#define pmdp_clear_young_notify(__vma, __address, __pmdp) \
> +({ \
> + int __young; \
> + struct vm_area_struct *___vma = __vma; \
> + unsigned long ___address = __address; \
> + __young = pmdp_test_and_clear_young(___vma, ___address, __pmdp);\
> + __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \
> + ___address + PMD_SIZE); \
> + __young; \
> +})
> +
> #define ptep_clear_flush_notify(__vma, __address, __ptep) \
> ({ \
> unsigned long ___addr = __address & PAGE_MASK; \
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 3b9b3d0741b2..5fbdd367bbed 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -123,6 +123,23 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
> return young;
> }
>
> +int __mmu_notifier_clear_young(struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end)
> +{
> + struct mmu_notifier *mn;
> + int young = 0, id;
> +
> + id = srcu_read_lock(&srcu);
> + hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> + if (mn->ops->clear_young)
> + young |= mn->ops->clear_young(mn, mm, start, end);
> + }
> + srcu_read_unlock(&srcu, id);
> +
> + return young;
> +}
> +
> int __mmu_notifier_test_young(struct mm_struct *mm,
> unsigned long address)
> {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 05148a43ef9c..61500cb028a3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -388,6 +388,23 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> return young;
> }
>
> +static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end)
> +{
> + struct kvm *kvm = mmu_notifier_to_kvm(mn);
> + int young, idx;
For reclaim, the clear_flush_young notifier may blow up the secondary
pte to estimate the access pattern, depending on hardware support (EPT
access bits available in Haswell onwards, not sure about AMD, PPC,
etc).
This is ok, because it's reclaim, we need to know the access pattern,
chances are the page is a goner anyway.
However, not so sure about that cost in this context. Depending on
user-space, this will periodically tear down all EPT tables in the
system. That's tricky.
So please add a note to that effect, so in the fullness of time kvm
may be able to refuse enacting this notifier based on performance/VM
priority/foo concerns.
> +
> + idx = srcu_read_lock(&kvm->srcu);
> + spin_lock(&kvm->mmu_lock);
> + young = kvm_age_hva(kvm, start, end);
Also please add a comment along the lines of no one really knowing
when and if to flush the secondary tlb.
We might come up with a heuristic later, or leave up to the regular
system cadence. We just don't know at the moment.
Andres
> + spin_unlock(&kvm->mmu_lock);
> + srcu_read_unlock(&kvm->srcu, idx);
> +
> + return young;
> +}
> +
> static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> struct mm_struct *mm,
> unsigned long address)
> @@ -420,6 +437,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
> .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
> .clear_flush_young = kvm_mmu_notifier_clear_flush_young,
> + .clear_young = kvm_mmu_notifier_clear_young,
> .test_young = kvm_mmu_notifier_test_young,
> .change_pte = kvm_mmu_notifier_change_pte,
> .release = kvm_mmu_notifier_release,
> --
> 2.1.4
>
--
Andres Lagar-Cavilla | Google Kernel Team | andreslc@google.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-07-15 19:16 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-15 13:54 [PATCH -mm v8 0/7] idle memory tracking Vladimir Davydov
2015-07-15 13:54 ` [PATCH -mm v8 1/7] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-07-15 18:59 ` Andres Lagar-Cavilla
2015-07-15 13:54 ` [PATCH -mm v8 2/7] hwpoison: use page_cgroup_ino for filtering by memcg Vladimir Davydov
2015-07-15 19:00 ` Andres Lagar-Cavilla
2015-07-15 13:54 ` [PATCH -mm v8 3/7] memcg: zap try_get_mem_cgroup_from_page Vladimir Davydov
2015-07-15 13:54 ` [PATCH -mm v8 4/7] proc: add kpagecgroup file Vladimir Davydov
2015-07-15 19:03 ` Andres Lagar-Cavilla
2015-07-16 9:28 ` Vladimir Davydov
2015-07-16 19:04 ` Andres Lagar-Cavilla
2015-07-17 9:27 ` Vladimir Davydov
2015-07-15 13:54 ` [PATCH -mm v8 5/7] mmu-notifier: add clear_young callback Vladimir Davydov
2015-07-15 19:16 ` Andres Lagar-Cavilla [this message]
2015-07-16 11:35 ` Paolo Bonzini
2015-07-15 13:54 ` [PATCH -mm v8 6/7] proc: add kpageidle file Vladimir Davydov
2015-07-15 19:42 ` Andres Lagar-Cavilla
2015-07-16 9:53 ` Vladimir Davydov
2015-07-15 13:54 ` [PATCH -mm v8 7/7] proc: export idle flag via kpageflags Vladimir Davydov
2015-07-15 19:17 ` Andres Lagar-Cavilla
2015-07-15 20:47 ` [PATCH -mm v8 0/7] idle memory tracking Andres Lagar-Cavilla
2015-07-16 10:02 ` Vladimir Davydov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJu=L58yzBr8+XaV90x+S60YnJzd7Yr2fDEgaQ0bcCKpwzSAhw@mail.gmail.com' \
--to=andreslc@google.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=digitaleric@google.com \
--cc=gorcunov@openvz.org \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=kvm@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=minchan@kernel.org \
--cc=pbonzini@redhat.com \
--cc=raghavendra.kt@linux.vnet.ibm.com \
--cc=rientjes@google.com \
--cc=vdavydov@parallels.com \
--cc=walken@google.com \
--cc=xemul@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox