From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.1 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48F74C32792 for ; Thu, 3 Oct 2019 11:17:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E852621A4C for ; Thu, 3 Oct 2019 11:17:11 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=shutemov-name.20150623.gappssmtp.com header.i=@shutemov-name.20150623.gappssmtp.com header.b="cwcJS8Eh" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E852621A4C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shutemov.name Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 651DA8E0003; Thu, 3 Oct 2019 07:17:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5BA7A6B0006; Thu, 3 Oct 2019 07:17:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 45A5F8E0003; Thu, 3 Oct 2019 07:17:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0112.hostedemail.com [216.40.44.112]) by kanga.kvack.org (Postfix) with ESMTP id 1F15E6B0005 for ; Thu, 3 Oct 2019 07:17:11 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 902F92809 for ; Thu, 3 Oct 2019 11:17:10 +0000 (UTC) X-FDA: 76002221820.04.cover76_3e8a841042047 X-HE-Tag: cover76_3e8a841042047 X-Filterd-Recvd-Size: 11658 Received: from mail-ed1-f68.google.com (mail-ed1-f68.google.com [209.85.208.68]) by imf07.hostedemail.com (Postfix) with ESMTP for ; Thu, 3 Oct 2019 11:17:09 +0000 (UTC) Received: by mail-ed1-f68.google.com with SMTP id v38so2071989edm.7 for ; Thu, 03 Oct 2019 04:17:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=GCU+79E+VhuFmZsBUjYy+yCcILJtzpwlT3Acgn/mCcs=; b=cwcJS8EhVKSG5jL+9Sf8z8L9Hf11l1xzCK2U5tnLSX9oLXPk9wJJMeLL5AtKGtzIX1 AsY5qItyn3osHJP25ckZV0Qv59ZhzGyPlyZt0YUrHQp/Q4xmigfkweMUgSjBQdqCQjvY FiXwvkLc0kI44En8mRvtlTyO27pHBAE8CV4mCjEUxDpTS2wxUltB0Jx/ubYH8qtIlv1U eUtZQf+NKh6/H848T9fdQZBUUuYhQiRVneUJUGefRcqhL4t1Se4A3pJeI4Q5aO0daB41 XckaCDlrSFFoXXn5qh6dYFH7IBAWr0ABe8dsZdXsrjHSeu1+jtoNnE/+lAzgMkYWZTt9 WxFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=GCU+79E+VhuFmZsBUjYy+yCcILJtzpwlT3Acgn/mCcs=; b=YCeeSEHh1H5iqRy9K3S7IOx56moSc93PzSWnbLhg9LjuEQmo8cIELgiiM2K/rw5mXG iEWVV8ISOKva21vkrJOr3SvCBXjDf8sYyZ3fo4xkRHpkUq1El7GSTrzrh9tiQC8wFiii fP2yys8dRFtGEjhZCtboj39Ve1p/BazVt6GuhN2QV6pV7m056qXyjrlButLuNqXtzHXa ZqnGZn28ObEgF7/yTiowyEhw83H2HLpYlgHOZ4+GyWoWiV0A431AVojLDsZl0nGNG/mT Nvlxh/pkoppzEp2hvdkXN10hQ9u4HlL7uztyy5GIC8xO+VYlbCl5k4OqDcDGbYd7GiqK huMg== X-Gm-Message-State: APjAAAVQbGq3CQ5NNMcp6iFHNtcE3FjYYrRclw/aaOEQoTMhzt4wNPV5 zlkaUUxywEeI1GwaRxqBC48nVg== X-Google-Smtp-Source: APXvYqymgUvk1nUrJhPNFyZrAPj5CCLBkjTYRjmLzl/bZyTcdkrN6N+EYO4/h4wRAxsL0k//pxYp7g== X-Received: by 2002:a17:906:4a06:: with SMTP id w6mr7286226eju.214.1570101428088; Thu, 03 Oct 2019 04:17:08 -0700 (PDT) Received: from box.localdomain ([86.57.175.117]) by smtp.gmail.com with ESMTPSA id h7sm412130edn.73.2019.10.03.04.17.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 03 Oct 2019 04:17:07 -0700 (PDT) Received: by box.localdomain (Postfix, from userid 1000) id 63948101174; Thu, 3 Oct 2019 14:17:08 +0300 (+03) Date: Thu, 3 Oct 2019 14:17:08 +0300 From: "Kirill A. Shutemov" To: Thomas =?utf-8?Q?Hellstr=C3=B6m_=28VMware=29?= Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, Thomas Hellstrom , Andrew Morton , Matthew Wilcox , Will Deacon , Peter Zijlstra , Rik van Riel , Minchan Kim , Michal Hocko , Huang Ying , =?utf-8?B?SsOpcsO0bWU=?= Glisse Subject: Re: [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the pagewalk code Message-ID: <20191003111708.sttkkrhiidleivc6@box> References: <20191002134730.40985-1-thomas_os@shipmail.org> <20191002134730.40985-3-thomas_os@shipmail.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20191002134730.40985-3-thomas_os@shipmail.org> User-Agent: NeoMutt/20180716 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 02, 2019 at 03:47:25PM +0200, Thomas Hellstr=F6m (VMware) wro= te: > From: Thomas Hellstrom >=20 > For users that want to travers all page table entries pointing into a > region of a struct address_space mapping, introduce a walk_page_mapping= () > function. >=20 > The walk_page_mapping() function will be initially be used for dirty- > tracking in virtual graphics drivers. >=20 > Cc: Andrew Morton > Cc: Matthew Wilcox > Cc: Will Deacon > Cc: Peter Zijlstra > Cc: Rik van Riel > Cc: Minchan Kim > Cc: Michal Hocko > Cc: Huang Ying > Cc: J=E9r=F4me Glisse > Cc: Kirill A. Shutemov > Signed-off-by: Thomas Hellstrom > --- > include/linux/pagewalk.h | 9 ++++ > mm/pagewalk.c | 99 +++++++++++++++++++++++++++++++++++++++- > 2 files changed, 107 insertions(+), 1 deletion(-) >=20 > diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h > index bddd9759bab9..6ec82e92c87f 100644 > --- a/include/linux/pagewalk.h > +++ b/include/linux/pagewalk.h > @@ -24,6 +24,9 @@ struct mm_walk; > * "do page table walk over the current vma", returning > * a negative value means "abort current page table walk > * right now" and returning 1 means "skip the current vma" > + * @pre_vma: if set, called before starting walk on a non-n= ull vma. > + * @post_vma: if set, called after a walk on a non-null vma,= provided > + * that @pre_vma and the vma walk succeeded. > */ > struct mm_walk_ops { > int (*pud_entry)(pud_t *pud, unsigned long addr, > @@ -39,6 +42,9 @@ struct mm_walk_ops { > struct mm_walk *walk); > int (*test_walk)(unsigned long addr, unsigned long next, > struct mm_walk *walk); > + int (*pre_vma)(unsigned long start, unsigned long end, > + struct mm_walk *walk); > + void (*post_vma)(struct mm_walk *walk); > }; > =20 > /** > @@ -62,5 +68,8 @@ int walk_page_range(struct mm_struct *mm, unsigned lo= ng start, > void *private); > int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops= *ops, > void *private); > +int walk_page_mapping(struct address_space *mapping, pgoff_t first_ind= ex, > + pgoff_t nr, const struct mm_walk_ops *ops, > + void *private); > =20 > #endif /* _LINUX_PAGEWALK_H */ > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index d48c2a986ea3..658d1e5ec428 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -253,13 +253,23 @@ static int __walk_page_range(unsigned long start,= unsigned long end, > { > int err =3D 0; > struct vm_area_struct *vma =3D walk->vma; > + const struct mm_walk_ops *ops =3D walk->ops; > + > + if (vma && ops->pre_vma) { > + err =3D ops->pre_vma(start, end, walk); > + if (err) > + return err; > + } > =20 > if (vma && is_vm_hugetlb_page(vma)) { > - if (walk->ops->hugetlb_entry) > + if (ops->hugetlb_entry) > err =3D walk_hugetlb_range(start, end, walk); > } else > err =3D walk_pgd_range(start, end, walk); > =20 > + if (vma && ops->post_vma) > + ops->post_vma(walk); > + > return err; > } > =20 > @@ -285,11 +295,17 @@ static int __walk_page_range(unsigned long start,= unsigned long end, > * - <0 : failed to handle the current entry, and return to the calle= r > * with error code. > * > + * > * Before starting to walk page table, some callers want to check whet= her > * they really want to walk over the current vma, typically by checkin= g > * its vm_flags. walk_page_test() and @ops->test_walk() are used for t= his > * purpose. > * > + * If operations need to be staged before and committed after a vma is= walked, > + * there are two callbacks, pre_vma() and post_vma(). Note that post_v= ma(), > + * since it is intended to handle commit-type operations, can't return= any > + * errors. > + * > * struct mm_walk keeps current values of some common data like vma an= d pmd, > * which are useful for the access from callbacks. If you want to pass= some > * caller-specific data to callbacks, @private should be helpful. > @@ -376,3 +392,84 @@ int walk_page_vma(struct vm_area_struct *vma, cons= t struct mm_walk_ops *ops, > return err; > return __walk_page_range(vma->vm_start, vma->vm_end, &walk); > } > + > +/** > + * walk_page_mapping - walk all memory areas mapped into a struct addr= ess_space. > + * @mapping: Pointer to the struct address_space > + * @first_index: First page offset in the address_space > + * @nr: Number of incremental page offsets to cover > + * @ops: operation to call during the walk > + * @private: private data for callbacks' usage > + * > + * This function walks all memory areas mapped into a struct address_s= pace. > + * The walk is limited to only the given page-size index range, but if > + * the index boundaries cross a huge page-table entry, that entry will= be > + * included. > + * > + * Also see walk_page_range() for additional information. > + * > + * Locking: > + * This function can't require that the struct mm_struct::mmap_sem i= s held, > + * since @mapping may be mapped by multiple processes. Instead > + * @mapping->i_mmap_rwsem must be held. This might have implications= in the > + * callbacks, and it's up tho the caller to ensure that the > + * struct mm_struct::mmap_sem is not needed. > + * > + * Also this means that a caller can't rely on the struct > + * vm_area_struct::vm_flags to be constant across a call, > + * except for immutable flags. Callers requiring this shouldn't use > + * this function. > + * > + * If @mapping allows faulting of huge pmds and puds, it is desirabl= e > + * that its huge_fault() handler blocks while this function is runni= ng on > + * @mapping. Otherwise a race may occur where the huge entry is spli= t when > + * it was intended to be handled in a huge entry callback. This requ= ires an > + * external lock, for example that @mapping->i_mmap_rwsem is held in > + * write mode in the huge_fault() handlers. Em. No. We have ptl for this. It's the only lock required (plus mmap_sem on read) to split PMD entry into PTE table. And it can happen not only from fault path. If you care about splitting compound page under you, take a pin or lock a page. It will block split_huge_page(). Suggestion to block fault path is not viable (and it will not happen magically just because of this comment). > + */ > +int walk_page_mapping(struct address_space *mapping, pgoff_t first_ind= ex, > + pgoff_t nr, const struct mm_walk_ops *ops, > + void *private) > +{ > + struct mm_walk walk =3D { > + .ops =3D ops, > + .private =3D private, > + }; > + struct vm_area_struct *vma; > + pgoff_t vba, vea, cba, cea; > + unsigned long start_addr, end_addr; > + int err =3D 0; > + > + lockdep_assert_held(&mapping->i_mmap_rwsem); > + vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index, > + first_index + nr - 1) { > + /* Clip to the vma */ > + vba =3D vma->vm_pgoff; > + vea =3D vba + vma_pages(vma); > + cba =3D first_index; > + cba =3D max(cba, vba); > + cea =3D first_index + nr; > + cea =3D min(cea, vea); > + > + start_addr =3D ((cba - vba) << PAGE_SHIFT) + vma->vm_start; > + end_addr =3D ((cea - vba) << PAGE_SHIFT) + vma->vm_start; > + if (start_addr >=3D end_addr) > + continue; > + > + walk.vma =3D vma; > + walk.mm =3D vma->vm_mm; > + > + err =3D walk_page_test(vma->vm_start, vma->vm_end, &walk); > + if (err > 0) { > + err =3D 0; > + break; > + } else if (err < 0) > + break; > + > + err =3D __walk_page_range(start_addr, end_addr, &walk); > + if (err) > + break; > + } > + > + return err; > +} > --=20 > 2.20.1 >=20 --=20 Kirill A. Shutemov