From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.1 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E88AC4CED1 for ; Thu, 3 Oct 2019 11:32:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CE74A20830 for ; Thu, 3 Oct 2019 11:32:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=shipmail.org header.i=@shipmail.org header.b="YdBtSbjF" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CE74A20830 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shipmail.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7D2576B0007; Thu, 3 Oct 2019 07:32:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 783FA6B0008; Thu, 3 Oct 2019 07:32:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 64CF06B000A; Thu, 3 Oct 2019 07:32:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0022.hostedemail.com [216.40.44.22]) by kanga.kvack.org (Postfix) with ESMTP id 452256B0007 for ; Thu, 3 Oct 2019 07:32:53 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id A737B52DA for ; Thu, 3 Oct 2019 11:32:52 +0000 (UTC) X-FDA: 76002261384.30.pest60_3606de41e4b62 X-HE-Tag: pest60_3606de41e4b62 X-Filterd-Recvd-Size: 10688 Received: from pio-pvt-msa1.bahnhof.se (pio-pvt-msa1.bahnhof.se [79.136.2.40]) by imf32.hostedemail.com (Postfix) with ESMTP for ; Thu, 3 Oct 2019 11:32:51 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by pio-pvt-msa1.bahnhof.se (Postfix) with ESMTP id 382E43F4D9; Thu, 3 Oct 2019 13:32:49 +0200 (CEST) Authentication-Results: pio-pvt-msa1.bahnhof.se; dkim=pass (1024-bit key; unprotected) header.d=shipmail.org header.i=@shipmail.org header.b="YdBtSbjF"; dkim-atps=neutral X-Virus-Scanned: Debian amavisd-new at bahnhof.se Received: from pio-pvt-msa1.bahnhof.se ([127.0.0.1]) by localhost (pio-pvt-msa1.bahnhof.se [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3wFipsKRcHup; Thu, 3 Oct 2019 13:32:48 +0200 (CEST) Received: from mail1.shipmail.org (h-205-35.A357.priv.bahnhof.se [155.4.205.35]) (Authenticated sender: mb878879) by pio-pvt-msa1.bahnhof.se (Postfix) with ESMTPA id 075B53F456; Thu, 3 Oct 2019 13:32:45 +0200 (CEST) Received: from localhost.localdomain (h-205-35.A357.priv.bahnhof.se [155.4.205.35]) by mail1.shipmail.org (Postfix) with ESMTPSA id 712EB360058; Thu, 3 Oct 2019 13:32:45 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=shipmail.org; s=mail; t=1570102365; bh=+Kcj+oZhCXXqWRzSU0liAB7tNi7jmHUWd2i35d6EPv4=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=YdBtSbjFkpUtQOrE94CWW4Kuaan52+Ac9KjroROllwg0VGDWj9jZUgFRnT8eBA6ww eEz28K4Vdtr+xFEnch980qtFMm9fcWYWaRB94A0rTm56AQ7yYhZ8Oodi2RrspYzJSt ssNuZCoWyVnVs9Qt/7X/GC4/Hfk9v32bNCIz9G4U= Subject: Re: [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the pagewalk code To: "Kirill A. Shutemov" Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, Thomas Hellstrom , Andrew Morton , Matthew Wilcox , Will Deacon , Peter Zijlstra , Rik van Riel , Minchan Kim , Michal Hocko , Huang Ying , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= References: <20191002134730.40985-1-thomas_os@shipmail.org> <20191002134730.40985-3-thomas_os@shipmail.org> <20191003111708.sttkkrhiidleivc6@box> From: =?UTF-8?Q?Thomas_Hellstr=c3=b6m_=28VMware=29?= Organization: VMware Inc. Message-ID: Date: Thu, 3 Oct 2019 13:32:45 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <20191003111708.sttkkrhiidleivc6@box> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10/3/19 1:17 PM, Kirill A. Shutemov wrote: > On Wed, Oct 02, 2019 at 03:47:25PM +0200, Thomas Hellstr=C3=B6m (VMware= ) wrote: >> From: Thomas Hellstrom >> >> For users that want to travers all page table entries pointing into a >> region of a struct address_space mapping, introduce a walk_page_mappin= g() >> function. >> >> The walk_page_mapping() function will be initially be used for dirty- >> tracking in virtual graphics drivers. >> >> Cc: Andrew Morton >> Cc: Matthew Wilcox >> Cc: Will Deacon >> Cc: Peter Zijlstra >> Cc: Rik van Riel >> Cc: Minchan Kim >> Cc: Michal Hocko >> Cc: Huang Ying >> Cc: J=C3=A9r=C3=B4me Glisse >> Cc: Kirill A. Shutemov >> Signed-off-by: Thomas Hellstrom >> --- >> include/linux/pagewalk.h | 9 ++++ >> mm/pagewalk.c | 99 ++++++++++++++++++++++++++++++++++++++= +- >> 2 files changed, 107 insertions(+), 1 deletion(-) >> >> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h >> index bddd9759bab9..6ec82e92c87f 100644 >> --- a/include/linux/pagewalk.h >> +++ b/include/linux/pagewalk.h >> @@ -24,6 +24,9 @@ struct mm_walk; >> * "do page table walk over the current vma", returning >> * a negative value means "abort current page table walk >> * right now" and returning 1 means "skip the current vma" >> + * @pre_vma: if set, called before starting walk on a non-= null vma. >> + * @post_vma: if set, called after a walk on a non-null vma= , provided >> + * that @pre_vma and the vma walk succeeded. >> */ >> struct mm_walk_ops { >> int (*pud_entry)(pud_t *pud, unsigned long addr, >> @@ -39,6 +42,9 @@ struct mm_walk_ops { >> struct mm_walk *walk); >> int (*test_walk)(unsigned long addr, unsigned long next, >> struct mm_walk *walk); >> + int (*pre_vma)(unsigned long start, unsigned long end, >> + struct mm_walk *walk); >> + void (*post_vma)(struct mm_walk *walk); >> }; >> =20 >> /** >> @@ -62,5 +68,8 @@ int walk_page_range(struct mm_struct *mm, unsigned l= ong start, >> void *private); >> int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_o= ps *ops, >> void *private); >> +int walk_page_mapping(struct address_space *mapping, pgoff_t first_in= dex, >> + pgoff_t nr, const struct mm_walk_ops *ops, >> + void *private); >> =20 >> #endif /* _LINUX_PAGEWALK_H */ >> diff --git a/mm/pagewalk.c b/mm/pagewalk.c >> index d48c2a986ea3..658d1e5ec428 100644 >> --- a/mm/pagewalk.c >> +++ b/mm/pagewalk.c >> @@ -253,13 +253,23 @@ static int __walk_page_range(unsigned long start= , unsigned long end, >> { >> int err =3D 0; >> struct vm_area_struct *vma =3D walk->vma; >> + const struct mm_walk_ops *ops =3D walk->ops; >> + >> + if (vma && ops->pre_vma) { >> + err =3D ops->pre_vma(start, end, walk); >> + if (err) >> + return err; >> + } >> =20 >> if (vma && is_vm_hugetlb_page(vma)) { >> - if (walk->ops->hugetlb_entry) >> + if (ops->hugetlb_entry) >> err =3D walk_hugetlb_range(start, end, walk); >> } else >> err =3D walk_pgd_range(start, end, walk); >> =20 >> + if (vma && ops->post_vma) >> + ops->post_vma(walk); >> + >> return err; >> } >> =20 >> @@ -285,11 +295,17 @@ static int __walk_page_range(unsigned long start= , unsigned long end, >> * - <0 : failed to handle the current entry, and return to the cal= ler >> * with error code. >> * >> + * >> * Before starting to walk page table, some callers want to check wh= ether >> * they really want to walk over the current vma, typically by check= ing >> * its vm_flags. walk_page_test() and @ops->test_walk() are used for= this >> * purpose. >> * >> + * If operations need to be staged before and committed after a vma i= s walked, >> + * there are two callbacks, pre_vma() and post_vma(). Note that post_= vma(), >> + * since it is intended to handle commit-type operations, can't retur= n any >> + * errors. >> + * >> * struct mm_walk keeps current values of some common data like vma = and pmd, >> * which are useful for the access from callbacks. If you want to pa= ss some >> * caller-specific data to callbacks, @private should be helpful. >> @@ -376,3 +392,84 @@ int walk_page_vma(struct vm_area_struct *vma, con= st struct mm_walk_ops *ops, >> return err; >> return __walk_page_range(vma->vm_start, vma->vm_end, &walk); >> } >> + >> +/** >> + * walk_page_mapping - walk all memory areas mapped into a struct add= ress_space. >> + * @mapping: Pointer to the struct address_space >> + * @first_index: First page offset in the address_space >> + * @nr: Number of incremental page offsets to cover >> + * @ops: operation to call during the walk >> + * @private: private data for callbacks' usage >> + * >> + * This function walks all memory areas mapped into a struct address_= space. >> + * The walk is limited to only the given page-size index range, but i= f >> + * the index boundaries cross a huge page-table entry, that entry wil= l be >> + * included. >> + * >> + * Also see walk_page_range() for additional information. >> + * >> + * Locking: >> + * This function can't require that the struct mm_struct::mmap_sem = is held, >> + * since @mapping may be mapped by multiple processes. Instead >> + * @mapping->i_mmap_rwsem must be held. This might have implication= s in the >> + * callbacks, and it's up tho the caller to ensure that the >> + * struct mm_struct::mmap_sem is not needed. >> + * >> + * Also this means that a caller can't rely on the struct >> + * vm_area_struct::vm_flags to be constant across a call, >> + * except for immutable flags. Callers requiring this shouldn't use >> + * this function. >> + * >> + * If @mapping allows faulting of huge pmds and puds, it is desirab= le >> + * that its huge_fault() handler blocks while this function is runn= ing on >> + * @mapping. Otherwise a race may occur where the huge entry is spl= it when >> + * it was intended to be handled in a huge entry callback. This req= uires an >> + * external lock, for example that @mapping->i_mmap_rwsem is held i= n >> + * write mode in the huge_fault() handlers. > Em. No. We have ptl for this. It's the only lock required (plus mmap_se= m > on read) to split PMD entry into PTE table. And it can happen not only > from fault path. > > If you care about splitting compound page under you, take a pin or lock= a > page. It will block split_huge_page(). > > Suggestion to block fault path is not viable (and it will not happen > magically just because of this comment). > I was specifically thinking of this: https://elixir.bootlin.com/linux/latest/source/mm/pagewalk.c#L103 If a huge pud is concurrently faulted in here, it will immediatly get=20 split without getting processed in pud_entry(). An external lock would=20 protect against that, but that's perhaps a bug in the pagewalk code?=C2=A0= =20 For pmds the situation is not the same since when pte_entry is used, all=20 pmds will unconditionally get split. There's a similar more scary race in https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L3931 It looks like if a concurrent thread faults in a huge pud just after the=20 test for pud_none in that pmd_alloc, things might go pretty bad. /Thomas