From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.1 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87AC9C33C99 for ; Tue, 7 Jan 2020 08:24:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D2D772077B for ; Tue, 7 Jan 2020 08:24:54 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=yandex-team.ru header.i=@yandex-team.ru header.b="FJbXwlcC" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D2D772077B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=yandex-team.ru Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6D21C8E001B; Tue, 7 Jan 2020 03:24:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 65A3C8E0001; Tue, 7 Jan 2020 03:24:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 570F98E001B; Tue, 7 Jan 2020 03:24:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0108.hostedemail.com [216.40.44.108]) by kanga.kvack.org (Postfix) with ESMTP id 3E57C8E0001 for ; Tue, 7 Jan 2020 03:24:54 -0500 (EST) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id DDA433AA7 for ; Tue, 7 Jan 2020 08:24:53 +0000 (UTC) X-FDA: 76350152466.18.honey82_47b0bc928063a X-HE-Tag: honey82_47b0bc928063a X-Filterd-Recvd-Size: 10408 Received: from forwardcorp1p.mail.yandex.net (forwardcorp1p.mail.yandex.net [77.88.29.217]) by imf16.hostedemail.com (Postfix) with ESMTP for ; Tue, 7 Jan 2020 08:24:52 +0000 (UTC) Received: from mxbackcorp2j.mail.yandex.net (mxbackcorp2j.mail.yandex.net [IPv6:2a02:6b8:0:1619::119]) by forwardcorp1p.mail.yandex.net (Yandex) with ESMTP id 9B2E62E12D7; Tue, 7 Jan 2020 11:24:49 +0300 (MSK) Received: from myt4-18a966dbd9be.qloud-c.yandex.net (myt4-18a966dbd9be.qloud-c.yandex.net [2a02:6b8:c00:12ad:0:640:18a9:66db]) by mxbackcorp2j.mail.yandex.net (mxbackcorp/Yandex) with ESMTP id LSUJrHcg5s-OnPmhG6S; Tue, 07 Jan 2020 11:24:49 +0300 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1578385489; bh=amHGw551aXYgpduMpWoLaSTOVPA3lpbL4Xgfo74IJig=; h=In-Reply-To:Message-ID:From:Date:References:To:Subject:Cc; b=FJbXwlcCEB4AnEKJOUVo14DxScHcGHY+IluuIVQybbuM+wKe6qQN9pCvHvA0gtvUt RzOMDcWu6qMvzFljHIZFEn7aBSX9dD3Xox/5cGCGtsCE3sw6TQJ4Fgeq1cXOPT53Dp nqlQ09p++3R3VTFcf9YJTUdy6WnWeX1Rs+04icks= Authentication-Results: mxbackcorp2j.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Received: from unknown (unknown [2a02:6b8:b080:6407::1:5]) by myt4-18a966dbd9be.qloud-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id hHwwf5o8nZ-OmVO8eOS; Tue, 07 Jan 2020 11:24:49 +0300 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (Client certificate not present) Subject: Re: [PATCH] mm/rmap: fix reusing mergeable anon_vma as parent when fork To: "lixinhai.lxh@gmail.com" , Konstantin Khlebnikov Cc: "linux-mm@kvack.org" , akpm , "linux-kernel@vger.kernel.org" , "richardw.yang" , "kirill.shutemov" References: <157830736034.8148.7070851958306750616.stgit@buzz> <2020010710441027026650@gmail.com> From: Konstantin Khlebnikov Message-ID: Date: Tue, 7 Jan 2020 11:24:47 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: <2020010710441027026650@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-CA Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 07/01/2020 05.44, lixinhai.lxh@gmail.com wrote: > On 2020-01-07=C2=A0at 04:35=C2=A0Konstantin Khlebnikov=C2=A0wrote: >> On Mon, Jan 6, 2020 at 1:42 PM Konstantin Khlebnikov >> wrote: >>> >>> This fixes couple misconceptions in commit 4e4a9eb92133 ("mm/rmap.c: = reuse >>> mergeable anon_vma as parent when fork"). >>> >>> First problem caused by initialization order in dup_mmap(): vma->vm_p= rev >>> is set after calling anon_vma_fork(). Thus in anon_vma_fork() it poin= ts to >>> previous VMA in parent mm. This is fixed by rearrangement in dup_mmap= (). >>> >>> If in parent VMAs: SRC1 SRC2 .. SRCn share anon-vma ANON0, then after= fork >>> before all patches in child process related VMAs: DST1 DST2 .. DSTn w= ill >>> use different anon-vmas: ANON1 ANON2 .. ANONn. Before this patch only= DST1 >>> will fork new ANON1 and following DST2 .. DSTn will share parent's AN= ON0. >>> With this patch DST1 will create new ANON1 and DST2 .. DSTn will shar= e it. >>> >>> Also this patch moves sharing logic out of anon_vma_clone() into more >>> specific anon_vma_fork() because this supposed to work only at fork()= . >>> Function anon_vma_clone() is more generic is also used at splitting V= MAs. >>> >>> Second problem is hidden behind first one: assumption "Parent has vm_= prev, >>> which implies we have vm_prev" is wrong if first VMA in parent mm has= set >>> flag VM_DONTCOPY. Luckily prev->anon_vma doesn't dereference NULL poi= nter >>> because in current code 'prev' actually is same as 'pprev'. To avoid = that >>> this patch just checks pointer and compares vm_start to verify relati= on >>> between previous VMAs in parent and child. >>> >>> Signed-off-by: Konstantin Khlebnikov >>> Fixes: 4e4a9eb92133 ("mm/rmap.c: reuse mergeable anon_vma as parent w= hen fork") >> >> Oops, I've forgot to mention that Li Xinhai >> found and reported this suspicious code. Sorry. >> >> Reported-by: Li Xinhai >> Link: https://lore.kernel.org/linux-mm/CALYGNiNzz+dxHX0g5-gNypUQc3B=3D= 8_Scp53-NTOh=3DzWsdUuHAw@mail.gmail.com/T/#t >> >=20 > Can we change the interface > int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pv= ma), > to > int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pv= ma, struct vm_area_struct *pcvma), > and 'pcvma' means previous child vma. > so highlight the use of that vma, and the current code sequence for lin= king 'tmp' vma > in dup_mmap() is not changed(in case some code would have dependency on= that > linking sequence) There should be no dependency on linking sequence. But we could generalize sharing: cloned vma could share prev anon-vma (or any other actually) if anon_vma->parent =3D=3D src->anon_vma. This is more clear than sharing only between related vmas. >=20 > Another issue is for linking the avc for the reused anon_vma. anon_vma_= clone() > use the iteration > list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma), > to link avc for child vma, and it is unable to reach the resued anon_vm= a because > that is from the previous vma not from parent vma. So, in anon_vma_fork= (), > we need to setup the=C2=A0avc link for vma->anon. Oh, yes. That's another example where current code miraculously stays cor= rect. >=20 >>> --- >>> =C2=A0 kernel/fork.c |=C2=A0=C2=A0=C2=A0 4 ++-- >>> =C2=A0 mm/rmap.c=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0 25 ++++++++++= ++------------- >>> =C2=A0 2 files changed, 14 insertions(+), 15 deletions(-) >>> >>> diff --git a/kernel/fork.c b/kernel/fork.c >>> index 2508a4f238a3..04ee5e243f65 100644 >>> --- a/kernel/fork.c >>> +++ b/kernel/fork.c >>> @@ -548,6 +548,8 @@ static __latent_entropy int dup_mmap(struct mm_st= ruct *mm, >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 if (retval) >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= goto fail_nomem_policy; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 tmp->vm_mm =3D mm; >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 tmp->vm_prev =3D prev;=C2=A0=C2=A0=C2=A0 /* anon_vma_fork= use this */ >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 tmp->vm_next =3D NULL; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 retval =3D dup_userfaultfd(tmp, &uf); >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 if (retval) >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= goto fail_nomem_anon_vma_fork; >>> @@ -559,7 +561,6 @@ static __latent_entropy int dup_mmap(struct mm_st= ruct *mm, >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 } else if (anon_vma_fork(tmp, mpnt)) >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= goto fail_nomem_anon_vma_fork; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 tmp->vm_flags &=3D ~(VM_LOCKED | VM_LOCKONFAU= LT); >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 tmp->vm_next =3D tmp->vm_prev =3D NULL; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 file =3D tmp->vm_file; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 if (file) { >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= struct inode *inode =3D file_inode(file); >>> @@ -592,7 +593,6 @@ static __latent_entropy int dup_mmap(struct mm_st= ruct *mm, >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 *pprev =3D tmp; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 pprev =3D &tmp->vm_next; >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 tmp->vm_prev =3D prev; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 prev =3D tmp; >>> >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 __vma_link_rb(mm, tmp, rb_link, rb_parent); >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index b3e381919835..77b3aa38d5c2 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -269,19 +269,6 @@ int anon_vma_clone(struct vm_area_struct *dst, s= truct vm_area_struct *src) >>> =C2=A0 { >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct anon_vma_cha= in *avc, *pavc; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct anon_vma *ro= ot =3D NULL; >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct vm_area_struct *prev =3D= dst->vm_prev, *pprev =3D src->vm_prev; >>> - >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * If parent share anon_vm= a with its vm_prev, keep this sharing in in >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * child. >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * 1. Parent has vm_prev, = which implies we have vm_prev. >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * 2. Parent and its vm_pr= ev have the same anon_vma. >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!dst->anon_vma && src->anon= _vma && >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pprev &= & pprev->anon_vma =3D=3D src->anon_vma) >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 dst->anon_vma =3D prev->anon_vma; >>> - >>> >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 list_for_each_entry= _reverse(pavc, &src->anon_vma_chain, same_vma) { >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 struct anon_vma *anon_vma; >>> @@ -334,6 +321,7 @@ int anon_vma_clone(struct vm_area_struct *dst, st= ruct vm_area_struct *src) >>> =C2=A0=C2=A0 */ >>> =C2=A0 int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_= struct *pvma) >>> =C2=A0 { >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct vm_area_struct *prev =3D= vma->vm_prev, *pprev =3D pvma->vm_prev; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct anon_vma_cha= in *avc; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct anon_vma *an= on_vma; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int error; >>> @@ -345,6 +333,17 @@ int anon_vma_fork(struct vm_area_struct *vma, st= ruct vm_area_struct *pvma) >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* Drop inherited a= non_vma, we'll reuse existing or allocate new. */ >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma->anon_vma =3D N= ULL; >>> >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * If parent shares anon_v= ma with its vm_prev, keep this sharing. >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Previous VMA could be m= issing or not match previuos in parent >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * if VM_DONTCOPY is set: = compare vm_start to avoid this case. >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (pvma->anon_vma && pprev && = prev && >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pprev->= anon_vma =3D=3D pvma->anon_vma && >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pprev->= vm_start =3D=3D prev->vm_start) >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 vma->anon_vma =3D prev->anon_vma; >>> + >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * First, atta= ch the new VMA to the parent VMA's anon_vmas, >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * so rmap can= find non-COWed pages in child processes. >>> >> >