From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4B6DC07E97 for ; Tue, 28 Nov 2023 00:11:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 155C96B028A; Mon, 27 Nov 2023 19:11:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 106AB6B028E; Mon, 27 Nov 2023 19:11:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F0F836B0290; Mon, 27 Nov 2023 19:11:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id DE7B06B028A for ; Mon, 27 Nov 2023 19:11:36 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id AFA73C0115 for ; Tue, 28 Nov 2023 00:11:36 +0000 (UTC) X-FDA: 81505434192.18.4811D86 Received: from mail-ua1-f47.google.com (mail-ua1-f47.google.com [209.85.222.47]) by imf18.hostedemail.com (Postfix) with ESMTP id C23A61C054B for ; Tue, 28 Nov 2023 00:11:31 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BBDPFAHl; spf=pass (imf18.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.47 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701130291; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hLPwMqJishFcOcomBEvrOdMlrWf+v3GiVQnwOo3w4A0=; b=jRx7ZJjxxiDNJR9hm3w/akYkRqzoZdK5Hyft2rHkSu8KoPa/69wXqM3K7uZYagETQFCU1j 4u6kWTGAz3lMN951AoBhLG/Ftd4mr8MtIC60/8bcN5ovHmHLacvkxJAN2UmfRC6l6pRs5+ g8ZjDHopaIDwdylapL8zecg+dYBAa/8= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BBDPFAHl; spf=pass (imf18.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.47 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701130291; a=rsa-sha256; cv=none; b=KaCfZBcN18D+hXj5a2L4mcx0oopTtEeS92NrBCV+xn/Nrs4TD+F3LFZSr0lliW17qwyXEQ L9yMx1fpRLj5Dv7wrr4edX+YjJ9JWlwJcdUOIlKQMhno9ZhfFMtrec6tiNnvdt5fcdm64T Ynx8Q/ZAvV7nNZHXHWmd4IVcm7S+IdY= Received: by mail-ua1-f47.google.com with SMTP id a1e0cc1a2514c-7ba170ac211so1180625241.2 for ; Mon, 27 Nov 2023 16:11:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701130291; x=1701735091; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hLPwMqJishFcOcomBEvrOdMlrWf+v3GiVQnwOo3w4A0=; b=BBDPFAHlZXgCgQwsjCIi19SYEcdGLmlF/ZNgk+rz6BaX4fXhqFKWBWL6DS3qDYLsX3 xiV5MiMjsCj7xTMu0sfnergq4QzdAl0Zo+nkJwq/0zpPqc2gMVqwsbYQJJ91ogMskuF2 JZ+NsnZHblS/qQpQOkRB4a/v5DHJthVJSSH3CXjSrH9pxDQUVFTEAII/LQC0MC+OWuda 3eUZ+2XFB/Eqg0mkwmlKfeR1U5NfIRiIYgFZTxPtHC7NDmPhzuoAeEpJ6BqxTyoozkFV tKgojIoUSuHjdZyN72kM9Av1OBNASwM9pEBJkxAPuuur5SA0zT4MV35Su9W0ks2sajoI xW9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701130291; x=1701735091; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hLPwMqJishFcOcomBEvrOdMlrWf+v3GiVQnwOo3w4A0=; b=fiufUKRR5h82IlfTwCWLYQLQCldgDQsv10MCQeLaGl6NZaHZAz4d18WR5YX9cwpCSR Ul9YTPDrjjlzRcyNhjYuEJYp6bjTWf8Nko0r5kln2ZaZB9BrlbtlJYcZl77SHVjdmAvZ qv1NMfGRcw2PcIoO6nLgiuj/2sOgx/Y5dk58Rx+BZz4ihCz0VrFWCh2xWUYh/NqAxrjs U2Wi424lv7QUoXilX8qF8a+aPiBFbMmgT3X75WyFXZmqlWmIeUXOk4ookTopD5Dm2kc+ BEEtSH8CwXibWfCPrIBBuECXFD+8zebL7HgCXpQAFRptPPkFvTwy/+yNQOuxcoYh4Oyf ylDw== X-Gm-Message-State: AOJu0YyWgi66Mq7n3KrpixmLeZdcPafNOUFISNyJ/lRPpBXpIyuK/Nv1 8aCo6DnOLF36Mi9zt7BxP/Tj6WN39R+em9EZdkO1UxDLn5RnbGmQ X-Google-Smtp-Source: AGHT+IGu7la2NIj9H3ri3a5UzeqzTVlJJ6Ex6kMGy2z00l2RQXwr+JP0sF+ttbzxDwb7DmwDrrh7D+hgIffFXLQun4E= X-Received: by 2002:a05:6122:1d45:b0:49c:b45:6cba with SMTP id gd5-20020a0561221d4500b0049c0b456cbamr13788833vkb.12.1701130290684; Mon, 27 Nov 2023 16:11:30 -0800 (PST) MIME-Version: 1.0 References: <20231115163018.1303287-2-ryan.roberts@arm.com> <20231127055414.9015-1-v-songbaohua@oppo.com> <755343a1-ce94-4d38-8317-0925e2dae3bc@arm.com> In-Reply-To: <755343a1-ce94-4d38-8317-0925e2dae3bc@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 28 Nov 2023 13:11:19 +1300 Message-ID: Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork() To: Ryan Roberts Cc: akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, ardb@kernel.org, catalin.marinas@arm.com, david@redhat.com, dvyukov@google.com, glider@google.com, james.morse@arm.com, jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, ziy@nvidia.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C23A61C054B X-Rspam-User: X-Stat-Signature: nm7znfjc4mc3nke8xfu764gp9h3ijt7s X-Rspamd-Server: rspam01 X-HE-Tag: 1701130291-272655 X-HE-Meta: U2FsdGVkX18ZI9Tz/JipgY0RbLnZ9l7OijVcc1y5UPaIaT137S6aA8CxzgnTbIKH2R5wPlq9CkUOKhPbO5fa2pq3f+/DrG8Lb47GP5izsqyg0o4uZYCZB+oN3t3gEzpiLn0cbEPTq2jFX9+BWXT9ePbMUlT/LyPMQz2Emqakb5V0bXXP0otjLDnaT8w84dwJEN+bgwH/DSeZ+VV10rnf53+SBliR9/Kje8tb0+2aeDwswruROKzuxI5m7F/gFP5HY2bpcCq4dWZGIu+JcJa4yq1ZcLP7eS/z9KPgQbD1KLtbWBKk7v67SY8HySCvZQALXjWlt1kJwXR2a8JVkY3GTUtxQRC4ID5JjnHw8yn+C08cZVmKIKExU1WVkiT2u8I7biWCofsfkYW0fH7cHlqJ6imQwOv+5DMuNfFiSviGt54vv734VSF67zgvqZ+ePA9CEZBecDSAV0z5NXt7AxXBPm8bbXReOMvST7HpDPOrucQXx6XIiNztqZGWH3+L2/nWcwJ7wkBnn11wU6ys9l1qf+W/blMRdRBYy3JOwHf/7tl01PZG1qd81vM8guGEZCmWnv0/2beFm0uWA/qLg/2YGhnRDGPS7kMl/FDaHrPWZhfmbm0An4PnzGfjLKjJJeeELuQgc8yFRTKT1QRZ+kiLV7whcqOQFKeUSNSCs/i772rDKelgXmb7oFkbKfS8l3x+2Y1kJP2Rh7Vi50KU9TrCZ2UhlsmDW52u/C+iWFkaVobIfshI+Oi3W1RH2cAlWVDpz30PhG452YL6i+bY89YBYSMsJ/FxbfjWNDCUiSoU//HGDZmgfYZzUHlWr/sjVB9i+ZecL3SIAgM7SzEtS/2/rpIJU9lZFjHphtQYDyNlCeXzldJzLvJwOD+xQJZ/I3kPeIaaLZZI0kxFPzAw7/Z0X9ms912lyNgpBv8YM0Y+sJZC3FB3mECBrf/qzN/9QUWbYQnt+DdeD7kWF+A8rm9 PyaezaNt bj8fNDdCvCaiIFB7TAUeSOWCgRDlFBCLcyKl6W1fjdS/rQyY86egNM5J+5NqA/KaFQHpjluZh2RT75MmQCgRSuYJT/B20EERA4KMqtI99KHjnuu7IVNeHAeaZxyW3LpwFV49lb7fbYyw8Dwzx5Je5Al+n9v9Zk3f6VTsIpH5u9TTLETcNfrRNA9dutQu+0PxZa25CR2d7CHUHhYnMx+3EwAMxn+oznEd8YPxKHMAK7j6eXJ1JmdGDnX5SK/+UZNljnWcfw548fZZISzPKaIoNPSXYWS548Y58Xo/XAGCFnii4XgXODgkeGBVxekH0PLEMbJJKyKBm12B2js2L/WfgLGP3IerU3oGxOkMK9aQAbU3MIx8PGSZNujsLhTwKAtlQS3NGQjEA6FjMqoaZ9DKmeXtNMKRUIH9soj2VKefY/gY3b4r8EgXwyoSPemBaxd2GfD925KvbqKyfd2Ksn25iTe/YAL5Lp3PX/5dvwSC2rmKUNtM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 27, 2023 at 10:24=E2=80=AFPM Ryan Roberts wrote: > > On 27/11/2023 05:54, Barry Song wrote: > >> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_stru= ct *src_vma, > >> + pte_t *dst_pte, pte_t *src_pte, > >> + unsigned long addr, unsigned long end, > >> + int *rss, struct folio **prealloc) > >> { > >> struct mm_struct *src_mm =3D src_vma->vm_mm; > >> unsigned long vm_flags =3D src_vma->vm_flags; > >> pte_t pte =3D ptep_get(src_pte); > >> struct page *page; > >> struct folio *folio; > >> + int nr =3D 1; > >> + bool anon; > >> + bool any_dirty =3D pte_dirty(pte); > >> + int i; > >> > >> page =3D vm_normal_page(src_vma, addr, pte); > >> - if (page) > >> + if (page) { > >> folio =3D page_folio(page); > >> - if (page && folio_test_anon(folio)) { > >> - /* > >> - * If this page may have been pinned by the parent proces= s, > >> - * copy the page immediately for the child so that we'll = always > >> - * guarantee the pinned page won't be randomly replaced i= n the > >> - * future. > >> - */ > >> - folio_get(folio); > >> - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma)= )) { > >> - /* Page may be pinned, we have to copy. */ > >> - folio_put(folio); > >> - return copy_present_page(dst_vma, src_vma, dst_pt= e, src_pte, > >> - addr, rss, prealloc, pag= e); > >> + anon =3D folio_test_anon(folio); > >> + nr =3D folio_nr_pages_cont_mapped(folio, page, src_pte, a= ddr, > >> + end, pte, &any_dirty); > > > > in case we have a large folio with 16 CONTPTE basepages, and userspace > > do madvise(addr + 4KB * 5, DONTNEED); > > nit: if you are offsetting by 5 pages from addr, then below I think you m= ean > page0~page4 and page6~15? > > > > > thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_= mapped() > > will return 15. in this case, we should copy page0~page3 and page5~page= 15. > > No I don't think folio_nr_pages_cont_mapped() will return 15; that's cert= ainly > not how its intended to work. The function is scanning forwards from the = current > pte until it finds the first pte that does not fit in the batch - either = because > it maps a PFN that is not contiguous, or because the permissions are diff= erent > (although this is being relaxed a bit; see conversation with DavidH again= st this > same patch). > > So the first time through this loop, folio_nr_pages_cont_mapped() will re= turn 5, > (page0~page4) then the next time through the loop we will go through the > !present path and process the single swap marker. Then the 3rd time throu= gh the > loop folio_nr_pages_cont_mapped() will return 10. one case we have met by running hundreds of real phones is as below, static int copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_v= ma, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, unsigned long end) { ... dst_pte =3D pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); if (!dst_pte) { ret =3D -ENOMEM; goto out; } src_pte =3D pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl); if (!src_pte) { pte_unmap_unlock(dst_pte, dst_ptl); /* ret =3D=3D 0 */ goto out; } spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); orig_src_pte =3D src_pte; orig_dst_pte =3D dst_pte; arch_enter_lazy_mmu_mode(); do { /* * We are holding two locks at this point - either of them * could generate latencies in another task on another CPU. */ if (progress >=3D 32) { progress =3D 0; if (need_resched() || spin_needbreak(src_ptl) || spin_needbreak(dst_p= tl)) break; } ptent =3D ptep_get(src_pte); if (pte_none(ptent)) { progress++; continue; } the above iteration can break when progress > =3D32. for example, at the beginning, if all PTEs are none, we break when progress >=3D32, and we break when we are in the 8th pte of 16PTEs which might become CONTPTE after we release PTL. since we are releasing PTLs, next time when we get PTL, those pte_none() mi= ght become pte_cont(), then are you going to copy CONTPTE from 8th pte, thus, immediately break the consistent CONPTEs rule of hardware? pte0 - pte_none pte1 - pte_none ... pte7 - pte_none pte8 - pte_cont ... pte15 - pte_cont so we did some modification to avoid a break in the middle of PTEs which can potentially become CONTPE. do { /* * We are holding two locks at this point - either of them * could generate latencies in another task on another CPU. */ if (progress >=3D 32) { progress =3D 0; #ifdef CONFIG_CONT_PTE_HUGEPAGE /* * XXX: don't release ptl at an unligned address as cont_pte might form while * ptl is released, this causes double-map */ if (!vma_is_chp_anonymous(src_vma) || (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE))) #endif if (need_resched() || spin_needbreak(src_ptl) || spin_needbreak(dst_ptl)) break; } We could only reproduce the above issue by running thousands of phones. Does your code survive from this problem? > > Thanks, > Ryan > > > > > but the current code is copying page0~page14, right? unless we are imme= diatly > > split_folio to basepages in zap_pte_range(), we will have problems? > > > >> + > >> + for (i =3D 0; i < nr; i++, page++) { > >> + if (anon) { > >> + /* > >> + * If this page may have been pinned by t= he > >> + * parent process, copy the page immediat= ely for > >> + * the child so that we'll always guarant= ee the > >> + * pinned page won't be randomly replaced= in the > >> + * future. > >> + */ > >> + if (unlikely(page_try_dup_anon_rmap( > >> + page, false, src_vma))) { > >> + if (i !=3D 0) > >> + break; > >> + /* Page may be pinned, we have to= copy. */ > >> + return copy_present_page( > >> + dst_vma, src_vma, dst_pte= , > >> + src_pte, addr, rss, preal= loc, > >> + page); > >> + } > >> + rss[MM_ANONPAGES]++; > >> + VM_BUG_ON(PageAnonExclusive(page)); > >> + } else { > >> + page_dup_file_rmap(page, false); > >> + rss[mm_counter_file(page)]++; > >> + } > > Thanks Barry