From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 224CBC77B73 for ; Mon, 22 May 2023 23:54:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B5836B0074; Mon, 22 May 2023 19:54:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9183E6B0075; Mon, 22 May 2023 19:54:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 790CE900002; Mon, 22 May 2023 19:54:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 61E466B0074 for ; Mon, 22 May 2023 19:54:51 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 05A36140198 for ; Mon, 22 May 2023 23:54:50 +0000 (UTC) X-FDA: 80819548782.15.E4096FA Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) by imf25.hostedemail.com (Postfix) with ESMTP id 319BEA0004 for ; Mon, 22 May 2023 23:54:48 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=J0eBETZ9; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684799689; a=rsa-sha256; cv=none; b=UwZqA7acJGq1NiO36zIqMnsGOjVGE1MRvXLunJ3y6JGKRVS0kX+8KUAsP+aaIMy/X1vVjC qPVm7Nn7jyiH30IFj4U8kE9eRtvua9rdAg2oylCOqQrsfa59AJNxkjQvwfgNN7SrQBUoCV ZBBNozkI4T8GcDho90ajr7/fG5HTBFA= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=J0eBETZ9; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684799689; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aP0oHuaIAb62A8T6EuB6LZ5TbwxhHkitQNBCiJMWqlw=; b=A6xeuWqNslZ2QgVYUUzNGBfuukhCAsp6mXPGGP+7GpypOKG9ElyCmKpddkF7UxLkB2nDv3 N2qZDHGxYRM6JJB6wpCqd23t8wMC0igsh2YwNSakYuh1SUYcBTP9dYTzY8S2V7ubMdIAwz 5xpKrLwkqT6XQoR4niUyv8lTYS7+aJw= Received: by mail-pj1-f41.google.com with SMTP id 98e67ed59e1d1-2533d3acd5fso5907497a91.2 for ; Mon, 22 May 2023 16:54:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684799688; x=1687391688; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=aP0oHuaIAb62A8T6EuB6LZ5TbwxhHkitQNBCiJMWqlw=; b=J0eBETZ9YkSQUvLhUZdyXYahQ/8qPHtDPjVfbF7S/xelNIPdZvVc/cw22GmDLYoMei SEFgl15nQkOdOc0pnRMau3kE/QFUFYvsi1FtTuNyYjiI53w5MDbQxAvgQPVYmcK+7Ujx /VnNzYMFzTRoXH3GVbRzmu3NIYvDibsHpuzz/VKdcy28392cvPf5UvECpG4PjBwTni12 CTfrE3m7rAoER+PT7J17TsthUfeJRB6rsVDV35Vb9E8eklj7a7m3oEp90NR3hqPJLwHz DfxahenJ5pzoGzaU+1Ku9TO9YUP9mP8LNTo1482+LpNWR43YGfi+d5Y/njUwhn3hoBt7 2zoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684799688; x=1687391688; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aP0oHuaIAb62A8T6EuB6LZ5TbwxhHkitQNBCiJMWqlw=; b=NptesI0TERwn/xXW0vRWyrvHMJTaNqiarwP932zhAKbDBncA5OmBLuFE9TaKpuKQBX 6UjcY6OOM+ZusTiRJkNTdcImHY/NzS9aYf6DY8t+o43xZPYk8VWVw75K1c4qPOWUeb1L OzAwGrXRaSU1OY0iaf4BsMp7ra54pw8HwjkZ/cr7JLwOwqDHdSNrxbVocddDGHqcz2He Aih06maLRNeOyrmQYdqJaZ/e5mulO3DTXA25TjW2kJ8G/3NUMUZPKBYSB4CgbNY0zbhl sX3e7K7iCz9OuTlvVo09JZCujwokmvzcYAxjRdavH+E19+hzRUpr0rLtp+iPBh82wpo9 aBBw== X-Gm-Message-State: AC+VfDyGt16ZlOELucm2R9L6UlFnmSLLdHLIb3sxBQpDDAwU8niYhxAH 2eCTh+I8yvfVEmHYWs2weONcpRczAIUNlvog+ls= X-Google-Smtp-Source: ACHHUZ6eXDxMJihK0tAM9nn8PJq4e3PzSs7ktzDlX2kCaiEQIfDo3LqT7JLv7Xg6MOaC8h9VqRytAS6GKa/pYZGuJrY= X-Received: by 2002:a17:90b:33ca:b0:253:48c2:9d45 with SMTP id lk10-20020a17090b33ca00b0025348c29d45mr11247328pjb.41.1684799687733; Mon, 22 May 2023 16:54:47 -0700 (PDT) MIME-Version: 1.0 References: <68a97fbe-5c1e-7ac6-72c-7b9c6290b370@google.com> In-Reply-To: From: Yang Shi Date: Mon, 22 May 2023 16:54:36 -0700 Message-ID: Subject: Re: [PATCH 27/31] mm/khugepaged: allow pte_offset_map[_lock]() to fail To: Hugh Dickins Cc: Andrew Morton , Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 319BEA0004 X-Stat-Signature: b6ww7xqtpuyytogjpqtp4iwtpbx3a9j1 X-HE-Tag: 1684799688-288081 X-HE-Meta: U2FsdGVkX1/IGdm8Akcl43HPojNSQn+qbqFfzeLJRF2h73Tr5q+R3SfdMxFqCaCf9W1J2f+vpoCYhG5X2BdsvHOB+xBiHADitTkpBc8U9auQYDBmr088PESfiVxfb9H4cx79si4OyIdSdV8JuJR20rMVJlwnw1jkgGRlkcXMosGLSqWqXCfdRBtVO8Mwdy2qrbW3y6ARFq4dWHevsa8AisugdGmqh57LN7F4crdMNNW5yxq58wrTarNxQ//BZy/GKUgAyv+5iUVYg4St8bKNdthp2g/lywcEZldRLbpsRhXo40uWqHeiCMX5ldQvarTQ2Z/07cDeV3UQ/FtOp9H2T+4EaFFWapM+BdW8l+e5cJ0TlClSDqvgKhReO7FKvDAsx+Wi5YtuVZwB+l6kifw2i5FuR0LbpIBtA/MlSdzX85WoMNQZJ8Aj2WWK/R+BMeXFesuBpUpeBxwa7RnIjSOqcn8Qtx9NvM74mipvZnLh3eeB8atcvILikjn1nxF6U5A/OFrlVt9kuMtwf+PjZqEkuvIbTjwUNKaaspg2M3VjzwrSBVY/rojqXPtuzVxDf1utMI0bAZsmTB3Z9PEm+4alXGnXblM6bskfReXJDEThEfM2/3gMrPKf8C+L/kSt9f0M+SbT6tci5ia4oKInEBWusLAAyGevdSrUqOyILGPz1X+IH9JDYsR6XmJXHaZ0aYxYSzrRG8Wx4LcZTWRivgOn2/bSlyABlk1/lUyIbAjGFE9AttgUKg4a1XwW8cdCGBGyyaB7BUR0m7ufEg1nYMFsVchKqGiHMVkdYhiqRb7fASofo4Ouo92ZD9+5EImSac6/fAzVHlKkoiUQ8jYKs49MlEGfJswANMkNvVRX4yJLI7DZgav9bQD6GDRoUKTsDYZVwRzvcnkDTCyG+WCJ4ofAwYd7uR660u1aD+gHi/L41JulnzI9SeXdkguP713m5isqbvwyhfZk6PeOq8O+rNY uOofrf8J Y++w4VD1eq5gb2OOrj5GNbQM4bsbM1KxyQrInmHCnhZgHmrlo2otj6CjdnHa5VhWQ3ZVew7IcuFAZfnv6iAtWAyFtVJT741sbFYiYMryd2ZPCUuU/NSqjjpHwgQRIRt5G1+X9OXKK3Eg41kjEFp7ydRVmvwrbE9he75YEdEUddeJaKgi+tUOPnlfXdSzAYWtWo1drLsPePbkFTEAxk3Ce0l/yKwdH4oBVW8aGCKj1+hQNwgP1a6kaviI24U7KFvcQklfQUi7Tr5ibUJJWLbdODfrzgEhTSu1nMEUjHiOBfxJYEUDYPnPssAAEkVo2d/eu0QW/qPXae5fQNgU5jNapMAtySXMQ23xinjzPgkhIyBQ/CEmwIO5JhuEh52NcCAZJnbOo48PoBrCXm/0yRE8LosFTAXLZ4lB38ga7dRk/mKaX9W2oc5gApXA2cdsET7vpgO46 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, May 21, 2023 at 10:24=E2=80=AFPM Hugh Dickins wr= ote: > > __collapse_huge_page_swapin(): don't drop the map after every pte, it > only has to be dropped by do_swap_page(); give up if pte_offset_map() > fails; trace_mm_collapse_huge_page_swapin() at the end, with result; > fix comment on returned result; fix vmf.pgoff, though it's not used. > > collapse_huge_page(): use pte_offset_map_lock() on the _pmd returned > from clearing; allow failure, but it should be impossible there. > hpage_collapse_scan_pmd() and collapse_pte_mapped_thp() allow for > pte_offset_map_lock() failure. > > Signed-off-by: Hugh Dickins Reviewed-by: Yang Shi A nit below: > --- > mm/khugepaged.c | 72 +++++++++++++++++++++++++++++++++---------------- > 1 file changed, 49 insertions(+), 23 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 732f9ac393fc..49cfa7cdfe93 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -993,9 +993,8 @@ static int check_pmd_still_valid(struct mm_struct *mm= , > * Only done if hpage_collapse_scan_pmd believes it is worthwhile. > * > * Called and returns without pte mapped or spinlocks held. > - * Note that if false is returned, mmap_lock will be released. > + * Returns result: if not SCAN_SUCCEED, mmap_lock has been released. > */ > - > static int __collapse_huge_page_swapin(struct mm_struct *mm, > struct vm_area_struct *vma, > unsigned long haddr, pmd_t *pmd, > @@ -1004,23 +1003,35 @@ static int __collapse_huge_page_swapin(struct mm_= struct *mm, > int swapped_in =3D 0; > vm_fault_t ret =3D 0; > unsigned long address, end =3D haddr + (HPAGE_PMD_NR * PAGE_SIZE)= ; > + int result; > + pte_t *pte =3D NULL; > > for (address =3D haddr; address < end; address +=3D PAGE_SIZE) { > struct vm_fault vmf =3D { > .vma =3D vma, > .address =3D address, > - .pgoff =3D linear_page_index(vma, haddr), > + .pgoff =3D linear_page_index(vma, address), > .flags =3D FAULT_FLAG_ALLOW_RETRY, > .pmd =3D pmd, > }; > > - vmf.pte =3D pte_offset_map(pmd, address); > - vmf.orig_pte =3D *vmf.pte; > - if (!is_swap_pte(vmf.orig_pte)) { > - pte_unmap(vmf.pte); > - continue; > + if (!pte++) { > + pte =3D pte_offset_map(pmd, address); > + if (!pte) { > + mmap_read_unlock(mm); > + result =3D SCAN_PMD_NULL; > + goto out; > + } > } > + > + vmf.orig_pte =3D *pte; > + if (!is_swap_pte(vmf.orig_pte)) > + continue; > + > + vmf.pte =3D pte; > ret =3D do_swap_page(&vmf); > + /* Which unmaps pte (after perhaps re-checking the entry)= */ > + pte =3D NULL; > > /* > * do_swap_page returns VM_FAULT_RETRY with released mmap= _lock. > @@ -1029,24 +1040,29 @@ static int __collapse_huge_page_swapin(struct mm_= struct *mm, > * resulting in later failure. > */ > if (ret & VM_FAULT_RETRY) { > - trace_mm_collapse_huge_page_swapin(mm, swapped_in= , referenced, 0); > /* Likely, but not guaranteed, that page lock fai= led */ > - return SCAN_PAGE_LOCK; > + result =3D SCAN_PAGE_LOCK; With per-VMA lock, this may not be true anymore, at least not true until per-VMA lock supports swap fault. It may be better to have a more general failure code, for example, SCAN_FAIL. But anyway you don't have to change it in your patch, I can send a follow-up patch once this series is landed on mm-unstable. > + goto out; > } > if (ret & VM_FAULT_ERROR) { > mmap_read_unlock(mm); > - trace_mm_collapse_huge_page_swapin(mm, swapped_in= , referenced, 0); > - return SCAN_FAIL; > + result =3D SCAN_FAIL; > + goto out; > } > swapped_in++; > } > > + if (pte) > + pte_unmap(pte); > + > /* Drain LRU add pagevec to remove extra pin on the swapped in pa= ges */ > if (swapped_in) > lru_add_drain(); > > - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1)= ; > - return SCAN_SUCCEED; > + result =3D SCAN_SUCCEED; > +out: > + trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, re= sult); > + return result; > } > > static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, > @@ -1146,9 +1162,6 @@ static int collapse_huge_page(struct mm_struct *mm,= unsigned long address, > address + HPAGE_PMD_SIZE); > mmu_notifier_invalidate_range_start(&range); > > - pte =3D pte_offset_map(pmd, address); > - pte_ptl =3D pte_lockptr(mm, pmd); > - > pmd_ptl =3D pmd_lock(mm, pmd); /* probably unnecessary */ > /* > * This removes any huge TLB entry from the CPU so we won't allow > @@ -1163,13 +1176,18 @@ static int collapse_huge_page(struct mm_struct *m= m, unsigned long address, > mmu_notifier_invalidate_range_end(&range); > tlb_remove_table_sync_one(); > > - spin_lock(pte_ptl); > - result =3D __collapse_huge_page_isolate(vma, address, pte, cc, > - &compound_pagelist); > - spin_unlock(pte_ptl); > + pte =3D pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); > + if (pte) { > + result =3D __collapse_huge_page_isolate(vma, address, pte= , cc, > + &compound_pagelist)= ; > + spin_unlock(pte_ptl); > + } else { > + result =3D SCAN_PMD_NULL; > + } > > if (unlikely(result !=3D SCAN_SUCCEED)) { > - pte_unmap(pte); > + if (pte) > + pte_unmap(pte); > spin_lock(pmd_ptl); > BUG_ON(!pmd_none(*pmd)); > /* > @@ -1253,6 +1271,11 @@ static int hpage_collapse_scan_pmd(struct mm_struc= t *mm, > memset(cc->node_load, 0, sizeof(cc->node_load)); > nodes_clear(cc->alloc_nmask); > pte =3D pte_offset_map_lock(mm, pmd, address, &ptl); > + if (!pte) { > + result =3D SCAN_PMD_NULL; > + goto out; > + } > + > for (_address =3D address, _pte =3D pte; _pte < pte + HPAGE_PMD_N= R; > _pte++, _address +=3D PAGE_SIZE) { > pte_t pteval =3D *_pte; > @@ -1622,8 +1645,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, = unsigned long addr, > * lockless_pages_from_mm() and the hardware page walker can acce= ss page > * tables while all the high-level locks are held in write mode. > */ > - start_pte =3D pte_offset_map_lock(mm, pmd, haddr, &ptl); > result =3D SCAN_FAIL; > + start_pte =3D pte_offset_map_lock(mm, pmd, haddr, &ptl); > + if (!start_pte) > + goto drop_immap; > > /* step 1: check all mapped PTEs are to the right huge page */ > for (i =3D 0, addr =3D haddr, pte =3D start_pte; > @@ -1697,6 +1722,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, u= nsigned long addr, > > abort: > pte_unmap_unlock(start_pte, ptl); > +drop_immap: > i_mmap_unlock_write(vma->vm_file->f_mapping); > goto drop_hpage; > } > -- > 2.35.3 >