From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93605C76196 for ; Wed, 29 Mar 2023 00:13:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 19FB86B0072; Tue, 28 Mar 2023 20:13:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 14EE06B0078; Tue, 28 Mar 2023 20:13:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0164E6B007B; Tue, 28 Mar 2023 20:13:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id E74956B0072 for ; Tue, 28 Mar 2023 20:13:09 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id BBB1A1C5B18 for ; Wed, 29 Mar 2023 00:13:09 +0000 (UTC) X-FDA: 80620010898.13.9D67C0F Received: from mail-yw1-f176.google.com (mail-yw1-f176.google.com [209.85.128.176]) by imf21.hostedemail.com (Postfix) with ESMTP id E39601C000F for ; Wed, 29 Mar 2023 00:13:07 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=tEfg2MX6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.176 as permitted sender) smtp.mailfrom=jiaqiyan@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680048787; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SKCcfxHXa0JTA0jfs8wk7fvZusm8H8KPYsRd5pj2AIo=; b=hOyId6NoIXGF491Y5C5FPy5WsJrBKHjGRLUv2nfQQFBXMpxauV0zUvHecz8TtSgVjyYMcs Sf+qJxn5MuI0MUA22V96UQJ0vp1fYSm15i4ECdQiM9R96MAOlfHtiX3LUiWqwe5d14qybn lpSw2zYPk31J7ArGdL9D/cx2vvDhLTo= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=tEfg2MX6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.176 as permitted sender) smtp.mailfrom=jiaqiyan@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680048787; a=rsa-sha256; cv=none; b=ZvEriK9p/bmESzsZfQFvLGc/Gv9ngnWhflX/TWOjIvW0+ppIndzV7dTW3V51CIsUduG2qt ZnSIQZZBJRi8zMSl++woO6GnqGwoSM2f57dRV0bgJNARSvdL19uwGwcryDNcwKRdhbuwOW WawF0mNyaO924o+9OSe72NpHm1Iewy0= Received: by mail-yw1-f176.google.com with SMTP id 00721157ae682-544f7c176easo262832927b3.9 for ; Tue, 28 Mar 2023 17:13:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1680048787; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=SKCcfxHXa0JTA0jfs8wk7fvZusm8H8KPYsRd5pj2AIo=; b=tEfg2MX6JzNeznS/w9eAUT0lewHJVUrEfLFabr/iDglKsE03NYahZqfFZWbZf+bO83 KE8M+w+PTM2s0vzO2SCR4kZO9UhXggNzUTKKbq3L8/ObPU39Tg9i8s0QQwuyUbd4dyzM Uv35Ih2Z4DcZIQCFnqudbw0TrSyOp7FX8Pj43r83WIRNqm3Q0nFzD9cysBSFgR9G1TXQ FYnzGuW8Onx5U2mSztYmTFqsO0sAoqFxowso8WrloPZw8hN1v9qvB2rh/FFl5/mI9CMC OCFntcJh/heclRVbpeXkLN4FkzaNFP9VAm31vosD0N4+wZNwBxomQGI4zx5E8Es0ueix mqIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680048787; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=SKCcfxHXa0JTA0jfs8wk7fvZusm8H8KPYsRd5pj2AIo=; b=ljJoIJasL5qXaGp7LImtUgU8PIM7byYxOlz2Y09Idwg5yhQkDSIqnQrQwmYvICGxqY 7ef1bJ9JYmxJVYUt0J5c597KDc5xjkLtB3yuuM6g+8wlz8KzpMDwbLEypFddxus0eVSB 20a6R8BPl9GrDNX98mh5eDxAZ5mWNMr+1tf+qHff82l4UK06UILrISuX/vHjL20GoKGL 7kb42i+Xn6sgtsJk3z03T6e5Ao4zlqmR2UQJbztiNCBhvifwlZOodS7NGb5bF6QzYfQ0 esxPO5YFMTiByD3xOOOr2BKeZM1WdhMLCvfRhntBN3ihjxzc1opkBQwF6xrgg1l8ix9g Rn7g== X-Gm-Message-State: AAQBX9ce6njzt+wdxEZgZ42uQ2xgEaylx6PFIvX3qhk/aBNCixN+0/aj jBvmdDrT0icVX1jm/h8QPRh8sX2rcaHQY1uCnx7aUQ== X-Google-Smtp-Source: AKy350bE+ZxXTXfPs55f7ZyrtzJRuPlpxkdmrK7vLvsB2LI5c7oQ4zbhpO1uzH58z+myYZA3GLaLs5Ok+vEt0OVt2Z8= X-Received: by 2002:a81:ef0f:0:b0:545:1d7f:acbf with SMTP id o15-20020a81ef0f000000b005451d7facbfmr8066903ywm.10.1680048786858; Tue, 28 Mar 2023 17:13:06 -0700 (PDT) MIME-Version: 1.0 References: <20230327211548.462509-1-jiaqiyan@google.com> <20230327211548.462509-4-jiaqiyan@google.com> In-Reply-To: From: Jiaqi Yan Date: Tue, 28 Mar 2023 17:12:55 -0700 Message-ID: Subject: Re: [PATCH v11 3/3] mm/khugepaged: recover from poisoned file-backed memory To: Yang Shi Cc: kirill.shutemov@linux.intel.com, kirill@shutemov.name, tongtiangen@huawei.com, tony.luck@intel.com, naoya.horiguchi@nec.com, linmiaohe@huawei.com, linux-mm@kvack.org, akpm@linux-foundation.org, osalvador@suse.de, wangkefeng.wang@huawei.com, stevensd@chromium.org, hughd@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E39601C000F X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: mtnn5sgomtrhnssspozk44wzpubqez1p X-HE-Tag: 1680048787-106323 X-HE-Meta: U2FsdGVkX18OSTGQvHbfdFkEdnuwIhDGlWcR6noOecGcw7NtLHIo6xMQuFfN7j9RWOuFzyOEC6cxIhKeTv/PcoS2L/4Qbu7B+E8RQscn/L/Ob49bkemQa83E7EvkTxI3PHd8au7tVvW82SbqiKJNpHP9NlHyzu9KH1TZaAs6sCNgQxR+Nfv1LcUlaTxtlCEVn9FVKnPkYcuKzdrFIdH1LnjxTZtTBuvxLb/GwhXL8Y0rVjerAGEByOses2GV4j18DkHKOzXEeXMs839zxefKy3kdB5xcJpWejWTtA7BX2/m0hk3wzAfp4iMUuUpzWuKOZJMeuGqaS0rMSgzJzoRvi2NX2/InVqFYwLIfTECn2sBHNppkmVoWl+7JP9lbtD07UN4380c9brmQLl4ObIJb08qTN9sY+rdHtQD17v3iY1zhPCyxWUxyaGCmGLkbbmenTu7nZkJcQi9r9zUj4OiFzRgZ6Nb1njh6SckG0Cd8uRJYljmseBvxkDHG8L9278piDtT7RRwfP3fb2zWT8Lp8stqDsJOVIaEdgrkZY/+YxOAGDaFwjBhf2X4hzZTYTZwed2/oxoR/IxWe53F6eeEow6lExK9n47eU6mnE0lIKHMcp7LyOwg9CtkC1GxISI0td/ZdSiTWPCvpi8g60bkTfeAVN4eF5gcpz9gqqPdKpsJzEunfxWAIK0uIiLO76L8QR1PW1mz1wf2K4YaVuWbMBmjBQcm6hzvkLemI2mlX7SfktQDc3/tT+Ujkq4U4OrsFdqljwnLfsCBr+6bqlheN+JDA3ZBrt20BChvft3n9P9cvlVhQYlci13NADfb5dy2ZdSpswTAoMlnrk8sV7kNweTSWYOZ8O0UyyfWsd1RguIH+uosetNxzBH2mIu8YFGSWLYhh+asUASamLei4itpCb/OJqWMHoPAuc7DWtGcNGBO6Dm6QQDLC3+a50a5qU9pIgrQeCFA+CwoIFdb/1p+g VgUsjy6J e3CjOvzh8vu0BRN8lZ9gCyTD1o6wX6CIYQxuimnOAeTxNfRrsCDCCvM+4ohaDD96+c/P8Gef7xSIXjLTWSd4MAfeGcNf2QIkt+LficYwJQYXrXX9Vqpb6xhR5yXbtJAIfOKRzRykhAoDqiEHbpPuTe39jG1sJX7fKoLNhE5kXIiSy4ZSIlbzzhsGMVHbXTrkBDVzIr6DR/iXUatK7swC0FVcXshAjrncsTH3HfcpAp4CHLVFl2gZOSb443kQSqdomTOfm97dVDqOPabx2q2jZyK1J1Cxc33QySdHbbK8nfus7Wlo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 28, 2023 at 9:02=E2=80=AFAM Yang Shi wrot= e: > > On Mon, Mar 27, 2023 at 2:16=E2=80=AFPM Jiaqi Yan w= rote: > > > > Make collapse_file roll back when copying pages failed. More concretely= : > > - extract copying operations into a separate loop > > - postpone the updates for nr_none until both scanning and copying > > succeeded > > - postpone joining small xarray entries until both scanning and copying > > succeeded > > - postpone the update operations to NR_XXX_THPS until both scanning and > > copying succeeded > > - for non-SHMEM file, roll back filemap_nr_thps_inc if scan succeeded b= ut > > copying failed > > > > Tested manually: > > 0. Enable khugepaged on system under test. Mount tmpfs at /mnt/ramdisk. > > 1. Start a two-thread application. Each thread allocates a chunk of > > non-huge memory buffer from /mnt/ramdisk. > > 2. Pick 4 random buffer address (2 in each thread) and inject > > uncorrectable memory errors at physical addresses. > > 3. Signal both threads to make their memory buffer collapsible, i.e. > > calling madvise(MADV_HUGEPAGE). > > 4. Wait and then check kernel log: khugepaged is able to recover from > > poisoned pages by skipping them. > > 5. Signal both threads to inspect their buffer contents and make sure n= o > > data corruption. > > > > Signed-off-by: Jiaqi Yan > > Reviewed-by: Yang Shi > > A nit below: > > > --- > > mm/khugepaged.c | 86 +++++++++++++++++++++++++++++++------------------ > > 1 file changed, 54 insertions(+), 32 deletions(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index bef68286345c8..38c1655ce0a9e 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1874,6 +1874,9 @@ static int collapse_file(struct mm_struct *mm, un= signed long addr, > > { > > struct address_space *mapping =3D file->f_mapping; > > struct page *hpage; > > + struct page *page; > > + struct page *tmp; > > + struct folio *folio; > > pgoff_t index =3D 0, end =3D start + HPAGE_PMD_NR; > > LIST_HEAD(pagelist); > > XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER); > > @@ -1918,8 +1921,7 @@ static int collapse_file(struct mm_struct *mm, un= signed long addr, > > > > xas_set(&xas, start); > > for (index =3D start; index < end; index++) { > > - struct page *page =3D xas_next(&xas); > > - struct folio *folio; > > + page =3D xas_next(&xas); > > > > VM_BUG_ON(index !=3D xas.xa_index); > > if (is_shmem) { > > @@ -2099,12 +2101,8 @@ static int collapse_file(struct mm_struct *mm, u= nsigned long addr, > > put_page(page); > > goto xa_unlocked; > > } > > - nr =3D thp_nr_pages(hpage); > > > > - if (is_shmem) > > - __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr); > > - else { > > - __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr); > > + if (!is_shmem) { > > filemap_nr_thps_inc(mapping); > > /* > > * Paired with smp_mb() in do_dentry_open() to ensure > > @@ -2115,21 +2113,9 @@ static int collapse_file(struct mm_struct *mm, u= nsigned long addr, > > smp_mb(); > > if (inode_is_open_for_write(mapping->host)) { > > result =3D SCAN_FAIL; > > - __mod_lruvec_page_state(hpage, NR_FILE_THPS, -n= r); > > filemap_nr_thps_dec(mapping); > > - goto xa_locked; > > } > > } > > - > > - if (nr_none) { > > - __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none); > > - /* nr_none is always 0 for non-shmem. */ > > - __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none); > > - } > > - > > - /* Join all the small entries into a single multi-index entry *= / > > - xas_set_order(&xas, start, HPAGE_PMD_ORDER); > > - xas_store(&xas, hpage); > > xa_locked: > > xas_unlock_irq(&xas); > > xa_unlocked: > > @@ -2142,21 +2128,36 @@ static int collapse_file(struct mm_struct *mm, = unsigned long addr, > > try_to_unmap_flush(); > > > > if (result =3D=3D SCAN_SUCCEED) { > > - struct page *page, *tmp; > > - struct folio *folio; > > - > > /* > > * Replacing old pages with new one has succeeded, now = we > > - * need to copy the content and free the old pages. > > + * attempt to copy the contents. > > */ > > index =3D start; > > - list_for_each_entry_safe(page, tmp, &pagelist, lru) { > > + list_for_each_entry(page, &pagelist, lru) { > > while (index < page->index) { > > clear_highpage(hpage + (index % HPAGE_P= MD_NR)); > > index++; > > } > > - copy_highpage(hpage + (page->index % HPAGE_PMD_= NR), > > - page); > > + if (copy_mc_highpage(hpage + (page->index % HPA= GE_PMD_NR), > > + page) > 0) { > > + result =3D SCAN_COPY_MC; > > + break; > > + } > > + index++; > > + } > > + while (result =3D=3D SCAN_SUCCEED && index < end) { > > + clear_highpage(hpage + (index % HPAGE_PMD_NR)); > > + index++; > > + } > > + } > > + > > + nr =3D thp_nr_pages(hpage); > > + if (result =3D=3D SCAN_SUCCEED) { > > + /* > > + * Copying old pages to huge one has succeeded, now we > > + * need to free the old pages. > > + */ > > + list_for_each_entry_safe(page, tmp, &pagelist, lru) { > > list_del(&page->lru); > > page->mapping =3D NULL; > > page_ref_unfreeze(page, 1); > > @@ -2164,12 +2165,23 @@ static int collapse_file(struct mm_struct *mm, = unsigned long addr, > > ClearPageUnevictable(page); > > unlock_page(page); > > put_page(page); > > - index++; > > } > > - while (index < end) { > > - clear_highpage(hpage + (index % HPAGE_PMD_NR)); > > - index++; > > + > > + xas_lock_irq(&xas); > > + if (is_shmem) > > + __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, n= r); > > + else > > + __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr= ); > > + > > + if (nr_none) { > > + __mod_lruvec_page_state(hpage, NR_FILE_PAGES, n= r_none); > > + /* nr_none is always 0 for non-shmem. */ > > + __mod_lruvec_page_state(hpage, NR_SHMEM, nr_non= e); > > } > > + /* Join all the small entries into a single multi-index= entry. */ > > + xas_set_order(&xas, start, HPAGE_PMD_ORDER); > > + xas_store(&xas, hpage); > > + xas_unlock_irq(&xas); > > > > folio =3D page_folio(hpage); > > folio_mark_uptodate(folio); > > @@ -2187,8 +2199,6 @@ static int collapse_file(struct mm_struct *mm, un= signed long addr, > > unlock_page(hpage); > > hpage =3D NULL; > > } else { > > - struct page *page; > > - > > /* Something went wrong: roll back page cache changes *= / > > xas_lock_irq(&xas); > > if (nr_none) { > > @@ -2222,6 +2232,18 @@ static int collapse_file(struct mm_struct *mm, u= nsigned long addr, > > xas_lock_irq(&xas); > > } > > VM_BUG_ON(nr_none); > > + /* > > + * Undo the updates of filemap_nr_thps_inc for non-SHME= M > > + * file only. This undo is not needed unless failure is > > + * due to SCAN_COPY_MC. > > + * > > + * Paired with smp_mb() in do_dentry_open() to ensure t= he > > + * update to nr_thps is visible. > > + */ > > + smp_mb(); > > + if (!is_shmem && result =3D=3D SCAN_COPY_MC) > > + filemap_nr_thps_dec(mapping); > > I think the memory barrier should be after the dec. Ah, will move into the if block and put after filemap_nr_thps_dec. > > > + > > xas_unlock_irq(&xas); > > > > hpage->mapping =3D NULL; > > -- > > 2.40.0.348.gf938b09366-goog > >