From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f71.google.com (mail-oi0-f71.google.com [209.85.218.71]) by kanga.kvack.org (Postfix) with ESMTP id 93F666B0003 for ; Mon, 16 Jul 2018 20:28:34 -0400 (EDT) Received: by mail-oi0-f71.google.com with SMTP id b8-v6so44809675oib.4 for ; Mon, 16 Jul 2018 17:28:34 -0700 (PDT) Received: from tyo162.gate.nec.co.jp (tyo162.gate.nec.co.jp. [114.179.232.162]) by mx.google.com with ESMTPS id x7-v6si19383732oix.414.2018.07.16.17.28.32 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 16 Jul 2018 17:28:33 -0700 (PDT) From: Naoya Horiguchi Subject: Re: [PATCH v1 1/2] mm: fix race on soft-offlining free huge pages Date: Tue, 17 Jul 2018 00:25:19 +0000 Message-ID: <20180717002518.GA4673@hori1.linux.bs1.fc.nec.co.jp> References: <1531452366-11661-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.com> <20180713133545.658173ca953e7d2a8a4ee6bd@linux-foundation.org> In-Reply-To: <20180713133545.658173ca953e7d2a8a4ee6bd@linux-foundation.org> Content-Language: ja-JP Content-Type: text/plain; charset="iso-2022-jp" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: "linux-mm@kvack.org" , Michal Hocko , "xishi.qiuxishi@alibaba-inc.com" , "zy.zhengyi@alibaba-inc.com" , "linux-kernel@vger.kernel.org" On Fri, Jul 13, 2018 at 01:35:45PM -0700, Andrew Morton wrote: > On Fri, 13 Jul 2018 12:26:05 +0900 Naoya Horiguchi wrote: >=20 > > There's a race condition between soft offline and hugetlb_fault which > > causes unexpected process killing and/or hugetlb allocation failure. > >=20 > > The process killing is caused by the following flow: > >=20 > > CPU 0 CPU 1 CPU 2 > >=20 > > soft offline > > get_any_page > > // find the hugetlb is free > > mmap a hugetlb file > > page fault > > ... > > hugetlb_fault > > hugetlb_no_page > > alloc_huge_page > > // succeed > > soft_offline_free_page > > // set hwpoison flag > > mmap the hugetlb file > > page fault > > ... > > hugetlb_fault > > hugetlb_no_page > > find_lock_page > > return VM_FAULT_HWPO= ISON > > mm_fault_error > > do_sigbus > > // kill the process > >=20 > >=20 > > The hugetlb allocation failure comes from the following flow: > >=20 > > CPU 0 CPU 1 > >=20 > > mmap a hugetlb file > > // reserve all free page but don't fau= lt-in > > soft offline > > get_any_page > > // find the hugetlb is free > > soft_offline_free_page > > // set hwpoison flag > > dissolve_free_huge_page > > // fail because all free hugepages are reserved > > page fault > > ... > > hugetlb_fault > > hugetlb_no_page > > alloc_huge_page > > ... > > dequeue_huge_page_node_exa= ct > > // ignore hwpoisoned hugep= age > > // and finally fail due to= no-mem > >=20 > > The root cause of this is that current soft-offline code is written > > based on an assumption that PageHWPoison flag should beset at first to > > avoid accessing the corrupted data. This makes sense for memory_failur= e() > > or hard offline, but does not for soft offline because soft offline is > > about corrected (not uncorrected) error and is safe from data lost. > > This patch changes soft offline semantics where it sets PageHWPoison fl= ag > > only after containment of the error page completes successfully. > >=20 > > ... > > > > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c > > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c > > @@ -1598,8 +1598,18 @@ static int soft_offline_huge_page(struct page *p= age, int flags) > > if (ret > 0) > > ret =3D -EIO; > > } else { > > - if (PageHuge(page)) > > - dissolve_free_huge_page(page); > > + /* > > + * We set PG_hwpoison only when the migration source hugepage > > + * was successfully dissolved, because otherwise hwpoisoned > > + * hugepage remains on free hugepage list, then userspace will > > + * find it as SIGBUS by allocation failure. That's not expected > > + * in soft-offlining. > > + */ >=20 > This comment is unclear. What happens if there's a hwpoisoned page on > the freelist? The allocator just skips it and looks for another page?=20 Yes, this is what the allocator does. > Or does the allocator return the poisoned page, it gets mapped and > userspace gets a SIGBUS when accessing it? If the latter (or the > former!), why does the comment mention allocation failure? The mention to allocation failure was unclear, I'd like to replace with below, is it clearer? + /* + * We set PG_hwpoison only when the migration source hugepage + * was successfully dissolved, because otherwise hwpoisoned + * hugepage remains on free hugepage list. The allocator ignores + * such a hwpoisoned page so it's never allocated, but it could + * kill a process because of no-memory rather than hwpoison. + * Soft-offline never impacts the userspace, so this is undesired. + */ Thanks, Naoya Horiguchi=