From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D45B8C433EF for ; Wed, 11 May 2022 15:11:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2DEFE6B0074; Wed, 11 May 2022 11:11:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 28EC26B0075; Wed, 11 May 2022 11:11:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 12FB76B0078; Wed, 11 May 2022 11:11:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 072186B0074 for ; Wed, 11 May 2022 11:11:25 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CA8E532E05 for ; Wed, 11 May 2022 15:11:24 +0000 (UTC) X-FDA: 79453800888.25.2F0C92B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf17.hostedemail.com (Postfix) with ESMTP id 6C5AD400AB for ; Wed, 11 May 2022 15:11:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1652281883; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=39RuDmaNXDQBWH0X/kkcsp5HOrtgFjXcurofW9ePNCM=; b=Ivn+qNpXYmbFAcOmcXbWX8b33B3flqPFluYZSxhZ/EBr5U6FQCHNWV2/xa0uzvYRnm/DYO HCBgfcrH/NavTpfEZ03gYbPhyQd4CD/k73O4491t7/Q+NlbnKePoAEd5xhy19MeQ0o5tPb 8WNiR1yUiVhQV4ZtXjbnSU3K5BZmw0Y= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-217-jWS-qxEHMt-e4zDV5TLcWA-1; Wed, 11 May 2022 11:11:20 -0400 X-MC-Unique: jWS-qxEHMt-e4zDV5TLcWA-1 Received: by mail-wm1-f72.google.com with SMTP id q6-20020a1cf306000000b0038c5726365aso812543wmq.3 for ; Wed, 11 May 2022 08:11:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=39RuDmaNXDQBWH0X/kkcsp5HOrtgFjXcurofW9ePNCM=; b=AV1124FVxagEI2SZAVuveVi6BSFHALvVoe6qJVNLzTmQK1IITiVM0LegZSYlo+IV2q eG+I7D7rzSfe6TSaImxe2JHaklBtJouHfH26Id+tKMlwyThshU+z+n88iHHl39iIlQ/w MV3OPgo7FpSDbxsppVePpWmCpzwQMyRbE2vksB1xTYAF0VAqxZLlokdH62cx17TL7ch5 dOQcN7pKPlriGxR86hQ6Wbe9VfcBsEb4qtntyiUiOgwD71ksdOA9NvS6lpi2IGIG8sZE 0Hnn2iqd/TDiFe59S8aKY9SLyOH55blY/kkgygDYWBEgFK6Vwn825jnLJBNw0X+Bk0M0 WQlA== X-Gm-Message-State: AOAM530cEq+OUGJbDysm4ouUOrL7CiGyPp/o2PPSRFFlJwA7n4ev5vf2 5vKSBkjPNFsQKM1ZCnQYKigsExCHfrDbS62auY8bKs9ulJDCnmgr58FfdzytpZdv5eTZMMBe5dD x6rgsT0KqMoo= X-Received: by 2002:a05:6000:178d:b0:20c:5bfd:4d7d with SMTP id e13-20020a056000178d00b0020c5bfd4d7dmr23049796wrg.23.1652281879112; Wed, 11 May 2022 08:11:19 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxxsf9oFM7Gk42CfSx1fKwFraAW0U7qz+SidUX3vN/Bplz5t0edJ1yw0cEoY4YS/Cm/duWrAA== X-Received: by 2002:a05:6000:178d:b0:20c:5bfd:4d7d with SMTP id e13-20020a056000178d00b0020c5bfd4d7dmr23049777wrg.23.1652281878846; Wed, 11 May 2022 08:11:18 -0700 (PDT) Received: from ?IPV6:2003:cb:c701:700:2393:b0f4:ef08:bd51? (p200300cbc70107002393b0f4ef08bd51.dip0.t-ipconnect.de. [2003:cb:c701:700:2393:b0f4:ef08:bd51]) by smtp.gmail.com with ESMTPSA id n7-20020adffe07000000b0020c5253d8dasm1865846wrr.38.2022.05.11.08.11.17 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 11 May 2022 08:11:18 -0700 (PDT) Message-ID: <0389eac1-af68-56b5-696d-581bb56878b9@redhat.com> Date: Wed, 11 May 2022 17:11:17 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 To: Miaohe Lin , Oscar Salvador Cc: =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , Naoya Horiguchi , "linux-mm@kvack.org" , Andrew Morton , Mike Kravetz , Yang Shi , Muchun Song , "linux-kernel@vger.kernel.org" References: <20220427042841.678351-1-naoya.horiguchi@linux.dev> <54399815-10fe-9d43-7ada-7ddb55e798cb@redhat.com> <20220427122049.GA3918978@hori.linux.bs1.fc.nec.co.jp> <20220509072902.GB123646@hori.linux.bs1.fc.nec.co.jp> <6a5d31a3-c27f-f6d9-78bb-d6bf69547887@huawei.com> <465902dc-d3bf-7a93-da04-839faddcd699@huawei.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug In-Reply-To: <465902dc-d3bf-7a93-da04-839faddcd699@huawei.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: dm3pif6q77otonits6nbobywtssc16y8 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 6C5AD400AB X-Rspam-User: Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Ivn+qNpX; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf17.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1652281865-120 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 09.05.22 12:53, Miaohe Lin wrote: > On 2022/5/9 17:58, Oscar Salvador wrote: >> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote: >>>>> So that leaves us with either >>>>> >>>>> 1) Fail offlining -> no need to care about reonlining >>> >>> Maybe fail offlining will be a better alternative as we can get rid of many races >>> between memory failure and memory offline? But no strong opinion. :) >> >> If taking care of those races is not an herculean effort, I'd go with >> allowing offlining + disallow re-onlining. >> Mainly because memory RAS stuff. > > This dose make sense to me. Thanks. We can try to solve those races if > offlining + disallow re-onlining is applied. :) > >> >> Now, to the re-onlining thing, we'll have to come up with a way to check >> whether a section contains hwpoisoned pages, so we do not have to go >> and check every single page, as that will be really suboptimal. > > Yes, we need a stable and cheap way to do that. My simplistic approach would be a simple flag/indicator in the memory block devices that indicates that any page in the memory block was hwpoisoned. It's easy to check that during memory onlining and fail it. diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 084d67fd55cc..3d0ef812e901 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem) struct zone *zone; int ret; + if (mem->hwpoisoned) + return -EHWPOISON; + zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group, start_pfn, nr_pages); Once the problematic DIMM would actually get unplugged, the memory block devices would get removed as well. So when hotplugging a new DIMM in the same location, we could online that memory again. Another place to store that would be the memory section, we'd then have to check all underlying sections here. We're a bit short on flags in the memory section I think, but they are easier to lookup from other code eventually then memory block devices. -- Thanks, David / dhildenb