linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Oscar Salvador <osalvador@suse.de>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: David Hildenbrand <david@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: Machine check recovery broken in v6.9-rc1
Date: Sun, 7 Apr 2024 06:51:51 +0200	[thread overview]
Message-ID: <ZhImZwKhp-CZ0MFN@localhost.localdomain> (raw)
In-Reply-To: <SJ1PR11MB608323D7E6113B78A35F4999FC012@SJ1PR11MB6083.namprd11.prod.outlook.com>

On Sun, Apr 07, 2024 at 12:08:30AM +0000, Luck, Tony wrote:
> Oscar.
> 
> Both the 6.1 and 6.9-rc2 patches make the BUG (and subsequent issues) go away.

Thanks for the switf test Tony!

> Here's what's happening.
> 
> When the machine check occurs there's a scramble from various subsystems
> to report the memory error.
> 
> ghes_do_memory_failure() calls memory_failure_queue() which later
> calls memory_failure() from a kernel thread. Side note: this happens TWICE
> for each error. Not sure yet if this is a BIOS issue logging more than once.
> or some Linux issues in acpi/apei/ghes.c code.
> 
> uc_decode_notifier() [called from a different kernel thread] also calls
> do_memory_failure()
> 
> Finally kill_me_maybe() [called from task_work on return to the application
> when returning from the machine check handler] also calls memory_failure()
> 
> do_memory_failure() is somewhat prepared for multiple reports of the same
> error. It uses an atomic test and set operation to mark the page as poisoned.
> 
> First called to report the error does all the real work. Late arrivals take a
> shorter path, but may still take some action(s) depending on the "flags"
> passed in:
> 
>         if (TestSetPageHWPoison(p)) {
>                 pr_err("%#lx: already hardware poisoned\n", pfn);
>                 res = -EHWPOISON;
>                 if (flags & MF_ACTION_REQUIRED)
>                         res = kill_accessing_process(current, pfn, flags);
>                 if (flags & MF_COUNT_INCREASED)
>                         put_page(p);
>                 goto unlock_mutex;
>         }

Thanks for the detailed explanation.

> In this case the last to arrive has MF_ACTION_REQUIRED set, so calls
> kill_accessing_process() ... which is in the stack trace that led to the:
> 
>    kernel BUG at include/linux/swapops.h:88!
> 
> I'm not sure that I fully understand your patch. I guess that it is making sure to
> handle the case that the page has already been marked as poisoned?

Basically what is happening is:

1) We mark the page as HWPoison
2) We see that the page is mapped by someone
3) We try to unmap it, and in the process we create a hwpoison swap entry.
   See the following chunk from try_to_unmap_one():

   "
    if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
            pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
            if (folio_test_hugetlb(folio)) {
                    hugetlb_count_sub(folio_nr_pages(folio), mm);
                    set_huge_pte_at(mm, address, pvmw.pte, pteval,
                                    hsz);
            } else {
                    dec_mm_counter(mm, mm_counter(folio));
                    set_pte_at(mm, address, pvmw.pte, pteval);
            }
	    ...
    }
   "
4) Now there is a second memory event (maybe the previous one has
   already finished, I do not think it matters for the sake of this
   problem)
5) The second event sees that the page has already been marked as
   HWPoison but since it has MF_ACTION_REQUIRED specified, it
   goes to kill_accessing_process() to do what its name says.
6) We walk the page tables of the accessing process to see if it has
   the poisoned pfn.
7) check_hwpoisoned_entry()
   (which is called from
   walk_page_range()->walk_{pgd,p4d,pud,pmd}_range()->ops->pmd_entry())
   checks whether any of the ptes is poisoned.
8) Since the previous MCE event unmapped the page, pte_present() == 0,
   so we want to get the swap entry, and this is where it falls off the
   cliff.
   See check_hwpoisoned_entry()

   static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
                                     unsigned long poisoned_pfn, struct to_kill *tk)
   {
          unsigned long pfn = 0;

          if (pte_present(pte)) {
                  pfn = pte_pfn(pte);
          } else {
                  swp_entry_t swp = pte_to_swp_entry(pte);

                  if (is_hwpoison_entry(swp))
			pfn = swp_offset_pfn(swp);
         }
	 ...
   }

is_hwpoison_entry() returns true (remember the make_hwpoison_entry()
call we did?)
But when we try to get the pfn from the swap entry, we stumble upon the
VM_BUG_ON(), because is_pfn_swap_entry() only checks for:
 is_migration_entry()
 is_device_private_entry()
 is_device_exclusive_entry()

but it should also check for is_hwpoison_entry().
Since it does not, is_pfn_swap_entry() returns false in our case,
leading to the VM_BUG_ON.

Note that this should only matter in environments where CONFIG_DEBUG_VM
is set.

I hope I shed some light in here.

> Anyway ... thanks for the quick fix. I hope the above helps write a good
> commit message to get this applied and backported to stable.
> 
> Tested-by: Tony Luck <tony.luck@intel.com>

Thanks again Tony, much appreciated.

I will write the patch and most likely send it out either today in the
afternoon or tomorrow early in the
morning.


-- 
Oscar Salvador
SUSE Labs


      parent reply	other threads:[~2024-04-07  4:52 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-04 22:05 Tony Luck
2024-04-04 23:39 ` Tony Luck
2024-04-05  7:19 ` David Hildenbrand
2024-04-05 15:05   ` Luck, Tony
2024-04-05 23:58     ` Tony Luck
2024-04-06  2:18       ` Oscar Salvador
2024-04-06  3:54         ` Oscar Salvador
2024-04-06  4:13           ` Oscar Salvador
2024-04-07  0:08             ` Luck, Tony
2024-04-07  3:59               ` Miaohe Lin
2024-04-07  4:51               ` Oscar Salvador [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZhImZwKhp-CZ0MFN@localhost.localdomain \
    --to=osalvador@suse.de \
    --cc=bp@alien8.de \
    --cc=david@redhat.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=tony.luck@intel.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox