From: Anchal Agarwal <anchalag@amazon.com>
To: <boris.ostrovsky@oracle.com>
Cc: "tglx@linutronix.de" <tglx@linutronix.de>,
"mingo@redhat.com" <mingo@redhat.com>,
"bp@alien8.de" <bp@alien8.de>, "hpa@zytor.com" <hpa@zytor.com>,
"jgross@suse.com" <jgross@suse.com>,
"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"sstabellini@kernel.org" <sstabellini@kernel.org>,
"konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
"roger.pau@citrix.com" <roger.pau@citrix.com>,
"axboe@kernel.dk" <axboe@kernel.dk>,
"davem@davemloft.net" <davem@davemloft.net>,
"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
"len.brown@intel.com" <len.brown@intel.com>,
"pavel@ucw.cz" <pavel@ucw.cz>,
"peterz@infradead.org" <peterz@infradead.org>,
"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
"vkuznets@redhat.com" <vkuznets@redhat.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
<Woodhouse@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>,
David <dwmw@amazon.co.uk>,
"benh@kernel.crashing.org" <benh@kernel.crashing.org>,
<anchalag@amazon.com>, <aams@amazon.com>
Subject: Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode
Date: Fri, 21 May 2021 05:26:50 +0000 [thread overview]
Message-ID: <20210521052650.GA19056@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com> (raw)
In-Reply-To: <8cd59d9c-36b1-21cf-e59f-40c5c20c65f8@oracle.com>
On Thu, Oct 01, 2020 at 08:43:58AM -0400, boris.ostrovsky@oracle.com wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> >>>>>>> Also, wrt KASLR stuff, that issue is still seen sometimes but I haven't had
> >>>>>>> bandwidth to dive deep into the issue and fix it.
> >>>> So what's the plan there? You first mentioned this issue early this year and judged by your response it is not clear whether you will ever spend time looking at it.
> >>>>
> >>> I do want to fix it and did do some debugging earlier this year just haven't
> >>> gotten back to it. Also, wanted to understand if the issue is a blocker to this
> >>> series?
> >>
> >> Integrating code with known bugs is less than ideal.
> >>
> > So for this series to be accepted, KASLR needs to be fixed along with other
> > comments of course?
>
>
> Yes, please.
>
>
>
> >>> I had some theories when debugging around this like if the random base address picked by kaslr for the
> >>> resuming kernel mismatches the suspended kernel and just jogging my memory, I didn't find that as the case.
> >>> Another hunch was if physical address of registered vcpu info at boot is different from what suspended kernel
> >>> has and that can cause CPU's to get stuck when coming online.
> >>
> >> I'd think if this were the case you'd have 100% failure rate. And we are also re-registering vcpu info on xen restore and I am not aware of any failures due to KASLR.
> >>
> > What I meant there wrt VCPU info was that VCPU info is not unregistered during hibernation,
> > so Xen still remembers the old physical addresses for the VCPU information, created by the
> > booting kernel. But since the hibernation kernel may have different physical
> > addresses for VCPU info and if mismatch happens, it may cause issues with resume.
> > During hibernation, the VCPU info register hypercall is not invoked again.
>
>
> I still don't think that's the cause but it's certainly worth having a look.
>
Hi Boris,
Apologies for picking this up after last year.
I did some dive deep on the above statement and that is indeed the case that's happening.
I did some debugging around KASLR and hibernation using reboot mode.
I observed in my debug prints that whenever vcpu_info* address for secondary vcpu assigned
in xen_vcpu_setup at boot is different than what is in the image, resume gets stuck for that vcpu
in bringup_cpu(). That means we have different addresses for &per_cpu(xen_vcpu_info, cpu) at boot and after
control jumps into the image.
I failed to get any prints after it got stuck in bringup_cpu() and
I do not have an option to send a sysrq signal to the guest or rather get a kdump.
This change is not observed in every hibernate-resume cycle. I am not sure if this is a bug or an
expected behavior.
Also, I am contemplating the idea that it may be a bug in xen code getting triggered only when
KASLR is enabled but I do not have substantial data to prove that.
Is this a coincidence that this always happens for 1st vcpu?
Moreover, since hypervisor is not aware that guest is hibernated and it looks like a regular shutdown to dom0 during reboot mode,
will re-registering vcpu_info for secondary vcpu's even plausible? I could definitely use some advice to debug this further.
Some printk's from my debugging:
At Boot:
xen_vcpu_setup: xen_have_vcpu_info_placement=1 cpu=1, vcpup=0xffff9e548fa560e0, info.mfn=3996246 info.offset=224,
Image Loads:
It ends up in the condition:
xen_vcpu_setup()
{
...
if (xen_hvm_domain()) {
if (per_cpu(xen_vcpu, cpu) == &per_cpu(xen_vcpu_info, cpu))
return 0;
}
...
}
xen_vcpu_setup: checking mfn on resume cpu=1, info.mfn=3934806 info.offset=224, &per_cpu(xen_vcpu_info, cpu)=0xffff9d7240a560e0
This is tested on c4.2xlarge [8vcpu 15GB mem] instance with 5.10 kernel running
in the guest.
Thanks,
Anchal.
>
> -boris
>
>
next prev parent reply other threads:[~2021-05-21 5:27 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-21 22:22 [PATCH v3 00/11] Fix PM hibernation in Xen guests Anchal Agarwal
2020-08-21 22:25 ` [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode Anchal Agarwal
2020-09-13 15:43 ` boris.ostrovsky
2020-09-14 21:47 ` Anchal Agarwal
2020-09-15 0:24 ` boris.ostrovsky
2020-09-15 18:00 ` Anchal Agarwal
2020-09-15 19:58 ` boris.ostrovsky
2020-09-21 21:54 ` Anchal Agarwal
2020-09-22 16:18 ` boris.ostrovsky
2020-09-22 23:17 ` Anchal Agarwal
2020-09-25 19:04 ` Anchal Agarwal
2020-09-25 20:02 ` boris.ostrovsky
2020-09-25 22:28 ` Anchal Agarwal
2020-09-28 18:49 ` boris.ostrovsky
2020-09-30 21:29 ` Anchal Agarwal
2020-10-01 12:43 ` boris.ostrovsky
2021-05-21 5:26 ` Anchal Agarwal [this message]
2021-05-25 22:23 ` Boris Ostrovsky
2021-05-26 4:40 ` Anchal Agarwal
2021-05-26 18:29 ` Boris Ostrovsky
2021-05-28 21:50 ` Anchal Agarwal
2021-06-01 14:18 ` Boris Ostrovsky
2021-06-02 19:37 ` Anchal Agarwal
2021-06-03 20:11 ` Boris Ostrovsky
2021-06-03 23:27 ` Anchal Agarwal
2021-06-04 1:49 ` Boris Ostrovsky
2020-09-13 17:07 ` boris.ostrovsky
2020-08-21 22:26 ` [PATCH v3 02/11] xenbus: add freeze/thaw/restore callbacks support Anchal Agarwal
2020-09-13 16:11 ` boris.ostrovsky
2020-09-15 19:56 ` Anchal Agarwal
2020-08-21 22:26 ` [PATCH v3 03/11] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume Anchal Agarwal
2020-08-21 22:27 ` [PATCH v3 04/11] x86/xen: add system core suspend and resume callbacks Anchal Agarwal
2020-09-13 17:25 ` boris.ostrovsky
2020-08-21 22:27 ` [PATCH v3 05/11] genirq: Shutdown irq chips in suspend/resume during hibernation Thomas Gleixner
2020-08-22 0:36 ` Thomas Gleixner
2020-08-24 17:25 ` Anchal Agarwal
2020-08-25 13:20 ` Christoph Hellwig
2020-08-25 15:25 ` Thomas Gleixner
2020-08-21 22:28 ` [PATCH v3 06/11] xen-blkfront: add callbacks for PM suspend and hibernation Anchal Agarwal
2020-08-21 22:29 ` [PATCH v3 07/11] xen-netfront: " Anchal Agarwal
2020-08-21 22:29 ` [PATCH v3 08/11] x86/xen: save and restore steal clock during PM hibernation Anchal Agarwal
2020-08-21 22:30 ` [PATCH v3 09/11] xen: Introduce wrapper for save/restore sched clock offset Anchal Agarwal
2020-08-21 22:30 ` [PATCH v3 10/11] xen: Update sched clock offset to avoid system instability in hibernation Anchal Agarwal
2020-09-13 17:52 ` boris.ostrovsky
2020-08-21 22:31 ` [PATCH v3 11/11] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA Anchal Agarwal
2020-08-28 18:26 ` [PATCH v3 00/11] Fix PM hibernation in Xen guests Anchal Agarwal
2020-08-28 18:29 ` Rafael J. Wysocki
2020-08-28 18:39 ` Anchal Agarwal
2020-09-11 20:44 ` Anchal Agarwal
2020-09-11 15:19 ` boris.ostrovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210521052650.GA19056@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com \
--to=anchalag@amazon.com \
--cc=Woodhouse@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com \
--cc=aams@amazon.com \
--cc=axboe@kernel.dk \
--cc=benh@kernel.crashing.org \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=davem@davemloft.net \
--cc=dwmw@amazon.co.uk \
--cc=hpa@zytor.com \
--cc=jgross@suse.com \
--cc=konrad.wilk@oracle.com \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-pm@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=pavel@ucw.cz \
--cc=peterz@infradead.org \
--cc=rjw@rjwysocki.net \
--cc=roger.pau@citrix.com \
--cc=sstabellini@kernel.org \
--cc=tglx@linutronix.de \
--cc=vkuznets@redhat.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox