From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=KGKx=AK=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.2 required=3.0 tests=DKIM_ADSP_ALL,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E7A51C433E1
	for <linux-mm@archiver.kernel.org>; Mon, 29 Jun 2020 19:21:15 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 9BDB520702
	for <linux-mm@archiver.kernel.org>; Mon, 29 Jun 2020 19:21:15 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="TQrtjf2t"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9BDB520702
Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 2B2E08D0005; Mon, 29 Jun 2020 15:21:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2629B6B0068; Mon, 29 Jun 2020 15:21:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 152618D0005; Mon, 29 Jun 2020 15:21:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0190.hostedemail.com [216.40.44.190])
	by kanga.kvack.org (Postfix) with ESMTP id F3A5E6B0037
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 15:21:14 -0400 (EDT)
Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 83CE28248047
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:21:14 +0000 (UTC)
X-FDA: 76983217668.30.fall92_5e1805c26e71
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin30.hostedemail.com (Postfix) with ESMTP id 48EF1180B3C8E
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:21:14 +0000 (UTC)
X-HE-Tag: fall92_5e1805c26e71
X-Filterd-Recvd-Size: 16498
Received: from smtp-fw-33001.amazon.com (smtp-fw-33001.amazon.com [207.171.190.10])
	by imf33.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 29 Jun 2020 19:21:13 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
  d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
  t=1593458474; x=1624994474;
  h=date:from:to:cc:message-id:references:mime-version:
   content-transfer-encoding:in-reply-to:subject;
  bh=kLlA5RN03Xc0rLgTC7sQtfZJgq55+f+4l+tD0cUCuiI=;
  b=TQrtjf2twU9L2fFI780jIm7e8B9dsvfahcSTtyBGmo8s5kfNbNlQa8py
   SS6EuXoFnbUEaltepowZ89rKkG4DrBrAoXqcz7q+Lm0EQ+qUcUe38tANL
   6mkKl6tGyhMPcRbKPNzrTUPwu1dN1hv/Tbg7Qotxpj2E1ph7UViC7vcTE
   w=;
IronPort-SDR: TFk9Ap13BSopn0WzM+ofsaWep3goJJ1oYevrRH/gIxCL0UUWZBH5DxVgfAHkikcgrsKcEgr33D
 uj0sRPYPNVsQ==
X-IronPort-AV: E=Sophos;i="5.75,295,1589241600"; 
   d="scan'208";a="54787643"
Subject: Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]
Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2a-6e2fc477.us-west-2.amazon.com) ([10.47.23.38])
  by smtp-border-fw-out-33001.sea14.amazon.com with ESMTP; 29 Jun 2020 19:20:55 +0000
Received: from EX13MTAUEE002.ant.amazon.com (pdx4-ws-svc-p6-lb7-vlan2.pdx.amazon.com [10.170.41.162])
	by email-inbound-relay-2a-6e2fc477.us-west-2.amazon.com (Postfix) with ESMTPS id 942DDA2269;
	Mon, 29 Jun 2020 19:20:53 +0000 (UTC)
Received: from EX13D08UEE004.ant.amazon.com (10.43.62.182) by
 EX13MTAUEE002.ant.amazon.com (10.43.62.24) with Microsoft SMTP Server (TLS)
 id 15.0.1497.2; Mon, 29 Jun 2020 19:20:35 +0000
Received: from EX13MTAUEE002.ant.amazon.com (10.43.62.24) by
 EX13D08UEE004.ant.amazon.com (10.43.62.182) with Microsoft SMTP Server (TLS)
 id 15.0.1497.2; Mon, 29 Jun 2020 19:20:35 +0000
Received: from dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com
 (172.22.96.68) by mail-relay.amazon.com (10.43.62.224) with Microsoft SMTP
 Server id 15.0.1497.2 via Frontend Transport; Mon, 29 Jun 2020 19:20:35 +0000
Received: by dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com (Postfix, from userid 4335130)
	id 5419940348; Mon, 29 Jun 2020 19:20:35 +0000 (UTC)
Date: Mon, 29 Jun 2020 19:20:35 +0000
From: Anchal Agarwal <anchalag@amazon.com>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>
CC: Boris Ostrovsky <boris.ostrovsky@oracle.com>, "tglx@linutronix.de"
	<tglx@linutronix.de>, "mingo@redhat.com" <mingo@redhat.com>, "bp@alien8.de"
	<bp@alien8.de>, "hpa@zytor.com" <hpa@zytor.com>, "x86@kernel.org"
	<x86@kernel.org>, "jgross@suse.com" <jgross@suse.com>,
	"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>, "linux-mm@kvack.org"
	<linux-mm@kvack.org>, "Kamata, Munehisa" <kamatam@amazon.com>,
	"sstabellini@kernel.org" <sstabellini@kernel.org>, "konrad.wilk@oracle.com"
	<konrad.wilk@oracle.com>, "axboe@kernel.dk" <axboe@kernel.dk>,
	"davem@davemloft.net" <davem@davemloft.net>, "rjw@rjwysocki.net"
	<rjw@rjwysocki.net>, "len.brown@intel.com" <len.brown@intel.com>,
	"pavel@ucw.cz" <pavel@ucw.cz>, "peterz@infradead.org" <peterz@infradead.org>,
	"Valentin, Eduardo" <eduval@amazon.com>, "Singh, Balbir" <sblbir@amazon.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	"vkuznets@redhat.com" <vkuznets@redhat.com>, "netdev@vger.kernel.org"
	<netdev@vger.kernel.org>, "linux-kernel@vger.kernel.org"
	<linux-kernel@vger.kernel.org>, "Woodhouse, David" <dwmw@amazon.co.uk>,
	"benh@kernel.crashing.org" <benh@kernel.crashing.org>
Message-ID: <20200629192035.GA13195@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>
References: <7FD7505E-79AA-43F6-8D5F-7A2567F333AB@amazon.com>
 <20200604070548.GH1195@Air-de-Roger>
 <20200616214925.GA21684@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>
 <20200617083528.GW735@Air-de-Roger>
 <20200619234312.GA24846@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>
 <20200622083846.GF735@Air-de-Roger>
 <20200623004314.GA28586@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>
 <20200623081903.GP735@Air-de-Roger>
 <20200625183659.GA26586@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com>
 <20200626091239.GA735@Air-de-Roger>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
In-Reply-To: <20200626091239.GA735@Air-de-Roger>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Rspamd-Queue-Id: 48EF1180B3C8E
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam05
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Jun 26, 2020 at 11:12:39AM +0200, Roger Pau Monn=E9 wrote:
> CAUTION: This email originated from outside of the organization. Do not=
 click links or open attachments unless you can confirm the sender and kn=
ow the content is safe.
>=20
>=20
>=20
> On Thu, Jun 25, 2020 at 06:36:59PM +0000, Anchal Agarwal wrote:
> > On Tue, Jun 23, 2020 at 10:19:03AM +0200, Roger Pau Monn=E9 wrote:
> > > CAUTION: This email originated from outside of the organization. Do=
 not click links or open attachments unless you can confirm the sender an=
d know the content is safe.
> > >
> > >
> > >
> > > On Tue, Jun 23, 2020 at 12:43:14AM +0000, Anchal Agarwal wrote:
> > > > On Mon, Jun 22, 2020 at 10:38:46AM +0200, Roger Pau Monn=E9 wrote=
:
> > > > > CAUTION: This email originated from outside of the organization=
. Do not click links or open attachments unless you can confirm the sende=
r and know the content is safe.
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jun 19, 2020 at 11:43:12PM +0000, Anchal Agarwal wrote:
> > > > > > On Wed, Jun 17, 2020 at 10:35:28AM +0200, Roger Pau Monn=E9 w=
rote:
> > > > > > > CAUTION: This email originated from outside of the organiza=
tion. Do not click links or open attachments unless you can confirm the s=
ender and know the content is safe.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 16, 2020 at 09:49:25PM +0000, Anchal Agarwal wr=
ote:
> > > > > > > > On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monn=E9=
 wrote:
> > > > > > > > > CAUTION: This email originated from outside of the orga=
nization. Do not click links or open attachments unless you can confirm t=
he sender and know the content is safe.
> > > > > > > > > On Wed, Jun 03, 2020 at 11:33:52PM +0000, Agarwal, Anch=
al wrote:
> > > > > > > > > >  CAUTION: This email originated from outside of the o=
rganization. Do not click links or open attachments unless you can confir=
m the sender and know the content is safe.
> > > > > > > > > >     > +             xenbus_dev_error(dev, err, "Freez=
ing timed out;"
> > > > > > > > > >     > +                              "the device may =
become inconsistent state");
> > > > > > > > > >
> > > > > > > > > >     Leaving the device in this state is quite bad, as=
 it's in a closed
> > > > > > > > > >     state and with the queues frozen. You should make=
 an attempt to
> > > > > > > > > >     restore things to a working state.
> > > > > > > > > >
> > > > > > > > > > You mean if backend closed after timeout? Is there a =
way to know that? I understand it's not good to
> > > > > > > > > > leave it in this state however, I am still trying to =
find if there is a good way to know if backend is still connected after t=
imeout.
> > > > > > > > > > Hence the message " the device may become inconsisten=
t state".  I didn't see a timeout not even once on my end so that's why
> > > > > > > > > > I may be looking for an alternate perspective here. m=
ay be need to thaw everything back intentionally is one thing I could thi=
nk of.
> > > > > > > > >
> > > > > > > > > You can manually force this state, and then check that =
it will behave
> > > > > > > > > correctly. I would expect that on a failure to disconne=
ct from the
> > > > > > > > > backend you should switch the frontend to the 'Init' st=
ate in order to
> > > > > > > > > try to reconnect to the backend when possible.
> > > > > > > > >
> > > > > > > > From what I understand forcing manually is, failing the f=
reeze without
> > > > > > > > disconnect and try to revive the connection by unfreezing=
 the
> > > > > > > > queues->reconnecting to backend [which never got diconnec=
ted]. May be even
> > > > > > > > tearing down things manually because I am not sure what s=
tate will frontend
> > > > > > > > see if backend fails to to disconnect at any point in tim=
e. I assumed connected.
> > > > > > > > Then again if its "CONNECTED" I may not need to tear down=
 everything and start
> > > > > > > > from Initialising state because that may not work.
> > > > > > > >
> > > > > > > > So I am not so sure about backend's state so much, lets s=
ay if  xen_blkif_disconnect fail,
> > > > > > > > I don't see it getting handled in the backend then what w=
ill be backend's state?
> > > > > > > > Will it still switch xenbus state to 'Closed'? If not wha=
t will frontend see,
> > > > > > > > if it tries to read backend's state through xenbus_read_d=
river_state ?
> > > > > > > >
> > > > > > > > So the flow be like:
> > > > > > > > Front end marks XenbusStateClosing
> > > > > > > > Backend marks its state as XenbusStateClosing
> > > > > > > >     Frontend marks XenbusStateClosed
> > > > > > > >     Backend disconnects calls xen_blkif_disconnect
> > > > > > > >        Backend fails to disconnect, the above function re=
turns EBUSY
> > > > > > > >        What will be state of backend here?
> > > > > > >
> > > > > > > Backend should stay in state 'Closing' then, until it can f=
inish
> > > > > > > tearing down.
> > > > > > >
> > > > > > It disconnects the ring after switching to connected state to=
o.
> > > > > > > >        Frontend did not tear down the rings if backend do=
es not switches the
> > > > > > > >        state to 'Closed' in case of failure.
> > > > > > > >
> > > > > > > > If backend stays in CONNECTED state, then even if we mark=
 it Initialised in frontend, backend
> > > > > > >
> > > > > > > Backend will stay in state 'Closing' I think.
> > > > > > >
> > > > > > > > won't be calling connect(). {From reading code in fronten=
d_changed}
> > > > > > > > IMU, Initialising will fail since backend dev->state !=3D=
 XenbusStateClosed plus
> > > > > > > > we did not tear down anything so calling talk_to_blkback =
may not be needed
> > > > > > > >
> > > > > > > > Does that sound correct?
> > > > > > >
> > > > > > > I think switching to the initial state in order to try to a=
ttempt a
> > > > > > > reconnection would be our best bet here.
> > > > > > >
> > > > > > It does not seems to work correctly, I get hung tasks all ove=
r and all the
> > > > > > requests to filesystem gets stuck. Backend does shows the sta=
te as connected
> > > > > > after xenbus_dev_suspend fails but I think there may be somet=
hing missing.
> > > > > > I don't seem to get IO interrupts thereafter i.e hitting the =
function blkif_interrupts.
> > > > > > I think just marking it initialised may not be the only thing=
.
> > > > > > Here is a short description of what I am trying to do:
> > > > > > So, on timeout:
> > > > > >     Switch XenBusState to "Initialized"
> > > > > >     unquiesce/unfreeze the queues and return
> > > > > >     mark info->connected =3D BLKIF_STATE_CONNECTED
> > > > >
> > > > > If xenbus state is Initialized isn't it wrong to set info->conn=
ected
> > > > > =3D=3D CONNECTED?
> > > > >
> > > > Yes, you are right earlier I was marking it explicitly but that w=
as not right,
> > > > the connect path for blkfront will do that.
> > > > > You should tear down all the internal state (like a proper clos=
e)?
> > > > >
> > > > Isn't that similar to disconnecting in the first place that faile=
d during
> > > > freeze? Do you mean re-try to close but this time re-connect afte=
r close
> > > > basically do everything you would at "restore"?
> > >
> > > Last time I checked blkfront supported reconnections (ie: disconnec=
t
> > > from a backend and connect again). I was assuming we could apply th=
e
> > > same here on timeout, and just follow the same path where the front=
end
> > > waits indefinitely for the backend to close and then attempts to
> > > reconnect.
> > >
> > > > Also, I experimented with that and it works intermittently. I wan=
t to take a
> > > > step back on this issue and ask few questions here:
> > > > 1. Is fixing this recovery a blocker for me sending in a V2 versi=
on?
> > >
> > > At the end of day it's your feature. I would certainly prefer for i=
t
> > > to work as good as possible, this being a recovery in case of failu=
re
> > > just make sure it does something sane (ie: crash/close the frontend=
)
> > > and add a TODO note.
> > >
> > > > 2. In our 2-3 years of supporting this feature at large scale we =
haven't seen this issue
> > > > where backend fails to disconnect. What we are trying to do here =
is create a
> > > > hypothetical situation where we leave backend in Closing state an=
d try and see how it
> > > > recovers. The reason why I think it "may not" occur and the timeo=
ut of 5HZ is
> > > > sufficient is because we haven't come across even a single use-ca=
se where it
> > > > caused hibernation to fail.
> > > > The reason why I think "it may" occur is if we are running a real=
ly memory
> > > > intensive workload and ring is busy and is unable to complete all=
 the requests
> > > > in the given timeout. This is very unlikely though.
> > >
> > > As said above I would generally prefer for code to handle possible
> > > failures the best way, and hence I think here it would be nice to
> > > fallback to the normal disconnect path and just wait for the backen=
d
> > > to close.
> > >
> > Do you mind throwing some light in here, what that path may be, if it=
s
> > straight forward to fix I would like to debug it a bit more. May be I=
 am
> > missing some of the context here.
>=20
> So the frontend should do:
>=20
> - Switch to Closed state (and cleanup everything required).
> - Wait for backend to switch to Closed state (must be done
>   asynchronously, handled in blkback_changed).
> - Switch frontend to XenbusStateInitialising, that will in turn force
>   the backend to switch to XenbusStateInitWait.
> - After that it should just follow the normal connection procedure.
>=20
> I think the part that's missing is the frontend doing the state change
> to XenbusStateInitialising when the backend switches to the Closed
> state.
>=20
> > I was of the view we may just want to mark frontend closed which shou=
ld do
> > the job of freeing resources and then following the same flow as
> > blkfront_restore. That does not seems to work correctly 100% of the t=
ime.
>=20
> I think the missing part is that you must wait for the backend to
> switch to the Closed state, or else the switch to
> XenbusStateInitialising won't be picked up correctly by the backend
> (because it's still doing it's cleanup).
>=20
> Using blkfront_restore might be an option, but you need to assert the
> backend is in the initial state before using that path.
>
Yes, I agree and I make sure that XenbusStateInitialising only triggers
on frontend once backend is disconnected. msleep in a loop not that grace=
ful but
works.
Frontend only switches to XenbusStateInitialising once it sees backend
as Closed. The issue here is and may require more debugging is:
1. Hibernate instance->Closing failed, artificially created situation by =
not
marking frontend Closed in the first place during freezing.
2. System comes back up fine restored to 'backend connected'.
3. Re-run (1) again without reboot
4. (4) fails to recover basically freezing does not fail at all which is =
weird
   because it should timeout as it passes through same path. It hits a BU=
G in
   talk_to_blkback() and instance crashes.

Anyways just wanted to paint out a picture that there may be something mo=
re
happening here which needs a persistent debugging.=20
> > > You likely have this very well tuned to your own environment and
> > > workloads, since this will now be upstream others might have more
> > > contended systems where it could start to fail.
> > >
> > I agree, however, this is also from the testing I did with 100 of run=
s
> > outside of EC2 running few tests of my own.
> > > > 3) Also, I do not think this may be straight forward to fix and e=
xpect
> > > > hibernation to work flawlessly in subsequent invocations. I am op=
en to
> > > > all suggestions.
> > >
> > > Right, adding a TODO would seem appropriate then.
> > >
> > Just to double check, I will send in a V2 with this marked as TO-DO?
>=20
> I think that's fine. Please clearly describe what's missing, so
> others know what they might have to implement.
>=20
Ack.
> Thanks, Roger.
>=20
Thanks,
Anchal