From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=DKIM_ADSP_ALL,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5278C10DCE for ; Fri, 6 Mar 2020 18:41:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8FBB520880 for ; Fri, 6 Mar 2020 18:41:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="lGhe2eke" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8FBB520880 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2D5016B0003; Fri, 6 Mar 2020 13:41:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 285AF6B0006; Fri, 6 Mar 2020 13:41:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1734B6B0007; Fri, 6 Mar 2020 13:41:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0245.hostedemail.com [216.40.44.245]) by kanga.kvack.org (Postfix) with ESMTP id EB2F06B0003 for ; Fri, 6 Mar 2020 13:41:09 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B67CB6137 for ; Fri, 6 Mar 2020 18:41:09 +0000 (UTC) X-FDA: 76565804658.22.meal42_a141c38c1447 X-HE-Tag: meal42_a141c38c1447 X-Filterd-Recvd-Size: 15254 Received: from smtp-fw-6002.amazon.com (smtp-fw-6002.amazon.com [52.95.49.90]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Fri, 6 Mar 2020 18:41:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1583520069; x=1615056069; h=date:from:to:cc:subject:message-id:references: mime-version:content-transfer-encoding:in-reply-to; bh=LX5YzkH7Pbr0juu9cPywdaUjXZNtvoQ0/dDSAKHCVE4=; b=lGhe2ekeUTrE+GPc4kEA7MAhBqWFcrI8n5tzXyWRWVySpfeTTeM5sA3M 1zhSaKc6NznpCv1xwAsIfVmGfeRgrXRPxRjrmLC0rnfuhVXl7V8CEKmTd UPE1VKItmpW6Jl+oUZioRxXiul6mCevabBRi5bQydrWZbbleSHe1y4cwF I=; IronPort-SDR: rwMd0NgzgS1CgXt2MLcRmEQROUZmIPkBtalvEa9cgrPMXvnjpRZkxoMo0OTcBJIfaxLKTY9Oo+ 0SjUBhxRD1JA== X-IronPort-AV: E=Sophos;i="5.70,523,1574121600"; d="scan'208";a="19961119" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-1e-a70de69e.us-east-1.amazon.com) ([10.43.8.6]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 06 Mar 2020 18:40:57 +0000 Received: from EX13MTAUEB002.ant.amazon.com (iad55-ws-svc-p15-lb9-vlan2.iad.amazon.com [10.40.159.162]) by email-inbound-relay-1e-a70de69e.us-east-1.amazon.com (Postfix) with ESMTPS id CABEEA2C8A; Fri, 6 Mar 2020 18:40:49 +0000 (UTC) Received: from EX13D08UEB004.ant.amazon.com (10.43.60.142) by EX13MTAUEB002.ant.amazon.com (10.43.60.12) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Fri, 6 Mar 2020 18:40:34 +0000 Received: from EX13MTAUEA001.ant.amazon.com (10.43.61.82) by EX13D08UEB004.ant.amazon.com (10.43.60.142) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Fri, 6 Mar 2020 18:40:33 +0000 Received: from dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com (172.22.96.68) by mail-relay.amazon.com (10.43.61.243) with Microsoft SMTP Server id 15.0.1367.3 via Frontend Transport; Fri, 6 Mar 2020 18:40:33 +0000 Received: by dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com (Postfix, from userid 4335130) id 821E640865; Fri, 6 Mar 2020 18:40:33 +0000 (UTC) Date: Fri, 6 Mar 2020 18:40:33 +0000 From: Anchal Agarwal To: Roger Pau =?iso-8859-1?Q?Monn=E9?= CC: , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [RFC PATCH v3 06/12] xen-blkfront: add callbacks for PM suspend and hibernation Message-ID: <20200306184033.GA25358@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com> References: <890c404c585d7790514527f0c021056a7be6e748.1581721799.git.anchalag@amazon.com> <20200221142445.GZ4679@Air-de-Roger> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline In-Reply-To: <20200221142445.GZ4679@Air-de-Roger> User-Agent: Mutt/1.5.21 (2010-09-15) Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Feb 21, 2020 at 03:24:45PM +0100, Roger Pau Monn=E9 wrote: > On Fri, Feb 14, 2020 at 11:25:34PM +0000, Anchal Agarwal wrote: > > From: Munehisa Kamata >=20 > > Add freeze, thaw and restore callbacks for PM suspend and hibernation > > support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUS= PEND > > events, need to implement these xenbus_driver callbacks. > > The freeze handler stops a block-layer queue and disconnect the > > frontend from the backend while freeing ring_info and associated reso= urces. > > The restore handler re-allocates ring_info and re-connect to the > > backend, so the rest of the kernel can continue to use the block devi= ce > > transparently. Also, the handlers are used for both PM suspend and > > hibernation so that we can keep the existing suspend/resume callbacks= for > > Xen suspend without modification. Before disconnecting from backend, > > we need to prevent any new IO from being queued and wait for existing > > IO to complete. Freeze/unfreeze of the queues will guarantee that the= re > > are no requests in use on the shared ring. > >=20 > > Note:For older backends,if a backend doesn't have commit'12ea729645ac= e' > > xen/blkback: unmap all persistent grants when frontend gets disconnec= ted, > > the frontend may see massive amount of grant table warning when freei= ng > > resources. > > [ 36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff) > > [ 36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use! > >=20 > > In this case, persistent grants would need to be disabled. > >=20 > > [Anchal Changelog: Removed timeout/request during blkfront freeze. > > Fixed major part of the code to work with blk-mq] > > Signed-off-by: Anchal Agarwal > > Signed-off-by: Munehisa Kamata > > --- > > drivers/block/xen-blkfront.c | 119 ++++++++++++++++++++++++++++++++-= -- > > 1 file changed, 112 insertions(+), 7 deletions(-) > >=20 > > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfron= t.c > > index 478120233750..d715ed3cb69a 100644 > > --- a/drivers/block/xen-blkfront.c > > +++ b/drivers/block/xen-blkfront.c > > @@ -47,6 +47,8 @@ > > #include > > #include > > #include > > +#include > > +#include > > =20 > > #include > > #include > > @@ -79,6 +81,8 @@ enum blkif_state { > > BLKIF_STATE_DISCONNECTED, > > BLKIF_STATE_CONNECTED, > > BLKIF_STATE_SUSPENDED, > > + BLKIF_STATE_FREEZING, > > + BLKIF_STATE_FROZEN > > }; > > =20 > > struct grant { > > @@ -220,6 +224,7 @@ struct blkfront_info > > struct list_head requests; > > struct bio_list bio_list; > > struct list_head info_list; > > + struct completion wait_backend_disconnected; > > }; > > =20 > > static unsigned int nr_minors; > > @@ -261,6 +266,7 @@ static DEFINE_SPINLOCK(minor_lock); > > static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo)= ; > > static void blkfront_gather_backend_features(struct blkfront_info *i= nfo); > > static int negotiate_mq(struct blkfront_info *info); > > +static void __blkif_free(struct blkfront_info *info); >=20 > I'm not particularly found of adding underscore prefixes to functions, > I would rather use a more descriptive name if possible. > blkif_free_{queues/rings} maybe? > Apologies for delayed response as I was OOTO. Appreciate your feedback. Will fix > > =20 > > static int get_id_from_freelist(struct blkfront_ring_info *rinfo) > > { > > @@ -995,6 +1001,7 @@ static int xlvbd_init_blk_queue(struct gendisk *= gd, u16 sector_size, > > info->sector_size =3D sector_size; > > info->physical_sector_size =3D physical_sector_size; > > blkif_set_queue_limits(info); > > + init_completion(&info->wait_backend_disconnected); > > =20 > > return 0; > > } > > @@ -1218,6 +1225,8 @@ static void xlvbd_release_gendisk(struct blkfro= nt_info *info) > > /* Already hold rinfo->ring_lock. */ > > static inline void kick_pending_request_queues_locked(struct blkfron= t_ring_info *rinfo) > > { > > + if (unlikely(rinfo->dev_info->connected =3D=3D BLKIF_STATE_FREEZING= )) > > + return; >=20 > Do you really need this check here? >=20 > The queue will be frozen and quiesced in blkfront_freeze when the state > is set to BLKIF_STATE_FREEZING, and then the call to > blk_mq_start_stopped_hw_queues is just a noop as long as the queue is > quiesced (see blk_mq_run_hw_queue). >=20 You are right. Will fix it. May have skipped this part of the patch when = fixing blkfront_freeze. > > if (!RING_FULL(&rinfo->ring)) > > blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true); > > } > > @@ -1341,8 +1350,6 @@ static void blkif_free_ring(struct blkfront_rin= g_info *rinfo) > > =20 > > static void blkif_free(struct blkfront_info *info, int suspend) > > { > > - unsigned int i; > > - > > /* Prevent new requests being issued until we fix things up. */ > > info->connected =3D suspend ? > > BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED; > > @@ -1350,6 +1357,13 @@ static void blkif_free(struct blkfront_info *i= nfo, int suspend) > > if (info->rq) > > blk_mq_stop_hw_queues(info->rq); > > =20 > > + __blkif_free(info); > > +} > > + > > +static void __blkif_free(struct blkfront_info *info) > > +{ > > + unsigned int i; > > + > > for (i =3D 0; i < info->nr_rings; i++) > > blkif_free_ring(&info->rinfo[i]); > > =20 > > @@ -1553,8 +1567,10 @@ static irqreturn_t blkif_interrupt(int irq, vo= id *dev_id) > > struct blkfront_ring_info *rinfo =3D (struct blkfront_ring_info *)d= ev_id; > > struct blkfront_info *info =3D rinfo->dev_info; > > =20 > > - if (unlikely(info->connected !=3D BLKIF_STATE_CONNECTED)) > > - return IRQ_HANDLED; > > + if (unlikely(info->connected !=3D BLKIF_STATE_CONNECTED)) { > > + if (info->connected !=3D BLKIF_STATE_FREEZING) >=20 > Please fold this into the previous if condition: >=20 > if (unlikely(info->connected !=3D BLKIF_STATE_CONNECTED && > info->connected !=3D BLKIF_STATE_FREEZING)) > return IRQ_HANDLED; > ACK > > + } > > =20 > > spin_lock_irqsave(&rinfo->ring_lock, flags); > > again: > > @@ -2020,6 +2036,7 @@ static int blkif_recover(struct blkfront_info *= info) > > struct bio *bio; > > unsigned int segs; > > =20 > > + bool frozen =3D info->connected =3D=3D BLKIF_STATE_FROZEN; >=20 > Please place this together with the rest of the local variable > declarations. >=20 ACK > > blkfront_gather_backend_features(info); > > /* Reset limits changed by blk_mq_update_nr_hw_queues(). */ > > blkif_set_queue_limits(info); > > @@ -2046,6 +2063,9 @@ static int blkif_recover(struct blkfront_info *= info) > > kick_pending_request_queues(rinfo); > > } > > =20 > > + if (frozen) > > + return 0; >=20 > I have to admit my memory is fuzzy here, but don't you need to > re-queue requests in case the backend has different limits of indirect > descriptors per request for example? >=20 > Or do we expect that the frontend is always going to be resumed on the > same backend, and thus features won't change? >=20 So to understand your question better here, AFAIU the maximum number of = indirect=20 grefs is fixed by the backend, but the frontend can issue requests with a= ny=20 number of indirect segments as long as it's less than the number provided= by=20 the backend. So by your question you mean this max number of MAX_INDIRECT= _SEGMENTS=20 256 on backend can change ?=20 > > + > > list_for_each_entry_safe(req, n, &info->requests, queuelist) { > > /* Requeue pending requests (flush or discard) */ > > list_del_init(&req->queuelist); > > @@ -2359,6 +2379,7 @@ static void blkfront_connect(struct blkfront_in= fo *info) > > =20 > > return; > > case BLKIF_STATE_SUSPENDED: > > + case BLKIF_STATE_FROZEN: > > /* > > * If we are recovering from suspension, we need to wait > > * for the backend to announce it's features before > > @@ -2476,12 +2497,37 @@ static void blkback_changed(struct xenbus_dev= ice *dev, > > break; > > =20 > > case XenbusStateClosed: > > - if (dev->state =3D=3D XenbusStateClosed) > > + if (dev->state =3D=3D XenbusStateClosed) { > > + if (info->connected =3D=3D BLKIF_STATE_FREEZING) { > > + __blkif_free(info); > > + info->connected =3D BLKIF_STATE_FROZEN; > > + complete(&info->wait_backend_disconnected); > > + break; > > + } > > + > > break; > > + } > > + > > + /* > > + * We may somehow receive backend's Closed again while thawing > > + * or restoring and it causes thawing or restoring to fail. > > + * Ignore such unexpected state anyway. > > + */ > > + if (info->connected =3D=3D BLKIF_STATE_FROZEN && > > + dev->state =3D=3D XenbusStateInitialised) { >=20 > I'm not sure you need the extra dev->state =3D=3D XenbusStateInitialise= d. > If the frotnend is in state BLKIF_STATE_FROZEN you can likely ignore > the notification of the backend switched to closed state, regardless > of the frontend state? >=20 I see. Sounds plausible will do my set of testing and figure out if it do= es not break anything. > > + dev_dbg(&dev->dev, > > + "ignore the backend's Closed state: %s", > > + dev->nodename); > > + break; > > + } > > /* fall through */ > > case XenbusStateClosing: > > - if (info) > > - blkfront_closing(info); > > + if (info) { > > + if (info->connected =3D=3D BLKIF_STATE_FREEZING) > > + xenbus_frontend_closed(dev); > > + else > > + blkfront_closing(info); > > + } > > break; > > } > > } > > @@ -2625,6 +2671,62 @@ static void blkif_release(struct gendisk *disk= , fmode_t mode) > > mutex_unlock(&blkfront_mutex); > > } > > =20 > > +static int blkfront_freeze(struct xenbus_device *dev) > > +{ > > + unsigned int i; > > + struct blkfront_info *info =3D dev_get_drvdata(&dev->dev); > > + struct blkfront_ring_info *rinfo; > > + /* This would be reasonable timeout as used in xenbus_dev_shutdown(= ) */ > > + unsigned int timeout =3D 5 * HZ; > > + int err =3D 0; > > + > > + info->connected =3D BLKIF_STATE_FREEZING; > > + > > + blk_mq_freeze_queue(info->rq); > > + blk_mq_quiesce_queue(info->rq); >=20 > Don't you need to also drain the queue and make sure it's empty? >=20 blk_mq_freeze_queue and blk_mq_quiesce_queue should take care of running = HW queues synchronously and making sure all the ongoing dispatches have finished. Did I understan= d your question right? > > + > > + for (i =3D 0; i < info->nr_rings; i++) { > > + rinfo =3D &info->rinfo[i]; > > + > > + gnttab_cancel_free_callback(&rinfo->callback); > > + flush_work(&rinfo->work); > > + } > > + > > + /* Kick the backend to disconnect */ > > + xenbus_switch_state(dev, XenbusStateClosing); > > + > > + /* > > + * We don't want to move forward before the frontend is diconnected > > + * from the backend cleanly. > > + */ > > + timeout =3D wait_for_completion_timeout(&info->wait_backend_disconn= ected, > > + timeout); > > + if (!timeout) { > > + err =3D -EBUSY; > > + xenbus_dev_error(dev, err, "Freezing timed out;" > > + "the device may become inconsistent state"); > > + } > > + > > + return err; > > +} > > + > > +static int blkfront_restore(struct xenbus_device *dev) > > +{ > > + struct blkfront_info *info =3D dev_get_drvdata(&dev->dev); > > + int err =3D 0; > > + > > + err =3D talk_to_blkback(dev, info); > > + blk_mq_unquiesce_queue(info->rq); > > + blk_mq_unfreeze_queue(info->rq); > > + > > + if (err) > > + goto out; >=20 > There's no need for an out label here, just return err, or even > simpler: >=20 ok. > if (!err) > blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings); >=20 > return err; >=20 > Thanks, Roger. > Thanks, Anchal