[PATCH] swap: add block io poll in swapin path

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] swap: add block io poll in swapin path
@ 2017-05-04 20:42 Shaohua Li
  2017-05-04 20:53 ` Jens Axboe
  2017-05-05  3:07 ` Huang, Ying
  0 siblings, 2 replies; 10+ messages in thread
From: Shaohua Li @ 2017-05-04 20:42 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Kernel-team, Tim Chen, Huang Ying, Jens Axboe

For fast flash disk, async IO could introduce overhead because of
context switch. block-mq now supports IO poll, which improves
performance and latency a lot. swapin is a good place to use this
technique, because the task is waitting for the swapin page to continue
execution.

In my virtual machine, directly read 4k data from a NVMe with iopoll is
about 60% better than that without poll. With iopoll support in swapin
patch, my microbenchmark (a task does random memory write) is about 10%
~ 25% faster. CPU utilization increases a lot though, 2x and even 3x CPU
utilization. This will depend on disk speed though. While iopoll in
swapin isn't intended for all usage cases, it's a win for latency
sensistive workloads with high speed swap disk. block layer has knob to
control poll in runtime. If poll isn't enabled in block layer, there
should be no noticeable change in swapin.

The swapin readahead might read several pages in in the same time and
form a big IO request. Since the IO will take longer time, it doesn't
make sense to do poll, so the patch only does iopoll for single page
swapin.

Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/swap.h |  5 +++--
 mm/madvise.c         |  4 ++--
 mm/page_io.c         | 20 ++++++++++++++++++--
 mm/swap_state.c      | 10 ++++++----
 mm/swapfile.c        |  2 +-
 5 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ba58824..c589e6c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -331,7 +331,7 @@ extern void kswapd_stop(int nid);
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
 /* linux/mm/page_io.c */
-extern int swap_readpage(struct page *);
+extern int swap_readpage(struct page *, bool do_poll);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
 extern void end_swap_bio_write(struct bio *bio);
 extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
@@ -362,7 +362,8 @@ extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
 extern struct page *lookup_swap_cache(swp_entry_t);
 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
-			struct vm_area_struct *vma, unsigned long addr);
+			struct vm_area_struct *vma, unsigned long addr,
+			bool do_poll);
 extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr,
 			bool *new_page_allocated);
diff --git a/mm/madvise.c b/mm/madvise.c
index 25b78ee..8eda184 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -205,7 +205,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 			continue;
 
 		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
-								vma, index);
+							vma, index, false);
 		if (page)
 			put_page(page);
 	}
@@ -246,7 +246,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 		}
 		swap = radix_to_swp_entry(page);
 		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
-								NULL, 0);
+							NULL, 0, false);
 		if (page)
 			put_page(page);
 	}
diff --git a/mm/page_io.c b/mm/page_io.c
index 23f6d0d..464cf16 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -117,6 +117,7 @@ static void swap_slot_free_notify(struct page *page)
 static void end_swap_bio_read(struct bio *bio)
 {
 	struct page *page = bio->bi_io_vec[0].bv_page;
+	struct task_struct *waiter = bio->bi_private;
 
 	if (bio->bi_error) {
 		SetPageError(page);
@@ -133,6 +134,7 @@ static void end_swap_bio_read(struct bio *bio)
 out:
 	unlock_page(page);
 	bio_put(bio);
+	wake_up_process(waiter);
 }
 
 int generic_swapfile_activate(struct swap_info_struct *sis,
@@ -329,11 +331,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return ret;
 }
 
-int swap_readpage(struct page *page)
+int swap_readpage(struct page *page, bool do_poll)
 {
 	struct bio *bio;
 	int ret = 0;
 	struct swap_info_struct *sis = page_swap_info(page);
+	blk_qc_t qc;
+	struct block_device *bdev;
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -372,9 +376,21 @@ int swap_readpage(struct page *page)
 		ret = -ENOMEM;
 		goto out;
 	}
+	bdev = bio->bi_bdev;
+	bio->bi_private = current;
 	bio_set_op_attrs(bio, REQ_OP_READ, 0);
 	count_vm_event(PSWPIN);
-	submit_bio(bio);
+	qc = submit_bio(bio);
+	while (do_poll) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!PageLocked(page))
+			break;
+
+		if (!blk_mq_poll(bdev_get_queue(bdev), qc))
+			break;
+	}
+	__set_current_state(TASK_RUNNING);
+
 out:
 	return ret;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 539b888..7c0a66c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -404,14 +404,14 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * the swap entry is no longer in use.
  */
 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
-			struct vm_area_struct *vma, unsigned long addr)
+			struct vm_area_struct *vma, unsigned long addr, bool do_poll)
 {
 	bool page_was_allocated;
 	struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
 			vma, addr, &page_was_allocated);
 
 	if (page_was_allocated)
-		swap_readpage(retpage);
+		swap_readpage(retpage, do_poll);
 
 	return retpage;
 }
@@ -488,11 +488,13 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long start_offset, end_offset;
 	unsigned long mask;
 	struct blk_plug plug;
+	bool do_poll = true;
 
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
 
+	do_poll = false;
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -503,7 +505,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
-						gfp_mask, vma, addr);
+						gfp_mask, vma, addr, false);
 		if (!page)
 			continue;
 		if (offset != entry_offset)
@@ -514,7 +516,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
-	return read_swap_cache_async(entry, gfp_mask, vma, addr);
+	return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
 }
 
 int init_swap_address_space(unsigned int type, unsigned long nr_pages)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4f6cba1..04516c1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1719,7 +1719,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
 		page = read_swap_cache_async(entry,
-					GFP_HIGHUSER_MOVABLE, NULL, 0);
+					GFP_HIGHUSER_MOVABLE, NULL, 0, false);
 		if (!page) {
 			/*
 			 * Either swap_duplicate() failed because entry
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] swap: add block io poll in swapin path
  2017-05-04 20:42 [PATCH] swap: add block io poll in swapin path Shaohua Li
@ 2017-05-04 20:53 ` Jens Axboe
  2017-05-04 21:27   ` Shaohua Li
  2017-05-05  3:07 ` Huang, Ying
  1 sibling, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2017-05-04 20:53 UTC (permalink / raw)
  To: Shaohua Li, linux-mm; +Cc: Andrew Morton, Kernel-team, Tim Chen, Huang Ying

On 05/04/2017 02:42 PM, Shaohua Li wrote:
> For fast flash disk, async IO could introduce overhead because of
> context switch. block-mq now supports IO poll, which improves
> performance and latency a lot. swapin is a good place to use this
> technique, because the task is waitting for the swapin page to continue
> execution.

Nitfy!

> In my virtual machine, directly read 4k data from a NVMe with iopoll is
> about 60% better than that without poll. With iopoll support in swapin
> patch, my microbenchmark (a task does random memory write) is about 10%
> ~ 25% faster. CPU utilization increases a lot though, 2x and even 3x CPU
> utilization. This will depend on disk speed though. While iopoll in
> swapin isn't intended for all usage cases, it's a win for latency
> sensistive workloads with high speed swap disk. block layer has knob to
> control poll in runtime. If poll isn't enabled in block layer, there
> should be no noticeable change in swapin.

Did you try with hybrid polling enabled? We should be able to achieve
most of the latency win at much less CPU cost with that.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] swap: add block io poll in swapin path
  2017-05-04 20:53 ` Jens Axboe
@ 2017-05-04 21:27   ` Shaohua Li
  2017-05-04 21:29     ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2017-05-04 21:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-mm, Andrew Morton, Kernel-team, Tim Chen, Huang Ying

On Thu, May 04, 2017 at 02:53:59PM -0600, Jens Axboe wrote:
> On 05/04/2017 02:42 PM, Shaohua Li wrote:
> > For fast flash disk, async IO could introduce overhead because of
> > context switch. block-mq now supports IO poll, which improves
> > performance and latency a lot. swapin is a good place to use this
> > technique, because the task is waitting for the swapin page to continue
> > execution.
> 
> Nitfy!
> 
> > In my virtual machine, directly read 4k data from a NVMe with iopoll is
> > about 60% better than that without poll. With iopoll support in swapin
> > patch, my microbenchmark (a task does random memory write) is about 10%
> > ~ 25% faster. CPU utilization increases a lot though, 2x and even 3x CPU
> > utilization. This will depend on disk speed though. While iopoll in
> > swapin isn't intended for all usage cases, it's a win for latency
> > sensistive workloads with high speed swap disk. block layer has knob to
> > control poll in runtime. If poll isn't enabled in block layer, there
> > should be no noticeable change in swapin.
> 
> Did you try with hybrid polling enabled? We should be able to achieve
> most of the latency win at much less CPU cost with that.

Hybrid poll is much slower than classic in my test, I tried different settings.
maybe because this is a vm though. 

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] swap: add block io poll in swapin path
  2017-05-04 21:27   ` Shaohua Li
@ 2017-05-04 21:29     ` Jens Axboe
  2017-05-04 23:23       ` Chen, Tim C
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2017-05-04 21:29 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-mm, Andrew Morton, Kernel-team, Tim Chen, Huang Ying

On 05/04/2017 03:27 PM, Shaohua Li wrote:
> On Thu, May 04, 2017 at 02:53:59PM -0600, Jens Axboe wrote:
>> On 05/04/2017 02:42 PM, Shaohua Li wrote:
>>> For fast flash disk, async IO could introduce overhead because of
>>> context switch. block-mq now supports IO poll, which improves
>>> performance and latency a lot. swapin is a good place to use this
>>> technique, because the task is waitting for the swapin page to continue
>>> execution.
>>
>> Nitfy!
>>
>>> In my virtual machine, directly read 4k data from a NVMe with iopoll is
>>> about 60% better than that without poll. With iopoll support in swapin
>>> patch, my microbenchmark (a task does random memory write) is about 10%
>>> ~ 25% faster. CPU utilization increases a lot though, 2x and even 3x CPU
>>> utilization. This will depend on disk speed though. While iopoll in
>>> swapin isn't intended for all usage cases, it's a win for latency
>>> sensistive workloads with high speed swap disk. block layer has knob to
>>> control poll in runtime. If poll isn't enabled in block layer, there
>>> should be no noticeable change in swapin.
>>
>> Did you try with hybrid polling enabled? We should be able to achieve
>> most of the latency win at much less CPU cost with that.
> 
> Hybrid poll is much slower than classic in my test, I tried different settings.
> maybe because this is a vm though. 

It's probably a vm issue, I bet the timed sleep are just too slow to be
useful in a vm.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] swap: add block io poll in swapin path
  2017-05-04 21:29     ` Jens Axboe
@ 2017-05-04 23:23       ` Chen, Tim C
  2017-05-05  2:24         ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Chen, Tim C @ 2017-05-04 23:23 UTC (permalink / raw)
  To: Jens Axboe, Shaohua Li; +Cc: linux-mm, Andrew Morton, Kernel-team, Huang, Ying



>-----Original Message-----
>From: Jens Axboe [mailto:axboe@fb.com]
>Sent: Thursday, May 04, 2017 2:29 PM
>To: Shaohua Li
>Cc: linux-mm@kvack.org; Andrew Morton; Kernel-team@fb.com; Chen, Tim C;
>Huang, Ying
>Subject: Re: [PATCH] swap: add block io poll in swapin path
>
>On 05/04/2017 03:27 PM, Shaohua Li wrote:
>> On Thu, May 04, 2017 at 02:53:59PM -0600, Jens Axboe wrote:
>>> On 05/04/2017 02:42 PM, Shaohua Li wrote:
>>>> For fast flash disk, async IO could introduce overhead because of
>>>> context switch. block-mq now supports IO poll, which improves
>>>> performance and latency a lot. swapin is a good place to use this
>>>> technique, because the task is waitting for the swapin page to
>>>> continue execution.
>>>
>>> Nitfy!
>>>
>>>> In my virtual machine, directly read 4k data from a NVMe with iopoll
>>>> is about 60% better than that without poll. With iopoll support in
>>>> swapin patch, my microbenchmark (a task does random memory write) is
>>>> about 10% ~ 25% faster. CPU utilization increases a lot though, 2x
>>>> and even 3x CPU utilization. This will depend on disk speed though.
>>>> While iopoll in swapin isn't intended for all usage cases, it's a
>>>> win for latency sensistive workloads with high speed swap disk.
>>>> block layer has knob to control poll in runtime. If poll isn't
>>>> enabled in block layer, there should be no noticeable change in swapin.
>>>
>>> Did you try with hybrid polling enabled? We should be able to achieve
>>> most of the latency win at much less CPU cost with that.
>>
>> Hybrid poll is much slower than classic in my test, I tried different settings.
>> maybe because this is a vm though.
>
>It's probably a vm issue, I bet the timed sleep are just too slow to be useful in a
>vm.
>

The speedup is quite nice.
The high CPU utilization is somewhat of a concern.   But this is directly
proportional to the poll time or latency of the drive's response.  The latest generation of
SSD drive's latency is a factor of 7 or more compared to the previous one, so the poll time 
could go down quite a bit, depending on what drive you were using in your test.
What is the latency and the kind of drive you're using?

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] swap: add block io poll in swapin path
  2017-05-04 23:23       ` Chen, Tim C
@ 2017-05-05  2:24         ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2017-05-05  2:24 UTC (permalink / raw)
  To: Chen, Tim C, Shaohua Li; +Cc: linux-mm, Andrew Morton, Kernel-team, Huang, Ying

On 05/04/2017 05:23 PM, Chen, Tim C wrote:
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@fb.com]
>> Sent: Thursday, May 04, 2017 2:29 PM
>> To: Shaohua Li
>> Cc: linux-mm@kvack.org; Andrew Morton; Kernel-team@fb.com; Chen, Tim C;
>> Huang, Ying
>> Subject: Re: [PATCH] swap: add block io poll in swapin path
>>
>> On 05/04/2017 03:27 PM, Shaohua Li wrote:
>>> On Thu, May 04, 2017 at 02:53:59PM -0600, Jens Axboe wrote:
>>>> On 05/04/2017 02:42 PM, Shaohua Li wrote:
>>>>> For fast flash disk, async IO could introduce overhead because of
>>>>> context switch. block-mq now supports IO poll, which improves
>>>>> performance and latency a lot. swapin is a good place to use this
>>>>> technique, because the task is waitting for the swapin page to
>>>>> continue execution.
>>>>
>>>> Nitfy!
>>>>
>>>>> In my virtual machine, directly read 4k data from a NVMe with iopoll
>>>>> is about 60% better than that without poll. With iopoll support in
>>>>> swapin patch, my microbenchmark (a task does random memory write) is
>>>>> about 10% ~ 25% faster. CPU utilization increases a lot though, 2x
>>>>> and even 3x CPU utilization. This will depend on disk speed though.
>>>>> While iopoll in swapin isn't intended for all usage cases, it's a
>>>>> win for latency sensistive workloads with high speed swap disk.
>>>>> block layer has knob to control poll in runtime. If poll isn't
>>>>> enabled in block layer, there should be no noticeable change in swapin.
>>>>
>>>> Did you try with hybrid polling enabled? We should be able to achieve
>>>> most of the latency win at much less CPU cost with that.
>>>
>>> Hybrid poll is much slower than classic in my test, I tried different settings.
>>> maybe because this is a vm though.
>>
>> It's probably a vm issue, I bet the timed sleep are just too slow to be useful in a
>> vm.
>>
> 
> The speedup is quite nice.  The high CPU utilization is somewhat of a
> concern.   But this is directly proportional to the poll time or
> latency of the drive's response.  The latest generation of SSD drive's
> latency is a factor of 7 or more compared to the previous one, so the
> poll time could go down quite a bit, depending on what drive you were
> using in your test.

That was my point with the hybrid comment. In hybrid mode, there's no
reason why we can't get the same latencies as pure polling, at a
drastically reduced overhead. The latencies of the drive should not
matter, as we use the actual completion times to decide how long to
sleep and spin.

There's room for a bit of improvement, though. We should be tracking the
time it takes to do sleep+wakeup, and factor that into our wait cycle.
Currently we just blindly use half the average completion time. But even
with that, testing by others have shown basically identical latencies
with hybrid polling, burning only half a core instead of a full one.
Compared to strict sync irq driven mode, that's still a bit higher in
terms of CPU, but not really that much.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] swap: add block io poll in swapin path
  2017-05-04 20:42 [PATCH] swap: add block io poll in swapin path Shaohua Li
  2017-05-04 20:53 ` Jens Axboe
@ 2017-05-05  3:07 ` Huang, Ying
  2017-05-05  5:12   ` Shaohua Li
  1 sibling, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2017-05-05  3:07 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, Andrew Morton, Kernel-team, Tim Chen, Huang Ying, Jens Axboe

Hi, Shaohua,

Shaohua Li <shli@fb.com> writes:

> For fast flash disk, async IO could introduce overhead because of
> context switch. block-mq now supports IO poll, which improves
> performance and latency a lot. swapin is a good place to use this
> technique, because the task is waitting for the swapin page to continue
> execution.
>
> In my virtual machine, directly read 4k data from a NVMe with iopoll is
> about 60% better than that without poll. With iopoll support in swapin
> patch, my microbenchmark (a task does random memory write) is about 10%
> ~ 25% faster.

How many concurrent processes/threads for memory writing in your test?

In general, I think polling is a good way to reduce swap in latency for
high speed NVMe disk.

I have a question.  If the load of NVMe disk is high, for example, there
is quite some swap out occurs at the same time of swap in, the latency
of swap in may be much higher too.  Then, under the max swap in latency,
the overhead of polling may be high.  For example, it may be not an
issue to pool for 10us, but it is more serious if we poll for 500us or
1ms.  Is there some way to resolve this?  Can we set a threshold for
polling?  Then if we poll for more than the threshold, we will go to
sleep.

Best Regards,
Huang, Ying

> CPU utilization increases a lot though, 2x and even 3x CPU
> utilization. This will depend on disk speed though. While iopoll in
> swapin isn't intended for all usage cases, it's a win for latency
> sensistive workloads with high speed swap disk. block layer has knob to
> control poll in runtime. If poll isn't enabled in block layer, there
> should be no noticeable change in swapin.
>
> The swapin readahead might read several pages in in the same time and
> form a big IO request. Since the IO will take longer time, it doesn't
> make sense to do poll, so the patch only does iopoll for single page
> swapin.
>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Jens Axboe <axboe@fb.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  include/linux/swap.h |  5 +++--
>  mm/madvise.c         |  4 ++--
>  mm/page_io.c         | 20 ++++++++++++++++++--
>  mm/swap_state.c      | 10 ++++++----
>  mm/swapfile.c        |  2 +-
>  5 files changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ba58824..c589e6c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -331,7 +331,7 @@ extern void kswapd_stop(int nid);
>  #include <linux/blk_types.h> /* for bio_end_io_t */
>  
>  /* linux/mm/page_io.c */
> -extern int swap_readpage(struct page *);
> +extern int swap_readpage(struct page *, bool do_poll);
>  extern int swap_writepage(struct page *page, struct writeback_control *wbc);
>  extern void end_swap_bio_write(struct bio *bio);
>  extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
> @@ -362,7 +362,8 @@ extern void free_page_and_swap_cache(struct page *);
>  extern void free_pages_and_swap_cache(struct page **, int);
>  extern struct page *lookup_swap_cache(swp_entry_t);
>  extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
> -			struct vm_area_struct *vma, unsigned long addr);
> +			struct vm_area_struct *vma, unsigned long addr,
> +			bool do_poll);
>  extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
>  			struct vm_area_struct *vma, unsigned long addr,
>  			bool *new_page_allocated);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 25b78ee..8eda184 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -205,7 +205,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
>  			continue;
>  
>  		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> -								vma, index);
> +							vma, index, false);
>  		if (page)
>  			put_page(page);
>  	}
> @@ -246,7 +246,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
>  		}
>  		swap = radix_to_swp_entry(page);
>  		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
> -								NULL, 0);
> +							NULL, 0, false);
>  		if (page)
>  			put_page(page);
>  	}
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 23f6d0d..464cf16 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -117,6 +117,7 @@ static void swap_slot_free_notify(struct page *page)
>  static void end_swap_bio_read(struct bio *bio)
>  {
>  	struct page *page = bio->bi_io_vec[0].bv_page;
> +	struct task_struct *waiter = bio->bi_private;
>  
>  	if (bio->bi_error) {
>  		SetPageError(page);
> @@ -133,6 +134,7 @@ static void end_swap_bio_read(struct bio *bio)
>  out:
>  	unlock_page(page);
>  	bio_put(bio);
> +	wake_up_process(waiter);
>  }
>  
>  int generic_swapfile_activate(struct swap_info_struct *sis,
> @@ -329,11 +331,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>  	return ret;
>  }
>  
> -int swap_readpage(struct page *page)
> +int swap_readpage(struct page *page, bool do_poll)
>  {
>  	struct bio *bio;
>  	int ret = 0;
>  	struct swap_info_struct *sis = page_swap_info(page);
> +	blk_qc_t qc;
> +	struct block_device *bdev;
>  
>  	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> @@ -372,9 +376,21 @@ int swap_readpage(struct page *page)
>  		ret = -ENOMEM;
>  		goto out;
>  	}
> +	bdev = bio->bi_bdev;
> +	bio->bi_private = current;
>  	bio_set_op_attrs(bio, REQ_OP_READ, 0);
>  	count_vm_event(PSWPIN);
> -	submit_bio(bio);
> +	qc = submit_bio(bio);
> +	while (do_poll) {
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		if (!PageLocked(page))
> +			break;
> +
> +		if (!blk_mq_poll(bdev_get_queue(bdev), qc))
> +			break;
> +	}
> +	__set_current_state(TASK_RUNNING);
> +
>  out:
>  	return ret;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 539b888..7c0a66c 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -404,14 +404,14 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>   * the swap entry is no longer in use.
>   */
>  struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> -			struct vm_area_struct *vma, unsigned long addr)
> +			struct vm_area_struct *vma, unsigned long addr, bool do_poll)
>  {
>  	bool page_was_allocated;
>  	struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
>  			vma, addr, &page_was_allocated);
>  
>  	if (page_was_allocated)
> -		swap_readpage(retpage);
> +		swap_readpage(retpage, do_poll);
>  
>  	return retpage;
>  }
> @@ -488,11 +488,13 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	unsigned long start_offset, end_offset;
>  	unsigned long mask;
>  	struct blk_plug plug;
> +	bool do_poll = true;
>  
>  	mask = swapin_nr_pages(offset) - 1;
>  	if (!mask)
>  		goto skip;
>  
> +	do_poll = false;
>  	/* Read a page_cluster sized and aligned cluster around offset. */
>  	start_offset = offset & ~mask;
>  	end_offset = offset | mask;
> @@ -503,7 +505,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	for (offset = start_offset; offset <= end_offset ; offset++) {
>  		/* Ok, do the async read-ahead now */
>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> -						gfp_mask, vma, addr);
> +						gfp_mask, vma, addr, false);
>  		if (!page)
>  			continue;
>  		if (offset != entry_offset)
> @@ -514,7 +516,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  skip:
> -	return read_swap_cache_async(entry, gfp_mask, vma, addr);
> +	return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
>  }
>  
>  int init_swap_address_space(unsigned int type, unsigned long nr_pages)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4f6cba1..04516c1 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1719,7 +1719,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
>  		swap_map = &si->swap_map[i];
>  		entry = swp_entry(type, i);
>  		page = read_swap_cache_async(entry,
> -					GFP_HIGHUSER_MOVABLE, NULL, 0);
> +					GFP_HIGHUSER_MOVABLE, NULL, 0, false);
>  		if (!page) {
>  			/*
>  			 * Either swap_duplicate() failed because entry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] swap: add block io poll in swapin path
  2017-05-05  3:07 ` Huang, Ying
@ 2017-05-05  5:12   ` Shaohua Li
  2017-05-05  6:02     ` Huang, Ying
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2017-05-05  5:12 UTC (permalink / raw)
  To: Huang, Ying; +Cc: linux-mm, Andrew Morton, Kernel-team, Tim Chen, Jens Axboe

On Fri, May 05, 2017 at 11:07:54AM +0800, Huang, Ying wrote:
> Hi, Shaohua,
> 
> Shaohua Li <shli@fb.com> writes:
> 
> > For fast flash disk, async IO could introduce overhead because of
> > context switch. block-mq now supports IO poll, which improves
> > performance and latency a lot. swapin is a good place to use this
> > technique, because the task is waitting for the swapin page to continue
> > execution.
> >
> > In my virtual machine, directly read 4k data from a NVMe with iopoll is
> > about 60% better than that without poll. With iopoll support in swapin
> > patch, my microbenchmark (a task does random memory write) is about 10%
> > ~ 25% faster.
> 
> How many concurrent processes/threads for memory writing in your test?

I tried 1 thread and 8 threads
 
> In general, I think polling is a good way to reduce swap in latency for
> high speed NVMe disk.
> 
> I have a question.  If the load of NVMe disk is high, for example, there
> is quite some swap out occurs at the same time of swap in, the latency
> of swap in may be much higher too.  Then, under the max swap in latency,
> the overhead of polling may be high.  For example, it may be not an
> issue to pool for 10us, but it is more serious if we poll for 500us or
> 1ms.  Is there some way to resolve this?  Can we set a threshold for
> polling?  Then if we poll for more than the threshold, we will go to
> sleep.

The hybrid polling could help. The default hybrid polling poll half the average
IO latency. But it will not work very well if the latency becomes very big. The
hybrid polling has an interface to allow userspace to configure the poll
threshold, but since the latency varies from time to time, it would be very
hard to set a single threshold for all workloads.

Thanks,
Shaohua

> 
> Best Regards,
> Huang, Ying
> 
> > CPU utilization increases a lot though, 2x and even 3x CPU
> > utilization. This will depend on disk speed though. While iopoll in
> > swapin isn't intended for all usage cases, it's a win for latency
> > sensistive workloads with high speed swap disk. block layer has knob to
> > control poll in runtime. If poll isn't enabled in block layer, there
> > should be no noticeable change in swapin.
> >
> > The swapin readahead might read several pages in in the same time and
> > form a big IO request. Since the IO will take longer time, it doesn't
> > make sense to do poll, so the patch only does iopoll for single page
> > swapin.
> >
> > Cc: Tim Chen <tim.c.chen@intel.com>
> > Cc: Huang Ying <ying.huang@intel.com>
> > Cc: Jens Axboe <axboe@fb.com>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  include/linux/swap.h |  5 +++--
> >  mm/madvise.c         |  4 ++--
> >  mm/page_io.c         | 20 ++++++++++++++++++--
> >  mm/swap_state.c      | 10 ++++++----
> >  mm/swapfile.c        |  2 +-
> >  5 files changed, 30 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index ba58824..c589e6c 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -331,7 +331,7 @@ extern void kswapd_stop(int nid);
> >  #include <linux/blk_types.h> /* for bio_end_io_t */
> >  
> >  /* linux/mm/page_io.c */
> > -extern int swap_readpage(struct page *);
> > +extern int swap_readpage(struct page *, bool do_poll);
> >  extern int swap_writepage(struct page *page, struct writeback_control *wbc);
> >  extern void end_swap_bio_write(struct bio *bio);
> >  extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
> > @@ -362,7 +362,8 @@ extern void free_page_and_swap_cache(struct page *);
> >  extern void free_pages_and_swap_cache(struct page **, int);
> >  extern struct page *lookup_swap_cache(swp_entry_t);
> >  extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
> > -			struct vm_area_struct *vma, unsigned long addr);
> > +			struct vm_area_struct *vma, unsigned long addr,
> > +			bool do_poll);
> >  extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
> >  			struct vm_area_struct *vma, unsigned long addr,
> >  			bool *new_page_allocated);
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 25b78ee..8eda184 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -205,7 +205,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> >  			continue;
> >  
> >  		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> > -								vma, index);
> > +							vma, index, false);
> >  		if (page)
> >  			put_page(page);
> >  	}
> > @@ -246,7 +246,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
> >  		}
> >  		swap = radix_to_swp_entry(page);
> >  		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
> > -								NULL, 0);
> > +							NULL, 0, false);
> >  		if (page)
> >  			put_page(page);
> >  	}
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index 23f6d0d..464cf16 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -117,6 +117,7 @@ static void swap_slot_free_notify(struct page *page)
> >  static void end_swap_bio_read(struct bio *bio)
> >  {
> >  	struct page *page = bio->bi_io_vec[0].bv_page;
> > +	struct task_struct *waiter = bio->bi_private;
> >  
> >  	if (bio->bi_error) {
> >  		SetPageError(page);
> > @@ -133,6 +134,7 @@ static void end_swap_bio_read(struct bio *bio)
> >  out:
> >  	unlock_page(page);
> >  	bio_put(bio);
> > +	wake_up_process(waiter);
> >  }
> >  
> >  int generic_swapfile_activate(struct swap_info_struct *sis,
> > @@ -329,11 +331,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
> >  	return ret;
> >  }
> >  
> > -int swap_readpage(struct page *page)
> > +int swap_readpage(struct page *page, bool do_poll)
> >  {
> >  	struct bio *bio;
> >  	int ret = 0;
> >  	struct swap_info_struct *sis = page_swap_info(page);
> > +	blk_qc_t qc;
> > +	struct block_device *bdev;
> >  
> >  	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > @@ -372,9 +376,21 @@ int swap_readpage(struct page *page)
> >  		ret = -ENOMEM;
> >  		goto out;
> >  	}
> > +	bdev = bio->bi_bdev;
> > +	bio->bi_private = current;
> >  	bio_set_op_attrs(bio, REQ_OP_READ, 0);
> >  	count_vm_event(PSWPIN);
> > -	submit_bio(bio);
> > +	qc = submit_bio(bio);
> > +	while (do_poll) {
> > +		set_current_state(TASK_UNINTERRUPTIBLE);
> > +		if (!PageLocked(page))
> > +			break;
> > +
> > +		if (!blk_mq_poll(bdev_get_queue(bdev), qc))
> > +			break;
> > +	}
> > +	__set_current_state(TASK_RUNNING);
> > +
> >  out:
> >  	return ret;
> >  }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 539b888..7c0a66c 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -404,14 +404,14 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >   * the swap entry is no longer in use.
> >   */
> >  struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > -			struct vm_area_struct *vma, unsigned long addr)
> > +			struct vm_area_struct *vma, unsigned long addr, bool do_poll)
> >  {
> >  	bool page_was_allocated;
> >  	struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
> >  			vma, addr, &page_was_allocated);
> >  
> >  	if (page_was_allocated)
> > -		swap_readpage(retpage);
> > +		swap_readpage(retpage, do_poll);
> >  
> >  	return retpage;
> >  }
> > @@ -488,11 +488,13 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >  	unsigned long start_offset, end_offset;
> >  	unsigned long mask;
> >  	struct blk_plug plug;
> > +	bool do_poll = true;
> >  
> >  	mask = swapin_nr_pages(offset) - 1;
> >  	if (!mask)
> >  		goto skip;
> >  
> > +	do_poll = false;
> >  	/* Read a page_cluster sized and aligned cluster around offset. */
> >  	start_offset = offset & ~mask;
> >  	end_offset = offset | mask;
> > @@ -503,7 +505,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >  	for (offset = start_offset; offset <= end_offset ; offset++) {
> >  		/* Ok, do the async read-ahead now */
> >  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> > -						gfp_mask, vma, addr);
> > +						gfp_mask, vma, addr, false);
> >  		if (!page)
> >  			continue;
> >  		if (offset != entry_offset)
> > @@ -514,7 +516,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >  
> >  	lru_add_drain();	/* Push any new pages onto the LRU now */
> >  skip:
> > -	return read_swap_cache_async(entry, gfp_mask, vma, addr);
> > +	return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
> >  }
> >  
> >  int init_swap_address_space(unsigned int type, unsigned long nr_pages)
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4f6cba1..04516c1 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1719,7 +1719,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
> >  		swap_map = &si->swap_map[i];
> >  		entry = swp_entry(type, i);
> >  		page = read_swap_cache_async(entry,
> > -					GFP_HIGHUSER_MOVABLE, NULL, 0);
> > +					GFP_HIGHUSER_MOVABLE, NULL, 0, false);
> >  		if (!page) {
> >  			/*
> >  			 * Either swap_duplicate() failed because entry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] swap: add block io poll in swapin path
  2017-05-05  5:12   ` Shaohua Li
@ 2017-05-05  6:02     ` Huang, Ying
  2017-05-05 13:47       ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2017-05-05  6:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Huang, Ying, linux-mm, Andrew Morton, Kernel-team, Tim Chen, Jens Axboe

Shaohua Li <shli@fb.com> writes:

> On Fri, May 05, 2017 at 11:07:54AM +0800, Huang, Ying wrote:
>> Hi, Shaohua,
>> 
>> Shaohua Li <shli@fb.com> writes:
>> 
>> > For fast flash disk, async IO could introduce overhead because of
>> > context switch. block-mq now supports IO poll, which improves
>> > performance and latency a lot. swapin is a good place to use this
>> > technique, because the task is waitting for the swapin page to continue
>> > execution.
>> >
>> > In my virtual machine, directly read 4k data from a NVMe with iopoll is
>> > about 60% better than that without poll. With iopoll support in swapin
>> > patch, my microbenchmark (a task does random memory write) is about 10%
>> > ~ 25% faster.
>> 
>> How many concurrent processes/threads for memory writing in your test?
>
> I tried 1 thread and 8 threads

If we can measure the distribution of the latency during the test, it
may be helpful for us to evaluate whether my question below is important
or not.

pmbench (https://sourceforge.net/projects/pmbench/) may be helpful here.

>> In general, I think polling is a good way to reduce swap in latency for
>> high speed NVMe disk.
>> 
>> I have a question.  If the load of NVMe disk is high, for example, there
>> is quite some swap out occurs at the same time of swap in, the latency
>> of swap in may be much higher too.  Then, under the max swap in latency,
>> the overhead of polling may be high.  For example, it may be not an
>> issue to pool for 10us, but it is more serious if we poll for 500us or
>> 1ms.  Is there some way to resolve this?  Can we set a threshold for
>> polling?  Then if we poll for more than the threshold, we will go to
>> sleep.
>
> The hybrid polling could help. The default hybrid polling poll half the average
> IO latency. But it will not work very well if the latency becomes very big. The
> hybrid polling has an interface to allow userspace to configure the poll
> threshold, but since the latency varies from time to time, it would be very
> hard to set a single threshold for all workloads.

If my understanding were correct, the hybrid polling will insert some
sleep before the polling, but will not restrict the duration of the
polling itself.  This helps CPU usage, but may not help much for very
long latency.  How about add another threshold to restrict the max
polling time?  For example, the sleep time + max polling time could be
1.5 * mean latency.  So that most IO requests could be serviced by
polling, and for very long latency, polling could be restricted to
reduce CPU usage.

Best Regards,
Huang, Ying

> Thanks,
> Shaohua
>
>> 
>> Best Regards,
>> Huang, Ying
>> 
>> > CPU utilization increases a lot though, 2x and even 3x CPU
>> > utilization. This will depend on disk speed though. While iopoll in
>> > swapin isn't intended for all usage cases, it's a win for latency
>> > sensistive workloads with high speed swap disk. block layer has knob to
>> > control poll in runtime. If poll isn't enabled in block layer, there
>> > should be no noticeable change in swapin.
>> >
>> > The swapin readahead might read several pages in in the same time and
>> > form a big IO request. Since the IO will take longer time, it doesn't
>> > make sense to do poll, so the patch only does iopoll for single page
>> > swapin.
>> >
>> > Cc: Tim Chen <tim.c.chen@intel.com>
>> > Cc: Huang Ying <ying.huang@intel.com>
>> > Cc: Jens Axboe <axboe@fb.com>
>> > Signed-off-by: Shaohua Li <shli@fb.com>
>> > ---
>> >  include/linux/swap.h |  5 +++--
>> >  mm/madvise.c         |  4 ++--
>> >  mm/page_io.c         | 20 ++++++++++++++++++--
>> >  mm/swap_state.c      | 10 ++++++----
>> >  mm/swapfile.c        |  2 +-
>> >  5 files changed, 30 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index ba58824..c589e6c 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -331,7 +331,7 @@ extern void kswapd_stop(int nid);
>> >  #include <linux/blk_types.h> /* for bio_end_io_t */
>> >  
>> >  /* linux/mm/page_io.c */
>> > -extern int swap_readpage(struct page *);
>> > +extern int swap_readpage(struct page *, bool do_poll);
>> >  extern int swap_writepage(struct page *page, struct writeback_control *wbc);
>> >  extern void end_swap_bio_write(struct bio *bio);
>> >  extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
>> > @@ -362,7 +362,8 @@ extern void free_page_and_swap_cache(struct page *);
>> >  extern void free_pages_and_swap_cache(struct page **, int);
>> >  extern struct page *lookup_swap_cache(swp_entry_t);
>> >  extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
>> > -			struct vm_area_struct *vma, unsigned long addr);
>> > +			struct vm_area_struct *vma, unsigned long addr,
>> > +			bool do_poll);
>> >  extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
>> >  			struct vm_area_struct *vma, unsigned long addr,
>> >  			bool *new_page_allocated);
>> > diff --git a/mm/madvise.c b/mm/madvise.c
>> > index 25b78ee..8eda184 100644
>> > --- a/mm/madvise.c
>> > +++ b/mm/madvise.c
>> > @@ -205,7 +205,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
>> >  			continue;
>> >  
>> >  		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
>> > -								vma, index);
>> > +							vma, index, false);
>> >  		if (page)
>> >  			put_page(page);
>> >  	}
>> > @@ -246,7 +246,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
>> >  		}
>> >  		swap = radix_to_swp_entry(page);
>> >  		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
>> > -								NULL, 0);
>> > +							NULL, 0, false);
>> >  		if (page)
>> >  			put_page(page);
>> >  	}
>> > diff --git a/mm/page_io.c b/mm/page_io.c
>> > index 23f6d0d..464cf16 100644
>> > --- a/mm/page_io.c
>> > +++ b/mm/page_io.c
>> > @@ -117,6 +117,7 @@ static void swap_slot_free_notify(struct page *page)
>> >  static void end_swap_bio_read(struct bio *bio)
>> >  {
>> >  	struct page *page = bio->bi_io_vec[0].bv_page;
>> > +	struct task_struct *waiter = bio->bi_private;
>> >  
>> >  	if (bio->bi_error) {
>> >  		SetPageError(page);
>> > @@ -133,6 +134,7 @@ static void end_swap_bio_read(struct bio *bio)
>> >  out:
>> >  	unlock_page(page);
>> >  	bio_put(bio);
>> > +	wake_up_process(waiter);
>> >  }
>> >  
>> >  int generic_swapfile_activate(struct swap_info_struct *sis,
>> > @@ -329,11 +331,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>> >  	return ret;
>> >  }
>> >  
>> > -int swap_readpage(struct page *page)
>> > +int swap_readpage(struct page *page, bool do_poll)
>> >  {
>> >  	struct bio *bio;
>> >  	int ret = 0;
>> >  	struct swap_info_struct *sis = page_swap_info(page);
>> > +	blk_qc_t qc;
>> > +	struct block_device *bdev;
>> >  
>> >  	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
>> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> > @@ -372,9 +376,21 @@ int swap_readpage(struct page *page)
>> >  		ret = -ENOMEM;
>> >  		goto out;
>> >  	}
>> > +	bdev = bio->bi_bdev;
>> > +	bio->bi_private = current;
>> >  	bio_set_op_attrs(bio, REQ_OP_READ, 0);
>> >  	count_vm_event(PSWPIN);
>> > -	submit_bio(bio);
>> > +	qc = submit_bio(bio);
>> > +	while (do_poll) {
>> > +		set_current_state(TASK_UNINTERRUPTIBLE);
>> > +		if (!PageLocked(page))
>> > +			break;
>> > +
>> > +		if (!blk_mq_poll(bdev_get_queue(bdev), qc))
>> > +			break;
>> > +	}
>> > +	__set_current_state(TASK_RUNNING);
>> > +
>> >  out:
>> >  	return ret;
>> >  }
>> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> > index 539b888..7c0a66c 100644
>> > --- a/mm/swap_state.c
>> > +++ b/mm/swap_state.c
>> > @@ -404,14 +404,14 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>> >   * the swap entry is no longer in use.
>> >   */
>> >  struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>> > -			struct vm_area_struct *vma, unsigned long addr)
>> > +			struct vm_area_struct *vma, unsigned long addr, bool do_poll)
>> >  {
>> >  	bool page_was_allocated;
>> >  	struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
>> >  			vma, addr, &page_was_allocated);
>> >  
>> >  	if (page_was_allocated)
>> > -		swap_readpage(retpage);
>> > +		swap_readpage(retpage, do_poll);
>> >  
>> >  	return retpage;
>> >  }
>> > @@ -488,11 +488,13 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> >  	unsigned long start_offset, end_offset;
>> >  	unsigned long mask;
>> >  	struct blk_plug plug;
>> > +	bool do_poll = true;
>> >  
>> >  	mask = swapin_nr_pages(offset) - 1;
>> >  	if (!mask)
>> >  		goto skip;
>> >  
>> > +	do_poll = false;
>> >  	/* Read a page_cluster sized and aligned cluster around offset. */
>> >  	start_offset = offset & ~mask;
>> >  	end_offset = offset | mask;
>> > @@ -503,7 +505,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> >  	for (offset = start_offset; offset <= end_offset ; offset++) {
>> >  		/* Ok, do the async read-ahead now */
>> >  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
>> > -						gfp_mask, vma, addr);
>> > +						gfp_mask, vma, addr, false);
>> >  		if (!page)
>> >  			continue;
>> >  		if (offset != entry_offset)
>> > @@ -514,7 +516,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> >  
>> >  	lru_add_drain();	/* Push any new pages onto the LRU now */
>> >  skip:
>> > -	return read_swap_cache_async(entry, gfp_mask, vma, addr);
>> > +	return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
>> >  }
>> >  
>> >  int init_swap_address_space(unsigned int type, unsigned long nr_pages)
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 4f6cba1..04516c1 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -1719,7 +1719,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
>> >  		swap_map = &si->swap_map[i];
>> >  		entry = swp_entry(type, i);
>> >  		page = read_swap_cache_async(entry,
>> > -					GFP_HIGHUSER_MOVABLE, NULL, 0);
>> > +					GFP_HIGHUSER_MOVABLE, NULL, 0, false);
>> >  		if (!page) {
>> >  			/*
>> >  			 * Either swap_duplicate() failed because entry

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] swap: add block io poll in swapin path
  2017-05-05  6:02     ` Huang, Ying
@ 2017-05-05 13:47       ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2017-05-05 13:47 UTC (permalink / raw)
  To: Huang, Ying, Shaohua Li; +Cc: linux-mm, Andrew Morton, Kernel-team, Tim Chen

On 05/05/2017 12:02 AM, Huang, Ying wrote:
>> The hybrid polling could help. The default hybrid polling poll half
>> the average IO latency. But it will not work very well if the latency
>> becomes very big. The hybrid polling has an interface to allow
>> userspace to configure the poll threshold, but since the latency
>> varies from time to time, it would be very hard to set a single
>> threshold for all workloads.
> 
> If my understanding were correct, the hybrid polling will insert some
> sleep before the polling, but will not restrict the duration of the
> polling itself.  This helps CPU usage, but may not help much for very
> long latency.  How about add another threshold to restrict the max
> polling time?  For example, the sleep time + max polling time could be
> 1.5 * mean latency.  So that most IO requests could be serviced by
> polling, and for very long latency, polling could be restricted to
> reduce CPU usage.

I don't think that's a bad idea at all, there's definitely room for
improvement on how long to sleep and when to completely stop. The stats
track min/avg/max for a given window of time, so it would not be too
hard to implement an appropriate backoff as well.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-05-05 13:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-04 20:42 [PATCH] swap: add block io poll in swapin path Shaohua Li
2017-05-04 20:53 ` Jens Axboe
2017-05-04 21:27   ` Shaohua Li
2017-05-04 21:29     ` Jens Axboe
2017-05-04 23:23       ` Chen, Tim C
2017-05-05  2:24         ` Jens Axboe
2017-05-05  3:07 ` Huang, Ying
2017-05-05  5:12   ` Shaohua Li
2017-05-05  6:02     ` Huang, Ying
2017-05-05 13:47       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox