linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>, Nitin Gupta <ngupta@vflare.org>,
	Jerome Marchand <jmarchan@redhat.com>,
	Ganesh Mahendran <opensource.ganesh@gmail.com>
Subject: Re: [PATCH v1 2/2] zram: remove init_lock in zram_make_request
Date: Fri, 30 Jan 2015 23:41:56 +0900	[thread overview]
Message-ID: <20150130144145.GA2840@blaptop> (raw)
In-Reply-To: <20150129070835.GD2555@swordfish>

Hello, Sergey

On Thu, Jan 29, 2015 at 04:08:35PM +0900, Sergey Senozhatsky wrote:
> On (01/29/15 15:35), Minchan Kim wrote:
> >
> > As you told, the data was not stable.
> >
> yes. fread test was always slower, and the rest was mostly slower.
> 
> 
> > Anyway, when I read down_read implementation, it's one atomic instruction.
> > Hmm, it seems te be better for srcu_read_lock which does more things.
> >
> srcu looks havier, agree.

ffffffff8172c350 <down_read>:
ffffffff8172c350:       e8 7b 3f 00 00          callq  ffffffff817302d0 <__fentry__>
ffffffff8172c355:       55                      push   %rbp
ffffffff8172c356:       48 89 e5                mov    %rsp,%rbp
ffffffff8172c359:       53                      push   %rbx
ffffffff8172c35a:       48 89 fb                mov    %rdi,%rbx
ffffffff8172c35d:       48 83 ec 08             sub    $0x8,%rsp
ffffffff8172c361:       e8 9a e0 ff ff          callq  ffffffff8172a400 <_cond_resched>
ffffffff8172c366:       48 89 d8                mov    %rbx,%rax
ffffffff8172c369:       f0 48 ff 00             lock incq (%rax)
ffffffff8172c36d:       79 05                   jns    ffffffff8172c374 <down_read+0x24>
ffffffff8172c36f:       e8 5c e7 c4 ff          callq  ffffffff8137aad0 <call_rwsem_down_read_failed>
ffffffff8172c374:       48 83 c4 08             add    $0x8,%rsp
ffffffff8172c378:       5b                      pop    %rbx
ffffffff8172c379:       5d                      pop    %rbp
ffffffff8172c37a:       c3                      retq   


ffffffff810eeec0 <__srcu_read_lock>:
ffffffff810eeec0:       e8 0b 14 64 00          callq  ffffffff817302d0 <__fentry__>
ffffffff810eeec5:       48 8b 07                mov    (%rdi),%rax
ffffffff810eeec8:       55                      push   %rbp
ffffffff810eeec9:       48 89 e5                mov    %rsp,%rbp
ffffffff810eeecc:       83 e0 01                and    $0x1,%eax
ffffffff810eeecf:       48 63 d0                movslq %eax,%rdx
ffffffff810eeed2:       48 8b 4f 08             mov    0x8(%rdi),%rcx
ffffffff810eeed6:       65 48 ff 04 d1          incq   %gs:(%rcx,%rdx,8)
ffffffff810eeedb:       0f ae f0                mfence 
ffffffff810eeede:       48 83 c2 02             add    $0x2,%rdx
ffffffff810eeee2:       48 8b 4f 08             mov    0x8(%rdi),%rcx
ffffffff810eeee6:       65 48 ff 04 d1          incq   %gs:(%rcx,%rdx,8)
ffffffff810eeeeb:       5d                      pop    %rbp
ffffffff810eeeec:       c3                      retq   

Yes, __srcu_read_lock is a little bit heavier but the number of instruction
are not too much difference to make difference 10%. A culprit is
__cond_resched but I don't think, either because our test was CPU intensive
soS I don't think schedule latency affects total bandwidth.

More cuprit is your data pattern.
It seems you didn't use scramble_buffers=0, zero_buffers in fio so that
fio fills random data pattern so zram bandwidth could be different by
compression/decompression ratio.

I did test your fio script adding above options with my 4 CPU real machine
(NOTE, ubuntu fio is old so that it doesn't work well above two options
so I should update fio recently which solves it perfectly)

Another thing about fio is it seems loops option works with write test
with overwrite=1 options while read test doesn't work so that I should
use perf stat -r options to verify stdev.

In addition, I passed first test to remove noise as creating files
and increased testsize as 1G from 400m

1) randread

= vanilla =

 Performance counter stats for 'fio test-fio-randread.txt' (10 runs):

       4713.879241      task-clock (msec)         #    3.160 CPUs utilized            ( +-  0.62% )
             1,131      context-switches          #    0.240 K/sec                    ( +-  2.83% )
                23      cpu-migrations            #    0.005 K/sec                    ( +-  4.40% )
            15,767      page-faults               #    0.003 M/sec                    ( +-  0.03% )
    15,134,497,088      cycles                    #    3.211 GHz                      ( +-  0.15% ) [83.36%]
    10,763,665,604      stalled-cycles-frontend   #   71.12% frontend cycles idle     ( +-  0.22% ) [83.34%]
     6,896,294,076      stalled-cycles-backend    #   45.57% backend  cycles idle     ( +-  0.29% ) [66.67%]
     9,898,608,791      instructions              #    0.65  insns per cycle        
                                                  #    1.09  stalled cycles per insn  ( +-  0.07% ) [83.33%]
     1,852,167,485      branches                  #  392.918 M/sec                    ( +-  0.07% ) [83.34%]
        14,864,143      branch-misses             #    0.80% of all branches          ( +-  0.16% ) [83.34%]

       1.491813361 seconds time elapsed                                          ( +-  0.62% )

= srcu =

 Performance counter stats for 'fio test-fio-randread.txt' (10 runs):

       4752.790715      task-clock (msec)         #    3.166 CPUs utilized            ( +-  0.48% )
             1,179      context-switches          #    0.248 K/sec                    ( +-  1.56% )
                26      cpu-migrations            #    0.005 K/sec                    ( +-  3.91% )
            15,764      page-faults               #    0.003 M/sec                    ( +-  0.02% )
    15,263,869,915      cycles                    #    3.212 GHz                      ( +-  0.25% ) [83.32%]
    10,935,658,177      stalled-cycles-frontend   #   71.64% frontend cycles idle     ( +-  0.38% ) [83.33%]
     7,067,290,320      stalled-cycles-backend    #   46.30% backend  cycles idle     ( +-  0.46% ) [66.64%]
     9,896,513,423      instructions              #    0.65  insns per cycle        
                                                  #    1.11  stalled cycles per insn  ( +-  0.07% ) [83.33%]
     1,847,612,285      branches                  #  388.743 M/sec                    ( +-  0.07% ) [83.38%]
        14,814,815      branch-misses             #    0.80% of all branches          ( +-  0.24% ) [83.37%]

       1.501284082 seconds time elapsed                                          ( +-  0.50% )

srcu is worse as 0.63% but the difference is really marginal.

2) randwrite

= vanilla =

 Performance counter stats for 'fio test-fio-randwrite.txt' (10 runs):

       6283.823490      task-clock (msec)         #    3.332 CPUs utilized            ( +-  0.44% )
             1,536      context-switches          #    0.245 K/sec                    ( +-  2.10% )
                25      cpu-migrations            #    0.004 K/sec                    ( +-  3.79% )
            15,914      page-faults               #    0.003 M/sec                    ( +-  0.02% )
    20,408,942,915      cycles                    #    3.248 GHz                      ( +-  0.40% ) [83.34%]
    14,398,424,739      stalled-cycles-frontend   #   70.55% frontend cycles idle     ( +-  0.62% ) [83.36%]
     9,513,822,555      stalled-cycles-backend    #   46.62% backend  cycles idle     ( +-  0.62% ) [66.65%]
    13,507,376,783      instructions              #    0.66  insns per cycle        
                                                  #    1.07  stalled cycles per insn  ( +-  0.05% ) [83.36%]
     3,155,423,934      branches                  #  502.150 M/sec                    ( +-  0.05% ) [83.34%]
        18,381,090      branch-misses             #    0.58% of all branches          ( +-  0.16% ) [83.34%]

       1.885926070 seconds time elapsed                                          ( +-  0.61% )

= srcu =

 Performance counter stats for 'fio test-fio-randwrite.txt' (10 runs):

       6152.997119      task-clock (msec)         #    3.304 CPUs utilized            ( +-  0.29% )
             1,422      context-switches          #    0.231 K/sec                    ( +-  3.45% )
                28      cpu-migrations            #    0.004 K/sec                    ( +-  7.47% )
            15,921      page-faults               #    0.003 M/sec                    ( +-  0.02% )
    19,862,315,430      cycles                    #    3.228 GHz                      ( +-  0.09% ) [83.33%]
    13,872,541,761      stalled-cycles-frontend   #   69.84% frontend cycles idle     ( +-  0.12% ) [83.34%]
     9,074,883,552      stalled-cycles-backend    #   45.69% backend  cycles idle     ( +-  0.19% ) [66.71%]
    13,494,854,651      instructions              #    0.68  insns per cycle        
                                                  #    1.03  stalled cycles per insn  ( +-  0.03% ) [83.37%]
     3,148,938,955      branches                  #  511.773 M/sec                    ( +-  0.04% ) [83.33%]
        17,701,249      branch-misses             #    0.56% of all branches          ( +-  0.23% ) [83.34%]

       1.862543230 seconds time elapsed                                          ( +-  0.35% )

srcu is better as 1.24% is better.

3) randrw

= vanilla =

 Performance counter stats for 'fio test-fio-randrw.txt' (10 runs):

       5609.976477      task-clock (msec)         #    3.249 CPUs utilized            ( +-  0.34% )
             1,407      context-switches          #    0.251 K/sec                    ( +-  0.96% )
                25      cpu-migrations            #    0.004 K/sec                    ( +-  5.37% )
            15,906      page-faults               #    0.003 M/sec                    ( +-  0.05% )
    18,090,560,346      cycles                    #    3.225 GHz                      ( +-  0.35% ) [83.36%]
    12,885,393,954      stalled-cycles-frontend   #   71.23% frontend cycles idle     ( +-  0.53% ) [83.33%]
     8,570,185,547      stalled-cycles-backend    #   47.37% backend  cycles idle     ( +-  0.59% ) [66.67%]
    11,771,620,352      instructions              #    0.65  insns per cycle        
                                                  #    1.09  stalled cycles per insn  ( +-  0.05% ) [83.35%]
     2,508,014,871      branches                  #  447.063 M/sec                    ( +-  0.05% ) [83.34%]
        18,585,638      branch-misses             #    0.74% of all branches          ( +-  0.23% ) [83.35%]

       1.726691239 seconds time elapsed                                          ( +-  0.40% )

= srcu =

       5475.312828      task-clock (msec)         #    3.246 CPUs utilized            ( +-  0.59% )
             1,399      context-switches          #    0.255 K/sec                    ( +-  1.46% )
                24      cpu-migrations            #    0.004 K/sec                    ( +-  6.27% )
            15,916      page-faults               #    0.003 M/sec                    ( +-  0.04% )
    17,583,197,041      cycles                    #    3.211 GHz                      ( +-  0.11% ) [83.33%]
    12,352,657,985      stalled-cycles-frontend   #   70.25% frontend cycles idle     ( +-  0.16% ) [83.33%]
     8,173,164,212      stalled-cycles-backend    #   46.48% backend  cycles idle     ( +-  0.19% ) [66.70%]
    11,780,176,340      instructions              #    0.67  insns per cycle        
                                                  #    1.05  stalled cycles per insn  ( +-  0.05% ) [83.36%]
     2,506,722,383      branches                  #  457.823 M/sec                    ( +-  0.06% ) [83.35%]
        18,436,877      branch-misses             #    0.74% of all branches          ( +-  0.18% ) [83.32%]

       1.686877512 seconds time elapsed                                          ( +-  0.43% )

srcu is better as 2.3%

Srcu is better than down_read but I don't believe either because when I
did perf record, [up|down]_read and srcu_read_[lock|unlock] is really
minor (about 0.5%) so that I think it's really marginal.
(for example, if we removes srcu_read_[un]lock totally, we just enhance
about 1%) So, I don't think it's worth.

Okay, if you concerns on the data still, how about this?
Even, it would be smaller instructions than [up|down]_read so I guess
it could remove your performance concern. But I don't believe
it could make significant difference, either.
Hope it addresses your concern.

Thanks.

  reply	other threads:[~2015-01-30 14:42 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-28  8:15 [PATCH 1/2] zram: free meta table in zram_meta_free Minchan Kim
2015-01-28  8:15 ` [PATCH v1 2/2] zram: remove init_lock in zram_make_request Minchan Kim
2015-01-28 14:56   ` Sergey Senozhatsky
2015-01-28 15:04     ` Sergey Senozhatsky
2015-01-28 23:33     ` Minchan Kim
2015-01-29  1:57       ` Sergey Senozhatsky
2015-01-29  2:01         ` Minchan Kim
2015-01-29  2:22           ` Sergey Senozhatsky
2015-01-29  5:28             ` Minchan Kim
2015-01-29  6:06               ` Sergey Senozhatsky
2015-01-29  6:35                 ` Minchan Kim
2015-01-29  7:08                   ` Sergey Senozhatsky
2015-01-30 14:41                     ` Minchan Kim [this message]
2015-01-31 11:31                       ` Sergey Senozhatsky
2015-02-01 14:50                       ` Sergey Senozhatsky
2015-02-01 15:04                         ` Sergey Senozhatsky
2015-02-02  1:43                           ` Minchan Kim
2015-02-02  1:59                             ` Sergey Senozhatsky
2015-02-02  2:45                               ` Minchan Kim
2015-02-02  3:47                                 ` Sergey Senozhatsky
2015-02-02  1:30                         ` Minchan Kim
2015-02-02  1:48                           ` Sergey Senozhatsky
2015-02-02  2:44                             ` Minchan Kim
2015-02-02  4:01                               ` Sergey Senozhatsky
2015-02-02  4:28                                 ` Minchan Kim
2015-02-02  5:09                                   ` Sergey Senozhatsky
2015-02-02  5:18                                     ` Minchan Kim
2015-02-02  5:28                                       ` Sergey Senozhatsky
2015-02-02  5:10                                   ` Minchan Kim
2015-01-30  0:20                   ` Sergey Senozhatsky
2015-01-29 13:48   ` Ganesh Mahendran
2015-01-29 15:12     ` Sergey Senozhatsky
2015-01-30  7:52       ` Ganesh Mahendran
2015-01-30  8:08         ` Sergey Senozhatsky
2015-01-31  8:50           ` Ganesh Mahendran
2015-01-31 11:07             ` Sergey Senozhatsky
2015-01-31 12:59               ` Ganesh Mahendran
2015-01-28 14:19 ` [PATCH 1/2] zram: free meta table in zram_meta_free Sergey Senozhatsky
2015-01-28 23:17   ` Minchan Kim
2015-01-29  1:49     ` Ganesh Mahendran
2015-02-02  3:41 [PATCH v1 2/2] zram: remove init_lock in zram_make_request Minchan Kim
2015-02-02  5:59 ` Sergey Senozhatsky
2015-02-02  6:18   ` Minchan Kim
2015-02-02  7:06     ` Sergey Senozhatsky
2015-02-03  1:54       ` Sergey Senozhatsky
2015-02-03  3:02         ` Minchan Kim
2015-02-03  3:56           ` Sergey Senozhatsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150130144145.GA2840@blaptop \
    --to=minchan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=jmarchan@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ngupta@vflare.org \
    --cc=opensource.ganesh@gmail.com \
    --cc=sergey.senozhatsky.work@gmail.com \
    --cc=sergey.senozhatsky@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox