LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* SMP performance degradation with sysbench @ 2007-02-25 17:44 Lorenzo Allegrucci 2007-02-25 23:46 ` Rik van Riel 0 siblings, 1 reply; 45+ messages in thread From: Lorenzo Allegrucci @ 2007-02-25 17:44 UTC (permalink / raw) To: linux-kernel; +Cc: Ingo Molnar, Suparna Bhattacharya, Jens Axboe Hi lkml, according to the test below (sysbench) Linux seems to have scalability problems beyond 8 client threads: http://jeffr-tech.livejournal.com/6268.html#cutid1 http://jeffr-tech.livejournal.com/5705.html Hardware is an 8-core amd64 system and jeffr seems willing to try more Linux versions on that machine. Anyway, is there anyone who can reproduce this? Chiacchiera con i tuoi amici in tempo reale! http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-25 17:44 SMP performance degradation with sysbench Lorenzo Allegrucci @ 2007-02-25 23:46 ` Rik van Riel 2007-02-26 13:36 ` Nick Piggin 0 siblings, 1 reply; 45+ messages in thread From: Rik van Riel @ 2007-02-25 23:46 UTC (permalink / raw) To: Lorenzo Allegrucci Cc: linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Lorenzo Allegrucci wrote: > Hi lkml, > > according to the test below (sysbench) Linux seems to have scalability > problems beyond 8 client threads: > http://jeffr-tech.livejournal.com/6268.html#cutid1 > http://jeffr-tech.livejournal.com/5705.html > Hardware is an 8-core amd64 system and jeffr seems willing to try more > Linux versions on that machine. > Anyway, is there anyone who can reproduce this? I have reproduced it on a quad core test system. With 4 threads (on 4 cores) I get a high throughput, with approximately 58% user time and 42% system time. With 8 threads (on 4 cores) I get way lower throughput, with 37% user time, 29% system time 35% idle time! The maximum time taken per query also increases from 0.0096s to 0.5273s. Ouch! I don't know if this is MySQL, glibc or Linux kernel, but something strange is going on... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-25 23:46 ` Rik van Riel @ 2007-02-26 13:36 ` Nick Piggin 2007-02-26 13:41 ` Nick Piggin ` (3 more replies) 0 siblings, 4 replies; 45+ messages in thread From: Nick Piggin @ 2007-02-26 13:36 UTC (permalink / raw) To: Rik van Riel Cc: Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 3000 bytes --] Rik van Riel wrote: > Lorenzo Allegrucci wrote: > >> Hi lkml, >> >> according to the test below (sysbench) Linux seems to have scalability >> problems beyond 8 client threads: >> http://jeffr-tech.livejournal.com/6268.html#cutid1 >> http://jeffr-tech.livejournal.com/5705.html >> Hardware is an 8-core amd64 system and jeffr seems willing to try more >> Linux versions on that machine. >> Anyway, is there anyone who can reproduce this? > > > I have reproduced it on a quad core test system. > > With 4 threads (on 4 cores) I get a high throughput, with > approximately 58% user time and 42% system time. > > With 8 threads (on 4 cores) I get way lower throughput, > with 37% user time, 29% system time 35% idle time! > > The maximum time taken per query also increases from > 0.0096s to 0.5273s. Ouch! > > I don't know if this is MySQL, glibc or Linux kernel, > but something strange is going on... Like you, I'm also seeing idle time start going up as threads increase. I initially thought this was a problem with the multiprocessor scheduler, because the pattern is exactly like some artificat in the load balancing. However, after looking at the stats, and testing a couple of things, I think it may not be after all. I've reproduced this on a 8-socket/16-way dual core Opteron. So far what I am seeing is that MySQL is having trouble putting enough load into the scheduler. Virtually all of the sleep time is coming from unix_stream_recvmsg, which seems to be what the clients and server threads use to communicate with. There doesn't seem to be any other tell-tale event that the database is blocking on. It seems like it might at least partially be a problem with MySQL thread/connection management. I found a couple of interesting issues so far. Firstly, the MySQL version that I'm using (5.0.26-Max) is making lots of calls to sched_setscheduler attempting to fiddle with SCHED_OTHER priority in what looks like an attempt to boot CPU time while holding some resource. All these calls actually fail, because you cannot change SCHED_OTHER priority like that. Adding a hack to make it fall through to set_user_nice provides a boost which eliminates the cliff (but a downward degredation is still there). Secondly, I've raised the thread numbers from 16 to 32 for my system, which also provides a bit more (although doesn't help the downward slope). Combined, it looks like around 30-40% improvement past 16 threads. It isn't anything like making up for the dropoff seen in the blog link, but different systems, different mysql version... I wonder how close we are with this hack in place? Attached is a graph of my numbers, from 1 to 32 clients. plain = 2.6.20.1, sched is with the attached sched patch, and thread is with 32 rather than 16 clients. Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). Nick -- SUSE Labs, Novell Inc. [-- Attachment #2: graph.png --] [-- Type: image/png, Size: 6969 bytes --] [-- Attachment #3: mysql-hack.patch --] [-- Type: text/plain, Size: 766 bytes --] --- kernel/sched.c.orig 2007-02-26 11:46:46.849841000 +0100 +++ kernel/sched.c 2007-02-26 12:04:09.283056000 +0100 @@ -4227,8 +4227,6 @@ recheck: (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) || (!p->mm && param->sched_priority > MAX_RT_PRIO-1)) return -EINVAL; - if (is_rt_policy(policy) != (param->sched_priority != 0)) - return -EINVAL; /* * Allow unprivileged RT tasks to decrease priority: @@ -4302,6 +4300,13 @@ recheck: rt_mutex_adjust_pi(p); + if (!is_rt_policy(policy)) { + if (param->sched_priority == 8) + set_user_nice(p, -20); + else + set_user_nice(p, param->sched_priority-6); + } + return 0; } EXPORT_SYMBOL_GPL(sched_setscheduler); ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-26 13:36 ` Nick Piggin @ 2007-02-26 13:41 ` Nick Piggin 2007-02-26 22:04 ` Pete Harlan ` (2 subsequent siblings) 3 siblings, 0 replies; 45+ messages in thread From: Nick Piggin @ 2007-02-26 13:41 UTC (permalink / raw) To: Nick Piggin Cc: Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Nick Piggin wrote: > Rik van Riel wrote: > >> Lorenzo Allegrucci wrote: >> >>> Hi lkml, >>> >>> according to the test below (sysbench) Linux seems to have scalability >>> problems beyond 8 client threads: >>> http://jeffr-tech.livejournal.com/6268.html#cutid1 >>> http://jeffr-tech.livejournal.com/5705.html >>> Hardware is an 8-core amd64 system and jeffr seems willing to try more >>> Linux versions on that machine. >>> Anyway, is there anyone who can reproduce this? >> >> >> >> I have reproduced it on a quad core test system. >> >> With 4 threads (on 4 cores) I get a high throughput, with >> approximately 58% user time and 42% system time. >> >> With 8 threads (on 4 cores) I get way lower throughput, >> with 37% user time, 29% system time 35% idle time! >> >> The maximum time taken per query also increases from >> 0.0096s to 0.5273s. Ouch! >> >> I don't know if this is MySQL, glibc or Linux kernel, >> but something strange is going on... > > > Like you, I'm also seeing idle time start going up as threads increase. > > I initially thought this was a problem with the multiprocessor scheduler, > because the pattern is exactly like some artificat in the load balancing. "artificat" Wow. I must need some sleep :) Please excuse any other typos! -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-26 13:36 ` Nick Piggin 2007-02-26 13:41 ` Nick Piggin @ 2007-02-26 22:04 ` Pete Harlan 2007-02-26 22:36 ` Dave Jones 2007-02-28 1:27 ` Nish Aravamudan 2007-03-12 22:00 ` Anton Blanchard 3 siblings, 1 reply; 45+ messages in thread From: Pete Harlan @ 2007-02-26 22:04 UTC (permalink / raw) To: Nick Piggin Cc: Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote: > I found a couple of interesting issues so far. Firstly, the MySQL > version that I'm using (5.0.26-Max) is making lots of calls to FYI, MySQL fixed some scalability problems in version 5.0.30, as mentioned here: http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/ It may be worth using more recent sources than 5.0.26 if tracking down scaling problems in MySQL. --Pete ---------------------------------- Pete Harlan ArtSelect, Inc. harlan@artselect.com http://www.artselect.com ArtSelect is a subsidiary of a21, Inc. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-26 22:04 ` Pete Harlan @ 2007-02-26 22:36 ` Dave Jones 2007-02-27 0:32 ` Hiro Yoshioka 0 siblings, 1 reply; 45+ messages in thread From: Dave Jones @ 2007-02-26 22:36 UTC (permalink / raw) To: Pete Harlan Cc: Nick Piggin, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Mon, Feb 26, 2007 at 04:04:01PM -0600, Pete Harlan wrote: > On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote: > > I found a couple of interesting issues so far. Firstly, the MySQL > > version that I'm using (5.0.26-Max) is making lots of calls to > > FYI, MySQL fixed some scalability problems in version 5.0.30, as > mentioned here: > > http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/ > > It may be worth using more recent sources than 5.0.26 if tracking down > scaling problems in MySQL. The blog post that originated this discussion ran tests on 5.0.33 Not that the mysql version should really matter. The key point here is that FreeBSD and Linux were running the *same* version, and FreeBSD was able to handle the situation better somehow. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-26 22:36 ` Dave Jones @ 2007-02-27 0:32 ` Hiro Yoshioka 2007-02-27 0:43 ` Rik van Riel 0 siblings, 1 reply; 45+ messages in thread From: Hiro Yoshioka @ 2007-02-27 0:32 UTC (permalink / raw) To: Dave Jones, Pete Harlan, Nick Piggin, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Cc: Hiro Yoshioka Howdy, MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 http://ossipedia.ipa.go.jp/capacity/EV0612260303/ (written in Japanese but you may read the graph. We compared 5.0.24 vs 5.0.32) The following is oprofile data ==> cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt <== CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % app name symbol name 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock 18600010 6.6502 mysqld rec_get_offsets_func 18121328 6.4790 mysqld btr_search_guess_on_hash 11453095 4.0949 mysqld row_search_for_mysql MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core machine. I think there are a lot of room to be inproved in MySQL implementation. On 2/27/07, Dave Jones <davej@redhat.com> wrote: > On Mon, Feb 26, 2007 at 04:04:01PM -0600, Pete Harlan wrote: > > On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote: > > > I found a couple of interesting issues so far. Firstly, the MySQL > > > version that I'm using (5.0.26-Max) is making lots of calls to > > > > FYI, MySQL fixed some scalability problems in version 5.0.30, as > > mentioned here: > > > > http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/ > > > > It may be worth using more recent sources than 5.0.26 if tracking down > > scaling problems in MySQL. > > The blog post that originated this discussion ran tests on 5.0.33 > Not that the mysql version should really matter. The key point here > is that FreeBSD and Linux were running the *same* version, and > FreeBSD was able to handle the situation better somehow. > > Dave > > -- > http://www.codemonkey.org.uk > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > Regards, Hiro -- Hiro Yoshioka mailto:hyoshiok at miraclelinux.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-27 0:32 ` Hiro Yoshioka @ 2007-02-27 0:43 ` Rik van Riel 2007-02-27 4:03 ` Hiro Yoshioka 0 siblings, 1 reply; 45+ messages in thread From: Rik van Riel @ 2007-02-27 0:43 UTC (permalink / raw) To: hyoshiok Cc: Dave Jones, Pete Harlan, Nick Piggin, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Hiro Yoshioka wrote: > Howdy, > > MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 > http://ossipedia.ipa.go.jp/capacity/EV0612260303/ > (written in Japanese but you may read the graph. We compared > 5.0.24 vs 5.0.32) > > The following is oprofile data > ==> > cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt > <== > CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit > mask of 0x00 (Unhalted core cycles) count 100000 > samples % app name symbol name > 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock > 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock > 18600010 6.6502 mysqld rec_get_offsets_func > 18121328 6.4790 mysqld btr_search_guess_on_hash > 11453095 4.0949 mysqld row_search_for_mysql > > MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core > machine. > > I think there are a lot of room to be inproved in MySQL implementation. That's one aspect. The other aspect of the problem is that when the number of threads exceeds the number of CPU cores, Linux no longer manages to keep the CPUs busy and we get a lot of idle time. On the other hand, with the number of threads being equal to the number of CPU cores, we are 100% CPU bound... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-27 0:43 ` Rik van Riel @ 2007-02-27 4:03 ` Hiro Yoshioka 2007-02-27 4:31 ` Rik van Riel 0 siblings, 1 reply; 45+ messages in thread From: Hiro Yoshioka @ 2007-02-27 4:03 UTC (permalink / raw) To: riel Cc: davej, harlan, nickpiggin, l_allegrucci, linux-kernel, mingo, suparna, jens.axboe, hyoshiok Hi, From: Rik van Riel <riel@redhat.com> > Hiro Yoshioka wrote: > > Howdy, > > > > MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 > > http://ossipedia.ipa.go.jp/capacity/EV0612260303/ > > (written in Japanese but you may read the graph. We compared > > 5.0.24 vs 5.0.32) snip > > MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core > > machine. > > > > I think there are a lot of room to be inproved in MySQL implementation. > > That's one aspect. > > The other aspect of the problem is that when the number of > threads exceeds the number of CPU cores, Linux no longer > manages to keep the CPUs busy and we get a lot of idle time. > > On the other hand, with the number of threads being equal to > the number of CPU cores, we are 100% CPU bound... I have a question. If so, what is the difference of kernel's view between SMP and CPU cores? Another question. When the number of threads exceeds the number of CPU cores, we may get a lot of idle time. Then a workaround of MySQL is that do not creat threads which exceeds the number of CPU cores. Is it right? Regards, Hiro -- Hiro Yoshioka CTO/Miracle Linux Corporation http://blog.miraclelinux.com/yume/ ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-27 4:03 ` Hiro Yoshioka @ 2007-02-27 4:31 ` Rik van Riel 2007-02-27 8:14 ` J.A. Magallón 0 siblings, 1 reply; 45+ messages in thread From: Rik van Riel @ 2007-02-27 4:31 UTC (permalink / raw) To: Hiro Yoshioka Cc: davej, harlan, nickpiggin, l_allegrucci, linux-kernel, mingo, suparna, jens.axboe Hiro Yoshioka wrote: > Hi, > > From: Rik van Riel <riel@redhat.com> >> Hiro Yoshioka wrote: >>> Howdy, >>> >>> MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 >>> http://ossipedia.ipa.go.jp/capacity/EV0612260303/ >>> (written in Japanese but you may read the graph. We compared >>> 5.0.24 vs 5.0.32) > snip >>> MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core >>> machine. >>> >>> I think there are a lot of room to be inproved in MySQL implementation. >> That's one aspect. >> >> The other aspect of the problem is that when the number of >> threads exceeds the number of CPU cores, Linux no longer >> manages to keep the CPUs busy and we get a lot of idle time. >> >> On the other hand, with the number of threads being equal to >> the number of CPU cores, we are 100% CPU bound... > > I have a question. If so, what is the difference of kernel's > view between SMP and CPU cores? None. Each schedulable entity (whether a fully fledged CPU core or an SMT/HT thread) is treated the same. > Another question. When the number of threads exceeds the number of > CPU cores, we may get a lot of idle time. Then a workaround of > MySQL is that do not creat threads which exceeds the number > of CPU cores. Is it right? Not really, that would make it impossible for MySQL to handle more simultaneous database queries than the system has CPUs. Besides, it looks like this is not a problem in MySQL per se (it works on FreeBSD) but some bug in Linux. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-27 4:31 ` Rik van Riel @ 2007-02-27 8:14 ` J.A. Magallón 2007-02-27 14:02 ` Rik van Riel 0 siblings, 1 reply; 45+ messages in thread From: J.A. Magallón @ 2007-02-27 8:14 UTC (permalink / raw) To: Rik van Riel Cc: Hiro Yoshioka, davej, harlan, nickpiggin, l_allegrucci, linux-kernel, mingo, suparna, jens.axboe On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel <riel@redhat.com> wrote: > Hiro Yoshioka wrote: > > Hi, > > > > From: Rik van Riel <riel@redhat.com> > >> Hiro Yoshioka wrote: > >>> Howdy, > >>> > >>> MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 > >>> http://ossipedia.ipa.go.jp/capacity/EV0612260303/ > >>> (written in Japanese but you may read the graph. We compared > >>> 5.0.24 vs 5.0.32) > > snip > >>> MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core > >>> machine. > >>> > >>> I think there are a lot of room to be inproved in MySQL implementation. > >> That's one aspect. > >> > >> The other aspect of the problem is that when the number of > >> threads exceeds the number of CPU cores, Linux no longer > >> manages to keep the CPUs busy and we get a lot of idle time. > >> > >> On the other hand, with the number of threads being equal to > >> the number of CPU cores, we are 100% CPU bound... > > > > I have a question. If so, what is the difference of kernel's > > view between SMP and CPU cores? > > None. Each schedulable entity (whether a fully fledged > CPU core or an SMT/HT thread) is treated the same. > And what do the SMT and Multi-Core scheduling options in the kernel config are for ? Because of this thread I re-read the help text, and it looks like on could de-select the SMT scheduler option, get a working SMP system, and see what difference ? I suppose its related to migration and cache flushing and so on, but where could I get more details ? And more strange, what is the difference between multi-core and normal SMP configs ? > > Another question. When the number of threads exceeds the number of > > CPU cores, we may get a lot of idle time. Then a workaround of > > MySQL is that do not creat threads which exceeds the number > > of CPU cores. Is it right? > > Not really, that would make it impossible for MySQL to > handle more simultaneous database queries than the system > has CPUs. > I don't know myqsl internals, but you assume one thread per query. If its more like Apache, one long living thread for several connections ? Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? > Besides, it looks like this is not a problem in MySQL > per se (it works on FreeBSD) but some bug in Linux. > -- J.A. Magallon <jamagallon()ono!com> \ Software is like sex: \ It's better when it's free Mandriva Linux release 2007.1 (Cooker) for i586 Linux 2.6.19-jam07 (gcc 4.1.2 20070115 (prerelease) (4.1.2-0.20070115.1mdv2007.1)) #2 SMP PREEMPT ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-27 8:14 ` J.A. Magallón @ 2007-02-27 14:02 ` Rik van Riel 2007-02-27 14:56 ` Paulo Marques 2007-02-27 19:05 ` Lorenzo Allegrucci 0 siblings, 2 replies; 45+ messages in thread From: Rik van Riel @ 2007-02-27 14:02 UTC (permalink / raw) To: "J.A. Magallón" Cc: Hiro Yoshioka, davej, harlan, nickpiggin, l_allegrucci, linux-kernel, mingo, suparna, jens.axboe J.A. Magallón wrote: > On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel <riel@redhat.com> wrote: > >> Hiro Yoshioka wrote: >>> Another question. When the number of threads exceeds the number of >>> CPU cores, we may get a lot of idle time. Then a workaround of >>> MySQL is that do not creat threads which exceeds the number >>> of CPU cores. Is it right? >> Not really, that would make it impossible for MySQL to >> handle more simultaneous database queries than the system >> has CPUs. >> > > I don't know myqsl internals, but you assume one thread per query. > If its more like Apache, one long living thread for several connections ? Yes, they are longer lived client connections. One thread per connection, just like Apache. > Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? That still doesn't fix the potential Linux problem that this benchmark identified. To clarify: I don't care as much about MySQL performance as I care about identifying and fixing this potential bug in Linux. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-27 14:02 ` Rik van Riel @ 2007-02-27 14:56 ` Paulo Marques 2007-02-27 20:40 ` Nish Aravamudan 2007-02-28 2:21 ` Bill Davidsen 2007-02-27 19:05 ` Lorenzo Allegrucci 1 sibling, 2 replies; 45+ messages in thread From: Paulo Marques @ 2007-02-27 14:56 UTC (permalink / raw) To: Rik van Riel Cc: "J.A. Magallón", Hiro Yoshioka, davej, harlan, nickpiggin, l_allegrucci, linux-kernel, mingo, suparna, jens.axboe Rik van Riel wrote: > J.A. Magallón wrote: >>[...] >> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? > > That still doesn't fix the potential Linux problem that this > benchmark identified. > > To clarify: I don't care as much about MySQL performance as > I care about identifying and fixing this potential bug in > Linux. IIRC a long time ago there was a change in the scheduler to prevent a low prio task running on a sibling of a hyperthreaded processor to slow down a higher prio task on another sibling of the same processor. Basically the scheduler would put the low prio task to sleep during an adequate task slice to allow the other sibling to run at full speed for a while. I don't know the scheduler code well enough, but comments like this one make me think that the change is still in place: > /* > * If an SMT sibling task has been put to sleep for priority > * reasons reschedule the idle task to see if it can now run. > */ > if (rq->nr_running) { > resched_task(rq->idle); > ret = 1; > } If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. -- Paulo Marques - www.grupopie.com "The face of a child can say it all, especially the mouth part of the face." ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-27 14:56 ` Paulo Marques @ 2007-02-27 20:40 ` Nish Aravamudan 2007-02-28 2:21 ` Bill Davidsen 1 sibling, 0 replies; 45+ messages in thread From: Nish Aravamudan @ 2007-02-27 20:40 UTC (permalink / raw) To: Paulo Marques Cc: Rik van Riel, "J.A. Magallón", Hiro Yoshioka, davej, harlan, nickpiggin, l_allegrucci, linux-kernel, mingo, suparna, jens.axboe On 2/27/07, Paulo Marques <pmarques@grupopie.com> wrote: > Rik van Riel wrote: > > J.A. Magallón wrote: > >>[...] > >> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? > > > > That still doesn't fix the potential Linux problem that this > > benchmark identified. > > > > To clarify: I don't care as much about MySQL performance as > > I care about identifying and fixing this potential bug in > > Linux. > > IIRC a long time ago there was a change in the scheduler to prevent a > low prio task running on a sibling of a hyperthreaded processor to slow > down a higher prio task on another sibling of the same processor. > > Basically the scheduler would put the low prio task to sleep during an > adequate task slice to allow the other sibling to run at full speed for > a while. > > I don't know the scheduler code well enough, but comments like this one > make me think that the change is still in place: <snip> > If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. To chime in here, I was attempting to reproduce this on an 8-way Xeon box (4 dual-core). SCHED_SMT and SCHED_MC on led to scaling issues when above 4 threads (4 threads was the peak). To the point, where I couldn't break 1000 transactions per second. Turning both off (with 2.6.20.1) gives much better performance through 16 threads. I am now running for the cases from 17 to 32 to see if I can reproduce the problem at hand. I'll regenerate my data and post numbers soon. I don't know if anyone else has those on in their kernel .config, but I'd suggest turning them off, as Paulo said. Thanks, Nish ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-27 14:56 ` Paulo Marques 2007-02-27 20:40 ` Nish Aravamudan @ 2007-02-28 2:21 ` Bill Davidsen 2007-02-28 2:52 ` Nish Aravamudan 1 sibling, 1 reply; 45+ messages in thread From: Bill Davidsen @ 2007-02-28 2:21 UTC (permalink / raw) To: Paulo Marques Cc: Rik van Riel, "J.A. Magallón", Hiro Yoshioka, davej, harlan, nickpiggin, l_allegrucci, linux-kernel, mingo, suparna, jens.axboe Paulo Marques wrote: > Rik van Riel wrote: >> J.A. Magallón wrote: >>> [...] >>> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? >> >> That still doesn't fix the potential Linux problem that this >> benchmark identified. >> >> To clarify: I don't care as much about MySQL performance as >> I care about identifying and fixing this potential bug in >> Linux. > > IIRC a long time ago there was a change in the scheduler to prevent a > low prio task running on a sibling of a hyperthreaded processor to slow > down a higher prio task on another sibling of the same processor. > > Basically the scheduler would put the low prio task to sleep during an > adequate task slice to allow the other sibling to run at full speed for > a while. > > I don't know the scheduler code well enough, but comments like this one > make me think that the change is still in place: > >> /* >> * If an SMT sibling task has been put to sleep for priority >> * reasons reschedule the idle task to see if it can now run. >> */ >> if (rq->nr_running) { >> resched_task(rq->idle); >> ret = 1; >> } > > If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. > That may be the case, but in my opinion if this helps it doesn't "solve" the problem, because the real problem is that a process which is not on a HT is being treated as if it were. Note that Intel does make multicore HT processors, and hopefully when this code works as intended it will result in more total throughput. My supposition is that it currently is NOT working as intended, since disabling SMT scheduling is reported to help. A test with MC on and SMT off would be informative for where to look next. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-28 2:21 ` Bill Davidsen @ 2007-02-28 2:52 ` Nish Aravamudan 2007-03-01 0:20 ` Nish Aravamudan 0 siblings, 1 reply; 45+ messages in thread From: Nish Aravamudan @ 2007-02-28 2:52 UTC (permalink / raw) To: Bill Davidsen Cc: Paulo Marques, Rik van Riel, "J.A. Magallón", Hiro Yoshioka, davej, harlan, nickpiggin, l_allegrucci, linux-kernel, mingo, suparna, jens.axboe On 2/27/07, Bill Davidsen <davidsen@tmr.com> wrote: > Paulo Marques wrote: > > Rik van Riel wrote: > >> J.A. Magallón wrote: > >>> [...] > >>> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? > >> > >> That still doesn't fix the potential Linux problem that this > >> benchmark identified. > >> > >> To clarify: I don't care as much about MySQL performance as > >> I care about identifying and fixing this potential bug in > >> Linux. > > > > IIRC a long time ago there was a change in the scheduler to prevent a > > low prio task running on a sibling of a hyperthreaded processor to slow > > down a higher prio task on another sibling of the same processor. > > > > Basically the scheduler would put the low prio task to sleep during an > > adequate task slice to allow the other sibling to run at full speed for > > a while. > > > > I don't know the scheduler code well enough, but comments like this one > > make me think that the change is still in place: > > > >> /* > >> * If an SMT sibling task has been put to sleep for priority > >> * reasons reschedule the idle task to see if it can now run. > >> */ > >> if (rq->nr_running) { > >> resched_task(rq->idle); > >> ret = 1; > >> } > > > > If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. > > > That may be the case, but in my opinion if this helps it doesn't "solve" > the problem, because the real problem is that a process which is not on > a HT is being treated as if it were. > > Note that Intel does make multicore HT processors, and hopefully when > this code works as intended it will result in more total throughput. My > supposition is that it currently is NOT working as intended, since > disabling SMT scheduling is reported to help. It does help, but we still drop off, clearly. Also, that's my baseline, so I'm not able to reproduce the *sharp* dropoff from the blog post yet. > A test with MC on and SMT off would be informative for where to look next. I'm rebooting my box with 2.6.20.1 and exactly this setup now. Thanks, Nish ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-28 2:52 ` Nish Aravamudan @ 2007-03-01 0:20 ` Nish Aravamudan 0 siblings, 0 replies; 45+ messages in thread From: Nish Aravamudan @ 2007-03-01 0:20 UTC (permalink / raw) To: Bill Davidsen Cc: Paulo Marques, Rik van Riel, J.A. Magallón, Hiro Yoshioka, davej, harlan, nickpiggin, l_allegrucci, linux-kernel, mingo, suparna, jens.axboe [-- Attachment #1: Type: text/plain, Size: 2012 bytes --] On 2/27/07, Nish Aravamudan <nish.aravamudan@gmail.com> wrote: > On 2/27/07, Bill Davidsen <davidsen@tmr.com> wrote: > > Paulo Marques wrote: > > > Rik van Riel wrote: > > >> J.A. Magallón wrote: > > >>> [...] > > >>> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? > > >> > > >> That still doesn't fix the potential Linux problem that this > > >> benchmark identified. > > >> > > >> To clarify: I don't care as much about MySQL performance as > > >> I care about identifying and fixing this potential bug in > > >> Linux. > > > > > > IIRC a long time ago there was a change in the scheduler to prevent a > > > low prio task running on a sibling of a hyperthreaded processor to slow > > > down a higher prio task on another sibling of the same processor. > > > > > > Basically the scheduler would put the low prio task to sleep during an > > > adequate task slice to allow the other sibling to run at full speed for > > > a while. <snip> > > > If that is the case, turning off CONFIG_SCHED_SMT would solve the problem. <snip> > > Note that Intel does make multicore HT processors, and hopefully when > > this code works as intended it will result in more total throughput. My > > supposition is that it currently is NOT working as intended, since > > disabling SMT scheduling is reported to help. > > It does help, but we still drop off, clearly. Also, that's my > baseline, so I'm not able to reproduce the *sharp* dropoff from the > blog post yet. > > > A test with MC on and SMT off would be informative for where to look next. > > I'm rebooting my box with 2.6.20.1 and exactly this setup now. Here are the results: idle.png: average % idle over 120s runs from 1 to 32 threads transactions.png: TPS over 120s runs from 1 to 32 threads Hope the data is useful. All I can conclude right now is that SMT appears to help (contradicting what I said earlier), but that MC seems to have no effect (or no substantial effect). Thanks, Nish [-- Attachment #2: idle.png --] [-- Type: image/png, Size: 5482 bytes --] [-- Attachment #3: transactions.png --] [-- Type: image/png, Size: 6653 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-27 14:02 ` Rik van Riel 2007-02-27 14:56 ` Paulo Marques @ 2007-02-27 19:05 ` Lorenzo Allegrucci 2007-03-01 16:57 ` Lorenzo Allegrucci 1 sibling, 1 reply; 45+ messages in thread From: Lorenzo Allegrucci @ 2007-02-27 19:05 UTC (permalink / raw) To: Rik van Riel Cc: "J.A. Magallón", Hiro Yoshioka, davej, harlan, nickpiggin, linux-kernel, mingo, suparna, jens.axboe On Tue, 2007-02-27 at 09:02 -0500, Rik van Riel wrote: > J.A. Magallón wrote: > > On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel <riel@redhat.com> wrote: > > > >> Hiro Yoshioka wrote: > > >>> Another question. When the number of threads exceeds the number of > >>> CPU cores, we may get a lot of idle time. Then a workaround of > >>> MySQL is that do not creat threads which exceeds the number > >>> of CPU cores. Is it right? > >> Not really, that would make it impossible for MySQL to > >> handle more simultaneous database queries than the system > >> has CPUs. > >> > > > > I don't know myqsl internals, but you assume one thread per query. > > If its more like Apache, one long living thread for several connections ? > > Yes, they are longer lived client connections. One thread > per connection, just like Apache. > > > Its the same to answer 4+4 queries than 8 at half the speed, isn't it ? > > That still doesn't fix the potential Linux problem that this > benchmark identified. > > To clarify: I don't care as much about MySQL performance as > I care about identifying and fixing this potential bug in > Linux. Here http://people.freebsd.org/~kris/scaling/mysql.html Kris Kennaway talks about a patch for FreeBSD 7 which addresses poor scalability of file descriptor locking and that it's responsible for almost all of the performance and scaling improvements. Chiacchiera con i tuoi amici in tempo reale! http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-27 19:05 ` Lorenzo Allegrucci @ 2007-03-01 16:57 ` Lorenzo Allegrucci 0 siblings, 0 replies; 45+ messages in thread From: Lorenzo Allegrucci @ 2007-03-01 16:57 UTC (permalink / raw) To: Rik van Riel Cc: "J.A. Magallón", Hiro Yoshioka, davej, harlan, nickpiggin, linux-kernel, mingo, suparna, jens.axboe On Tue, 2007-02-27 at 20:05 +0100, Lorenzo Allegrucci wrote: > On Tue, 2007-02-27 at 09:02 -0500, Rik van Riel wrote: > > That still doesn't fix the potential Linux problem that this > > benchmark identified. > > > > To clarify: I don't care as much about MySQL performance as > > I care about identifying and fixing this potential bug in > > Linux. > > Here http://people.freebsd.org/~kris/scaling/mysql.html Kris Kennaway > talks about a patch for FreeBSD 7 which addresses poor scalability > of file descriptor locking and that it's responsible for almost all > of the performance and scaling improvements. How does Linux scale with many threads contending for file descriptor lock? Has anyone tried to run the test with oprofile? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-26 13:36 ` Nick Piggin 2007-02-26 13:41 ` Nick Piggin 2007-02-26 22:04 ` Pete Harlan @ 2007-02-28 1:27 ` Nish Aravamudan 2007-02-28 2:22 ` Nick Piggin 2007-03-12 22:00 ` Anton Blanchard 3 siblings, 1 reply; 45+ messages in thread From: Nish Aravamudan @ 2007-02-28 1:27 UTC (permalink / raw) To: Nick Piggin Cc: Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 2138 bytes --] On 2/26/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Rik van Riel wrote: > > Lorenzo Allegrucci wrote: > > > >> Hi lkml, > >> > >> according to the test below (sysbench) Linux seems to have scalability > >> problems beyond 8 client threads: > >> http://jeffr-tech.livejournal.com/6268.html#cutid1 > >> http://jeffr-tech.livejournal.com/5705.html > >> Hardware is an 8-core amd64 system and jeffr seems willing to try more > >> Linux versions on that machine. > >> Anyway, is there anyone who can reproduce this? > > > > > > I have reproduced it on a quad core test system. > > > > With 4 threads (on 4 cores) I get a high throughput, with > > approximately 58% user time and 42% system time. > > > > With 8 threads (on 4 cores) I get way lower throughput, > > with 37% user time, 29% system time 35% idle time! > > > > The maximum time taken per query also increases from > > 0.0096s to 0.5273s. Ouch! > > > > I don't know if this is MySQL, glibc or Linux kernel, > > but something strange is going on... > > Like you, I'm also seeing idle time start going up as threads increase. > > I initially thought this was a problem with the multiprocessor scheduler, > because the pattern is exactly like some artificat in the load balancing. > > However, after looking at the stats, and testing a couple of things, I > think it may not be after all. > > I've reproduced this on a 8-socket/16-way dual core Opteron. So far what > I am seeing is that MySQL is having trouble putting enough load into the > scheduler. Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC in .config) I posted about earlier. transactions.png resembles Nick's results pretty closely, in that a drop-off occurs, at the same # of threads, too. That seems weird to me, but I haven't thought about it too closely. Shouldn't Nick's be dropping off closer to 16 threads (that would be 1 per core, then, right?) idle.png is the average % idle according to sar over each run from 1 to 32 threads. This appears to confirm what Rik was seeing. Not sure if my data is hurting or helping, but this box remains available for further tests. Thanks, Nish [-- Attachment #2: transactions.png --] [-- Type: image/png, Size: 3837 bytes --] [-- Attachment #3: idle.png --] [-- Type: image/png, Size: 3349 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-28 1:27 ` Nish Aravamudan @ 2007-02-28 2:22 ` Nick Piggin 2007-02-28 2:51 ` Nish Aravamudan 0 siblings, 1 reply; 45+ messages in thread From: Nick Piggin @ 2007-02-28 2:22 UTC (permalink / raw) To: Nish Aravamudan Cc: Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Nish Aravamudan wrote: > On 2/26/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> Rik van Riel wrote: >> > Lorenzo Allegrucci wrote: >> > >> >> Hi lkml, >> >> >> >> according to the test below (sysbench) Linux seems to have scalability >> >> problems beyond 8 client threads: >> >> http://jeffr-tech.livejournal.com/6268.html#cutid1 >> >> http://jeffr-tech.livejournal.com/5705.html >> >> Hardware is an 8-core amd64 system and jeffr seems willing to try more >> >> Linux versions on that machine. >> >> Anyway, is there anyone who can reproduce this? >> > >> > >> > I have reproduced it on a quad core test system. >> > >> > With 4 threads (on 4 cores) I get a high throughput, with >> > approximately 58% user time and 42% system time. >> > >> > With 8 threads (on 4 cores) I get way lower throughput, >> > with 37% user time, 29% system time 35% idle time! >> > >> > The maximum time taken per query also increases from >> > 0.0096s to 0.5273s. Ouch! >> > >> > I don't know if this is MySQL, glibc or Linux kernel, >> > but something strange is going on... >> >> Like you, I'm also seeing idle time start going up as threads increase. >> >> I initially thought this was a problem with the multiprocessor scheduler, >> because the pattern is exactly like some artificat in the load balancing. >> >> However, after looking at the stats, and testing a couple of things, I >> think it may not be after all. >> >> I've reproduced this on a 8-socket/16-way dual core Opteron. So far what >> I am seeing is that MySQL is having trouble putting enough load into the >> scheduler. > > > Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC > in .config) I posted about earlier. > > transactions.png resembles Nick's results pretty closely, in that a > drop-off occurs, at the same # of threads, too. That seems weird to > me, but I haven't thought about it too closely. Shouldn't Nick's be > dropping off closer to 16 threads (that would be 1 per core, then, > right?) I don't think it is exactly a matter of processes >= cores, but rather just a general problem at higher concurrency. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-28 2:22 ` Nick Piggin @ 2007-02-28 2:51 ` Nish Aravamudan 0 siblings, 0 replies; 45+ messages in thread From: Nish Aravamudan @ 2007-02-28 2:51 UTC (permalink / raw) To: Nick Piggin Cc: Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On 2/27/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Nish Aravamudan wrote: > > On 2/26/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > >> Rik van Riel wrote: > >> > Lorenzo Allegrucci wrote: > >> > > >> >> Hi lkml, > >> >> > >> >> according to the test below (sysbench) Linux seems to have scalability > >> >> problems beyond 8 client threads: > >> >> http://jeffr-tech.livejournal.com/6268.html#cutid1 > >> >> http://jeffr-tech.livejournal.com/5705.html > >> >> Hardware is an 8-core amd64 system and jeffr seems willing to try more > >> >> Linux versions on that machine. > >> >> Anyway, is there anyone who can reproduce this? > >> > > >> > > >> > I have reproduced it on a quad core test system. > >> > > >> > With 4 threads (on 4 cores) I get a high throughput, with > >> > approximately 58% user time and 42% system time. > >> > > >> > With 8 threads (on 4 cores) I get way lower throughput, > >> > with 37% user time, 29% system time 35% idle time! > >> > > >> > The maximum time taken per query also increases from > >> > 0.0096s to 0.5273s. Ouch! > >> > > >> > I don't know if this is MySQL, glibc or Linux kernel, > >> > but something strange is going on... > >> > >> Like you, I'm also seeing idle time start going up as threads increase. > >> > >> I initially thought this was a problem with the multiprocessor scheduler, > >> because the pattern is exactly like some artificat in the load balancing. > >> > >> However, after looking at the stats, and testing a couple of things, I > >> think it may not be after all. > >> > >> I've reproduced this on a 8-socket/16-way dual core Opteron. So far what > >> I am seeing is that MySQL is having trouble putting enough load into the > >> scheduler. > > > > > > Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC > > in .config) I posted about earlier. > > > > transactions.png resembles Nick's results pretty closely, in that a > > drop-off occurs, at the same # of threads, too. That seems weird to > > me, but I haven't thought about it too closely. Shouldn't Nick's be > > dropping off closer to 16 threads (that would be 1 per core, then, > > right?) > > I don't think it is exactly a matter of processes >= cores, but rather > just a general problem at higher concurrency. Ok, thanks for the clarification. -Nish ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-26 13:36 ` Nick Piggin ` (2 preceding siblings ...) 2007-02-28 1:27 ` Nish Aravamudan @ 2007-03-12 22:00 ` Anton Blanchard 2007-03-13 5:11 ` Nick Piggin ` (2 more replies) 3 siblings, 3 replies; 45+ messages in thread From: Anton Blanchard @ 2007-03-12 22:00 UTC (permalink / raw) To: Nick Piggin Cc: Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Hi Nick, > Anyway, I'll keep experimenting. If anyone from MySQL wants to help look > at this, send me a mail (eg. especially with the sched_setscheduler issue, > you might be able to do something better). I took a look at this today and figured Id document it: http://ozlabs.org/~anton/linux/sysbench/ Bottom line: it looks like issues in the glibc malloc library, replacing it with the google malloc library fixes the negative scaling: # apt-get install libgoogle-perftools0 # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Anton ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-12 22:00 ` Anton Blanchard @ 2007-03-13 5:11 ` Nick Piggin 2007-03-13 9:45 ` Andrea Arcangeli 2007-03-13 6:00 ` Eric Dumazet 2007-03-14 0:36 ` Nish Aravamudan 2 siblings, 1 reply; 45+ messages in thread From: Nick Piggin @ 2007-03-13 5:11 UTC (permalink / raw) To: Anton Blanchard Cc: Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Anton Blanchard wrote: > > Hi Nick, > > >>Anyway, I'll keep experimenting. If anyone from MySQL wants to help look >>at this, send me a mail (eg. especially with the sched_setscheduler issue, >>you might be able to do something better). > > > I took a look at this today and figured Id document it: > > http://ozlabs.org/~anton/linux/sysbench/ > > Bottom line: it looks like issues in the glibc malloc library, replacing > it with the google malloc library fixes the negative scaling: > > # apt-get install libgoogle-perftools0 > # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Hi Anton, Very cool. Yeah I had come to the conclusion that it wasn't a kernel issue, and basically was afraid to look into userspace ;) That bogus setscheduler thing must surely have never worked, though. I wonder if FreeBSD avoids the scalability issue because it is using SCHED_RR there, or because it has a decent threaded malloc implementation. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 5:11 ` Nick Piggin @ 2007-03-13 9:45 ` Andrea Arcangeli 2007-03-13 10:06 ` Nick Piggin 0 siblings, 1 reply; 45+ messages in thread From: Andrea Arcangeli @ 2007-03-13 9:45 UTC (permalink / raw) To: Nick Piggin Cc: Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote: > Hi Anton, > > Very cool. Yeah I had come to the conclusion that it wasn't a kernel > issue, and basically was afraid to look into userspace ;) btw, regardless of what glibc is doing, still the cpu shouldn't go idle IMHO. Even if we're overscheduling and trashing over the mmap_sem with threads (no idea if other OS schedules the task away when they find the other cpu in the mmap critical section), or if we've overscheduling with futex locking, the cpu usage should remain 100% system time in the worst case. The only explanation for going idle legitimately could be on HT cpus where HT may hurt more than help but on real multicore it shouldn't happen. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 9:45 ` Andrea Arcangeli @ 2007-03-13 10:06 ` Nick Piggin 2007-03-13 10:31 ` Andrea Arcangeli 0 siblings, 1 reply; 45+ messages in thread From: Nick Piggin @ 2007-03-13 10:06 UTC (permalink / raw) To: Andrea Arcangeli Cc: Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Andrea Arcangeli wrote: > On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote: > >>Hi Anton, >> >>Very cool. Yeah I had come to the conclusion that it wasn't a kernel >>issue, and basically was afraid to look into userspace ;) > > > btw, regardless of what glibc is doing, still the cpu shouldn't go > idle IMHO. Even if we're overscheduling and trashing over the mmap_sem > with threads (no idea if other OS schedules the task away when they > find the other cpu in the mmap critical section), or if we've > overscheduling with futex locking, the cpu usage should remain 100% > system time in the worst case. The only explanation for going idle > legitimately could be on HT cpus where HT may hurt more than help but > on real multicore it shouldn't happen. > Well ignoring the HT issue, I was seeing lots of idle time simply because userspace could not keep up enough load to the scheduler. There simply were fewer runnable tasks than CPU cores. But it wasn't a case of all CPUs going idle, just most of them ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 10:06 ` Nick Piggin @ 2007-03-13 10:31 ` Andrea Arcangeli 2007-03-13 10:37 ` Nick Piggin 0 siblings, 1 reply; 45+ messages in thread From: Andrea Arcangeli @ 2007-03-13 10:31 UTC (permalink / raw) To: Nick Piggin Cc: Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote: > Well ignoring the HT issue, I was seeing lots of idle time simply > because userspace could not keep up enough load to the scheduler. > There simply were fewer runnable tasks than CPU cores. When you said idle I thought idle and not waiting for I/O. Waiting for I/O would be hardly a kernel issue ;). If they're not waiting for I/O and they're not scheduling in userland with nanosleep/pause, the cpu shouldn't go idle. Even if they're calling sched_yield in a loop the cpu should account for zero idle time as far as I can tell. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 10:31 ` Andrea Arcangeli @ 2007-03-13 10:37 ` Nick Piggin 2007-03-13 10:57 ` Andrea Arcangeli 0 siblings, 1 reply; 45+ messages in thread From: Nick Piggin @ 2007-03-13 10:37 UTC (permalink / raw) To: Andrea Arcangeli Cc: Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Andrea Arcangeli wrote: > On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote: > >>Well ignoring the HT issue, I was seeing lots of idle time simply >>because userspace could not keep up enough load to the scheduler. >>There simply were fewer runnable tasks than CPU cores. > > > When you said idle I thought idle and not waiting for I/O. Waiting for > I/O would be hardly a kernel issue ;). If they're not waiting for I/O > and they're not scheduling in userland with nanosleep/pause, the cpu > shouldn't go idle. Even if they're calling sched_yield in a loop the > cpu should account for zero idle time as far as I can tell. Well it wasn't iowait time. From Anton's analysis, I would probably say it was time waiting for either the glibc malloc mutex or MySQL heap mutex. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 10:37 ` Nick Piggin @ 2007-03-13 10:57 ` Andrea Arcangeli 2007-03-13 11:12 ` Nick Piggin 0 siblings, 1 reply; 45+ messages in thread From: Andrea Arcangeli @ 2007-03-13 10:57 UTC (permalink / raw) To: Nick Piggin Cc: Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote: > Well it wasn't iowait time. From Anton's analysis, I would probably > say it was time waiting for either the glibc malloc mutex or MySQL > heap mutex. So it again makes little sense to me that this is idle time, unless some userland mutex has a usleep in the slow path which would be very wrong, in the worst case they should yield() (yield can still waste lots of cpu if two tasks in the slow paths calls it while the holder is not scheduled, but at least it wouldn't be idle time). Idle time is suspicious for a kernel issue in the scheduler or some userland inefficiency (the latter sounds more likely). ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 10:57 ` Andrea Arcangeli @ 2007-03-13 11:12 ` Nick Piggin 2007-03-13 11:40 ` Eric Dumazet 2007-03-13 11:42 ` Andrea Arcangeli 0 siblings, 2 replies; 45+ messages in thread From: Nick Piggin @ 2007-03-13 11:12 UTC (permalink / raw) To: Andrea Arcangeli Cc: Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Andrea Arcangeli wrote: > On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote: > >>Well it wasn't iowait time. From Anton's analysis, I would probably >>say it was time waiting for either the glibc malloc mutex or MySQL >>heap mutex. > > > So it again makes little sense to me that this is idle time, unless > some userland mutex has a usleep in the slow path which would be very > wrong, in the worst case they should yield() (yield can still waste > lots of cpu if two tasks in the slow paths calls it while the holder > is not scheduled, but at least it wouldn't be idle time). They'll be sleeping in futex_wait in the kernel, I think. One thread will hold the critical mutex, some will be off doing their own thing, but importantly there will be many sleeping for the mutex to become available. > Idle time is suspicious for a kernel issue in the scheduler or some > userland inefficiency (the latter sounds more likely). That is what I first suspected, because the dropoff appeared to happen exactly after we saturated the CPU count: it seems like a scheduler artifact. However, I tested with a bigger system and actually the idle time comes before we saturate all CPUs. Also, increasing the aggressiveness of the load balancer did not drop idle time at all, so it is not a case of some runqueues idle while others have many threads on them. I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose glibc allocator. But I wonder if there are other improvements that glibc can do here? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 11:12 ` Nick Piggin @ 2007-03-13 11:40 ` Eric Dumazet 2007-03-13 11:56 ` Nick Piggin 2007-03-13 11:42 ` Andrea Arcangeli 1 sibling, 1 reply; 45+ messages in thread From: Eric Dumazet @ 2007-03-13 11:40 UTC (permalink / raw) To: Nick Piggin Cc: Andrea Arcangeli, Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Tuesday 13 March 2007 12:12, Nick Piggin wrote: > > I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose > glibc allocator. But I wonder if there are other improvements that glibc > can do here? I cooked a patch some time ago to speedup threaded apps and got no feedback. http://lkml.org/lkml/2006/8/9/26 Maybe we have to wait for 32 core cpu before thinking of cache line bouncings... ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 11:40 ` Eric Dumazet @ 2007-03-13 11:56 ` Nick Piggin 0 siblings, 0 replies; 45+ messages in thread From: Nick Piggin @ 2007-03-13 11:56 UTC (permalink / raw) To: Eric Dumazet Cc: Andrea Arcangeli, Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Eric Dumazet wrote: > On Tuesday 13 March 2007 12:12, Nick Piggin wrote: > >>I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose >>glibc allocator. But I wonder if there are other improvements that glibc >>can do here? > > > I cooked a patch some time ago to speedup threaded apps and got no feedback. Well that doesn't help in this case. I tested and the mmap_sem contention is not an issue. > http://lkml.org/lkml/2006/8/9/26 > > Maybe we have to wait for 32 core cpu before thinking of cache line > bouncings... The idea is a good one, and I was half way through implementing similar myself at one point (some java apps hit this badly). It is just horribly sad that futexes are supposed to implement a _scalable_ thread synchronisation mechanism, whilst fundamentally relying on an mm-wide lock to operate. I don't like your interface, but then again, the futex interface isn't exactly pretty anyway. You should resubmit the patch, and get the glibc guys to use it. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 11:12 ` Nick Piggin 2007-03-13 11:40 ` Eric Dumazet @ 2007-03-13 11:42 ` Andrea Arcangeli 2007-03-13 12:02 ` Eric Dumazet 2007-03-13 12:08 ` Nick Piggin 1 sibling, 2 replies; 45+ messages in thread From: Andrea Arcangeli @ 2007-03-13 11:42 UTC (permalink / raw) To: Nick Piggin Cc: Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote: > They'll be sleeping in futex_wait in the kernel, I think. One thread > will hold the critical mutex, some will be off doing their own thing, > but importantly there will be many sleeping for the mutex to become > available. The initial assumption was that there was zero idle time with threads = cpus and the idle time showed up only when the number of threads increased to the double the number of cpus. If the idle time wouldn't increase with the number of threads, nothing would be suspect. > However, I tested with a bigger system and actually the idle time > comes before we saturate all CPUs. Also, increasing the aggressiveness > of the load balancer did not drop idle time at all, so it is not a case > of some runqueues idle while others have many threads on them. It'd be interesting to see the sysrq+t after the idle time increased. > I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose > glibc allocator. But I wonder if there are other improvements that glibc > can do here? My wild guess is that they're allocating memory after taking futexes. If they do, something like this will happen: taskA taskB taskC user lock mmap_sem lock mmap sem -> schedule user lock -> schedule If taskB wouldn't be there triggering more random trashing over the mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. I suspect the real fix is not to allocate memory or to run other expensive syscalls that can block inside the futex critical sections... ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 11:42 ` Andrea Arcangeli @ 2007-03-13 12:02 ` Eric Dumazet 2007-03-13 12:27 ` Jakub Jelinek 2007-03-13 12:08 ` Nick Piggin 1 sibling, 1 reply; 45+ messages in thread From: Eric Dumazet @ 2007-03-13 12:02 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote: > My wild guess is that they're allocating memory after taking > futexes. If they do, something like this will happen: > > taskA taskB taskC > user lock > mmap_sem lock > mmap sem -> schedule > user lock -> schedule > > If taskB wouldn't be there triggering more random trashing over the > mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. > > I suspect the real fix is not to allocate memory or to run other > expensive syscalls that can block inside the futex critical sections... glibc malloc uses arenas, and trylock() only. It should not block because if an arena is already locked, thread automatically chose another arena, and might create a new one if necessary. But yes, mmap_sem contention is a big problem, because it's also taken by futex code (unfortunately) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 12:02 ` Eric Dumazet @ 2007-03-13 12:27 ` Jakub Jelinek 0 siblings, 0 replies; 45+ messages in thread From: Jakub Jelinek @ 2007-03-13 12:27 UTC (permalink / raw) To: Eric Dumazet Cc: Andrea Arcangeli, Nick Piggin, Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Tue, Mar 13, 2007 at 01:02:44PM +0100, Eric Dumazet wrote: > On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote: > > > My wild guess is that they're allocating memory after taking > > futexes. If they do, something like this will happen: > > > > taskA taskB taskC > > user lock > > mmap_sem lock > > mmap sem -> schedule > > user lock -> schedule > > > > If taskB wouldn't be there triggering more random trashing over the > > mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. > > > > I suspect the real fix is not to allocate memory or to run other > > expensive syscalls that can block inside the futex critical sections... > > glibc malloc uses arenas, and trylock() only. It should not block because if > an arena is already locked, thread automatically chose another arena, and > might create a new one if necessary. Well, only when allocating it uses trylock, free uses normal lock. glibc malloc will by default use the same arena for all threads, only when it sees contention during allocation it gives different threads different arenas. So, e.g. if mysql did all allocations while holding some global heap lock (thus glibc wouldn't see any contention on allocation), but freeing would be done outside of application's critical section, you would see contention on main arena's lock in the free path. Calling malloc_stats (); from e.g. atexit handler could give interesting details, especially if you recompile glibc malloc with -DTHREAD_STATS=1. Jakub ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 11:42 ` Andrea Arcangeli 2007-03-13 12:02 ` Eric Dumazet @ 2007-03-13 12:08 ` Nick Piggin 2007-03-14 23:33 ` Siddha, Suresh B 1 sibling, 1 reply; 45+ messages in thread From: Nick Piggin @ 2007-03-13 12:08 UTC (permalink / raw) To: Andrea Arcangeli Cc: Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Andrea Arcangeli wrote: > On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote: > >>They'll be sleeping in futex_wait in the kernel, I think. One thread >>will hold the critical mutex, some will be off doing their own thing, >>but importantly there will be many sleeping for the mutex to become >>available. > > > The initial assumption was that there was zero idle time with threads > = cpus and the idle time showed up only when the number of threads > increased to the double the number of cpus. If the idle time wouldn't > increase with the number of threads, nothing would be suspect. Well I think more threads ~= more probability that this guy is going to be preempted while holding the mutex? This might be why FreeBSD works much better, because it looks like MySQL actually will set RT scheduling for those processes that take critical resources. >>However, I tested with a bigger system and actually the idle time >>comes before we saturate all CPUs. Also, increasing the aggressiveness >>of the load balancer did not drop idle time at all, so it is not a case >>of some runqueues idle while others have many threads on them. > > > It'd be interesting to see the sysrq+t after the idle time > increased. > > >>I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose >>glibc allocator. But I wonder if there are other improvements that glibc >>can do here? > > > My wild guess is that they're allocating memory after taking > futexes. If they do, something like this will happen: > > taskA taskB taskC > user lock > mmap_sem lock > mmap sem -> schedule > user lock -> schedule > > If taskB wouldn't be there triggering more random trashing over the > mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too. > > I suspect the real fix is not to allocate memory or to run other > expensive syscalls that can block inside the futex critical sections... I would agree that it points to MySQL scalability issues, however the fact that such large gains come from tcmalloc is still interesting. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-13 12:08 ` Nick Piggin @ 2007-03-14 23:33 ` Siddha, Suresh B 2007-03-20 2:29 ` Zhang, Yanmin 0 siblings, 1 reply; 45+ messages in thread From: Siddha, Suresh B @ 2007-03-14 23:33 UTC (permalink / raw) To: Nick Piggin Cc: Andrea Arcangeli, Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote: > I would agree that it points to MySQL scalability issues, however the > fact that such large gains come from tcmalloc is still interesting. What glibc version are you, Anton and others are using? Does that version has this fix included? Dynamically size mmap treshold if the program frees mmaped blocks. http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158&r2=1.159&cvsroot=glibc thanks, suresh ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-14 23:33 ` Siddha, Suresh B @ 2007-03-20 2:29 ` Zhang, Yanmin 2007-04-02 2:59 ` Zhang, Yanmin 0 siblings, 1 reply; 45+ messages in thread From: Zhang, Yanmin @ 2007-03-20 2:29 UTC (permalink / raw) To: Siddha, Suresh B Cc: Nick Piggin, Andrea Arcangeli, Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote: > On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote: > > I would agree that it points to MySQL scalability issues, however the > > fact that such large gains come from tcmalloc is still interesting. > > What glibc version are you, Anton and others are using? > > Does that version has this fix included? > > Dynamically size mmap treshold if the program frees mmaped blocks. > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158&r2=1.159&cvsroot=glibc > Last week, I reproduced it on RHEL4U3 with glibc 2.3.4-2.19. Today, I installed RHEL5GA and reproduced it again. RHEL5GA uses glibc 2.5-12 which already includes the dynamically size mmap threshold patch, so this patch doesn’t resolve the issue. The problem is really relevant to malloc/free of glibc multi-thread. My paxville has 16 logical cpu (dual core+HT). I disabled HT by hot removing the last 8 logical processors. I captured the schedule status. When sysbench thread=8 (best performance), there are about 3.4% context switches caused by __down_read/__down_write_nested. When sysbench thread=10 (best performance), the percentage becomes 11.83%. I captured the thread status by gdb. When sysbench thread=10, usually 2 threads are calling mprotect/mmap. When sysbench thread=8, there are no threads calling mprotect/mmap. Such capture has random behavior, but I tried for many times. I think the increased percentage of context switch related to __down_read/__down_write_nested is caused by mprotect/mmap. mprotect/mmap accesses the semaphore of vm, so there are some contentions on the sema which make performance down. The strace shows mysqld often calls mprotect/mmap with the same data length 61440. That’s another evidence. Gdb showed such mprotect is called by init_io_malloc=>my_malloc=>malloc=>init_malloc=>mprotect. Mmap is caused by __init_free=>mmap. I checked the source codes of glibc and found the real call chains are malloc=>init_malloc=>grow_heap=>mprotect and __init_free=>heap_trim=>mmap. I guess the transaction processing of mysql/sysbench is: mysql accepts a connection and initiates a block for the connection. After processing a couple of transactions, sysbench closes the connection. Then, restart the procedure. So why are there so many mprotect/mmap? Glibc uses arena to speedup malloc/free at multi-thread environment. mp.trim_threshold only controls main_arena. In function __init_free, FASTBIN_CONSOLIDATION_THRE might be helpful, but it’s a fixed value. The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena. To verify my idea, I created a small patch. When freeing a block, always check mp_.trim_threshold even though it might not be in main arena. The patch is just to verify my idea instead of the final solution. --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.000000000 +0800 +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.000000000 +0800 @@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem) } else { /* Always try heap_trim(), even if the top chunk is not large, because the corresponding heap might go away. */ + if ((unsigned long)(chunksize(av->top)) >= + (unsigned long)(mp_.trim_threshold)) { heap_info *heap = heap_for_ptr(top(av)); assert(heap->ar_ptr == av); heap_trim(heap, mp_.top_pad); + } } } With the patch, I recompiled glibc and reran sysbench/mysql. The result is good. When thread number is larger than 8, the tps and response time(avg) are smooth, and don't drop severely. Is there anyone being able to test it on AMD machine? Yanmin ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-20 2:29 ` Zhang, Yanmin @ 2007-04-02 2:59 ` Zhang, Yanmin 0 siblings, 0 replies; 45+ messages in thread From: Zhang, Yanmin @ 2007-04-02 2:59 UTC (permalink / raw) To: Siddha, Suresh B Cc: Nick Piggin, Andrea Arcangeli, Anton Blanchard, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe, drepper On Tue, 2007-03-20 at 10:29 +0800, Zhang, Yanmin wrote: > On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote: > > On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote: > > > I would agree that it points to MySQL scalability issues, however the > > > fact that such large gains come from tcmalloc is still interesting. > > > > What glibc version are you, Anton and others are using? > > > > Does that version has this fix included? > > > > Dynamically size mmap treshold if the program frees mmaped blocks. > > > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158&r2=1.159&cvsroot=glibc > The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena. > > To verify my idea, I created a small patch. When freeing a block, always > check mp_.trim_threshold even though it might not be in main arena. The > patch is just to verify my idea instead of the final solution. > > --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.000000000 +0800 > +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.000000000 +0800 > @@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem) > } else { > /* Always try heap_trim(), even if the top chunk is not > large, because the corresponding heap might go away. */ > + if ((unsigned long)(chunksize(av->top)) >= > + (unsigned long)(mp_.trim_threshold)) { > heap_info *heap = heap_for_ptr(top(av)); > > assert(heap->ar_ptr == av); > heap_trim(heap, mp_.top_pad); > + } > } > } > > I sent a new patch to glibc maintainer, but didn't get response. So resend it here. Glibc arena is to decrease the malloc/free contention among threads. But arena chooses to shrink agressively, so also grow agressively. When heaps grow, mprotect is called. When heaps shrink, mmap is called. In kernel, both mmap and mprotect need hold the write lock of mm->mmap_sem which introduce new contention. The new contention actually causes the arena effort to become 0. Here is a new patch to address this issue. Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> --- --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.000000000 +0800 +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-30 09:01:18.000000000 +0800 @@ -4605,12 +4605,13 @@ _int_free(mstate av, Void_t* mem) sYSTRIm(mp_.top_pad, av); #endif } else { - /* Always try heap_trim(), even if the top chunk is not - large, because the corresponding heap might go away. */ - heap_info *heap = heap_for_ptr(top(av)); - - assert(heap->ar_ptr == av); - heap_trim(heap, mp_.top_pad); + if ((unsigned long)(chunksize(av->top)) >= + (unsigned long)(mp_.trim_threshold)) { + heap_info *heap = heap_for_ptr(top(av)); + + assert(heap->ar_ptr == av); + heap_trim(heap, mp_.top_pad); + } } } ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-12 22:00 ` Anton Blanchard 2007-03-13 5:11 ` Nick Piggin @ 2007-03-13 6:00 ` Eric Dumazet 2007-03-14 0:36 ` Nish Aravamudan 2 siblings, 0 replies; 45+ messages in thread From: Eric Dumazet @ 2007-03-13 6:00 UTC (permalink / raw) To: Anton Blanchard Cc: Nick Piggin, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Anton Blanchard a écrit : > > Hi Nick, > >> Anyway, I'll keep experimenting. If anyone from MySQL wants to help look >> at this, send me a mail (eg. especially with the sched_setscheduler issue, >> you might be able to do something better). > > I took a look at this today and figured Id document it: > > http://ozlabs.org/~anton/linux/sysbench/ > > Bottom line: it looks like issues in the glibc malloc library, replacing > it with the google malloc library fixes the negative scaling: > > # apt-get install libgoogle-perftools0 > # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Hi Anton, thanks for the report. glibc has certainly many scalability problems. One of the known problem is its (ab)use of mmap() to allocate one (yes : one !) page every time you fopen() a file. And then a munmap() at fclose() time. mmap()/munmap() should be avoided as hell in multithreaded programs. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-12 22:00 ` Anton Blanchard 2007-03-13 5:11 ` Nick Piggin 2007-03-13 6:00 ` Eric Dumazet @ 2007-03-14 0:36 ` Nish Aravamudan 2007-03-14 1:00 ` Eric Dumazet 2 siblings, 1 reply; 45+ messages in thread From: Nish Aravamudan @ 2007-03-14 0:36 UTC (permalink / raw) To: Anton Blanchard Cc: Nick Piggin, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On 3/12/07, Anton Blanchard <anton@samba.org> wrote: > > Hi Nick, > > > Anyway, I'll keep experimenting. If anyone from MySQL wants to help look > > at this, send me a mail (eg. especially with the sched_setscheduler issue, > > you might be able to do something better). > > I took a look at this today and figured Id document it: > > http://ozlabs.org/~anton/linux/sysbench/ > > Bottom line: it looks like issues in the glibc malloc library, replacing > it with the google malloc library fixes the negative scaling: > > # apt-get install libgoogle-perftools0 > # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld Quick datapoint, still collecting data and trying to verify it's always the case: on my 8-way Xeon, I'm actually seeing *much* worse performance with libtcmalloc.so compared to mainline. Am generating graphs and such still, but maybe someone else with x86_64 hardware could try the google PRELOAD and see if it helps/hurts (to rule out tester stupidity)? Thanks, Nish ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-14 0:36 ` Nish Aravamudan @ 2007-03-14 1:00 ` Eric Dumazet 2007-03-14 1:09 ` Nish Aravamudan 0 siblings, 1 reply; 45+ messages in thread From: Eric Dumazet @ 2007-03-14 1:00 UTC (permalink / raw) To: Nish Aravamudan Cc: Anton Blanchard, Nick Piggin, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Nish Aravamudan a écrit : > On 3/12/07, Anton Blanchard <anton@samba.org> wrote: >> >> Hi Nick, >> >> > Anyway, I'll keep experimenting. If anyone from MySQL wants to help >> look >> > at this, send me a mail (eg. especially with the sched_setscheduler >> issue, >> > you might be able to do something better). >> >> I took a look at this today and figured Id document it: >> >> http://ozlabs.org/~anton/linux/sysbench/ >> >> Bottom line: it looks like issues in the glibc malloc library, replacing >> it with the google malloc library fixes the negative scaling: >> >> # apt-get install libgoogle-perftools0 >> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld > > Quick datapoint, still collecting data and trying to verify it's > always the case: on my 8-way Xeon, I'm actually seeing *much* worse > performance with libtcmalloc.so compared to mainline. Am generating > graphs and such still, but maybe someone else with x86_64 hardware > could try the google PRELOAD and see if it helps/hurts (to rule out > tester stupidity)? I wish I had a 8-way test platform :) Anyway, could you post some oprofile results ? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-03-14 1:00 ` Eric Dumazet @ 2007-03-14 1:09 ` Nish Aravamudan 0 siblings, 0 replies; 45+ messages in thread From: Nish Aravamudan @ 2007-03-14 1:09 UTC (permalink / raw) To: Eric Dumazet Cc: Anton Blanchard, Nick Piggin, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe On 3/13/07, Eric Dumazet <dada1@cosmosbay.com> wrote: > Nish Aravamudan a écrit : > > On 3/12/07, Anton Blanchard <anton@samba.org> wrote: > >> > >> Hi Nick, > >> > >> > Anyway, I'll keep experimenting. If anyone from MySQL wants to help > >> look > >> > at this, send me a mail (eg. especially with the sched_setscheduler > >> issue, > >> > you might be able to do something better). > >> > >> I took a look at this today and figured Id document it: > >> > >> http://ozlabs.org/~anton/linux/sysbench/ > >> > >> Bottom line: it looks like issues in the glibc malloc library, replacing > >> it with the google malloc library fixes the negative scaling: > >> > >> # apt-get install libgoogle-perftools0 > >> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld > > > > Quick datapoint, still collecting data and trying to verify it's > > always the case: on my 8-way Xeon, I'm actually seeing *much* worse > > performance with libtcmalloc.so compared to mainline. Am generating > > graphs and such still, but maybe someone else with x86_64 hardware > > could try the google PRELOAD and see if it helps/hurts (to rule out > > tester stupidity)? > > I wish I had a 8-way test platform :) > > Anyway, could you post some oprofile results ? Hopefully soon -- want to still make sure I'm not doing something dumb. Am also hoping to get some of the gdb backtraces like Anton had. Thanks, Nish ^ permalink raw reply [flat|nested] 45+ messages in thread
[parent not found: <fa.V3M3ZgXL+lFlIyhx43YxCU/JFUk@ifi.uio.no>]
[parent not found: <fa.ciL5lzdfskdJHJPgn+UVCHt/9EM@ifi.uio.no>]
[parent not found: <fa.2ABbHhyCbp3Fx7hSE/Gr0SuzFvw@ifi.uio.no>]
[parent not found: <fa.oaZk6Aiqd8gyZNsj7+m+w9MibhU@ifi.uio.no>]
[parent not found: <fa.RjX9Y4ckjRCle5L+uWNdd0snOio@ifi.uio.no>]
[parent not found: <fa.XocsudxlGplKh0kloTtA0juPwtA@ifi.uio.no>]
* Re: SMP performance degradation with sysbench [not found] ` <fa.XocsudxlGplKh0kloTtA0juPwtA@ifi.uio.no> @ 2007-02-28 0:20 ` Robert Hancock 2007-02-28 1:32 ` Hiro Yoshioka 0 siblings, 1 reply; 45+ messages in thread From: Robert Hancock @ 2007-02-28 0:20 UTC (permalink / raw) To: hyoshiok Cc: Dave Jones, Pete Harlan, Nick Piggin, Rik van Riel, Lorenzo Allegrucci, linux-kernel, Ingo Molnar, Suparna Bhattacharya, Jens Axboe Hiro Yoshioka wrote: > Howdy, > > MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 > http://ossipedia.ipa.go.jp/capacity/EV0612260303/ > (written in Japanese but you may read the graph. We compared > 5.0.24 vs 5.0.32) > > The following is oprofile data > ==> > cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt > <== > CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit > mask of 0x00 (Unhalted core cycles) count 100000 > samples % app name symbol name > 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock > 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock > 18600010 6.6502 mysqld rec_get_offsets_func > 18121328 6.4790 mysqld btr_search_guess_on_hash > 11453095 4.0949 mysqld row_search_for_mysql > > MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core > machine. Curious that it calls pthread_mutex_trylock (as opposed to pthread_mutex_lock) so often. Maybe they're doing some kind of mutex lock busy-looping? -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from hancockr@nospamshaw.ca Home Page: http://www.roberthancock.com/ ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: SMP performance degradation with sysbench 2007-02-28 0:20 ` Robert Hancock @ 2007-02-28 1:32 ` Hiro Yoshioka 0 siblings, 0 replies; 45+ messages in thread From: Hiro Yoshioka @ 2007-02-28 1:32 UTC (permalink / raw) To: hancockr Cc: davej, harlan, nickpiggin, riel, l_allegrucci, linux-kernel, mingo, suparna, jens.axboe, hyoshiok From: Robert Hancock <hancockr@shaw.ca> Subject: Re: SMP performance degradation with sysbench Date: Tue, 27 Feb 2007 18:20:25 -0600 Message-ID: <45E4CAC9.4070504@shaw.ca> > Hiro Yoshioka wrote: > > Howdy, > > > > MySQL 5.0.26 had some scalability issues and it solved since 5.0.32 > > http://ossipedia.ipa.go.jp/capacity/EV0612260303/ > > (written in Japanese but you may read the graph. We compared > > 5.0.24 vs 5.0.32) > > > > The following is oprofile data > > ==> > > cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt > > <== > > CPU: Core Solo / Duo, speed 2666.76 MHz (estimated) > > Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit > > mask of 0x00 (Unhalted core cycles) count 100000 > > samples % app name symbol name > > 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock > > 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock > > 18600010 6.6502 mysqld rec_get_offsets_func > > 18121328 6.4790 mysqld btr_search_guess_on_hash > > 11453095 4.0949 mysqld row_search_for_mysql > > > > MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core > > machine. > > Curious that it calls pthread_mutex_trylock (as opposed to > pthread_mutex_lock) so often. Maybe they're doing some kind of mutex > lock busy-looping? Yes, it is. Regards, Hiro ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2007-04-02 2:59 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-02-25 17:44 SMP performance degradation with sysbench Lorenzo Allegrucci 2007-02-25 23:46 ` Rik van Riel 2007-02-26 13:36 ` Nick Piggin 2007-02-26 13:41 ` Nick Piggin 2007-02-26 22:04 ` Pete Harlan 2007-02-26 22:36 ` Dave Jones 2007-02-27 0:32 ` Hiro Yoshioka 2007-02-27 0:43 ` Rik van Riel 2007-02-27 4:03 ` Hiro Yoshioka 2007-02-27 4:31 ` Rik van Riel 2007-02-27 8:14 ` J.A. Magallón 2007-02-27 14:02 ` Rik van Riel 2007-02-27 14:56 ` Paulo Marques 2007-02-27 20:40 ` Nish Aravamudan 2007-02-28 2:21 ` Bill Davidsen 2007-02-28 2:52 ` Nish Aravamudan 2007-03-01 0:20 ` Nish Aravamudan 2007-02-27 19:05 ` Lorenzo Allegrucci 2007-03-01 16:57 ` Lorenzo Allegrucci 2007-02-28 1:27 ` Nish Aravamudan 2007-02-28 2:22 ` Nick Piggin 2007-02-28 2:51 ` Nish Aravamudan 2007-03-12 22:00 ` Anton Blanchard 2007-03-13 5:11 ` Nick Piggin 2007-03-13 9:45 ` Andrea Arcangeli 2007-03-13 10:06 ` Nick Piggin 2007-03-13 10:31 ` Andrea Arcangeli 2007-03-13 10:37 ` Nick Piggin 2007-03-13 10:57 ` Andrea Arcangeli 2007-03-13 11:12 ` Nick Piggin 2007-03-13 11:40 ` Eric Dumazet 2007-03-13 11:56 ` Nick Piggin 2007-03-13 11:42 ` Andrea Arcangeli 2007-03-13 12:02 ` Eric Dumazet 2007-03-13 12:27 ` Jakub Jelinek 2007-03-13 12:08 ` Nick Piggin 2007-03-14 23:33 ` Siddha, Suresh B 2007-03-20 2:29 ` Zhang, Yanmin 2007-04-02 2:59 ` Zhang, Yanmin 2007-03-13 6:00 ` Eric Dumazet 2007-03-14 0:36 ` Nish Aravamudan 2007-03-14 1:00 ` Eric Dumazet 2007-03-14 1:09 ` Nish Aravamudan [not found] <fa.V3M3ZgXL+lFlIyhx43YxCU/JFUk@ifi.uio.no> [not found] ` <fa.ciL5lzdfskdJHJPgn+UVCHt/9EM@ifi.uio.no> [not found] ` <fa.2ABbHhyCbp3Fx7hSE/Gr0SuzFvw@ifi.uio.no> [not found] ` <fa.oaZk6Aiqd8gyZNsj7+m+w9MibhU@ifi.uio.no> [not found] ` <fa.RjX9Y4ckjRCle5L+uWNdd0snOio@ifi.uio.no> [not found] ` <fa.XocsudxlGplKh0kloTtA0juPwtA@ifi.uio.no> 2007-02-28 0:20 ` Robert Hancock 2007-02-28 1:32 ` Hiro Yoshioka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).