Sunday, November 15, 2009

100,000 tasklets: Stackless and Go

This is the place to leave comments regarding my essay 100,000 tasklets: Stackless and Go. I repeated a timing test from Go using Stackless and showed that Stackless was faster than Go, yet Go should have the advantage of being a compiled language designed for just this task. I also questioned why the performance of the Go compiler was interesting, given that it seems to be slower than gcc.


Karl said...

There's an ongoing discussion on Ars Technica's Programmer Symposium forum. Metasyntactic works at google and outlines the focus on compilation speed on page 3 or 4. I don't remember the concurrency speed coming up, but you can probably get a decent answer from Meta if he's feeling up for it.

Johann Visagie said...

Hi Andrew. Antoine and I were also discussing the whole Go thing with some befuddlement ranging from the actual point of the exercise to, well, the performance.

We were mostly looking at things from the point of view of Haskell, which has become tantalisingly close to being a viable "systems programming language" despite the fact that the space usage signature of lazy algorithms is still poorly understood in some cases.

Yet despite the fact that Haskell in the current stat of its development wouldn't try to make exaggerated claims about execution speed or memory efficiency, it beats the pants off Go's 6g compiler on the computer language shootout at So… huh?

One bizarre take-home from Rob Pike's talk is the fact that they seem have focussed on compiler efficiency (as opposed to efficiency of compiled code) because they believe that most people choose "dynamic" languages because they don't like waiting for a compiler. Which (if true) is a simply monumental misconception. (If you ask me, many people opt for dynamic languages because they're tired of dealing with ancient, poorly designed type systems. But that's besides the point.)

Andrew Dalke said...

Hi Karl. Thanks for the pointer, though I had hope the lazyweb would have worked for me and just told me the answer. ;)

Reading through that thread, the Go assertion seems to be "fast for large programs with lots of dependencies where the C and Java program analysis isn't good enough to figure out quickly what needs to be compiled."

The problem is, the videos only point out "look at how fast Go is on this small set of code." Which seems to be beside the point.

Anonymous said...

I always look forward for your articles. This one is again as interesting as always. Thanks!

Don Stewart said...

The Haskell version:

Note this is a modified version of the famous 'thread-ring' benchmark from the Language Shootout.

tunginobi said...

Compilation speed seems to also have taken the upper hand over binary sizes. As I've written in my blog, hello world is 569 KB, while a 30-something line HTTP file server I wrote is 990 KB!

I understand it's still early days for Go, but is compilation speed really that important?

Andrew Dalke said...

Thanks Anonymous!

To Don Stewart, I've been participating in the reddit forum, as you've likely seen.

To tunginobi, I've read your blog post. Interesting about the sizes, and I think your conclusion is correct.

As for Johann, that's an interesting viewpoint, and one I hadn't thought of. The view from Metasyntactic on Ars Technica was really more an issue of dependency couplings in very large code bases, which of course is not a problem in Python.

I still don't get it, but then perhaps I'm too deep into the waters of Python, blessed it be. (All hail the Python!)

Leonardo Santagada said...

One of the things that go does is tasklet migration and maybe balancing on threads, so while you have the same abstractions on stackless python, you don't get the same performance on anything bigger than a single core processor.

This could be fixed, but you would also have to remove the GIL or do something like distributing tasklets to diferent processes and managing interprocess channels.

Andrew Dalke said...

Hi Leonardo! I know exactly what you're saying, but I want to point out that my question is why Pike emphasized the performance of Go's concurrency demo, when it was likely quite a bit slower than Stackless Python, which isn't a compiled language and wasn't built from scratch for concurrency.

The demo he showed cannot be sped up by multiple processors because there's really only one active goroutine at a time. And according to what Richard found in the source code, Go currently defaults to a single kernel thread for goroutines, with other threads used for blocking calls.

If there was a timing test demo which made more effective use of multiple cores then I would like to see it, but from comments I've seen so far, increasing the number of kernel threads ends up reducing the overall performance.

Leonardo Santagada said...

I think all your answer is very interesting and important information, it should be somewhere on your article :)

And if it only uses other threads to do sync io then you could probably compare it to twisted, if I understand correctly there is support for that with twisted deferreds.

So it is not an erlang killer after all :)

Chuck Simmons said...

The speed of compling becomes important when you are compiling millions of lines of code, which frequently happens at Google. There are huge sub-systems that constantly get pulled in. Oracle also has this problem.

The reason this is a problem is because there are O(N^2) (or worse) algorithms when compiling these subsystems. 'go' aims to fix this problem of hour-long project recompilations.

'go' is concerned with dynamic program execution. That's why it is a typed langugage. (Typing also provides much of the redundancy that python unittests provide.) 'go' aims to keep the typing from interfering with the benefits that python or ruby bring when they avoid the redundancy of strong typing.

'go' is intended to fill a niche between C and python. For 'go' to be successful, it must provide performance comparable to 'c' (e.g. within 20%; certainly not a factor of 10), but it must also preserve or improve the "fun" of programming in python. For people adding code to large systems, reducing the compile time is very important.

Unfortunately, there isn't enough 'go' code to actually validate how fast 'go' is on millions of lines of code, so Rob is trying to extrapolate from the data he currently has available.

Overall, the trollish nitpicks that speed-of-compilation has only been demonstrated on a small amount of code, or that there is some particular algorithm for which 'go' is slow aren't that interesting. 'go' has plenty of time to fix speed-of-execution problems for particular algorithms.

If you're having fun programming in Haskell or Stackless Python or Erlang or whatever, keep having fun. Go isn't trying to make you stop using your favorite language. Go is a research project that is still early days.

Andrew Dalke said...

A result of my posting was finding out just how bad C/C++ is for large code bases. The biggest systems I've ever worked on are only 1/2 million LOC and without the multi-project dependencies I've since learned that Google has.

(Another result was Stackless users found out just how fast Stackless was on that sort of task, which they didn't expect.)

I still didn't like that PIke emphasized the performance with little basis for it. But *shrug*, there's little I can do about that.

Narrow Road said...
This comment has been removed by the author.
Magnet.Code said...

if "go" works on Windows it would be interesting to see if the diff way windows handles threading could lower the differences you see here on the python/go result.

Hadyan Mardhi Fadlillah said...

Nice Info, I like your blog.

keep update !!

visit this site:

Denny said...

Hi friend,
Thank you sharing some info
Have a nice day

Unknown said...

I think you might have forgotten one important aspect here: goroutines may run across multiple OS threads ("The global setting GOMAXPROCS controls how many threads Go scheduler will use to run goroutines not blocked in system calls. " ~ effective go), whereas SP tasklets run within the same Python thread (more specifically, execution context, of which there can only be 1/process anyway because of the GIL). Thread synchronization has its cost.

I'd suggest first setting GOMAXPROCS to 1 and comparing the timing for an accurate comparison between the two languages.

Andrew Dalke said...

Hi Unknown.

When I wrote that some 3 years ago, GOMAXPROCS was hard-coded to 1 because there was a problem with the N:M coding.

Even now, I see that GOMAXPROCS is a per-application setting, since some programs don't scale up with the number of processors, because of the extra context switch overhead.

As I recall, Stackless supports multiple stackless threads per pthread, though I'm not sure about that. In any case, the GIL is released when calling out to blocking code, so many I/O bound programs using Stackless can take advantage of multiple threads, and CPU bound tasks with limited I/O can use the multiprocessing module.

But yes, pure (C)Python code with lots of mutable state can't take advantage of multiple CPUs.

Jayson Vantuyl said...
This comment has been removed by the author.
Jayson Vantuyl said...

With respect to Addendum #1, Go does not crash, though why it doesn't is a bit involved.

Go has two subtly different concepts: arrays and slices. Arrays are the underlying data structure that resides in memory. Slices are typically how you use them. In fact, 90% of the time you never touch arrays, just slices.

How does this impact sort? As it turns out, sort.go defers everything to the Swap implementation on your interface. If that interface is defined with a value receiver, then it gets a copy of the slice.

Even though it's modifying an underlying array, it won't cause a crash due to concurrent mutation of the slice (though the data in the array could cause later crashes if it gets raced-to-death), as the slice being modified belongs to it alone. All of the example interfaces are defined this way.

That said, if one were to define your interface badly, you could have issues. Slices, maps, and pointers are not threadsafe (intentionally, for performance).

I haven't tested any of this, mind you. This is just my impression from the following resources:

* The Sort Implementation
* Blog Post on Slices
* Memory Model Documentation
* A Particularly Salient E-mail Thread

In these cases, you could segfault the runtime, though it would go down with some decent debugging, I think.

With respect to the lack of exceptions, Go has "errors" and "panics", which subsume the exception use-cases.

In general, it's type-safety and multi-value returns make 90% of error-handling cases not require any particularly unusual control structures. See here.

For large-scale control patterns, there are still panics. They are typically useful for higher-level stuff than just errors. See here.

I'm hard-pressed to articulate when you'd choose between the two, but I've got a pretty good intuitive sense of it.

Basically, the sentiment is that using exceptions for error handling requires you to use an incredibly disruptive control structure for what is effectively a very normal operation. Multi-value returns get you the 80% for normal error conditions.

For that remaining 20% where the stack-unwinding behavior is desirable, Go provides the generally useful defer behavior which is useful for non-error cases (i.e. try-finally), and uses panic-recover to provide the remaining, behavior.

sdgsdg said...

FYI the results are still the same. I've just compiled a Stackless 3.4.2 for this and checked against go1.7.3. I plan to do some more investigations just to see different scenarios.
Here are the numbers:

$ time ./stackless/bin/python3.4 ./

real 0m0.276s
user 0m0.236s
sys 0m0.036s

$ time ./speedtest

real 0m0.982s
user 0m0.612s
sys 0m0.976s