I know what is the problem
Where we are going we need all the courage we can get
When I first read the comment that our shiny new gcc 6.2 C++14 compiler has slow printf and scanf I immediately knew what I would do the next weekend. The new MinGW switched to custom implementation and I didn't need anything else to tell what the problem is. Of course, their format string parsing is too slow, they copy data around countless times, no file buffering is used and they have enough levels of indirection to make finding the real code feeling like a treasure hunt. What else could be the problem after all?
Keeping this thoughts in mind I patiently waited for a weekend. When I finally opened MinGW source and searched for printf function, I was pleasantly surprised. The real implementation was only a couple hops away and soon enough I was looking at the heart of printf code. MinGW did try to conceal it a bit though by naming it mingw_pformat.c. But a ~100Kb source file was a very clear hint where all the action is. The format string parsing was straightforward and located in a single big function. Everything was passed by pointer and there were hardly any levels of indirection. But the data was clear — MinGW printf is twice as slow for integer numbers compared with msvcrt.dll and scanf is even worse. Increasing the file buffer size didn't help at all. Reluctantly, I had to admit that the problem is somewhere else. But what could possibly be the problem?
Locked out of performance
A new wonderful multithreaded world
The journey to performance promised not to be as straightforward as I thought. So I needed a quick way to make changes in MinGW CRT library and test its performance. Looking through libraries symbols I found that MinGW versions of printf and scanf are located in mingwex library. So I just need to find out how to build it. Looking for build files brought bad news — MinGW uses automake and autoconf. I don't know a way to extract a single library from this monstrosity but no one can stop us from creating our own build files! The list of all source files in the library was clearly visible in the build files and nothing else is really needed. Time to dust off CMake book on my desk and get to work. A few hours later I had a working build of mingwex library and the benchmark program that uses it instead of system one.
With our bases covered, now it is time to read and follow through pformat code searching for possible performance issues. At the bottom of the call stack I invariably found the __pformat_putc function which in turn calls fputc function to write into the file. Aha, there it is! Everybody knows that writing data symbol by symbol is slow. But why is it slow exactly? Does it flush the buffer after every symbol? Documentation says otherwise. And if fputc is buffered just like everything else, we shouldn't have any slowdown except a function call. The only way to find out is look at the fputc source code. You will not find it in MinGW folder though. Unlike printf, fputc doesn't have MinGW variant and so it goes straight to msvcrt.dll library. Its source code is included in Visual Studio 6.0 installation but we can use source of its more modern versions as an approximation. So how do you write a symbol to a file? You check file stream for errors (like end-of-file), create a SEH frame, take a lock, call _fputc_nolock, and release a lock in the SEH finally block. Now it all makes sense. In the multithreaded world we need to take locks on every file operation and these SEH frames doesn't look cheap either. Well, printf is a file operation and so takes locks too. So why do we need to take them again and again? We don't and replacing fputc with _fputc_nolock solves the printf performance problem.
The same trick can be done for scanf. But this time we have 2 functions to worry about: getc and... ungetc. Replacing getc with _getc_nolock (and adding missing locks to scanf) immediately gives a huge performance boost but replacing ungetc causes compile error. Turns out, _ungetc_nolock is only present in post-msvcrt runtime libraries. But there is still a way to increase performance. We can replace ungetc with writing into internal buffer and do all ungetc at once at the end of the function. This way we will not need to lock the file if ungetc is followed by getc. Moreover this internal buffer already exist for this purpose to support string variants of scanf. You can't ungetc into a user-provided string (it can be in read-only memory) and the fact that the function only tries to put back exactly the same symbols that were there before was not noticed by authors.
Cutting corners and hopefully not blowing up on a pad
Even with getc and ungetc locking under control scanf was still 2.3x times slower on my integer numbers test. Unfortunately, there is more going on than in printf case. The next breakthrough happened when I thought about internal buffer for ungetc for the umpteenth time. It is wastefully large — 1024 elements which have int type. It is not like stack place is precious, it takes a single command to allocate any amount of stack space after all. But we need to initialize it with zeros. Stop, why are we doing that? Initializing structures with memset is convenient but what would happen it we initialize only what we need i.e. everything else but this large buffer? Finally, the scanf performance became just below 2x times slower.
The next improvement came from a simple thought — we got rid of file locks but there might be other locks hiding in function calls. Luckily, there are very few of those in scanf code. It is just malloc/free and localeconv. Both turned out to be a problem. The only reason malloc was used in integer parsing code is to copy input into a string in order to call standard strtoXX function on it. The string buffer is lazily allocated so every scanf run needs at least one malloc call. Allocating 256-byte buffer on the stack and using malloc only when larger buffer is needed helped a lot. The problem with localeconv was caused by locale being different for each thread and so the function needs to read thread-local variable which is not very fast on Windows. The localeconv function were called twice to get decimal and thousands separators. Luckily, we need only one call to get both of them. After applying both changes the scanf integer test performance reached 1.5x times slowdown mark.
It is not a compiler
With all easy performance problems gone, it is time to try a profiler. Normally, one starts with one but we are trying to speed up a MinGW library. This means that Visual Studio profiler can't understand its symbols. And MinGW profiler need to be compiled first which turned out to be a long quest in and out of itself. Unfortunately, even MinGW profiler only shows useful per-function information while per-line information is complete nonsense. But looking through per-function information gave unexpected gift — some short functions are not inlined. Adding inline and separating code that is good for inlining from code that can stay in a function squeezed a bit more speed. The final performance on the integer benchmark is 1.3-1.35x slower than the native version.
Drowning in floating point numbers
There is no easy road to correct rounding
The performance for floating point numbers is very different. On a floating point test MinGW version of scanf is ~1.9x times slower while printf is ~1.6x times faster. Unfortunately, it is very hard to improve scanf. Both native (at least in modern version of the runtime) and MinGW versions of scanf implement correct rounding and so had to use big integers and very complex algorithms. Profiler shows that a significant time is spent on allocating memory for these big integers (which also needs locking by the way) but there is no easy fix for that. Native version in modern C/C++ runtime uses fixed-width big integers which are capable to handle the worst case and allocates them on the stack instead (without zero-initialization of course). It doesn't handle long double though and had to be more careful with how many big integers are used.
For now this performance problem remains unresolved. Any substantial change of floating point parsing code would better have good tests.
When can I use it on Codeforces?
Making a reality check
New printf/scanf are available in GNU G++14 6.2.0 compiler since yesterday. I have run simple integer and floating point benchmarks on Codeforces servers via custom invocation tab to test real speed. Integer benchmark behaves similar to my machine while floating point times are quite different (both MinGW printf and scanf are slower but performance difference is smaller). In the table below, MinGW means new C++14 compiler, msvcrt means new C++14 compiler using msvcrt functions by including stdio.h first (should be similar to existing gcc 5.2 C++11 compiler) and VS 2010 means Visual Studio compiler using variant of msvcrt from 2010.
Integer benchmark. 10M integers, printf("%d")/scanf("%d") printf scanf mingw: 1482ms 1919ms msvcrt: 1310ms 1560ms vs 2010: 1779ms 1154ms Floating point benchmark. 5M doubles, printf("%.1lf")/scanf("%lf") printf scanf mingw: 2153ms 4977ms msvcrt: 1762ms 3198ms vs 2010: 3807ms 2558ms