This one weird trick will speed up your I/O 5x

#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
1	maomao90	161
3	adamant	156
4	maroonrk	153
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	nor	144
10	TheScrasse	142

Just do this!!

#include <iostream>

int main() {
    std::ios_base::sync_with_stdio(false);
    std::cin.tie(nullptr);

    // Your code here
    int n;
    std::cin >> n;
    long long sum = 0;
    for (int i = 0; i < n; i++) {
        int x;
        std::cin >> x;
        sum += x;
    }
    std::cout << sum << std::endl;
}

This will speed up your code so much you won't have to care about spending time in I/O anymore!! Give likez plz :3

Yeah, no. That's not how it works. We're still using <iostream>, and that means we have to rely on virtual inheritance and all the deoptimizations that follow from that. Even if you use a random "here's-my-fast-io-library" snippet from Codeforces, you still depend on block-buffered output. (Also, none of those libraries mesh well with floats.) In many cases, more CPU time is spent encoding/decoding numbers than interacting with the file system.

Can we do better? ...Yeah, stupid question. I wouldn't write this post if we couldn't. Here's how far I managed to go in half a month:

Aggregated benchmark results

These results are from Linux, that's why they look reasonable.

Under Windows, my library is at least as fast as this plot shows, but it also manages to get much faster than <iostream> on integers and much faster than <cstdio> on floating-point numbers. This supports the common claim "use cstdio unless you need floating-point" which I never managed to reproduce myself because apparently the relative performance of cstdio and iostream is the exact inverse of that on Linux. Duh. (To Apple fans: your <iostream> sucks because you use libc++. Install GCC together with Linux (brew sucks), switch to cstdio or (better yet (yes I am trying to force people to use this, why do you ask?)) my I/O.)

Here's the relevant "who the hell thought releasing this garbage was a good idea" plots for Windows MinGW:

The performance is compared to <iostream> and <cstdio>, and also to burunduk1's library as a common example of fast I/O among competitive programmers.

"Random length" means that the test numbers are generated by first choosing length uniformly and then generating this many random digits. "Random value" means the number is chosen uniformly from the valid values of the type. For instance, most of the numbers are going to be of maximal length in this case. This predictive behavior makes some stuff faster to process.

rng is assumed to be uniformly distributed, The difference between rng and e^rng is just like between "random value" and "random length". The units are one football field over the height of 100 Big Macs, by the way. Do not compare the metrics from different tests.

Nothing of interest here. Maybe LOL at if (flag) std::cout << "YES"; else std::cout << "NO"; being faster than std::cout << (flag ? "YES" : "NO"); or something.

It supports std::bitset too. Not like anyone really uses that but uhhh...

Also, notice that for whatever reason output is relatively slow compared to input. That's not just an inefficiency of any particular I/O algorithm; that limitation is due to the performance limits of writing to a file. Generally speaking, output performance increases and input performance decreases in interactive tasks (because the input file can no longer be mmapped but the output pipe can be written to without going through vfs handlers). Unless you have to flush constantly, which you do. Idk, this is probably of more interest to test system developers than anyone else.

I'm impatient. How do I use this in my solutions?

You do this:

#include <iostream>

/* Copy-paste blazingio library here */

int main() {
    // If you're used to writing the two lines below, you can still use them, they just become no-ops.
    // std::ios_base::sync_with_stdio(false);
    // std::cin.tie(nullptr);
    // If you use std::cout.tie(nullptr);, you are wrong.
    // This won't ever be supported.
    // You don't need this line. Remove it.
    // You have almost certainly added 12 useless bytes to every solution. They take space on disk.
    // Disks are produced by young children on factories. Do you want children to work more because
    // of you? Think of the children.
    // And about yourself too. One of the pictures on this page is loaded from my domain. I know your IP.

    // Your code here
    int n;
    std::cin >> n;
    long long sum = 0;
    for (int i = 0; i < n; i++) {
        int x;
        std::cin >> x;
        sum += x;
    }
    std::cout << sum << std::endl;
}

Here's a big blue link to the library because I know you're going to miss it otherwise:

blazingio.min.hpp

The library, called blazingio because I use Rust professionally and I signed an agreement that says I have to use the word "blazing" at least once per project, is packaged as a drop-in <iostream> replacement, which means you can use std::cin and std::cout (or using namespace std; + cin/cout) just like before.

If you're used to cin/cout, just paste the library into your template. If it behaves weirdly or you want to disable it for whatever reason, just remove it. No need to change your own code. Isn't that cool? I think that's cool. It took a while to make it work reliably. Anyway.

Are there any downsides?

That's quite pessimistic of you. You shouldn't have asked that, you should have said "what a cool project" or asked "what are the upsides?".

The main downside is that it's large. It's around 9 KB, and Codeforces submission size limit is just 65535 bytes. The repository contains technical instructions on how to build a slim version if you want to reduce the size. But unless you use another big template, this shouldn't be an issue.

Another downside is that blazingio keeps various large in-memory buffers, which means that your memory usage might increase by approximately the size of I/O. This is typically around 20 MiB max. This shouldn't be a problem in most cases, but you might want to be aware of this. Additionally, ejudge scripts have some legacy code that forces blazingio to allocate exactly 24 MiB for output buffers on some ejudge instances. This is problematic both because it uses more space than necessary and because that might turn out to be not enough. Hopefully these issues can be solved soon.

The last significant problem is that it might be, uh, useless. There used to be tasks that couldn't be solved without fast I/O in the past, but this is considered to be in bad taste by problem setters. However, even if fast I/O is not required, it might still give you an edge, enabling you to use less efficient algorithms in your solution elsewhere.

How it works

The one problem with competitive programmers is they don't see how to apply the tricks they constantly use to seemingly low-level tasks. (Another problem is that most of them share the same mental conditions, and don't talk to anyone IRL, so they don't question their mental health and think their meltdowns are just a quirk.) They are also less aware of OS and hardware internals. So maybe this could be your chance to learn something new.

Unfortunately, I do not have time for a write up at the moment. The short list is:

mmap for input, if it's a regular file (similarly on Windows)
MAP_NORESERVE-based large output buffer on Linux
Branchless autogrowing PAGE_GUARD-based output buffer on Windows
Optimized fast paths for a) input from a regular file, b) input from a pipe if the local buffer is filled
Workarounds for bad GCC codegen wrt. syscalls with inline assembly (had to read XNU code for that for M1 Macs; screw Apple, by the way)
Optimized float reading by loading digits to u64 if precision allows
"sqrt decomposition" (no, not really) for float exponents
SIMD-optimized string input and bitset input/output, including SWAR on SIMDless targets
FILE_FLAG_NO_BUFFERING, FILE_FLAG_WRITE_THROUGH for file output on Windows
Divisionless integer output via Terje Mathisen's trick, which applies to float mantissa too

If that's an interesting topic for you to hear about, please tell me and I might write another post on the internals!

Comments (26)

Write comment?

chromate00

5 months ago, # |

← Rev. 2 →

+20

One of the pictures on this page is loaded from my domain. I know your IP.

sir what the hell

besides some question probably relevant:

→ Reply

purplesyringa

5 months ago, # ^ |

does this work for interactive tasks?

Yeah! It automagically choses between algorithms for interactive and non-interactive tasks in runtime. You can use std::cout << std::flush or std::cout.flush() to flush stdout, or just use endl.

ut-k8g

+11

IP tracking does not work for interactive tasks.

s-lissov

It does work, but if you don't need that support and want to disable it you can rebuild the library from source. This will give you a smaller minified code and will be potentially a bit faster.

maxrgby

+58

why are you exposing us :(

arash.

+22

Getting Runtime error in 248325960 by adding it to 248326069.

Seems to be unstable with pragmas? Or in general?

Barys

Most likely, you just didn't copy the entire code. I tried to fix your code 248352372

I think Barys is right. I removed the line with sync_with_stdio and copied blazingio in and the tests passed: https://codeforces.com/contest/418/submission/248381558

-8

Someone told me the problem was that you copied the older version of code, which for some reason stayed in GitHub caches. I updated the link.

Additionally, I don't believe it can be unstable, because the library is covered with lots of tests and is additionally benchmarked on large data. It is tested on every target from the {x86_64, i386, aarch64} times {linux, macos, windows-mingw, windows-msvc} set, save for i386 Mac and aarch64 Windows, so every bug would be quickly noticed. See latest run.

Yep it works now. Thanks for the help, Amazing stuff!

nor

Ideally you should also add tests with sanitizers and other safety checks turned on — my fast IO was based on someone's fast IO and it had quite a bit of UB (memmove vs memcpy on overlapping ranges for example), and given how weird C++ object lifetimes (not the RAII thing but the lifetimes in the abstract machine model) are, and how much of SIMD is close to type punning, it is important to have such tests. In any case, formal proofs are more important than tests as primary checks for correctness, but tests are good for sanity checks and a library without tests is useless at best and dangerous in the worst case.

Yeah, you're right. I'll add sanitizers/valgrind soon-ish. There are formal proofs for most of the non-obvious stuff in the comments, by the way.

tvmpqx_8601

orz purplesyringa avx512 SSSE high performace optimization unroll-loops Ofast movsb intel manuals read online for free

I tried comparing it with the fast IO that I use, and found that the fast IO in the blog is a bit slower. Note that a lot of the low-level code is taken from Nyaan's template, but I had to do some fixes for UB and did other optimizations and made it a drop-in replacement for std::cin, std::cout (but there might be some minor loose ends that I didn't bother fixing because I don't use them anyway — like reading into char* etc).

The fast IO in the blog: https://judge.yosupo.jp/submission/193614
The fast IO I use: https://judge.yosupo.jp/submission/193613
The current fastest fast IO (but severely limited to the problem at hand, there might be other more generic solutions to the same problem): https://judge.yosupo.jp/submission/156493

I wonder if it is possible to combine the ideas in both (for example, there is no manual SIMD in my fast IO, which can probably be done if the compiler's autovectorization is not optimal) to make an even faster template. On the other hand, great work around the QoL side of things — I really liked the std::bitset and std::cerr handling.

Yeah, I know damn well what went wrong here. Fuck GCC and its stupid codegen. I'm working on a fix.

I released the fix, by the way, but it didn't affect the timing on the first test much.

I think the main reason the fast IO you use is faster is because it uses a 40 KiB table to resolve a number up to 10000 to four decimal digits in $$$\mathcal{O}(1)$$$. This is a fine tradeoff for this particular task, but generally speaking, in realistic problems cache locality will be an important problem, so trashing 40 KiB of cache in a general-purpose I/O library is a bad idea. In contrast, blazingio uses a 200-byte table to resolve just two decimal digits. This decreases performance somewhat, but looks like a more reasonable choice to me, especially given that other optimizations somewhat offset the loss in efficiency.

Have you considered taking 3 digits at a time instead? It will make the cache locality hit negligible (and even 40KiB is usually fine considering most of the input is taken in the beginning of the program for most problems).

The overhead is from writing the data. And I think

// Read input
for (int i = 0; i < q; i++) {
    // Handle query
    // Write output for query
}

is a rather common idiom. And you don't want to trash your L1 cache in such a loop.

I've never seen a slowdown due to cache misses in my IO template from my experience, though, even in problems with queries. Can you point out concrete problems where this is a real issue? And if it's a problem, reducing it to 3 digits at a time instead of 4 will keep it faster than just 2 digits at a time. Even if you're going to the L2 cache in the beginning of the loop due to IO at the end of the previous loop, unless a difference of 4 cycles matters (usually queries are at least 10s of cycles), practically the difference would be negligible. And the best way to fix that is to store your outputs in a buffer and do the conversion from ints to bytes in a vectorizable manner in the end.

As for the third link, I'm somewhat afraid of using AVX512 as it is unavailable or even disabled on quite a few CPUs still in use, including (IIRC) Codeforces. I believe making do with AVX2 would be hard, efficiency-wise. So while the avenue is certainly worth pursuing, it's unfortunately out of scope for me.