Reimplementing cat
in Rust
The motivation behind this post is twofold. I am not very familiar with Rust's file io. And I need to test my custom web server with an actual post.
GNU Cat
I first decided to how figure out how GNU cat
functions. Below is an extremely stripped-down version of it:
// <various includes> #define IO_BUFSIZE 256 * 1024 void gnu_cat(char* path) { struct stat stats; int fd = open(path, O_RDONLY); fstat(fd, &stats); // blk size that is divisible by st_blksize and >= IO_BUFSIZE ptrdiff_t opt_insize = stats.st_blksize + (IO_BUFSIZE - 1) - (IO_BUFSIZE - 1) % stats.st_blksize; posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL); char* input_buf = aligned_alloc(getpagesize(), opt_insize); while (true) { ptrdiff_t n_read = read(fd, input_buf, opt_insize); ptrdiff_t n_write = 0; while (n_write < n_read) { n_write += write(STDOUT_FILENO, input_buf, n_read); } if (n_read == 0) { break; } } }
The GNU authors stated that they selected 256KiB as the read size by testing on a variety of (now old) CPUs using the following script:
for i in $(seq 0 10); do bs=$((1024*2**$i)) printf "%7s=" $bs timeout --foreground -sINT 2 \ dd bs=$bs if=/dev/zero of=/dev/null 2>&1 \ | sed -n 's/.* \([0-9.]* [GM]B\/s\)/\1/p' done
According to the above, the optimal read size is 512KiB on my laptop.
Rust Comparison
Below is the naive Rust translation,
// <various imports> fn basic_cat(path: &str) -> anyhow::Result<()> { let mut file = File::open(path)?; let mut buf = String::new(); file.read_to_string(&mut buf)?; println!("{}", buf); Ok(()) }
The above diverges from GNU cat
:
- We allocate the entire buffer into memory before writing it.
- Since our buffer is a
String
, it includes manual checks for UTF-8 compliance. - Rust's
io::stdout()
is line-buffered by default. Thus, our program can be either memory-intensive or CPU-intensive depending on frequency of\n
.
Alternatively, we can write:
// <various imports> const IO_BUFSIZE: u64 = 256 * 1024; fn buffered_cat(path: &str) -> anyhow::Result<()> { let mut file = File::open(path)?; let blksize = file.metadata()?.blksize(); let opt_insize = blksize + (IO_BUFSIZE - 1) - (IO_BUFSIZE - 1) % blksize; let mut stdout = unsafe { File::from_raw_fd(1) }; let mut input_buf: Vec<u8> = vec![0; opt_insize as usize]; input_buf.fill_with(Default::default); loop { let n_read = file.read(input_buf.as_mut_slice()).unwrap(); if n_read == 0 { break; } let mut n_write = 0; while n_write < n_read { n_write += stdout.write(&input_buf).unwrap(); } } Ok(()) }
The above is a much-closer representation to GNU cat
. The translation is actually quite nice, and even the singular unsafe block has fairly intuitive behaviour. But we are redoing the work of std
. Instead of buffering manually we can wrap the file descriptors with io::BufReader
and io::BufWriter
.
// <various imports> const IO_BUFSIZE: u64 = 256 * 1024; fn wrapped_cat(path: &str) -> anyhow::Result<()> { let file = File::open(path)?; let blksize = file.metadata()?.blksize(); let opt_insize = blksize + (IO_BUFSIZE - 1) - (IO_BUFSIZE - 1) % blksize; let mut reader = BufReader::with_capacity(opt_insize as usize, file); let mut stdout = BufWriter::new(unsafe { File::from_raw_fd(1) }); io::copy(&mut reader, &mut stdout).unwrap(); Ok(()) }
We are almost done! But we have have one more external code base to compare...
Rust Coreutils
Obviously, I was not the first person to port cat
to Rust. The project uutils/coreutils
is a complete rust port of all of GNU coreutils. Below is a simplified version of their code:
fn uutils_cat(path: &str) -> anyhow::Result<()> { use nix::{ fcntl::{self, OFlag}, sys::stat::Mode, }; const SPLICE_SIZE: usize = 1024 * 128; const BUF_SIZE: usize = 1024 * 16; /// An alias of fcntl that ignores setst the offsets to 0. fn splice( fd1: impl AsFd, fd2: impl AsFd, amount: usize, ) -> anyhow::Result<usize> { Ok(fcntl::splice( &fd1, None, &fd2, None, amount, fcntl::SpliceFFlags::empty(), )?) } let fd = fcntl::open(path, OFlag::O_RDONLY, Mode::S_IWUSR)?; let (read, write) = nix::unistd::pipe()?; loop { let n_read = splice(&fd, &write, SPLICE_SIZE)?; if n_read == 0 { return Ok(()); } let mut n_write = 0; while n_write != n_read { n_write += splice(&read, unsafe { File::from_raw_fd(1) }, n_read)?; } } }
The syscall splice
avoids copying between userspace and kernel space (GNU cat
employed a similar syscall copy_file_range
but it did not function as expected). Note that there is a fallback if splice
does not work.
Benchmarking
I measured performance via hyperfine
, testing on a large and a small file. I recorded these results on a Legion Slim 5 16AHP9 employing EXT4 with a 4k block size using the release profile for both programs. The benchmarking software is published to GitHub and can be run with one command. Please try it out! I'm curious what results others get.
Notably, gnu_cat
was significantly faster on smaller files although its advantage disappeared with the larger file.
./gnu small.txt ran 1.39 ± 0.19 times faster than ./buffered small.txt 1.41 ± 0.18 times faster than ./basic small.txt 1.43 ± 0.21 times faster than ./uutils small.txt 1.44 ± 0.18 times faster than ./wrapped small.txt ./wrapped large.txt ran 1.02 ± 0.08 times faster than ./uutils large.txt 1.06 ± 0.09 times faster than ./gnu large.txt 1.41 ± 0.17 times faster than ./buffered large.txt 2.97 ± 0.21 times faster than ./basic large.txt
I was surprised by how much faster gnu_cat
was for smaller files. Initially I blamed the lack of error checking in the C version, but after adding return-value checks, the measurements remained identical. Testing without anyhow
did not change the results. Changing compiler, removing posix_fadvise
and aligned_alloc
, all had no effect. I eventually realized that the C-version was not zero-initializing its input which caused the majority of the performance disparity. So adding a memset(input_buf, 0, opt_insize)
eliminated most of the difference.
The updated times are:
./gnu small.txt ran 1.15 ± 0.11 times faster than ./basic small.txt 1.16 ± 0.15 times faster than ./buffered small.txt 1.17 ± 0.12 times faster than ./uutils small.txt 1.18 ± 0.15 times faster than ./wrapped small.txt ./wrapped large.txt ran 1.02 ± 0.11 times faster than ./uutils large.txt 1.02 ± 0.09 times faster than ./gnu large.txt 1.39 ± 0.19 times faster than ./buffered large.txt 2.93 ± 0.24 times faster than ./basic large.txt
I was also fairly surprised that the performance of wrapped_cat
and gnu_cat
were so much better than buffered_cat
; I had thought they executed much the same code. Inspecting strace
revelead that wrapped_cat
actually employs a different syscall: sendfile
. That seemed to provide large performance improvements.
I never did figure out the disparity between buffered_cat
and gnu_cat
though. They seem to use the same number of reads and writes, so it appears to be some kind of run time disparity with the Rust code. I wonder what the performance would be under no_std
.
Notes
I removed the syscall copy_file_range()
in the code for gnu_cat
. Intended to copy between two file descriptors — removing the need for a user space copy similar to splice
— I thought it would speed up the code substantially. However, it is finicky and does not seem to work for stdout
on my machine.
In additon, I formatted rust code using two spaces instead of four. I figured that the smaller code blocks would view better on mobile. If not, please let me know!