TL;DR: I had a bad encounter with Go pointers and I don’t think I’ll ever trust this language. The documentation is nice though.
I was solving Day 3, Part 1. The gist of the problem is that there is a grid of characters, some of which form numbers and some are symbols, and you have to find the numbers adjacent to symbols. I came up with an algorithm that scrolls down the input, so that I can read the input line by line and only keep three lines in memory at a time.
I implemented it, did some unit tests then tried it on the small sample input. Everything worked out. I then ran it on the real input and got the wrong answer. I went back and checked for edge cases. Everything looked to be accounted for. I started to print out line by line results. Everything looked good. I started to get frustrated. I looked through about 50% of the lines in the file and they were all being processed correctly. Did I misinterpret what a “symbol” was?
Time traveling code
By accident I noticed something, out of the corner of my eye as it were. It looked like the occasional line was out of order. Through the usual creative and chaotic debugging process I could replicate the issue and verify it was not a figment of my imagination. The essence of the issue in my code was a loop like this:
scanner := bufio.NewScanner(file)
var this_line, next_line []byte
for scanner.Scan() {
this_line = next_line
next_line = scanner.Bytes()
fmt.Println(string(this_line))
}
For small inputs, it would read in files just fine. I made a test file with an X moving on a diagonal so the correctness can be seen easily. For small inputs things work as expected
But as soon as I made the line size longer i.e. we have to read more bytes per line …
We can see clearly that line 7 is read in place of line 3, line 11 in place of 7 and so on. Or rather, when we go to print the variable “this_line” something has happened to it’s value that is quite mysterious – it travels forward in time and grabs a line from the future. If you’ve ever used C or C++ this will immediately remind you of a pointer issue.
Go did something funny with pointers (yes, Go has user accessible pointers). If you look at the code above it is not apparent that scanner is a pointer type because we are using regular dot notation to access methods and variables.
Similarly reading the documentation for scanner.Bytes()
the return value of is a funky concept called a slice. We can read up about Go arrays and slices nicely here.
From this we have a very good guess of what is happening: In regular C/C++ terms, scanner.Bytes()
is returning a pointer to an array we are not in control of. We’ve just been lucky with the smaller data set because the reader just reads in a chunk of the file at one go and we’ve been pointing to different parts of the same array which hasn’t had to be written over yet. The larger dataset borks us.
The solution turns out to be using copy to copy over the data immediately and create new arrays for our usage.
Interesting. Do you think this is a fundamental problem of all languages that support a concept of object reference? I know people step on it with python, and java, and c++, and even with C (to a lesser extent because of the rigid dereferencing syntax). Should be safe with Fortran and pure functional languages, I think.
In C/C++ I’m ready for manual memory management and, at least in C, the functions tend to have a convention, where you allocate memory and then pass the pointer in to the function.
I would say it was an issue of ergonomics here: I wasn’t expecting to deal with pointer issues in go.
You are too kind to C/C++, not all code follows this convention. In fact, many C++ libraries will allocate objects internally: `std::string a = b;` will create a copy and `std::string_view a = b;` will not. You cannot really tell one from another without looking up documentation.
Your point is taken that in C++ when using third party libraries you have to tread carefully re: if the data you pass in is mutated or not. Return values are a bit easier to decipher. My recollection (I haven’t written C++ for some time) is that the compiler got better about warning about unintended side effects like mutation.
In C++ my recollection is that the default for builtins, like a vector of strings, or vector of structs, is to copy. This is safer, but it’s a performance hit. I was looking to speed up some code I wrote and a profile showed an inner loop where a large vector was being copied each iteration. This is better than losing pointers IMO, because the code is correct at the start and profiling will show you where you have to do some tricky things with references or pointers if you want to.