because, even though the space taken by the filesystem is the fault of the filesystem, one needs to consider the minimum information requirements of stating starts and ends of files, specially when stuff is split into multiple files.
I would have actually considered the file size information as part of the file size instead (for both the input and the output) because, for a binary file, which can include a string of bits which might match an EOF, causing a falsely ended file, would be a problem. And as such, the contestant didn’t go checking for character == EOF, but used the function that truly tells whether the end of file is reached, which would, then be using the file system’s file size information.
Since the input file was a 3145728 bytes and the output files would have been smaller than that, I would go with 22 bits to store the file size information. This would be in favour of the contestant as:
That would be the minimum (hyh) number of bits required to store the file size, making it as easy as possible for the contestant to make more files
You could actually go with 2 bits, if you predefine MiB to be the unit, but that would make it harder for the contestant, because they will be unable to present file sizes less than 1 MiB, and would have to increase the file size information bits
On the other hand, had the contestant decided to break the file between bits (instead at byte ends), instead of bytes (which, from the code, I think they didn’t) the file size information would require an additional 3 bits.
Now, using this logic, if I check the result:
From the result claimed by the contestant, there were 44 extra bytes (352 bits) remaining.
+ 22 bits for the input file size information
- 22*219 bits for the output file size information because 219 files
so the contestant succeeds by 352 + 22 − (22 × 219) = −4444 bits.
In other words, fails by 4444 bits.
Now of course, the output file size information might be representable in a smaller number of bits, but to calculate that, I would require downloading the file (which I am not in the mood for.
And in that case, you would require additional information to tell the file size bits. So;
5 bits for the number 22 in the input
5 bits for the size of the file size information (I am feeling this won’t give significant gains) and rest of the bits as stated in the first 5 bits, as the file size bits
you waste bits for every file size requiring more than 16 bits to store the file size information
it is possible to get a net gain with this, as qalc says, log(3145728 / 219, 2) = (ln(1048576) − ln(73)) / ln(2) ≈ 13.81017544
But even then, you have 352 + 5 + 22 − (5 + (14 × 219)) = −2692 for the best case scenario in which all output file sizes manage to be under 14 bits of file size informations.
More realistically, it would be something around 352 + 5 + 22 − ((5 + 14) × 219) = −3782 because you will the the 5 bits for every file, separately, with the 14 in this case, be a changing value for every file, giving a possibly smaller number.
If instead going with the naive 8 bit EOF that the offerer desired, well, going with 2 consecutive characters instead of a single one, seems doable. As long as you are able to find enough of said 2 characters.
After going on a little google search, I seem to think that in a 3MiB file, there would be either 47 or 383 (depending upon which of my formulae was correct) possible occurrences of the same 2 character combination. Well, you’d need to find the correct combination.
But of course, that’s not exactly compression for a binary file, as I said before, as an EOF is not good enough.
Nice stuff.
I got sold on the :
because, even though the space taken by the filesystem is the fault of the filesystem, one needs to consider the minimum information requirements of stating starts and ends of files, specially when stuff is split into multiple files.
I would have actually considered the file size information as part of the file size instead (for both the input and the output) because, for a binary file, which can include a string of bits which might match an
EOF, causing a falsely ended file, would be a problem. And as such, the contestant didn’t go checking forcharacter == EOF, but used the function that truly tells whether the end of file is reached, which would, then be using the file system’s file size information.Since the input file was a 3145728 bytes and the output files would have been smaller than that, I would go with 22 bits to store the file size information. This would be in favour of the contestant as:
On the other hand, had the contestant decided to break the file between bits (instead at byte ends), instead of bytes (which, from the code, I think they didn’t) the file size information would require an additional 3 bits.
Now, using this logic, if I check the result:
From the result claimed by the contestant, there were 44 extra bytes (352 bits) remaining.
+ 22 bits for the input file size information - 22*219 bits for the output file size information because 219 files
so the contestant succeeds by
352 + 22 − (22 × 219) = −4444bits. In other words, fails by 4444 bits.Now of course, the output file size information might be representable in a smaller number of bits, but to calculate that, I would require downloading the file (which I am not in the mood for.
And in that case, you would require additional information to tell the file size bits. So;
22in the inputqalcsays,log(3145728 / 219, 2) = (ln(1048576) − ln(73)) / ln(2) ≈ 13.81017544But even then, you have
352 + 5 + 22 − (5 + (14 × 219)) = −2692for the best case scenario in which all output file sizes manage to be under 14 bits of file size informations. More realistically, it would be something around352 + 5 + 22 − ((5 + 14) × 219) = −3782because you will the the 5 bits for every file, separately, with the14in this case, be a changing value for every file, giving a possibly smaller number.If instead going with the naive 8 bit
EOFthat the offerer desired, well, going with 2 consecutive characters instead of a single one, seems doable. As long as you are able to find enough of said 2 characters.After going on a little google search, I seem to think that in a 3MiB file, there would be either 47 or 383 (depending upon which of my formulae was correct) possible occurrences of the same 2 character combination. Well, you’d need to find the correct combination.
But of course, that’s not exactly compression for a binary file, as I said before, as an
EOFis not good enough.