Find X-largest values in a large file
https://github.com/fghpdf/X-largest.git
Example:
1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
First, we will maintain a min-heap of size X.
Then, we can put elements into the heap.
If the size of the heap is bigger than X, we can remove the top element of the heap.
So the time complexity and space complexity depends on the size of the heap.
| Complexity | Value |
|---|---|
| Time | O(NlogX) |
| Space | O(X) |
We can split the huge file into many smaller files by hash.
Then we can read each file and put the elements into the heap.
Now we have the top x elements of each small file.
Finally, we can merge the top x elements and get the top x elements of the huge file.
By the way, I prepared a tool to generate a huge file.
It will generate a unique identifier by snowflake and a random number into the huge file.
The tool will pick the line whose number >= 99900 and write it into a new file.
This file named gen_top_x.txt can help us to test more conveniently.
So we can calculate the number of files by the number of cores.
Assume a core can take 16 times the memory of the original dataset.
And we can use 1GB of memory.
So our formula is:
tmpFileSize := MaxMemorySize / (coreCount * 16)
processNum := fileSize / tmpFileSize
e.g. we have 1 GB of the huge file, and we have 4 cores.
temp file size = 1 GB / (16 * 4) = 0.015625 GB = 16 MB
process num = 1 GB / 16 MB = 64
Second, you can run the following command to run:
make build
You will get two bin files in the dist folder.
./dist/main.bin -h to get help for this command../dist/main.bin -file=<file path> -x=<top xx> to get the top x elements of the file../dist/main.bin -x=<top x> to get the top x elements of the stdin.make test
The test will generate a huge file and then get the top x elements of the huge file.
It will take a long time to generate a huge file and split it, just relax โ๏ธ.
When everything is done, you can get the coverage of the program.

The test report is:
PASS
coverage: 66.8% of statements
total (statements) 87.8%
The test program can cover the main process of the program.
Integration test is main_test.go.
Unit tests are topx/*_test.go.
./dist/main.bin -file=tmp/gen_records.txt -x=394
| Step | Time |
|---|---|
| Split 165 Files | 4m1.5s |
| Top X | 21.75s |