The purpose of DataHarvester is to scan huge quantities of documents to extract textual patterns. This can be an intensive process, and therefore necessitates an efficient algorithm this is conducted in the fastest time possible.
The test set used for bench testing has been a set of 5,317 OCR’d Adobe PDF documents amounting to 10.9 gigabytes in disk space, with file sizes ranging from 10kb to almost 300mb.
The machine used was a 2.2 GHz Sony i7 quad core laptop.
Initial bench tests proved very efficient, extracting 1362 matches from all documents in a time of 44 minutes and 42 seconds. This utilised a single-thread process.
Concurrent processor threading has been incorporated to allow configuration of CPU resources. This allows DataHarvester to be configured based on the specification of the host machine, therefore utilising as much processing power as possible.
A further test has been conducted on the same set of documents in a time of 19 minutes 56 seconds, a significant increase on the previous test.