Following on from Bench Test I, a further test has been conducted on a huge 1 terabyte document store.
The test proved incredibly successful, scanning a total of 650,720 files and returning 63,663 matches. No issues were encountered and the software proved to be reliable, robust, and efficient. The test ran very smoothly and consistently with no memory issues.
The testing of DataHarvester has proved very successful and sets the standard for eSensible data management software in the near future.
The purpose of DataHarvester is to scan huge quantities of documents to extract textual patterns. This can be an intensive process, and therefore necessitates an efficient algorithm this is conducted in the fastest time possible.
The test set used for bench testing has been a set of 5,317 OCR’d Adobe PDF documents amounting to 10.9 gigabytes in disk space, with file sizes ranging from 10kb to almost 300mb.
The machine used was a 2.2 GHz Sony i7 quad core laptop.
DataHarvester Bench Testing
Initial bench tests proved very efficient, extracting 1362 matches from all documents in a time of 44 minutes and 42 seconds. This utilised a single-thread process.
Concurrent processor threading has been incorporated to allow configuration of CPU resources. This allows DataHarvester to be configured based on the specification of the host machine, therefore utilising as much processing power as possible.
A further test has been conducted on the same set of documents in a time of 19 minutes 56 seconds, a significant increase on the previous test.
More information of DataHarvester.