DataHarvester Bench Testing

No Comments 4729 Views0

The purpose of DataHarvester is to scan huge quantities of documents to extract textual patterns. This can be an intensive process, and therefore necessitates an efficient algorithm this is conducted in the fastest time possible.

The test set used for bench testing has been a set of 5,317 OCR’d Adobe PDF documents amounting to 10.9 gigabytes in disk space, with file sizes ranging from 10kb to almost 300mb.

The machine used was a 2.2 GHz Sony i7 quad core laptop.

DataHarvester Bench Testing

DataHarvester Bench Testing

Initial bench tests proved very efficient, extracting 1362 matches from all documents in a time of 44 minutes and 42 seconds. This utilised a single-thread process.

Concurrent processor threading has been incorporated to allow configuration of CPU resources. This allows DataHarvester to be configured based on the specification of the host machine, therefore utilising as much processing power as possible.

A further test has been conducted on the same set of documents in a time of 19 minutes 56 seconds, a significant increase on the previous test.

More information of DataHarvester.

Digiprove sealCopyright secured by Digiprove © 2014

Leave a Reply

Your email address will not be published. Required fields are marked *