Storage Technologies

Our storage solutions are completely customizable and are well suited for data striping across multiple networks instead of just multiple disks.

Maintenance Bot

The Maintenance Bot is used as a general tool for maintaining our storage array of collected domains. It performs the following tasks:

  • Expiration of domains - Walk the storage array and expire domains and their data (places them on a list to be re-crawled) based upon a configurable date threshold.
  • Force expire list of domains - Expire domains from storage array from a given list of domains. This will force them to be downloaded with the Spider Engine again.
  • Compute load of storage array - Compute distribution level of the storage array, and the average load per machine.
  • Collate data for reanalysis - Take all existing data from a specified storage array location and re-test it using the Scan Bot for analysis using current tool set.
  • Integrate remote data into storage array - Remote location data is integrated into the existing storage array using the current cluster configuration.
  • Index a storage node - Walk the storage node and update the link index for each domain. This updates the inbound and outbound links for each domain found, including image references.

Storage Array:

The Storage Array is how the content of domains crawled is kept on disk. This information is used by the spiders to determine if a remote site has new content, and is used to run sections of the database through new algorithms for testing. This entire system consists of off the shelf hardware and custom software algorithms for a distributed load balanced storage system. The array may be load balanced across n(where n is any prime number) nodes (nodes can be thought of as virtual drives) and x (where x can be any number greater than 0) machines. Each machine has very modest system requirements (Celeron 1 GHz +, 256 MB of RAM, and IDE drives). Larger machines may be used as larger caches handling the duties of more than one system.

We currently operate 63 nodes across 10 machines on one of our test racks. The system can also be configured to allow redundancy and failure through striping across systems. Each machine may handle the storage assigned to it, and parts of the data from other nodes. In the event of a single machine failure no data will be lost across the array. This system is written in such a way that it can be utilized by any type of application for distributed storage of large volumes of information.

Storage Reporter:

The Storage Reporter shows in a graphical way the load of the each of the storage arrays relative to each other. The chart may be rotated in 3 dimensions for easy viewing of information.

News

RSS Newsfeed offline

Website updates are in progress. 

Not all of the content on this new site is complete.  If you have questions please contact us for more information.

Uncomplicated solutions for categorized URLs

Technology at work

We believe that making technologies that are easy to deploy and manage are essential to our partners success.