Storage Technologies
Our storage solutions are completely customizable and are well suited for data striping
across multiple networks instead of just multiple disks.
Maintenance Bot
The Maintenance Bot is used as a general tool
for maintaining our storage array of collected domains. It performs the following
tasks:
- Expiration of domains - Walk the storage array and expire domains
and their data (places them on a list to be re-crawled) based upon a configurable
date threshold.
- Force expire list of domains - Expire domains from storage array
from a given list of domains. This will force them to be downloaded with the Spider
Engine again.
- Compute load of storage array - Compute distribution level of the
storage array, and the average load per machine.
- Collate data for reanalysis - Take all existing data from a specified
storage array location and re-test it using the Scan Bot for analysis using current
tool set.
- Integrate remote data into storage array - Remote location data
is integrated into the existing storage array using the current cluster configuration.
- Index a storage node - Walk the storage node and update the link
index for each domain. This updates the inbound and outbound links for each domain
found, including image references.
Storage Array:
The Storage Array is how the content of domains
crawled is kept on disk. This information is used by the spiders to determine if
a remote site has new content, and is used to run sections of the database through
new algorithms for testing. This entire system consists of off the shelf hardware
and custom software algorithms for a distributed load balanced storage system. The
array may be load balanced across n(where n is any prime number)
nodes (nodes can be thought of as virtual drives) and x
(where x can be any number greater than 0) machines. Each machine has very
modest system requirements (Celeron 1 GHz +, 256 MB of RAM, and IDE drives). Larger
machines may be used as larger caches handling the duties of more than one system.
We currently operate 63 nodes across 10 machines on one of our
test racks. The system can also be configured to allow redundancy and failure through
striping across systems. Each machine may handle the storage assigned to it, and
parts of the data from other nodes. In the event of a single machine failure no
data will be lost across the array. This system is written in such a way that it
can be utilized by any type of application for distributed storage of large volumes
of information.
Storage Reporter:
The Storage Reporter shows in a graphical way the load of the
each of the storage arrays relative to each other. The chart may be rotated in 3
dimensions for easy viewing of information.