Spider Technologies
Emerald has been creating network related technologies for over xxx years.
Web Spider:
Emerald's web spider is able to be configured in a variety of ways. Allowing the
user to dictate everything from whether or not to obey robots.txt, how many pages
per domain to crawl, what type of data to collect, etc. The web spider is able to
store this data in our storage array or to a configured output folder.
Options explained:
- Obey Robots.txt - It is not always in the best interest to obey
robots.txt. Many sites containing questionable content will attempt to use this
file to keep spiders/bots from finding it. The primary reason a site uses this file
is to keep search engine spiders and such from attempting to index data they do
not wish to be searched on or to keep them out of image folders. Our spiders are
intellegent enough to not drain bandwidth from sites looking at these sites as we
will not attempt to download just anything.
- How many pages per domain to crawl - This number is adjustable
mainly because some there are a large amount of sites out there that have a very
large page volume. Sites such as CNN.com may have several hundred's of pages available
to web surfers at any given time. It is important to choose a number that will not
tax these sites but enough to obtain the data required for analysis.
- Type of data to collect - The spider is capable of downloading
anything and everything from web sites. This how ever is likely not a good idea
mainly for the sake of the site owners bandwidth. By specifying the data to collect
the spider is able to be selective on what it downloads and stores. Thus increasing
speed and saving storage space. For example you may not wish to
download images, but perhaps you would like a record of the images that were on
the site. The spider can be configured to handle this.
Spider Engine:
The spider engine acts as a manager for web spiders. It can be configured to monitor
each web spider's processor usage, memory usage, time active, and much more. The
spider engine feeds the web spiders with sites, launching them according to configurable
schedules aswell as prioritizing the sites visited using primary, secondary, and
tertiary lists.