Spider Technologies

Emerald has been creating network related technologies for over xxx years.

Web Spider:

Emerald's web spider is able to be configured in a variety of ways. Allowing the user to dictate everything from whether or not to obey robots.txt, how many pages per domain to crawl, what type of data to collect, etc. The web spider is able to store this data in our storage array or to a configured output folder. 

Options explained:

  • Obey Robots.txt - It is not always in the best interest to obey robots.txt. Many sites containing questionable content will attempt to use this file to keep spiders/bots from finding it. The primary reason a site uses this file is to keep search engine spiders and such from attempting to index data they do not wish to be searched on or to keep them out of image folders. Our spiders are intellegent enough to not drain bandwidth from sites looking at these sites as we will not attempt to download just anything.
  • How many pages per domain to crawl - This number is adjustable mainly because some there are a large amount of sites out there that have a very large page volume. Sites such as CNN.com may have several hundred's of pages available to web surfers at any given time. It is important to choose a number that will not tax these sites but enough to obtain the data required for analysis.
  • Type of data to collect - The spider is capable of downloading anything and everything from web sites. This how ever is likely not a good idea mainly for the sake of the site owners bandwidth. By specifying the data to collect the spider is able to be selective on what it downloads and stores. Thus increasing speed and saving storage space. For example you may not wish to download images, but perhaps you would like a record of the images that were on the site. The spider can be configured to handle this.

Spider Engine:

The spider engine acts as a manager for web spiders. It can be configured to monitor each web spider's processor usage, memory usage, time active, and much more. The spider engine feeds the web spiders with sites, launching them according to configurable schedules aswell as prioritizing the sites visited using primary, secondary, and tertiary lists.

News

RSS Newsfeed offline

Website updates are in progress. 

Not all of the content on this new site is complete.  If you have questions please contact us for more information.

Uncomplicated solutions for categorized URLs

Technology at work

We believe that making technologies that are easy to deploy and manage are essential to our partners success.