Why choose Emerald Shield's domain list
Our category list is updated every 15 minutes of every day.
About our list:
Emerald's categorized list's consist of domain names, not URL’s. We chose to use domains rather
than URL’s because we wanted to rate the domains overall intent. For example CNN’s
site contains sport, finance, and entertainment information. But the overall intent
of the site is to provide news.
We have been crawling sites for our spam filter, and other internal product uses
since 2001. No other company has the history of crawling these types of sites
for the exlicit purpose of categorizing them. We started using our Stop And
Dig technology for our spam filter before most spam companies even looked at the
URL in the body.
Publicly available lists claim to have millions of unique sites
in them; in fact most do not. We have merged the complete DMOZ database and found
it only to contain 1.2 million unique domains. Not the 5+ million sites they claim.
Lots of these can be explained by free hosting sites (thousands of “sites” may be
on that one domain), and blogs (which generally have lots of blogs “sites” hosted
at their location). We also merged down one category of the DMOZ for a client and
found that 8% of the domains had expired and were purchased for domain parking schemes.
Porn sites and other expired domain sales locations often target
domains that are contained in the DMOZ because they know they will get traffic.
The most recent trend is for “parked domains”. They purchase domains that are common
misspellings or expired domains and place a parked page there with advertising.
They in turn hope that the user will click on the ad, or continue the search using
their page. They make money from referrals to search engines and ad placements.
Recrawl every 45 days
We at Emerald are attempting to expire and re-crawl the domains
in our database every 45 days at a minimum. Some domains get re-crawled faster (if
they returned an error code when we last crawled them). We are also working on new
systems to allow us to detect when a domain has changed owners, or server locations.
This will allow us to more rapidly detect changes in domains and get them re-classified.
Emerald currently has 1,960,620 domains (Jan 2008) in our whitelisted
categories and 1,456,540
domains (Jan 2008) in our blacklisted categories.
We currently spend a large part of our time crawling Domain Kiting sites - Sites
that will live less than 4 days. They are "reserved" at a registrar, then
spam or some other mechanism is used to refer to the domain. Then domain is never
paid for, and the registrar drops it after 4 days. We have over 8 million
domains that are currently inactive, but have been used like this in the past year
alone.
We have several partner companies that use our lists as the power behind their products
already. They trust our list for their mission critical systems. Over
2 million users today utilize part of our technology through these partners.
Trapping more malicious sites
In mid 2007 we added another layer to our scanning technology. We started
scanning for malicious software by downloading all the javascript, EXEs, ZIPs, etc
we find on a site. We then run those files through two different antivirus
engines. If we find the site is deploying malicious software it is immediately
added to the illegal activities category and pulled from any other categories.
This new technique has resulted in us finding over 15,000 websites that distributed
virus, trojan, and other malicious software to users. Many of these sites
looked legitimate. Video codec downloads, movie viewers, etc were all offered
to get users to install their bad software.