Analyzing the Web

The background story

I wanted to investigate to current position of the web by analysing the most successful companies. Companies in this analysis are represented by unique domain+extension (domain.com). I'll refer to these as 'websites'.

How are these websites in ways of innovation and location. Are the most successful websites from America? Are Chinese websites a majority of the internet? Is the web made of porn? Are they implementing easy-access RSS feeds?

This page describes the quantitative analysis of web data gathered from various sources. And it shows the rough stats that you can use when source is appropriately credited (www.yvoschaap.com)

To compile a list of most significant websites I used the Alexa top 10,000 websites ranking. The top 10,000 websites represent ..% of total internet traffic (source). The top 500 websites represent 45% of internet traffic. Over 40% of alexa data is gathered by users outside the United States. cite.
Reach is defined by XXX. I calculate the percentage of reach by setting the sum of websites analyzed as 100%.

With the list of the top 10,000 websites I gathered information grabbing the website homepage, and looking for available: RSS feed, stylesheet and advertising network.
Plus data is gathered about domain reach, views, adult, language, inlinks, and location (ownership).
The downsides are that only toplevel domains with extension (domain.com) are uniquely counted. So to add a extra level of analysis, the websites google, yahoo, and ebay get a special treatment where they get reach and view of all their domain extension are added. Microsoft gets one listing of live.com, msn.*, microsoft.com and hotmail.com.

Alexa's data is over represented by English users. And 30% of adult site owners don't identify themselves. And I've heard stories about websites blocking access from Alexa toolbar users.

The textual analyses of the top 10,000 websites that represent XX% of total internet traffic is here.

 
Example stats:
Most adult content distributing countries.
Most websites owned by country
Biggest reach
Most linked websites
total reach of top 100, top 1,000 and top 10,000
Googles,yahoo,microsoft, ebay total domain reach (all extentions)
% adsense publishers
other advertising networks
 
 
Domain
# total reach data: 6,683,133
# total views data: 631,286
# total domains data: 9,999 (almost 10,000)
####################################
# identified adult: 923  (5% total reach)
####################################
# identified advertsing network: 1,038
####################################
# with country data: 6,141
####################################
# with language data: 9,095 
####################################
# adult sites: Us --> 30.4% (281)
# adult sites:  --> 27.8% (257)
# adult sites: Canada --> 6.6% (61)
# adult sites: Netherlands --> 4.6% (42)
# adult sites: Spain --> 3.1% (29)
# adult sites: United Kingdom --> 2.5% (23)
# adult sites: France --> 1.8% (17)
# adult sites: Japan --> 1.5% (14)
# adult sites: Brazil --> 1.2% (11)
# adult sites: British Virgin Islands --> 1% (9)
# adult sites: Russia --> 1% (9)
# adult sites: Czech Republic --> 1% (9)
# adult sites: Panama --> 1% (9)
# adult sites: Cyprus --> 0.9% (8)
# adult sites: Dominica --> 0.9% (8)
####################################
# most domains: Us --> 43.5% (2673)
# most domains: China --> 9% (554)
# most domains: Canada --> 4.7% (288)
# most domains: Germany --> 3.8% (232)
# most domains: United Kingdom --> 3.5% (215)
# most domains: France --> 2.7% (168)
# most domains: Spain --> 2.3% (144)
# most domains: Japan --> 2.2% (136)
# most domains: Netherlands --> 1.8% (111)
# most domains: Hong Kong --> 1.7% (103)
# most domains: Brazil --> 1.4% (85)
# most domains: Saudi Arabia --> 1.3% (77)
# most domains: Czech Republic --> 1.2% (75)
# most domains: Australia --> 1.2% (73)
# most domains: Taiwan --> 1.1% (65)
####################################
# pop language: English --> 54.8% (4988)
# pop language: Chinese --> 13.9% (1264)
# pop language: Spanish --> 5% (452)
# pop language: Japanese --> 3.6% (330)
# pop language: Arabic --> 3.3% (300)
# pop language: Taiwanees --> 3% (273)
# pop language: German --> 2.4% (216)
# pop language: Russian --> 2.3% (212)
# pop language: French --> 1.6% (146)
# pop language: Portuguese --> 1.4% (129)
# pop language: Turkish --> 1.2% (110)
# pop language: Polish --> 0.9% (84)
# pop language: Korean --> 0.7% (68)
# pop language: Cantonese (Hong Kong) --> 0.7% (65)
# pop language: cs-CZ --> 0.6% (59)
####################################
# biggest reach: Us --> 38% 2550105 (domains: 2673)
# biggest reach: China --> 9% 599752 (domains: 554)
# biggest reach: Canada --> 2% 123392 (domains: 288)
# biggest reach: Germany --> 2% 115299 (domains: 232)
# biggest reach: United Kingdom --> 2% 103722 (domains: 215)
# biggest reach: France --> 1% 82012 (domains: 168)
# biggest reach: Hong Kong --> 1% 73018 (domains: 103)
# biggest reach: Brazil --> 1% 71420 (domains: 85)
# biggest reach: Japan --> 1% 64453 (domains: 136)
# biggest reach: Spain --> 1% 51775 (domains: 144)
# biggest reach: Beijing, Prc 100020 --> 1% 38750 (domains: 1)
# biggest reach: Netherlands --> 1% 35151 (domains: 111)
# biggest reach: Czech Republic --> 1% 34286 (domains: 75)
# biggest reach: Australia --> 0% 32629 (domains: 73)
# biggest reach: Taiwan --> 0% 30377 (domains: 65)
####################################
# biggest reach US/state: CA --> 37% 931958 (websites: 699)
# biggest reach US/state: WA --> 22% 560566 (websites: 137)
# biggest reach US/state: NY --> 9% 219325 (websites: 322)
# biggest reach US/state: FL --> 5% 130848 (websites: 174)
# biggest reach US/state:  --> 3% 70310 (websites: 82)
# biggest reach US/state: MA --> 2% 60825 (websites: 110)
# biggest reach US/state: TX --> 2% 57423 (websites: 96)
# biggest reach US/state: GA --> 2% 44164 (websites: 51)
# biggest reach US/state: VA --> 2% 41124 (websites: 58)
# biggest reach US/state: IL --> 2% 41067 (websites: 101)
# biggest reach US/state: UT --> 1% 30466 (websites: 43)
# biggest reach US/state: NC --> 1% 29446 (websites: 36)
# biggest reach US/state: NV --> 1% 27765 (websites: 52)
# biggest reach US/state: NJ --> 1% 27098 (websites: 81)
# biggest reach US/state: PA --> 1% 25574 (websites: 43)
####################################
# most linked: geocities.com --> 2015878
# most linked: google.com --> 361472
# most linked: miibeian.gov.cn --> 258647
# most linked: adobe.com --> 240727
# most linked: microsoft.com --> 232108
# most linked: amazon.com --> 192276
# most linked: macromedia.com --> 160126
# most linked: wikipedia.org --> 118273
# most linked: apple.com --> 111403
# most linked: statcounter.com --> 99139
# most linked: blogger.com --> 95404
# most linked: phpbb.com --> 93470
# most linked: yahoo.com --> 91035
# most linked: angelfire.com --> 90120
# most linked: alibaba.com --> 87010
# most linked: cnn.com --> 80177
# most linked: taobao.com --> 79103
# most linked: nytimes.com --> 76138
# most linked: mapquest.com --> 72668
# most linked: baidu.com --> 70275
# most linked: bbc.co.uk --> 69876
# most linked: wordpress.org --> 67746
# most linked: myspace.com --> 65789
# most linked: flickr.com --> 62809
# most linked: imdb.com --> 61826
####################################
# total reach below 100: 37% (2439485)
####################################
# total reach between 100 <> 1000: 26% (1731080)
####################################
# total reach between 1000 <> 10000: 37% (2504213)
####################################
# total views google (all extension): 9% (56786) (websites: 72)
# total views yahoo (all extension): 12% (78681) (websites: 4)
# total views msn/live/hotmail/microsoft: 5% (29419) (websites: 27)
####################################
# total reach google (all extension): 8% (562990) (websites: 72)
# total reach yahoo (all extension): 5% (338695) (websites: 5)
# total reach msn/live/hotmail/microsoft: 4% (280393) (websites: 27)
# total reach ebay (all extensions): 1% (61331) (websites: 22)
####################################
# most used banner network: 48.1% (499) --> google (reach: 183195)
# most used banner network: 34.4% (357) --> doubleclick (reach: 426936)
# most used banner network: 3.4% (35) --> google doubleclick (reach: 29182)
# most used banner network: 2.7% (28) --> google tribalfusion (reach: 10254)
# most used banner network: 2.3% (24) --> adbrite (reach: 5426)
# most used banner network: 2.1% (22) --> tribalfusion (reach: 14756)
# most used banner network: 1.4% (15) --> yahoo (reach: 6804)
# most used banner network: 1.3% (14) --> valueclick (reach: 13036)
# most used banner network: 1% (10) --> rightmedia (reach: 25675)
# most used banner network: 0.7% (7) --> google adbrite (reach: 2029)
# most used banner network: 0.4% (4) --> revsc (reach: 4659)
# most used banner network: 0.3% (3) --> google rightmedia (reach: 1021)
# most used banner network: 0.3% (3) --> yahoo doubleclick (reach: 1643)
# most used banner network: 0.2% (2) --> doubleclick tribalfusion (reach: 554)
# most used banner network: 0.2% (2) --> rightmedia doubleclick (reach: 499)
####################################
# total reach google network (excl. google.* ): 55.6% (websites: 577) (reach: 226879)
# total reach doubleclick network: 38.8% (websites: 403) (reach: 461146)
####################################
# stylesheet usage 58%  (5757)
####################################
# rss usage 10% (1023)

I don't represent any of the listed websites, nor any of the advertising networks.

Stylesheets found by: "<style" || "rel=stylesheet"
RSS found by: "application/rss+xml" || "rss.xml"
Advertising networks found by adcode characteristics. Some websites have specialised adverting network code, thus excluded.
Homepage is the "/" page or the first follow (some redirect to another).