Thanks to everybody for the positive feedback on GetContentSize, which I presented Friday.
I logged the tool's results to a database, so I'm able to present a few interesting observations/statistics. (I've been away from my computer for a few days, so I apologize for not having this earlier.)
Total Web pages examined: 4,296
(Pages ranged from news sites to blogs to, yes, porn sites.)
Average percent text content: 21.57%
Highest percent text content: This page (86.57%)
Lowest percent text content: This page (.02%)
Average percent text content for URLs ending in ".com" or ".com/": 17.71%
(I figured this might be decent way to narrow down the results to commercial home pages.)
Average page size: 27,910 bytes
If there's another statistic you'd like to see, post a comment here, and I'll query the database to get it, as long as the statistic is obtainable by MySQL. The logged fields are: URL, page size (in bytes), percent content and the date/time.
Comments
Posted by Barry Parr on November 13, 2002, at 4:36 a.m.:
I think it would be useful to know the ratios by decile. What was the ratio of the site at the bottom 10%, bottom 20%, etc...
Posted by Devon on November 13, 2002, at 1:50 p.m.:
Would you be able to create a small search engine so people can find out what pages had similar ratios? Like, if I typed in "http://cnnsi.com/", it would give me it's ratio and pages that had a ratio within 2% or something? That could be interesting.
Posted by Ben on November 13, 2002, at 5:18 p.m.:
This may just be a tweak in the javascript in the front, but I think it would be nice to have variations of the tool, such as a GCS-Lite bookmarklet, which just spits out the ratio number in an alert box... or a multiple page input, so that I could run it on several pages at once (or a spider engine so that it could be run by a developer/administrator on their own site)... or a GCS-dex that spiders all relevant news sources each day and lists them all.
Comments have been turned off for this page.