Earlier today, Steve Outing of E-Media Tidbits wrote about "page bloat behind the scenes" -- the fact that many major news sites have incredibly bulky HTML under the hood. Then Barry Parr followed up, posting content-to-code ratios for six major news sites.
This topic intrigued me, so I threw together an application that calculates the ratio of text content to total page size for a given Web page. It'll strip all the HTML, JavaScript and CSS, and determine how much of the document is actual text.
I gave it the unsexy name GetContentSize, and I've put it online so everybody can play around with it, just for kicks. (All you fans of object-oriented PHP can also download the source code.)
GetContentSize will tell you, for example, that CNN.com's home page -- just the page itself, not any attached images, JavaScript classes or style sheets -- weighs 47,000 bytes but only devotes 8.70 percent of that to text content. (Really makes you wonder what the other 91.3 percent of the document accomplishes.)
Other interesting ratios:
- dallasnews.com -- 5.85%
- abcnews.com -- 5.94%
- news.bbc.co.uk/1/hi/uk/default.stm -- 7.54%
- chicagotribune.com -- 8.86%
- washingtonpost.com -- 9.70%
- nytimes.com -- 10.85%
- latimes.com -- 10.95%
- boston.com -- 14.70%
Not surprisingly, blogs outperformed news sites considerably -- probably because blogs tend to use CSS to separate content from code, and they're more text-driven than news-site home pages anyway. Some examples:
- dashes.com/anil/ -- 25.94%
- doc.weblogs.com -- 26.18%
- kottke.org -- 40.00%
- ashbykuhlman.net -- 44.63%
- holovaty.com -- 46.67%
- simon.incutio.com -- 46.93%
- hypergene.net/blog/ -- 59.03%
Of course, having a low ratio isn't horrible. HTML structure is important. And this tool doesn't account for photos (which are an important part of many sites' content) nor JavaScript-generated content. Still, I think something's wrong when less than 10 percent of a Web page's raw code is devoted to text content. Load time and rendering time remain important concerns.
UPDATE, Nov. 14, 12:14 AM: I've done a bit of follow-up analysis and have changed my methodology slightly.
Comments
Posted by Anil on November 8, 2002, at 7:10 a.m.:
Aw, man. Now I'm so embarrassed! My sidebar full of links is killing my ratio. Is this going to be a new kind of status, like body fat ratio?
Posted by Jan! on November 8, 2002, at 3:08 p.m.:
Hey, I tested it on my blog-to-be [1] and was surprised at its very low ratio. There isn't much content yet, being just a mockup, but a quick view of the source made me realize I use a lot of title attributes. Shouldn't those, and others like alt, be counted as well?
Anyway, very nice tool!
[1] http://jan.moesen.nu/.temp/20021004_blog_mockup/
Posted by Adrian on November 8, 2002, at 3:41 p.m.:
It can be argued that links (and their title attributes) are content, so maybe the tool should include links in its analysis. What does everyone think?
Posted by Ben on November 8, 2002, at 4:26 p.m.:
What do you mean by that? It seems to be counting links (like navigational text links and such)... Should ALT text be considered content? Excellent tool, Adrian. Quick turnaround, too!
Posted by Adrian on November 8, 2002, at 4:28 p.m.:
Sorry for the confusion: I meant maybe it should include the contents of the A tag -- the href, title and alt.
Glad you like it!
Posted by Carl on November 8, 2002, at 5:43 p.m.:
Define content. I suggest we call it content if it is meant to be seen by visitors. In that case, alt tags and the content of the links (the part the viewer sees, between the > and </a> ) would be included. The href= and such would not, as that is meant to be seen by the browser. Make sense?
Posted by Ben on November 8, 2002, at 6:32 p.m.:
I would vote for TITLE= and ALT=, because while they may not necessarily be viewed, they are viewABLE. I would venture to say links would throw off the ratio, due to the unknown and infinite length of URLs.
Posted by Devon on November 8, 2002, at 6:42 p.m.:
this is awesome. i found that news.google.com is 24% content. I'm wondering how cool this service would be if you logged every 'search' done and what percentage each site turned out to be. I don't know, I just like to brainstorm.
Posted by David Wertheimer on November 8, 2002, at 6:45 p.m.:
Weblog: 40.67%
Economist.com: 10.33%
Weblogs don't have ads and eight editors cramming in content. Pretty self-explanatory. Both my sites are on a par with their respective medians, though. I'll take it.
Posted by Adrian on November 8, 2002, at 6:57 p.m.:
Devon: Great idea. It now logs all the search URLs, page sizes, percentages and date/time. Eventually (possibly tonight) I'll make a public interface to that information.
Posted by Jay Small on November 8, 2002, at 8:40 p.m.:
This is a useful metric (and that DallasNews.com ratio exemplifies why I'm in the middle of a recoding project for the Belo sites). But I also like to measure the total weight of a page, including all scripts, images, style sheets etc.
That's easy enough to do in Internet Explorer (haven't tried it, but I bet there's a way in Mozilla, too). Just pick the Web page you want to weigh, and do File - Save As... then select "Web Page - Complete." You'll get a copy of the HTML document as served, plus a folder containing all the files required to render the disk copy of the page properly.
Get the properties of the page and the folder. You then have a combined page weight.
This method, as well, revealed results that showed it's time to put the pages I deal with in my day job on a diet.
Posted by Mike on November 8, 2002, at 10:15 p.m.:
You have to be VERY careful using the method Jay explains above. When IE saves the page, it can and will make changes to your HTML that can throw off the size of your page (one page I saved made a 50K page into 82K). You may also want to save a "View Source" for the HTML page for comparison sakes.
Posted by Jay Small on November 8, 2002, at 11:07 p.m.:
The changes IE makes are to things such as img tags, to allow the browser to load in images from locally cached copies. But I have never noticed a big difference in file size between the IE-saved version and a true source version. Note also that the "file size" and the "size on disk" will vary based on the way the disk is formatted.
I just checked and you can also try this method using Mozilla and compare the results. Or you can, as Mike suggests, compare to a true-source file. Regardless, the best reason to use a method such as this is to get the whole picture on page weight -- HTML file, images, scripts and CSS. It's great for comparison.
Posted by Shayne on November 8, 2002, at 11:13 p.m.:
Don't forget the ROI component to all this. Leaner pages = less bandwidth served.
Adrian, you are the shizzle ... nice work.
Posted by John Roberts on November 9, 2002, at 6:58 a.m.:
Interesting tool, no doubt. For sites that post "stories" on their home page (as many blogs do), the % is bound to be higher... which leads one to wonder just how much scanning is a good thing. Should news sites put more content (several paras of a story) on their FD even if they have fewer links / stories? That ties back to the recent thread about home page bloat, which was focused on # of links. Do you want people to read a home page or scan it? I expect scanning is the de facto activity... but I often don't get past the home page of blogs, nor do I need/want to. Hmmm...
Posted by Marek Prokop on November 9, 2002, at 12:27 p.m.:
Great tool, thank you Adrian. However I'd definitely add the TITLE attribute to the content. My blog (in Czech, sorry) is full of titles for links, abbrs and acronyms and thus scores only 40% ;)
Posted by Garçon on November 10, 2002, at 4:09 a.m.:
Get Content Size Bookmarklet -- enjoy it.
Posted by kpaul on November 10, 2002, at 7:16 a.m.:
Thanks to Adrian for the original programming time and Garçon for making it a bookmarklet. This is indeed quite useful. Using tables for design, my ratio is currently low... Yet more reason to sit down and make the switch. Anyone have any 'must read' guides to making the switch from tables to divs?
Maybe we should start up a 'switch' campaign for compliance. "Before, I had bloated table based design, but then I made the switch and my pages size has never been smaller..." :)
Posted by Kiruba.com on November 10, 2002, at 8:46 a.m.:
Adrian, my sincere thanks for spending time on this tool. It's been very useful. There's just one small suggestion. You say, you don't include photos during calculation. Don't you think photos comes under 'content' ?
Posted by Jeevan on November 10, 2002, at 9:19 a.m.:
Kiruba--adding that would boost p0rn sites' "content" ratings from 3% to 90%. ;-)
Posted by Anil on November 10, 2002, at 6:13 p.m.:
It seems if you're not including photos as part of content, perhaps it makes sense to include just .jpeg images. Since a lot of .gifs are for text labels and other things of arguable utility, it might be a good compromise to just include .jpegs which are usually used for actual photographs.
Posted by Garçon on November 11, 2002, at 4:23 a.m.:
I think Adrian's script is strongly used, and the server ansvers slowly. Therefore I rewrote the script into the JavaScript, so you can use new bookmarklet for counting content size offline. The source little differs but... Watch this!
Posted by Rob on November 11, 2002, at 8:03 a.m.:
Great work, Adrian! Fun tool. I put a little form that allows me to type in URLs as I go and I decided to look at the University of Missouri's web pages to see how they faired and I have posted about the results, but only about 9% of the pages I tested (on average) were text.
Posted by anand on November 11, 2002, at 10:25 a.m.:
ooh la la. My site scores 71% :-)
neat tool.
Posted by Thierry on November 11, 2002, at 7:51 p.m.:
Great tool and led me into looking at the high ration page source.
Why isn't this main page (http://holovaty.com) does not diplay any CSS structure in Netscape 4.7 (mac and PC)? Any other browser would not work it out correctly?
thanks
Posted by kpaul on November 11, 2002, at 8:07 p.m.:
Well, I've finally started tinkering with a table-less design. Not looking too bad so far. Not the greatest on NN4.x and still some kinks in IE5.x mac, but I see the light at the end of the tunnel! :) Would've been easier, I imagine, if I wasn't so keen on the 3-col design.
My efforts can be seen here. While I haven't gone live with it yet, I'm definitely planning on it in the near future. I've shaved about 18k from the file size so far...
Posted by Carl on November 12, 2002, at 4:46 p.m.:
kpaul - your middle column covers much of your right column on my 15 inch screen, and has a white background. But looks like a solid start.
Glish has some 3-column css layouts, although I'm not sure they are up-to-date with the latest bug workarounds. http://glish.com/css/
Posted by Adrian on November 12, 2002, at 7:18 p.m.:
Thierry: This site doesn't display a pretty layout in Netscape 4.x because that browser does not render stylesheet-based layouts correctly. Content, though, is still accessible in Netscape 4 and in all other browsers back to the 1.0 generation. (Fancy handheld Web browsers, too.)
If you're using Netscape 4, I pity you, but I don't disdain you. You'll still be able to access this site.
Posted by snowsuit on November 12, 2002, at 8:44 p.m.:
Im surprised at my ratio (49%) - espcially considering that my site is table-based and not strict css. great utility. thanks!
Posted by JS on November 13, 2002, at 11:53 p.m.:
Neato tool. My blog is 43.93% - but now that I know, seems like I should be able to get it to at least 50% since I do aim for at least a half-assed job.
Posted by Richard Edwards on November 19, 2002, at 3:33 a.m.:
How about this for a quick ratio in Internet Explorer:
javascript:alert(document.body.innerText.length/document.body.innerHTML.length)
For convenience, I made this a bookmarklet in my Links bar.
Posted by Mr. Farlops on November 21, 2002, at 11:52 p.m.:
Perhaps there ought to be a list or something--the 40+ club. To qualify you have to have pages that meet or exceed a 40% content to markup ratio.
I agree with the others, ALT, CITE, LONGDESC, SUMMARY, TITLE, etc. values should count as content too.
Some suggestions on the what-counts-as-content issue:
HREF values are problematic but I think you could do a simple test. If it points to internal pages or document fragments don't count it because it's navigational. If it points to external stuff, it's content. However always count the text bounded by A as content, whether its HREF is internal or external.
An idea for whether images count would be to put in a penalizing function that discounts all images without ALT or LONGDESC attributes or with null ALT attributes.
You may also want to penalize tables without SUMMARY values or that lack TH, that should shut out most layout tables or improperly constructed data tables. Data tables that pass the criteria count as content too.
Anyway definitely a nifty tool!
(Luckily, thanks to the good markup and design practices of the kind folks at the WaSP, I've been a member of the 40+ club since 1998!)
Posted by Martin on December 12, 2002, at 7:50 a.m.:
Any thoughts as to what content % might be considered too high?
Posted by Camilo on October 22, 2003, at 5:17 p.m.:
Agh. I am down there with Anil, on the 25%! Perhaps all the links and metadata?
Posted by Mike on December 17, 2003, at 6:08 p.m.:
thanks for keeping this up, and sharing the code!
Comments have been turned off for this page.