The following is a proposed standard for bringing more semanticity to articles on the Web. In our efforts to provide quality content without the superfluous leavings, we've seen that the Web is a pretty messy place. We hope that by providing some simple guidelines we can help publishers make their content a little more presentable with Readability while also making the Web a bit more semantic.
By and large, you'll find that our guidelines just follow other specifications. We lean heavily on the work of the hNews microformat as well as the new elements provided within HTML5. If anything is unclear, please refer to the hNews microformat specification as well as this handy guide to semantic elements in html5, from Mark Pilgrim's Dive into HTML5.
Hover over an element to the left to see more information about its use.
hentry denotes the beginning of an Entry, which will be the wrapper within which all of our content is found.
entry-title class denotes the title of the Article. This is intentionally distinct from the title tag, which often differs due to organization name or SEO content.
entry-content class denotes what part of the article is the body content. Readability will use this as the body, if found.
entry-summary class denotes the lede, subhead or dek of the Article. If this exists, it should be content distinct from the title or content of the article that gives a brief summary—one or two sentences—of the article itself.
byline vcard denotes who wrote the article. Typically a person. 'fn' within it denotes the person's full name. See hCard for more info.
The source of the article is the organization or group backing the article. If it is solely an individual, the individual itself will suffice (and you may append source-org onto the author vcard).
source-org also follows the hCard spec.
Use the article tag to wrap an entry. It's semantic and easy for Readability to spot.
We’ll be looking for time elements with the
pubdate attribute within articles we process. This will help us understand when the article was published. To quote the HTML5 Working Draft, the pubdate attribute “is a boolean attribute. If specified, it indicates that the date and time given by the element is the publication date and time of the nearest ancestor article element, or, if the element has no ancestor article element, of the document as a whole.”
<aside>, <header>, <nav> and <footer>
By using these tags, you can provide a big head start in figuring out what is not the primary content of the page.
<figure> and <figcaption>
These tags should be used for media related to an article. This allows us to pull media in nicely into an article's flow. Most typically images, but other media is also allowed, as per the w3c spec: “The element can thus be used to annotate illustrations, diagrams, photos, code listings, etc.” Please note that Readability may strip content such as flash and images, depending on user preference.
This is a special class that explicitly tells Readability (and other parsers) to ignore the content within it. It can be used on any element. This is currently the only readability-specific directive.
This is a special class that explicitly tells Readability (and other parsers) that the content within it is related to the content. This is particularly useful in cases where you have content that should be an asset in a figure tag, but can't yet switch to HTML5.
A comment class will help Readability to better filter (or, in the future, display) extraneous comments from an article text.