Content from 2015-01

Synthetic feeds

posted on 2015-01-12

Not every site that publishes articles has a feed. Even when feed readers were at their height, not every site had a feed, and now that feed readers are, if not declining, certainly marginalized, it cannot be safely expected that every interesting new site will have a feed.

And while they usually do have feeds, in many if not most cases it is an unintentional side effect of the use of a platform or CMS that has feeds on by default. We cannot assume that this will be so forever.

(An inauspicious sign: the Roundtable blog at Lapham’s Quarterly, which uses the feed icon in its branding, does not actually offer a feed.)

On the other hand, social networks are on the ascendant, and search engines are not so much ascendant as enthroned. Semantic markup is increasingly common, to improve presentation in search engines and social media – a motivation that seems unlikely to slacken. (Not to mention the new semantic elements introduced by HTML5 – article, time, &c.)

Widespread use of semantic markup means that a large subset of the information in a feed can now reliably be gleaned directly from ordinary web pages. And, even if semantic markup is lacking, we have good general algorithms for content extraction.

So TBRSS is introducing a new and highly experimental feature: synthetic feeds. If you try to add a page’s feed, and that page does not contain a link to a feed, you now have the option to create a feed directly from that page.

The semantics are simple: we scan the page for metadata and links; we resolve a certain number of those links in a certain order (this is the most experimental part), fetch the linked pages, and process them into entries.

(This is completely different from something like page2rss. They monitor a single page for changes and report the differences within that page. For TBRSS the root page of the synthetic feed is only of interest as a source of metadata and links. The entries are one-to-one with the pages in the site linked to from the root page.)

This is crawling, and we respect the same rules as any crawler: robots.txt and the ROBOTS meta header, rate-limiting our requests, &c.

There is of course also the question of copyright. Sites that provide feeds are implicitly giving us permission to display their content. We respect this for synthetic feeds the same way as for truncated feeds: we do not display the content (only a summary) unless the user explicitly asks to see the full content.

Again, synthetic feeds are experimental; but the experiment is in progress.

This blog covers lisp, code


Unless otherwise credited all material copyright by Paul M. Rodriguez