Recent Content

Open source

posted on 2014-03-31

TBRSS does a lot of interesting things – it is a feed fetcher, a feed reader, and a sophisticated text analyzer – but it is not a large program, usually around 10,000 lines of code. It stays small because, whenever possible, I move functionality into separate libraries. Lately, I have been getting some of these libraries ready for release.

TBRSS is built on open source software; naturally I want to do my part. But I see very little value in the fashionable idea of open sourcing the application. Applications are compromises: they run in a particular environment, on particular hardware, and reflect a path-dependent history of hacks and trade-offs.

As a rule, interesting functionality should be moved into libraries, and those libraries should be open-sourced. It’s only fair; it’s an sleazeless form of self-promotion; and it’s best for the library, both because it permits outside contributions, and because, in bringing it up to publishable standards, you make it better and more maintainable.

Our first release is the foundation for all the others – Serapeum, “utilities beyond Alexandria.” The README explains the purpose and reasoning behind yet another utility library.

As other libraries are prepared, they will be released on the TBRSS GitHub.

Weekend Reading

posted on 2014-01-12

One thing I’ve learned from TBRSS is that very little gets written or published on the weekends (where “weekend” includes Friday afternoons). By Sunday one’s reading list becomes depressingly thin.

So I’ve introduced a very simple but very useful feature. Over the weekend, the algorithm that ranks entries adapts itself to give less weight to recency. The result is a reading list with a longer window, one that reflects the best of the week rather than the past few days.

Call it “weekend reading.”

(An ideal implementation would be able to infer the country of origin of every feed and adjust its idea of what the “weekend” is to the vagaries of cultures and time zones. But in practice, as far as I can tell, the net as a whole still moves to the rhythm of the American workweek.)

Tags

posted on 2013-12-19

TBRSS has fewer features than other feed readers: this is by design, because most of those “features” implicitly benefit the operators of the feed reader at the cost of its users, by promoting compulsive behavior.

I used to think that tagging (or categorizing, or labeling) feeds was one such misfeature.

The heart of TBRSS is the reading list. However many feeds you subscribe to, when you log into TBRSS what you see is a reading list of the 10 entries most worth reading.

I think of that 10 as very optimistic. It would be remarkable for ten things worth reading at all, let alone ten things worth reading for any particular person, to be published in the same few days.

Separate reading lists for separate tags, I thought, would just surface more noise. And tags, in particular, are problematic: they offload to the user distinctions of relevance and quality that should be made by the code itself.

But that view is idealized. The fact is, we all have areas of special interest where our standards are relatively low. Running a website, I need to keep up with certain technologies, and to do that I need to keep up with certain sources and authors, even if they fail to crack the top ten. (Every time I visit Hacker News, that’s a bug.)

Besides that, there was another consideration: I wanted to make sure that the OPML file you download from TBRSS is reasonably close to the OPML file you upload – and since that means internal support for tags, it might as well become a feature.

Galleries

posted on 2013-11-29

TBRSS has a new feature: you can browse all the pictures in your feeds as a single gallery. There are more or less tortuous ways to justify such a feature for a “reader for readers”, but the truth is, I missed pictures.

An Introduction to SubToMe (for My Competitors)

posted on 2013-10-27

A few days ago I integrated TBRSS with SubToMe. The idea behind SubToMe is to “grow the RSS pie,” for the benefit of all feed readers, by creating an open equivalent to one of the most basic affordance of closed platforms like Tumblr, Twitter, &c. – the convenient “follow” button.

The clever thing here is that there is no server, no database, no protocol; SubToMe is a JavaScript application that runs in the browser, maintaining a list of feed readers in localStorage. When the user clicks a SubToMe button (or uses the SubToMe browser extension), a modal dialog appears with their list of feed readers; they choose, and SubToMe redirects them.

The process for registering an application with SubToMe is simple, but not quite as simple as it looks: I had to refer to the source code to settle some doubts. So, in the spirit of “growing the RSS pie,” here are some notes addressed to my present and future competitors.

SubToMe only needs two pieces of information: the name of your application, and a URL template to construct the redirect. In the case of TBRSS, for example:

name  TBRSS
URL   https://tbrss.com/subscribe?url={url}

The endpoint should take the URL given, and return a form for the end user to confirm their subscription. (Don’t omit the form: without the extra step you are performing a CSRF against yourself.)

I trust your application already has a name.

The template must take at least one of three parameters: {url}, the location of the page where the end user used SubToMe; {feeds}, a comma-separated list of feeds extracted from the page; and {feed}, the first of {feeds}.

But, of the three, only {url} is useful. SubToMe’s feed extractor is simpleminded – equivalent to link[rel~=alternate][href] – and of course it has to be, since browsers and servers limit the length of URLs.

Presumably you can do better. Just fetch the {url} and do your own extraction.

Since SubToMe runs in the browser, you will have to register your application once per user, with an iframe in your HTML:

<iframe style="display:none;" src='https://www.subtome.com/register.html?name=<Name of your application>&url=<url of the subscription handler>' />

For TBRSS, for example:

<iframe style=display:none src=\"https://www.subtome.com/register.html?name=TBRSS&url=https%3A%2F%2Ftbrss.com%2Fsubscribe%3Furl%3D%7Burl%7D\"></iframe>

(Remember to URL-encode your endpoint.)

You must serve the iframe repeatedly, since there is no way to check whether a user has already been registered. The iframe is cheap – SubToMe uses an appcache, so the only overhead should be a 304 Not Modified from the manifest. Still, a request is a request, so we only serve the iframe once per session.

Nothing in the design of SubToMe prevents you from simply registering every visitor. I’m not sure if this is intentional. The more feeds readers are registered, the more potentially useful SubToMe becomes; but there is no mechanism for the end user to prefer one feed reader to another. TBRSS stops at registering logged-in users.

The name, SubToMe (short for “Subscribe To Me” – nothing to do with tomes or submission) – is awkward; but of course we live in latter days, and all the good names were taken long before we came.

Expanding entries

posted on 2013-10-16

I’ve added an experimental feature: a button to expand entries from truncated feeds inline. It is far from perfect, but is already very useful.

Extracting the content from the soup of credits, ads, and comments that is a modern blog entry or news article is much harder than it looks, but there is a very good, very general solution which treats the notional writer as a Markov process. The whole paper is worth reading, but I’ll quote the following, because it is one of the insights behind the whole project of TBRSS:

The use of full sentences usually means the author wants to make a more or less complex statement which needs grammatical constructs, long explanations etc. Extensive coding is required because the author (sender) does not expect that the audience (receivers) understand the information without explanation.

To put it another way: the intention to communicate is something that machines can recognize, and very reliably, because when we actually want to communicate, we have a lot to say.

Code & math

posted on 2013-08-15

TBRSS now integrates Highlight.js and MathJax. That means that code snippets in most programming languages get displayed with syntax highlighting, and mathematical notation using both inline TeΧ and MathML gets typeset using a full-blown mathematics-specific display engine. (That includes the pre-rendered LaTeΧ in WordPress-powered blogs like What’s New.)

Postscript: MathJax turns out to be somewhat oversold. For small amounts of mathematics, it works fine, but I can certainly see why it’s not in universal use among mathematicians: with large amounts of math, it’s appallingly slow. For the moment I’ve compromised by disabling it except when the browser supports MathML, which is much faster than the default HTML+CSS rendering.

Languages

posted on 2013-07-17

Another surprise from seeing actual people’s actual lists of subscriptions is how common polyglots are. Better support for languages was always something I planned for, but it turned out to be urgent.

The problem is that an algorithm that correctly ranks content for substance within one language is not necessarily valid for ranking content across different languages.

The primary reading list – the one you see when you first log in – will remain multilingual. But, when applicable, the reading list will now display a menu for filtering your reading list by language.

You can see what the menu looks like on the Top page.

The language is controlled by a lang query parameter, so you can bookmark the reading list for a particular language by appending the abbreviated language name as an argument to the URL. Your English reading list, for example, is at https://tbrss.com/reading-list?lang=en.

Robots.txt

posted on 2013-07-01

Part of the reasoning behind offering temporary accounts is to get my hands on as many actual feeds as possible. Since yesterday I’ve learned another dozen ways a feed can go wrong.

The biggest lesson so far is that it’s a mistake to pay attention to robots.txt. Roughly 7% of new subscriptions are for feeds on hosts where robot exclusion policies forbid access to the feed.

This is presumably not what the feed authors intend. So, although TBRSS will still obey crawl-delay, if one is specified, it will ignore the access rules.

A feed reader for readers

posted on 2013-06-30

I didn’t think I missed RSS. In the bad old days I read between 200 and 600 feeds at a time, for some definition of reading. I was always behind, and in the end I defaulted. When I logged back in and saw that long left column and all the unread counts – none of it made any sense.

The news of Google Reader’s demise made me stop and remember. When I left RSS, I had good intentions. I packed the few that were always worth reading – every single post – and resolved to keep up by hand, the old-fashioned way, clicking one link at a time. But resolution was not enough. Nothing matches having everything in one column in one place.

Google Reader is dead. The throne is empty. This is a good time to ask what a feed reader could be. I saw a scramble to build replacements for Google Reader. I wondered: why not do something better?

So I sat down and started building one. I named it TBRSS (To Be Read + RSS). Names matter; they mark commitments. This one would be a feed reader for readers.

I knew I was never going back to feeds on the left, posts on the right, mark as read. What else?

Feed readers descend from email clients. A feed reader is one big inbox for the Internet. At some point, someone had the thought – strange to say – that the Internet would be so much better if only everything were more like email. First principle for a reformed RSS reader: it should look as little like an inbox as possible.

What else? Start with the obvious: RSS readers look like email clients. What should they look like? The answer is in the name. Readers are for reading; perhaps they should emphasize readability. A feed reader, being made of blogs, should look like a blog; and not just like a blog, but like blogs should look.

But wait. Why bother? RSS is dying. Let it die.

It is true, social media and social news have divided the inheritance of RSS. In some ways they have improved on it. But in making the division they left the most important parts out. The user of a feed reader is distinguished by intellectual curiosity. The last thing we want is to train another machine to provide more of the same. The last thing we want is another bubble.

How do social media and social news improve on feed readers?

The precise overlap between feed readers and social media depends on the reader and the medium; but social media never have mark-as-read.

This is important. It sends a different message. RSS readers have limited demands and high expectations. There are a finite number of posts to be read, but you are meant to read them all. Social media have limitless demands and low expectations. The scrolling is infinite, but also bottomless: the only fixed point is the top. The media that have something like an unread count use it to nag you about the new posts appearing at the top. The rest may not be accessible at all, except through scripted scrolling.

Social news is not a descendant of RSS, but a clear case of convergent evolution. They feed on the same ecosystem and fill the same niche. Both are made up of a mix of blogs and news; both provide for keeping up with events and ideas. But social news does not rely on chronological order.

Of course the headlines on a social news site are a function of time. There is a codified prejudice against old headlines that gradually weighs them down until they sink out of sight. But meanwhile the content on the front page, drawing from the new and the less new, orders them according to other principles.

The lesson of social media is negative: don’t have mark-as-read. The lesson of social news is positive: use other principles to supplement chronological order. And if my feed reader is going to organize my feeds, I want only one standard: I only want the substance. I want, at least, what remains after the filler and the noise are left out.

What is substance? There are some obvious correlates. A feed where the author posts infrequently, at length, and with a command of grammar and style, stands a good chance of being substantive.

This sounds like an obvious application for machine learning, but I wanted to see how far I could get with simple metrics, and the results have been better than I hoped.

(It’s suggestive that in the problem of extracting the main content from complex webpages, the most successful project, Boilerpipe, relies completely on “shallow text features.” And in clustering (content vs boilerplate) and ranking (substance vs filler) we are facing much the same problem.)

Given this success, what to do with it?

Watching software eat the world, I’ve thought about startups. Occasionally something makes sense to me: providing a service, for a fee. I like the idea of building something for myself, making it as useful as can, and charging other people to use it.

Not just for money (although for money), but because like-minded people, people who would want to use such a service, might have good ideas.

TBRSS is not intended to be the only feed reader. It is designed around one particular experience. I want to sit down, open the site, see what’s new, and read it: not to get it out of the way, but for all the reasons it is good to read in the first place. And then I want to close it, and feel neither the panicky sense of having missed something, nor the grim satisfaction of having made it to the end. I want to leave looking forward to what will show up next.

It is that experience that I intend to spend my time and effort perfecting. In the end, if you want something else, you should look elsewhere.

But, in the meantime, stay and try it. You don’t have to sign up first. You can upload an OPML file or start adding feeds right away, on a temporary account. And of course – if you like it – then sign up.

This blog covers lisp, code


Unless otherwise credited all material copyright by Paul M. Rodriguez