Web feeds from Nepal
Friday, March 24th, 2006If you want to get to the point of this post, skip to this paragraph.
From Wikipedia:
A web feed is a document (often XML-based) which contains content items, often summaries of stories or weblog posts with web links to longer versions. Weblogs and news websites are common sources for web feeds, but feeds are also used to deliver structured information ranging from weather data to “top ten” lists of hit tunes. The two main web feed formats are RSS (which is older and far more widely used) and Atom (a newer format that has just completed the IETF standardization process.)
RSS is a family of web feed formats, specified in XML and used for Web syndication. RSS is used by (among other things) news websites, weblogs and podcasting. The abbreviation is variously used to refer to the following standards:
- Rich Site Summary (RSS 0.91)
- RDF Site Summary (RSS 0.9 and 1.0)
- Really Simple Syndication (RSS 2.0)
In less technical terms, the RSS “feed” for a site is basically a dynamic list of updates to the site that can be read using a program called an aggregator, without actually having to visit the site in a browser. There are many aggregator programs out there and even more aggregator sites that will read RSS feeds for you (for example, Google, Yahoo! and Microsoft’s preview of start.com). However, the easiest way that I have found to read simple RSS feeds is to use browsers that have built-in aggregators, such as Mozilla Firefox and Safari (for Macs). For example, if you are reading this blog in Mozilla Firefox, you will see a small orange icon (something that looks like
), in the address bar (or in the bottom right corner for some older versions). You can click this icon to add the RSS feed for my blog to your Bookmarks folder or your Bookmarks Toolbar Folder (the latter is a popular choice) so that when you click that bookmark, you will be presented with a list of the latest updates to this site.
This process of web syndication is especially useful for sites such as blogs and news sites because their content changes a lot with time. Most popular blogging software and services (for example, Wordpress, Movable type, Blogger) have built-in code to automatically generate RSS feeds, and many news (and non-news) sites such as BBC News, Fark and Slashdot have RSS feeds.
So what has all that got to do anything with me? These days, I have come to expect most news sites that I read to have RSS feeds, but none of the major news sites from back home in Nepal that I regularly read (Nepalnews, Kantipur Online, Nepali Times) offer RSS feeds. So I decided to try to build RSS feeds for these sites. The process basically involves looking at the sources of these sites, finding out where the news items occur and how to identify them, and writing some PHP to extract that information into a valid XML-based RSS site. I could have chosen either of the two web feed formats, RSS or Atom; I chose RSS 2.0, which is more popular of the two. I also used HTMLParser, a PHP script by Jose Solorzano that parses html and returns tags and inline text.
The first result of this effort is the Nepalnews RSS feed. To use this RSS in Firefox, go to Bookmarks -> Manage Bookmarks… and then open File -> New Live Bookmark… from the menu. Enter the URL http://gridley.res.carleton.edu/~tuladhaa/projects/rss/nepalnewsrss.php for feed location and place the bookmark in your Bookmark Toolbar Folder. You should now be able to see a list of the news items from Nepalnews.com when you click on your bookmark. The same URL can also be used in any other Aggregator program that supports RSS.
I am still working on making RSS feeds for the other news sites, but it appears to be harder, mainly because the news items are somewhat spread around their pages. Once done, they will be available from my projects page.
One thing that I have read about before but realized more strongly when doing this, is the importance of using semantically significant tags when writing any markup document, especially webpages. For example, having an h1 tag to mark a level one heading as opposed to having a decoration tag such as font or a non-descriptive class makes it much easier to extract the headings from a page. In the end it all comes down to how well the format (for example, the layout and colors on a page) has been separated from content (the actual text on a page).

