feed-abstract gem updated to support twitter RSS and Atom

I updated my feed-abstract gem to support twitter RSS/Atom, in that it will automatically parse hashtags and turn them into RSS item subjects/categories. Huzzah! This is pretty fun, as it allows tweets to be aggregated into TagTeam seamlessly and they can be remixed, archived, and searched by tag.

You can get at twitter RSS/Atom via URLs like:

https://search.twitter.com/search.atom?q=url encoded hashtag

so:

https://search.twitter.com/search.atom?q=%23rails

I’m sure there are more search parameters available too. If you want RSS, just change the “.atom” to “.rss”.

TagTeam close to 1.0

A long, complicated project of mine (under the direction of Peter Suber and the auspices of the Harvard Library Lab) is nearing its release date – TagTeam (source, demo site).

TagTeam is an RSS/Atom/RDF aggregator that allows administrators to remix and republish feeds on multiple levels. It also allows for the filtering of tags – additions, substitutions, and removals in a flexible “tiered” filtering system.

It uses the feed-abstract gem I wrote to create a “common object graph” between the different feed formats – this has been a huge time saver and made feed parsing much more reliable.

YaCy – a p2p search engine

So I’m running a YaCy node – which is a pretty awesome project to create a search engine indexed “by the people, for the people.”

YaCy provides a java servent  that can index internal resources and external web pages. You have MANY controls over what and how it’s indexing and the resources allocated to it. There are tons of built-in analytics and logging for the stats geek in you.

It’s still rough, but seems damned promising.  A bonus – it uses jQuery and Solr.

I really like the idea of indexing all the content you care about and also providing that index to the world at large to search, but I have concerns over the long-term impact of more ‘bots crawling the web. I would like to see YaCy figure out a way to minimize it’s impact on a global level – if every yacy node is indexing the same sites, it could easily escalate to a DDoS-level problem. Perhaps they’re already working on this issue.

fulltext wildcard searching with ruby/rails and sunspot

I love Sunspot for full-text searching in Rails apps, but it took me a while to figure out how to do left-bound wildcard searching in full-text indexed fields.

So – if we’re searching for “collis” in a set of fulltext indexed fields, in the default solr config supplied by sunspot you have to search for the entire word. To get “colli” or “coll” to return records with “collis” in the fulltext index, you just need to modify the solr config (in $RAILS_ROOT/solr/conf/schema.xml), changing:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

to:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

which essentially makes the full text tokenizer create left-bound n-grams for indexed terms. This taught me:

  1. Solr/lucene/sunspot rock, and
  2. I have more to learn about solr config because the schema.xml looks like it exposes some very powerful search juju.

Thanks to Arndt Lehmann’s tip on this page.