You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

URL Management and Acts As Nested Set

August 28th, 2008

I’ve been tasked with redesigning our database to optimize the way we store and search for URLs. My first attempt involved Acts_as_Tree. But I’ve since, thrown that out in favor of Acts_as_Nested_Set.

With the nested set model, we can retrieve a single path without having multiple self-joins. The full tree is retrieved through the use of a self-join that links parents with nodes on the basis that a node’s lft value will always appear between its parent’s lft and rgt values. For a more in depth explination of how nested sets work, check out this article on Managing Hierarchical Data in MySQL.

You can’t look up nested sets without tripping over the plugin called BetterNestedSet. BetterNestedSet is an extension of ActsAsNestedSet that provides an enhanced acts_as_nested_set mixin for ActiveRecord.

>> script/plugin install svn://rubyforge.org/var/svn/betternestedset/tags/stable/

So below I’ve detailed my implementation of how to store and retrieve urls using Acts_As_Nested_Set ad the BetterNestedSets plugin. There are two tables, Domains and Directories. I split this out because in most cases we’re really not interested in the directory structure of a url. Rather than clutter up the Domains table with additional rows, we can simply access the Directories table by domain_id if we find the need.

Domain levels are separated by a dot, or period symbol. According to ICANN, domain names can have a 255 character limit and the maximum character number of a supported second level and lower level domains is 63. So by splitting out a url by subdomain, you’ll never do a string search over more than 63 characters. In addition, for performance we should specifiy in the database that the string column be set to a char(63) as opposed to a varchar(255) since varchar is used to keep table sizes down and char is used to make searches faster.

class CreateDomains < ActiveRecord::Migration
def self.up
create_table :domains do |t|
t.int :parent_id
t.string :subdomain, :limit => 63, :null => false
t.int :lft
t.int :rgt
t.timestamps
end
end
def self.down
drop_table :domains
end
end
class CreateDirectories < ActiveRecord::Migration
def self.up
create_table :paths do |t|
t.int :parent_id
t.int :domain_id
t.int :lft
t.int :rgt
t.string :directory, :limit => 63, :null => false
t.timestamps
end
end
def self.down
drop_table :directories
end
end

To store my Urls in my new database tables, I created a method that loops through an array of urls and calls the find_or_create_nestedset method where the meat of the code is stored. The reason I’m showing you this simple loop is to demonstrate the Addressable::URI class and the heuristic_parse method used to parse the urls into a standardized format. You can also use the URI::extract method.

You might want to note that I’m catching the InvalidURIError that is thrown for non standard urls. This is a must when working with user entered urls.

#This method loops through an array of urls and calls
#the find_or_create_nestedset method on each
def self.parse_urls(urls)
require 'rubygems'
require 'addressable/uri'
#Loop through urls and create sitetree records
urls.each do |url|
begin
uri = Addressable::URI.heuristic_parse(url)
domain = find_or_create_nestedset(uri)
#
rescue Addressable::URI::InvalidURIError
#skip the record for now
next
end
end
end

The find_or_create_nestedset method parses a uri into the domain table and (if the uri contains a directory structure as well) into a separate directory table. The domain string is read right to left to find or create a domain record, which means the TLD (top level domain e.g. .com, .net, .org) will always be a root node.

Note: I did try using a dynamic finder rather than checking to see if the find_by returned a record or not. While finding a record by subdomain or parent_id is fine, creating a record throws an error because the dynamic finder assumes that you are trying to assign a parent_id on create. Like the id column in a rails table, you’re not allowed to assign the parent_id column of a nested set. I couldn’t find an option that would allow me to bypass this issue, but please let me know if there’s a better way.

#This method parses uri into the
#domain/directory nested set structure
def self.find_or_create_nestedset(uri)
return Directory.new if uri.host.nil?
#   * normalize the uri
#   * Parse the domain and insert into the domains table
#   * Parse the directories and insert into the directories table
#normalize uri
uri.normalize!
#strip off leading www
sitename = uri.host.start_with?('www.') ? uri.host.gsub('www.', '').downcase : uri.host.downcase
#begin transation
Domain.transaction do
#domains
domain = Domain.new() #initialize
parent = Domain.new() #initialize
#split up the domains into an array
subdomains = sitename.split('.').delete_if {|i| i.empty? }.reverse
#loop through the domain strings right to left to find or create a domain record
subdomains.each do |subdomain|
node = Domain.find_by_subdomain_and_parent_id(subdomain, parent.id)
if node.nil?
node = Domain.create(:subdomain => subdomain, :parent_id => parent.id)
node.move_to_child_of parent unless parent.new_record?  #for acts_as_nested_set
end
parent = node
end
domain = parent
#directories
node = Directory.new() #reinitialize
parent = Directory.new() #reinitialize
#split path directories into an array
patharray = uri.path.strip.split('/').delete_if {|i| i.empty? }
#remove extensions (e.g. foo.html)
patharray.pop unless uri.extname.empty?
return domain if patharray.empty?
#loop through directories left to right to find or create a directory record
patharray.each do |directory|
node = Directory.find_by_directory_and_domain_id_and_parent_id(directory, domain.id, parent.id)
if node.nil?
node = Directory.create(:directory => directory, :domain_id => domain.id, :parent_id => parent.id)
node.move_to_child_of parent unless parent.new_record? #for acts_as_nested_set
end
parent = node
end
directory = parent
end
return domain
end

Now that are tables are populated, we can create the models to access our urls in pleasing ways.

class Domain < ActiveRecord::Base
has_many   :directories
acts_as_nested_set
#
def self.all_parents
self.find(:all, :conditions => "rgt  lft + 1")
end
#
def self.all_leaves
self.find(:all, :conditions => "rgt = lft + 1")
end
#
def host
subdomain + ancestors.collect {|i| "." + i.subdomain }.to_s
end
#
def urls
return host.collect if directories.empty?
directories.collect {|i| i.url }
end
#
def tld
self.root
end
#
end
class Directory < ActiveRecord::Base
belongs_to :domain
acts_as_nested_set :scope => :domain
#
def subdir
return "" if directory.nil?
path = ancestors.reverse.collect {|i| "/" + i.directory.to_s if !i.directory.nil?}.to_s
path + "/" + directory
end
#
def url
domain.host + subdir
end
#
end

Given the leaf record of our nested set, our host method returns the a full url string by using the BetterNestedSet ancestor method. Our urls method will return all the url directories found under a domain. This is useful because now you have access to a bunch of information about a uri that can be accessed rather quickly. The key here is speed. With any Domain object, I can quickly return all the urls in my databases that contain the word “mychildren” with one line of code:

Domain.find_by_subdomain("mychildren").urls

If you want the “mychildren” root domain and we know the tld is “.com”, we can trim away significant sections of the set rather than doing a full table scan. For example:

dotcom = Domain.find_by_subdomain_and_parent_id("com",nil)
domaindotcom = Domain.find_by_subdomain_and_parent_id("mychildren",dotcom);
#
#return all the subdomains of "mychildren.com"
puts domaindotcom.urls
#
# output:
# lillian.mychildren.com
# george.harry.mychildren.com
# mark.mychildren.com

Entry Filed under: Ruby on Rails

2 Comments Add your own

  • 1. lianaleahy  |  September 10th, 2008 at 11:27 am

    P.S. Don’t forget to add acts_as_nested_set to your model. And if you have multiple roots in your tree, you’ll need to add the scope. (e.g. :scope => :root_id)

    And by the way, to migrate an old table… simply add the lft and rgt columns. Then in you code add a “.renumber_all”. This will fill in the new columns with their proper values.

  • 2. lianaleahy  |  September 19th, 2008 at 10:24 am

    I found a solution to the problem detailed in my note above. has_many :through comes through again!

    Josh created a patch last year for an enhancement to Rails dynamic finders. The new feature is that if you pass a hash to the finder, it will still use only the values named in the finder to find the object, but it will create a new object using all the values. Check out the link below for more details:

    http://blog.hasmanythrough.com/2007/3/14/dynamic-finders-with-hash-attributes-for-creation

    Nice!

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>