URL Management and Acts As Nested Set
August 28th, 2008
I’ve been tasked with redesigning our database to optimize the way we store and search for URLs. My first attempt involved Acts_as_Tree. But I’ve since, thrown that out in favor of Acts_as_Nested_Set.
With the nested set model, we can retrieve a single path without having multiple self-joins. The full tree is retrieved through the use of a self-join that links parents with nodes on the basis that a node’s lft value will always appear between its parent’s lft and rgt values. For a more in depth explination of how nested sets work, check out this article on Managing Hierarchical Data in MySQL.
You can’t look up nested sets without tripping over the plugin called BetterNestedSet. BetterNestedSet is an extension of ActsAsNestedSet that provides an enhanced acts_as_nested_set mixin for ActiveRecord.
>> script/plugin install svn://rubyforge.org/var/svn/betternestedset/tags/stable/
So below I’ve detailed my implementation of how to store and retrieve urls using Acts_As_Nested_Set ad the BetterNestedSets plugin. There are two tables, Domains and Directories. I split this out because in most cases we’re really not interested in the directory structure of a url. Rather than clutter up the Domains table with additional rows, we can simply access the Directories table by domain_id if we find the need.
Domain levels are separated by a dot, or period symbol. According to ICANN, domain names can have a 255 character limit and the maximum character number of a supported second level and lower level domains is 63. So by splitting out a url by subdomain, you’ll never do a string search over more than 63 characters. In addition, for performance we should specifiy in the database that the string column be set to a char(63) as opposed to a varchar(255) since varchar is used to keep table sizes down and char is used to make searches faster.
class CreateDomains < ActiveRecord::Migration def self.up create_table :domains do |t| t.int :parent_id t.string :subdomain, :limit => 63, :null => false t.int :lft t.int :rgt t.timestamps end end def self.down drop_table :domains end end
class CreateDirectories < ActiveRecord::Migration def self.up create_table :paths do |t| t.int :parent_id t.int :domain_id t.int :lft t.int :rgt t.string :directory, :limit => 63, :null => false t.timestamps end end def self.down drop_table :directories end end
To store my Urls in my new database tables, I created a method that loops through an array of urls and calls the find_or_create_nestedset method where the meat of the code is stored. The reason I’m showing you this simple loop is to demonstrate the Addressable::URI class and the heuristic_parse method used to parse the urls into a standardized format. You can also use the URI::extract method.
You might want to note that I’m catching the InvalidURIError that is thrown for non standard urls. This is a must when working with user entered urls.
#This method loops through an array of urls and calls #the find_or_create_nestedset method on each def self.parse_urls(urls) require 'rubygems' require 'addressable/uri' #Loop through urls and create sitetree records urls.each do |url| begin uri = Addressable::URI.heuristic_parse(url) domain = find_or_create_nestedset(uri) # rescue Addressable::URI::InvalidURIError #skip the record for now next end end end
The find_or_create_nestedset method parses a uri into the domain table and (if the uri contains a directory structure as well) into a separate directory table. The domain string is read right to left to find or create a domain record, which means the TLD (top level domain e.g. .com, .net, .org) will always be a root node.
Note: I did try using a dynamic finder rather than checking to see if the find_by returned a record or not. While finding a record by subdomain or parent_id is fine, creating a record throws an error because the dynamic finder assumes that you are trying to assign a parent_id on create. Like the id column in a rails table, you’re not allowed to assign the parent_id column of a nested set. I couldn’t find an option that would allow me to bypass this issue, but please let me know if there’s a better way.
#This method parses uri into the #domain/directory nested set structure def self.find_or_create_nestedset(uri) return Directory.new if uri.host.nil? # * normalize the uri # * Parse the domain and insert into the domains table # * Parse the directories and insert into the directories table #normalize uri uri.normalize! #strip off leading www sitename = uri.host.start_with?('www.') ? uri.host.gsub('www.', '').downcase : uri.host.downcase #begin transation Domain.transaction do #domains domain = Domain.new() #initialize parent = Domain.new() #initialize #split up the domains into an array subdomains = sitename.split('.').delete_if {|i| i.empty? }.reverse #loop through the domain strings right to left to find or create a domain record subdomains.each do |subdomain| node = Domain.find_by_subdomain_and_parent_id(subdomain, parent.id) if node.nil? node = Domain.create(:subdomain => subdomain, :parent_id => parent.id) node.move_to_child_of parent unless parent.new_record? #for acts_as_nested_set end parent = node end domain = parent #directories node = Directory.new() #reinitialize parent = Directory.new() #reinitialize #split path directories into an array patharray = uri.path.strip.split('/').delete_if {|i| i.empty? } #remove extensions (e.g. foo.html) patharray.pop unless uri.extname.empty? return domain if patharray.empty? #loop through directories left to right to find or create a directory record patharray.each do |directory| node = Directory.find_by_directory_and_domain_id_and_parent_id(directory, domain.id, parent.id) if node.nil? node = Directory.create(:directory => directory, :domain_id => domain.id, :parent_id => parent.id) node.move_to_child_of parent unless parent.new_record? #for acts_as_nested_set end parent = node end directory = parent end return domain end
Now that are tables are populated, we can create the models to access our urls in pleasing ways.
class Domain < ActiveRecord::Base has_many :directories acts_as_nested_set # def self.all_parents self.find(:all, :conditions => "rgt lft + 1") end # def self.all_leaves self.find(:all, :conditions => "rgt = lft + 1") end # def host subdomain + ancestors.collect {|i| "." + i.subdomain }.to_s end # def urls return host.collect if directories.empty? directories.collect {|i| i.url } end # def tld self.root end # end
class Directory < ActiveRecord::Base belongs_to :domain acts_as_nested_set :scope => :domain # def subdir return "" if directory.nil? path = ancestors.reverse.collect {|i| "/" + i.directory.to_s if !i.directory.nil?}.to_s path + "/" + directory end # def url domain.host + subdir end # end
Given the leaf record of our nested set, our host method returns the a full url string by using the BetterNestedSet ancestor method. Our urls method will return all the url directories found under a domain. This is useful because now you have access to a bunch of information about a uri that can be accessed rather quickly. The key here is speed. With any Domain object, I can quickly return all the urls in my databases that contain the word “mychildren” with one line of code:
Domain.find_by_subdomain("mychildren").urls
If you want the “mychildren” root domain and we know the tld is “.com”, we can trim away significant sections of the set rather than doing a full table scan. For example:
dotcom = Domain.find_by_subdomain_and_parent_id("com",nil) domaindotcom = Domain.find_by_subdomain_and_parent_id("mychildren",dotcom); # #return all the subdomains of "mychildren.com" puts domaindotcom.urls # # output: # lillian.mychildren.com # george.harry.mychildren.com # mark.mychildren.com
Entry Filed under: Ruby on Rails
2 Comments Add your own
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>
1. lianaleahy | September 10th, 2008 at 11:27 am
P.S. Don’t forget to add acts_as_nested_set to your model. And if you have multiple roots in your tree, you’ll need to add the scope. (e.g. :scope => :root_id)
And by the way, to migrate an old table… simply add the lft and rgt columns. Then in you code add a “.renumber_all”. This will fill in the new columns with their proper values.
2. lianaleahy | September 19th, 2008 at 10:24 am
I found a solution to the problem detailed in my note above. has_many :through comes through again!
Josh created a patch last year for an enhancement to Rails dynamic finders. The new feature is that if you pass a hash to the finder, it will still use only the values named in the finder to find the object, but it will create a new object using all the values. Check out the link below for more details:
http://blog.hasmanythrough.com/2007/3/14/dynamic-finders-with-hash-attributes-for-creation
Nice!