Al Hoang

December 14, 2005

Humor for the day

Filed under: Uncategorized — @ 8:32 am

December 3, 2005

New free fonts for Vietnamese Nom characters

Filed under: tiengviet — hoanga @ 10:54 pm

For all 20 of the people on the planet probably interested in this…


Two new fonts have been announced by the Institute of Vietnamese Studies in
California for Vietnamese Nom characters. Of particular interest is that these
free fonts between them also cover the CJK-A and CJK-B extensions to Unicode.
(First seen on Unihan mailing list)

I’ll have to check this out soon.

Chinese Computing News Link

Link to the free fonts

Chinese_rules.cf - A Chinese ruleset for SpamAssassin

Filed under: Uncategorized — @ 10:46 pm

This puts to rest some of my complaints on how SpamAssasin deals with spam
in Chinese.


Chinese_rules.cf is a third party drop-in custom rule set for SpamAssassin to catch spam written in Chinese. Due to there is no rule for Chinese mail before, SpamAssassin can not catch Chinese spam effectively. Chinese_rules.cf is the first rule set to catch Chinese spam for SpamAssassin. The Chinese_rules.cf is built based on a very new and luxuriant Chinese spam database own by CCERT. It is updated once a week, therefore, it is able to catch very new spam.

Check it out

A better way to run Postgresql on Windows

Filed under: Uncategorized — @ 11:25 am



Under Windows, trying to get Postgresql has been a tricky thing for some time.
At one point you definitely needed cygwin before it would run correctly
but it seems there are now native binaries at last. And even better a
nice cross platform GUI tool for working with Postgresql is called
PGadmin. It also includes a all-in-one installer for Postgresql and the
GUI altogether. Highly recommended.

Check it out here

NLP techniques applied to Vietnamese

Filed under: Uncategorized — @ 11:25 am

I’ve been getting interested in computational linguistics lately and have
been trying to read up on the
fundamentals. It’s been quite
a bit of probability which I’m not sure I really understand that well but
I imagine it’ll get easier to understand if I work with it a bit more.

Most of the examples in the textbook are focused on English so I thought it
would be interesting to apply it to Vietnamese. After looking around for
grammars for Vietnamese that I could use as a model for being able to tag
a Vietnamese corpus I stumbled across the website for the
Vietnam Lexicography Centre.
It looks like a great resource for Vietnamese computational linguistics issues.
There’s a paper describing an effort for descring the tags for part-of-speech
(POS) tagging in Vietnamese titled Lexical descriptions for Vietnamese
language processing
. It’s a very good description of problems running
across computational linguists trying to tackle Vietnamese (hint: the number
is very low).

Here’s some brief highlights:

  • The classification methods which would be a necessary model to use for
    POS tagging is still in discussion among Vietnamese linguists.
  • There has been some other efforts but without an implementation available
    it’s impossible to check the effectiveness of those efforts
  • They are trying to adhere to some model called the MULTEXT model (which
    itself is some effort to help standardize some parts of linguistics databases
    so researchers can actually exchange their data easier)
  • Few resources for tools for Vietnamese text analysis are in the public
    research domain

I think the authors of the paper make a good point. Publicly available
resources for linguistic processing research are very important. I think
the problem is magnified far more for languages that are not ‘economic
powerhouse’ languages such as English, French, or Japanese. What I mean
by ‘economic powerhouse’ is that the cultures that use these languages have
a much more stable financial base to build research efforts and derive
some sort of financial benefit from tools that are built. However, for
Vietnamese or Cambodian I’d argue that economic incentives for doing
nlp research in these languages is quite low. In order to progress research
in these languages I feel it is even MORE necessary to have the tools and
as much of the data as possible available for public research usage. Without
such efforts I get the feeling much of the research will not make progress
since people will be too busy building fundamental building blocks rather
than slowly stacking the building blocks on top of one another to find out
what works and what doesn’t.

One last note, in order to become a competent Vietnamese linguist it seems
from a cursory inspection you’ll need a decent command of Vietnamese (not
surprising) and perhaps slightly more surprising is some literacy in
French to catch some research that hasn’t been translated to English yet
(if it ever will be).

Resources I found:

Lexical
descriptions for Vietnamese language processing
(PDF)

Vietnamese linguistic and
Cultural Information
(A little light imo)

Wikipedia article on the Vietnamese language

Vietnamese Grammar Project(Seems dead and definitely far from complete)

Mon-khmer.com’s description of Vietnamese

Postgresql on cygwin giving you bad system call?

Filed under: Uncategorized — @ 11:24 am

You’ve probably done the whole song and dance of getting the cygrunserver
and postgresql just gives you the bad system call and no love? Like me
and the article I link to…. you probably forgot to set the CYGWIN
env variable to server


export CYGWIN=server


Thanks Mr Schneider!

Powered by WordPress

Protected by AkismetBlog with WordPress