04 June 2005
This post may be outdated due to it was written on 2005. The links may be broken. The code may be not working anymore. Leave comments if needed.

Google 最近推出了新业务,Google Sitemap.
主要是给 webmaster 在自己的网站上建一个 XML 文档,然后 Google Sprider 来抓取这个文档查看网站更新了什么。
我不明白为什么不直接抓取 RSS/Atom, 这样就可以省去重新写一个文档了。

或许 RSS/Atom 的冗长资料太多了,Google Sitemap 所要求的 XML 标签很少。详细的要求请参阅 https://www.google.com/webmasters/sitemaps/docs/en/protocol.html
我在自己的 Eplanet 上简单 hack 了下。
因为我自己的要求很少(比如不用过滤 XML 的特殊标签),所以代码很简单(没有使用任何 XML:: 模块)。

# declare the vars
my $data; # the data to save in

# normal settings
my $file = $c->config->{build_root} . "/sitemaps.xml";
my $site_prefix = 'http://www.fayland.org/journal';
# got the data from database
my @topics = Eplanet::M::CDBI::Cms->retrieve_from_sql(qq{
    1=1 ORDER BY cms_id DESC LIMIT 0, 20
# the data format see https://www.google.com/webmasters/sitemaps/docs/en/protocol.html
$data = '<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">'."\n";

# for every topic
foreach my $topic (@topics) {
    my $topic_name = $topic->get('cms_file');
    my $date = $topic->{'cms_mod_data'} || $topic->{'cms_cre_data'};
    # format the date
    $date = &format_date($date);
    # add to the data
    $data .= "<url>
# finish the data format
$data .= '</urlset>';

# save to the file
open(FH, ">$file");
flock(FH, 2);
print FH $data;

完毕后就提交 URL. Done!

blog comments powered by Disqus