11 June 2011
BerkeleyDB is quit popular and nice to use.

there is one interesting article from The Architecture of Open Source Applications worth reading: http://www.aosabook.org/en/bdb.html

BerkeleyDB is not available under Windows. and usually you can use apt-get or yum to install it on Linux, eg: $ sudo apt-get install libberkeleydb-perl

all the modules are usually down to "When to use it?" and "How to use it?".

BerkeleyDB is suitable when you meet:
1. you have a VERY big file and you want manipulate it like remove duplication lines or sort on string inside each line. but you don't have enough memory.
2. you want to share data in 'forks'. that can be an option.
3. and more ...

Case A: you have enough disk, but limited memory

use FindBin qw/$Bin/;
use BerkeleyDB;

my $berkeleydb_temp_file = "$Bin/tmp.berkeleydb"; # temp file for BerkeleyDB
tie my %data, 'BerkeleyDB::Hash',
    -Filename => $berkeleydb_temp_file,
    -Flags    => DB_CREATE|DB_TRUNCATE
        or die "Cannot create file: $! $BerkeleyDB::Error\n";
open(my $fh, '<', 'RealBigFile.log') or die "Can't open: $!";
while (my $line = <$fh>) {
    ## some code that %data will be a real very big hash
    ## we just need the first line and the last line which matches the $pattern
    my $pattern = get_pattern($line);
    if (exists $data{"start_$pattern"}) {
        $data{"end_$pattern"} = $line;
    } else {
        $data{"start_$pattern"} = $line;
    }
}
close($fh);
# now working on the %data

Case B: in forks
I shared some thoughts on tips around Parallel::ForkManager previously. in that article, I suggested Parallel::Scoreboard and Cache::FastMmap. but really, BerkeleyDB is a good choice too.

but that's not so easy to write the correct code from the first glance. if you don't use cds_lock, you may get some wrong results. I wrote two tests (hosted on github):

[email protected]:~/git/fayland.org/blogs$ perl right-forks-BerkeleyDB.t
ok 1
1..1
[email protected]:~/git/fayland.org/blogs$ perl wrong-forks-BerkeleyDB.t
not ok 1
#   Failed test at wrong-forks-BerkeleyDB.t line 30.
#          got: '656'
#     expected: '1000'
1..1
# Looks like you failed 1 test of 1.
[email protected]:~/git/fayland.org/blogs$ perl right-forks-BerkeleyDB.t
ok 1
1..1
[email protected]:~/git/fayland.org/blogs$ perl wrong-forks-BerkeleyDB.t
not ok 1
#   Failed test at wrong-forks-BerkeleyDB.t line 30.
#          got: '661'
#     expected: '1000'
1..1
# Looks like you failed 1 test of 1.

without the lock, you may get some weird results.

snippets code below:

my $env = new BerkeleyDB::Env
    -Home   => $tmp_dir,
    -Flags  => DB_CREATE|DB_INIT_CDB|DB_INIT_MPOOL
        or die "cannot open environment: $BerkeleyDB::Error\n";
my $db = tie my %data, 'BerkeleyDB::Hash',
    -Filename => $berkeleydb_temp_file,
    -Flags    => DB_CREATE,
    -Env      => $env
        or die "Cannot create file: $! $BerkeleyDB::Error\n";

my $lock = $db->cds_lock();
$data{$i} = $i * 2;
$db->db_sync();
$lock->cds_unlock();

BerkeleyDB is not only for Hash, it also supports Btree, Recno, Queue and others.

I hope it helps when you meet same issues in your daily life.

Enjoy, Thanks


blog comments powered by Disqus