10 April 2011

WWW::Selenium is a good choice when you have to test your own web application or work with a webpage full of javascript.

well, but right now it's a pity that it doesn't support the API of Selenium 2.0. though it's still a good tool to play with.

below is some notes that may help when you meet the issues like me.

1. the tests may fail but you don't need worry about it.
actually I never made the tests passing or SYNOPSIS working because it uses wait_for_page_to_load and Google. you can try a Yahoo! search and it works.
one suggestion to avoid the timeout error from wait_for_page_to_load is using wait_for_element_present. wait_for_element_present works much more better in my side.

2. get image stored
big thanks to stackoverflow, here is an example code.

my $js = <<'JS';
var img = this.browserbot.findElement("//img[contains(@src,'captchaData')]");

var canvas = document.createElement("canvas");
canvas.width = img.width;
canvas.height = img.height;

var ctx = canvas.getContext("2d");
ctx.drawImage(img, 0, 0);

var dataURL = canvas.toDataURL("image/png");

dataURL.replace(/^data:image\/(png|jpg);base64,/, "");

my $val = $sel->get_eval($js);
my $bin_data = decode_base64($val); # from use MIME::Base64;
open(my $fh, '>', "$Bin/test.jpg");
print $fh $bin_data;

I'm using the firefox 3.6 and it works pretty well. canvas.toDataURL produces encoded base64 string and you can decode it then save it.

The solution using javascript to get image stored is pretty interesting.

3. how to get the element HTML?
well, this module doesn't return javascript DOM HTML to you. instead, you need always use js to get what you want. Yes. javascript is powerful.

my $html = $sel->get_eval('this.browserbot.findElement("id=whatever").innerHTML');

anyway, Selenium is somehow slow. I'm trying to use WWW::Mechanize::Firefox but I don't have X in my server right now. (tried to install the mozrepl plugin through command line but no luck yet)

I'll keep you posted. Thanks.

all about sphinx

03 April 2011

for those who don't know sphinx yet, sphinx is http://sphinxsearch.com/ full text search engine.

1. geo distance

It has been quite popular for location application recently. see you have a user table with lat/lon data, and you want to find out the nearest people or order by distance.

it's pretty hard to do it with SQL but very easy with sphinx. sample:

* the config file:
sql_query = SELECT user_id, radians(longitude) as longitude, radians(latitude) as latitude FROM user_location
sql_attr_float = longitude
sql_attr_float = latitude
* perl sample code:
my $pi = atan2(1,1) * 4;
sub deg2rad {
    my ($deg) = @_;
    return ($deg * $pi / 180);
$sphinx->SetSortMode(SPH_SORT_EXTENDED, '@geodist ASC');
#### $lat/$lon is the point you want to be based on
$sphinx->SetGeoAnchor('latitude', 'longitude', deg2rad($lat), deg2rad($lon));
#### $radius is how far you want to search with
my $circle = $radius * 1.609344; # meter
$sphinx->SetFilterFloatRange('@geodist', 0, $circle);
my $ret = $sphinx->Query();
it's simple, sphinx did all the magic you want. the stuff you want to know is that sphinx can do it. :)

2. haproxy as load balancer

when you have many sphinx servers, one choice is that you can do disturbed index as mentioned in the doc. the other is to put load balancer before them.
well, I'm not saying the built-in disturbed index is bad or something like that, actually I haven't tried that yet. below is just my cents when I use haproxy with sphinx servers.

* when you put 'log   local0' in the conf, don't forget to follow the docs with google search, put 'local0.* /var/log/haproxy.log' and the "-r" in SYSLOGD_OPTIONS="-m 0 -r"

actually I'm not 100% satisfied with haproxy.
* haproxy doesn't support TCP stats like http as I tested with haproxy 1.4.13
* haproxy can't check sphinx searchd status. it's not the fault of haproxy, searchd doesn't have a simple way to verify it's working smoothly or have fatal error inside. it can be fixed by a Perl script but I'd like that searchd has this inside.

but it really works pretty well, I may write a Perl script as the 'check' in haproxy so that it can auto failover then I think I'll be more happy.

3. new rotating way

To do full index on every sphinx server is really dumb. it puts heavy load on the underlying MySQL server. Big thanks to the sphinx forum, I have the answer to do something like below and it works pretty good.

* create a new index section in the conf
index XXX {
    source = XXX
    path    = /var/data/sphinx/XXX
index XXX_new : XXX {
    path = /var/data/sphinx/new/XXX.new
* never run indexer --all on XXX, instead, always run it with
indexer --all --config /path/to/above.conf XXX_new
* run bash or Perl script to do the full index, after run the command above. copy those XXX.new.sp* to destination by cp or scp. then send kill -1 to the pid of searchd which can use `cat /the/pid/file/in/conf/XXX.pid`. (-1 is SIGHUP).
* NOTE here: when you do the ->Query API call, you have to put XXX as the second arg as index name. or it will search with XXX;XXX_new, and it wastes. (or you can try start the searchd with --index XXX)

the real magic here is that
if you run indexer --all with XXX. it will create XXX.new.sp* file. and once you sent the SIGHUP by --rotate or restart the searchd, the XXX.new.sp* will becomes XXX.sp*
so you can run the indexer --all with XXX_new, and it will generate XXX.new.sp* which is the same as the ones you do it with --all XXX, and you can copy it to the directory XXX.sp* lives, and SIGHUP can make it becomes XXX.sp* without any fault.

well, not funny but maybe useful when you have the same situation.


Don't let JS block your scrape

18 January 2011

well, sometimes you're use LWP::UserAgent or WWW::Mechanize to scrape webpages, and in those webpages, there is some javascript code to set js cookies and the site will use those cookie to continue.

for example, one site has some js code like:

<script type="text/javascript">function test(){
// complicated js code to generate different code each time
document.cookie = "TS884e96_75=" + "b0f056f808cab30029f1dfed8117af84:"
 + chlg + ":" + slt + ":" + crc + ";Max-Age=3600;path=/";
<body onload="test()">

OK, now you're totally blocked out.

lucky we have JE, which I talked in my PerlChina Advent last year: http://advent.perlchina.org/2010/JE.html
it's pretty amazing and the solution is very simple:

# cookie_jar will build memory cookie for UA
my $ua = WWW::Mechanize->new(
    agent => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: Gecko/20101203 Firefox/3.6.13',
    stack_depth => 1,
    autocheck  => 0,
    cookie_jar => {},
my $resp = $ua->get($url);

# get the js code
my ($js) = ($ua->content =~ /\<script type=\"text\/javascript\"\>(.*?)\<\/script\>/s);
$js =~ s/document.cookie \=/return/s;
$js .= "\ntest();"; ## use return and run it for JE
## get js and set cookie
my $j = JE->new;
my $v = $j->eval($js);

$resp->header('Set-Cookie', $v->value);


pretty simple and that's almost all what you need to write. a little explanation:
1. cookie_jar => {} will build memory cookie and read HTTP::Cookies for more details.
2. we convert the document.cookie = to return in javascript code so that we can get the value by eval the js.
3. in the HTTP::Response, we set the header Set-Cookie to do the job what js does for us
4. cookie_jar (HTTP::Cookies) extract_cookies will add that cookie into WWW::Mechanzie UA.

fun and Enjoy!



11 January 2011

I got a new module uploaded to CPAN: WebService::IPRental

it's basically a Perl port for IP Rental API: http://www.iprental.com/apidoc/

well, I really admire that php has a nice WSDL soap client that sometimes SOAP::WSDL wsdl2perl sucks. anyway, we can always request the same stuff with SOAP::Lite by checking the php


and the simple map is:

1. ->on_action( sub { 'urn:IPRentalSoapAPIAction' } ) is for SOAPAction in header
2. ->ns( 'urn:IPRentalSoapAPI', 'ns1' ) can register a new namespace
3. simply do ->outputxml(1) b/c that's more reliable (I prefer)

maybe not so much people use the IP Rental service, but people will feel that CPAN is great when he searches a module like this.


tips around Parallel::ForkManager

08 January 2011

Parallel::ForkManager is my choice for 'forks'. it's simple to use, fit my demand and well-maintained.

I use it frequently. for example, in scrape job with Tor. or there is a lot of db rows to process. forks is required to fast the whole progress when we have enough resource.

here is some tips I use with Parallel::ForkManager.

1. Scope::OnExit
we know we always need call $pm->finish; in child so that we won't get something like 'Cannot start another process while you are in the child process'.
it won't be an issue if you have simple code without much next in a loop.
but it could be very troublesome if you have lots of 'next' in the loop. you have to call ->finish before next. that's stupid. and Scope::OnExit can save you out.

foreach my $part (@parts) {
    $pm->start and next; # do the fork
    ## when on next
    on_scope_exit {
        $pm->finish; # Terminates the child process
    # do whatever you like, call 'next' on whenever you want.
2. List::MoreUtils
see you have a big list to process, simple you can fork on every element. but in this case, you'll need init or clone every object in forked child, and it's expensive sometimes.so how about something like this, you divide the big list into $PROCESS parts, then fork on each part. so at last, you just init/clone $PROCESS times instead of scalar(@big_list) times. List::MoreUtils part can do the job here:

my @big_list = (1 .. 10000); # from file, database or whatever
my $i = 0;
my @parts = part { $i++ % $FORK_CHILDREN } @big_list;
foreach my $part (@parts) {
    $pm->start and next; # do the fork
    ## when on next
    on_scope_exit {
        $pm->finish; # Terminates the child process
    # init dbh/ua etc.

    while (my $ele = shift @$part) {

3. DBI and LWP::UserAgent clone
I don't know if it's wise or not to call DBI->connect or LWP::UserAgent->new in child code. but usually we can do

my $dbh = $odbh->clone();
my $ua2 = $ua->clone(); # will copy the cookies and referer etc.
4. share variables between parent and children
well, I don't like threads::shared. and I don't like IPC.
usually a cache solution can do the tricky. from one simple txt file (with lock), maybe you can use Parallel::Scoreboard to my choice Cache::FastMmap
sample code below:

my $cache = Cache::FastMmap->new;
my @array = (1 .. 10); # in parent
$cache->set($cache_key, \@array);

### then in forked child after ->start
        $cache->get_and_set( $cache_key, sub {
            my $v = $_[1];
            push @$v, $value_in_child;
            return $v;
        } );

### after $pm->wait_all_children;
my $array_ref = $cache->get($cache_key);

get_and_set does the tricky here. anyway, that's just my solution. it won't fit into every situation.

that's all for Parallel::ForkManager. hope it helps when you want to use it.


new start

06 January 2011

I haven't been blogging for years, earlier than blogger stopped to support the FTP publishing. (check my old blogs @ http://fayland.org/blog/)

it's a new year, a new start and I think I should start a new blog.

I moved my service from dreamhost to linode and setup blog.fayland.org. hope I can write something useful soon.

cy. Thanks


02 January 2010

this one is my first module of 2010 - WWW::Google::Contacts

it's nothing big but implement the Google Contacts Data API.

so Enjoy!


The End of 2009 CN Perl Advent Calendar

25 December 2009

I'm really very happy that we get it done today. the last article is perlthanks from I. and we didn't miss one day. 25 tips 25 days.

I have totally 18 articles published, really Wow! they include ack, autodie, dzil, local::lib, Devel::NYTProf, Padre, pip, Plack, REPL, perlthanks and more.
check them if you missed. :)


2009 CN Perl Advent Calendar

01 December 2009

we just have the first article for our first advent calendar, from I. :)



Enjoy! Thanks

PSGI and Plack

23 October 2009

The most exciting thing in Perl world today is PSGI

The homepage is http://plackperl.org/

I tried that today. there are some tips for Win32 user.
* Don't try to run under Standalone server, the alarm is not supported in Win32
* after install Plack (maybe notest install Plack), try install Plack::Server::AnyEvent or Plack::Server::ServerSimple
* at last, to know a bit of the Plack, cd the cpan build dir or download the tar.gz
C:\strawberry\cpan\build\Plack-0.9006-ZEaDbK\eg\dot-psgi>plackup -a Hello.psgi -s AnyEvent
Accepting requests at
C:\strawberry\cpan\build\Plack-0.9006-ZEaDbK\eg\dot-psgi>plackup -a Hello.psgi -s ServerSimple
Plack::Server::ServerSimple: You can connect to your server at http://localhost:5000/ - - [23/十月/2009 22:19:15] "GET / HTTP/1.1" 200 11 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20090824 Firefox/3.5.3 GTB5" - - [23/十月/2009 22:19:15] "GET /favicon.ico HTTP/1.1" 200 11 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20090824 Firefox/3.5.3 GTB5"
try visit then you'll get the access_log like above.

well, I think the Plack::Middleware::AccessLog should use &POSIX::setlocale(&POSIX::LC_ALL); on qw( .ISO8859-1 .ISO_8859-15 .US-ASCII .UTF-8 ) to avoid 十月 the Chinese.

you can try more with the all the .psgi files under eg/dot-psgi, and you might have some ideas to write a Middleware. :)


(Updates on Oct 24th)
miyagawa applied the patch with some changes and it works pretty well now.
Thanks very much. :)