[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [SAGE] simple database problem



At 12:42 PM -0500 2005-01-03, Andrew Hume wrote:

>  	the problem is that on Linux (actually, i could just stop here,
>  couldn't I?), the 'print all' operation can take 30mins or more
>  on a busy machine, (busy here means lots of I/O) as opposed to
>  the normal 2-3secs, apparently because of the random seeking around
>  in the database file. performance is significantly helped by simply
>  running 'wc db.dbm' just prior to using the database.

	I must confess that I don't know a whole lot about *dbm, all I 
know is that it has always seemed to be slow and non-scalable for the 
sorts of things I've tried to do with it in the past, in comparison 
to Berkeley DB.

	In my own experiences, with a million e-mail addresses in a *dbm 
file, the system becomes dead-dog slow when you try to handle an 
operational e-mail load.  Substitute db instead, and you can't slow 
the system down with 10 million e-mail addresses and a much higher 
load.  I even threw 100 million e-mail addresses at the problem, and 
the system was not measurably degraded over 10 million.  Of course, 
they weren't 100 million real e-mail addresses, so db may have been 
able to exploit the random methods I was using to generate the input 
in order to optimize performance at those levels, but the difference 
between *dbm and db on just one million real addresses was quite 
extreme.


	Now, I'm sure that this sounds like a case of "if you only have a 
hammer", but I'm curious to know why you choose to use *dbm instead 
of Berkeley DB?

	Among other things, I know that db will try to cache the entire 
database in memory, which may or may not be a good thing, depending 
on your application (although in your case, I think it would probably 
be good).  I also know that db gives you lots of options in terms of 
storage methods used, and b-tree may be best for some applications, 
while a hash may be better for others.  Contrariwise, *dbm doesn't 
give you any storage method choices that I know of.


	Anyway, I don't think that I have any solutions to your specific 
problems, but I am curious to know why *dbm was chosen over Berkeley 
DB.

-- 
Brad Knowles, <brad@stop.mail-abuse.org>

"Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety."

     -- Benjamin Franklin (1706-1790), reply of the Pennsylvania
     Assembly to the Governor, November 11, 1755

   SAGE member since 1995.  See <http://www.sage.org/> for more info.