Jun 5, 2012

Slow start of java servers on linux VM


During server boot sequence, java process hangs with no apparent IO/CPU activity.
Running "cat /proc/sys/kernel/random/entropy_avail" prints low number < 100.
Usually this means that server tries to read random data from /dev/random, and blocks.
Sample stack of such process might look similar to this:

...    at java/io/FileInputStream.read(FileInputStream.java:220)    at sun/security/provider/NativePRNG$RandomIO.readFully(NativePRNG.java:185)    at sun/security/provider/NativePRNG$RandomIO.implGenerateSeed(NativePRNG.java:202)(NativePRNG.java:202)    ^-- Holding lock: java/lang/Object@0x9e5d2e80[biased lock]    at sun/security/provider/NativePRNG$RandomIO.access$300(NativePRNG.java:108)    at sun/security/provider/NativePRNG$RandomIO.access$300(NativePRNG.java:108)    at sun/security/provider/NativePRNG.engineGenerateSeed(NativePRNG.java:102)    at java/security/SecureRandom.generateSeed    at java/security/SecureRandom.generateSeed(SecureRandom.java:495)...

What happens?

Linux keeps track of how much random data was read, and blocks /dev/random reading if there is no "entropy" available.
Entropy regeneration depends on entropy sources: some semi-random events, like network card/disk keyboard/mouse signals. On a machine without keyboard/mouse/display (virtual machine as an example), kernel has less sources of randomness, and regeneration could be slow.

What I can do?

Blocking random source, might make sense in security-sensitive environment, on production servers, but in most cases pointless on dev/test VM, and just wastes your time.
Attaching hardware random noise generator, redefining randomness source to /dev/urandom, are possible  solutions, but there is a simple hack: this script (http://pastebin.com/jxEDbbXK).
It will copy data from /dev/urandom to /dev/random, feeding it with "fake entropy", and thus unblocking pending reads from /dev/random.

The script should to be run as root/sudo (to be able to write into /dev/random). Upon completion it will print random bits count before, and after injection. Usually, number <= 100 means that your system was "starving". It is possible to execute it as cron job, but I usually just run it manually before/during service restart.

Disclaimer: it's probably a bad idea to run this script on production environment! Random data, used to generate cryptographic keys for ssl/ssh is "less random".

No comments: