David Holmes
2018-11-20 00:01:09 UTC
Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live target
thread so that we can use its pthread_t identity to obtain its CPU clock
via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered
during its constructor and de-registered during its destructor. A thread
that has only been constructed has not yet executed and so is not a
valid target for this management API. This seems to be the cause of
failures reported in this bug (and JDK-8213434). Registering a NJT only
when it starts executing is an appealing fix for this, but that impacts
all current users of the NJT list and straight-away causes a problem
with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread
to skip threads that have not yet executed - which we can recognize by
seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for
NJT lifecycle management if desired, tackles the problem of encountering
a terminated thread during iteration - which can also lead to SEGVs.
This can arise because NJT's are not actually "destructed", even if they
terminate, and so they never get removed from the NJT list. Calling
destructors is problematic because the code using these NJTs assume they
are always valid. So the fix in this case is to move the de-registering
from the NJT list out of the destructor and into the Thread::call_run()
method so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where the
remaining steps touch on a lot of areas and so need to be handled
separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test this
in the conditions reported in JDK-8213434.
Thanks,
David
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live target
thread so that we can use its pthread_t identity to obtain its CPU clock
via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is registered
during its constructor and de-registered during its destructor. A thread
that has only been constructed has not yet executed and so is not a
valid target for this management API. This seems to be the cause of
failures reported in this bug (and JDK-8213434). Registering a NJT only
when it starts executing is an appealing fix for this, but that impacts
all current users of the NJT list and straight-away causes a problem
with the BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for ThreadTimesClosure::do_thread
to skip threads that have not yet executed - which we can recognize by
seeing an uninitialized (i.e. zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE for
NJT lifecycle management if desired, tackles the problem of encountering
a terminated thread during iteration - which can also lead to SEGVs.
This can arise because NJT's are not actually "destructed", even if they
terminate, and so they never get removed from the NJT list. Calling
destructors is problematic because the code using these NJTs assume they
are always valid. So the fix in this case is to move the de-registering
from the NJT list out of the destructor and into the Thread::call_run()
method so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where the
remaining steps touch on a lot of areas and so need to be handled
separately e.g. see JDK-8087340 for shutting down WorkGang GC worker
threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test this
in the conditions reported in JDK-8213434.
Thanks,
David