
|
If you were logged in you would be able to see more operations.
|
|
|
|
Issues with cluster management refactoring
Basically, an instance can no longer recover itself, and will strand in progress triggers if the server restarts within the cluster check in interval. This is because the whole firstCheckIn block in findFailedInstances() ends up ignored:
a. It is hit, this instance is potentially added to the list, and firstCheckIn is set to false
b. The instance state is overwritten to have null recoverer
c. That list of failed instances from a. is thrown away.
d. findFailedInstances() is called again (from within the lock) but this time, firstCheckIn is false, so our instance can't be added to the failed list (and so it can't recover itself, nor recognize that a prior recoverer is dead vs in progress).
The solution I came up with is to move the setting of the firstCheckIn flag up to the JobStoreCMT/TX level, so that findFailedInstances() will always handle our instance correctly the first time through. In addition, I skip the double check locking if firstCheckIn is true, so that we can't have problems with an in progress recoverer. (Though heaven forbid we ever have a false positive after that first checkin now that recovery deletes the state...)
One other small thing I noticed, is that inside the firstCheckIn block of findFailedInstances(), it actually adds the instance to the failed list twice if recoverer is null.
|
|
Description
|
Issues with cluster management refactoring
Basically, an instance can no longer recover itself, and will strand in progress triggers if the server restarts within the cluster check in interval. This is because the whole firstCheckIn block in findFailedInstances() ends up ignored:
a. It is hit, this instance is potentially added to the list, and firstCheckIn is set to false
b. The instance state is overwritten to have null recoverer
c. That list of failed instances from a. is thrown away.
d. findFailedInstances() is called again (from within the lock) but this time, firstCheckIn is false, so our instance can't be added to the failed list (and so it can't recover itself, nor recognize that a prior recoverer is dead vs in progress).
The solution I came up with is to move the setting of the firstCheckIn flag up to the JobStoreCMT/TX level, so that findFailedInstances() will always handle our instance correctly the first time through. In addition, I skip the double check locking if firstCheckIn is true, so that we can't have problems with an in progress recoverer. (Though heaven forbid we ever have a false positive after that first checkin now that recovery deletes the state...)
One other small thing I noticed, is that inside the firstCheckIn block of findFailedInstances(), it actually adds the instance to the failed list twice if recoverer is null.
|
Show » |
|
org.quartz.scheduler.instanceName = DefaultQuartzScheduler
org.quartz.scheduler.rmi.export = false
org.quartz.scheduler.rmi.proxy = false
org.quartz.scheduler.wrapJobExecutionInUserTransaction = false
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 10
org.quartz.threadPool.threadPriority = 5
org.quartz.threadPool.threadsInheritContextClassLoaderOfInitializingThread = true
org.quartz.jobStore.misfireThreshold = 60000
# CLUSTERING
# ----------
org.quartz.scheduler.instanceName = MyClusteredScheduler
org.quartz.scheduler.instanceId = 1
org.quartz.jobStore.isClustered = true
# JOB STORE CONFIG
# ----------------
org.quartz.jobStore.class = org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.tablePrefix = QRTZ_
org.quartz.jobStore.dataSource = QUARTZ
org.quartz.jobStore.useProperties = true
# DATA SOURCE CONFIG
# ------------------
org.quartz.dataSource.QUARTZ.driver = oracle.jdbc.driver.OracleDriver
org.quartz.dataSource.QUARTZ.URL = ?????????????
org.quartz.dataSource.QUARTZ.user = ?????
org.quartz.dataSource.QUARTZ.password = ?????
org.quartz.dataSource.QUARTZ.maxConnections = 5
If I terminate the scheduler without allowing the associated jobs to complete and if one or more records in the QRTZ_TRIGGERS table is left, after termination, with a TRIGGER_STATE value = "ACQUIRED" then these triggers will not restart when the scheduler is restarted.
Note that this problem occurs when only one scheduler is running in the cluster. If I turn off clustering and run a standalone scheduler the stranded trigger(s) is restarted.
On the surface, this appears to be related to this issue. I'm stumped...
Eric
e.gagnon@computer.org