cannot reproduce the reported best result "2Channel2logit" #2

Open
opened 2019-08-26 13:55:05 +08:00 by chendongliang87 · 7 comments
chendongliang87 commented 2019-08-26 13:55:05 +08:00 (Migrated from github.com)

I've managed to train a model using firmas (4000 signers) dataset but the training has some problems.

I did the following preprocess progress:

  1. Use preprocess_image.py to binarize firmas invertedly. (black background and white color signature strokes)
  2. Generate training and validation list by running generate_list_firmas.py
  3. Start training with run.py with correct path settings (keep training hyperparameters unchanged)

The training threw an exception at model.py line 266:

        metric_ops = tf.metrics.auc(labels_reversal, distance_norm)
        tf.summary.scalar('auc', metric_ops[1])

and I suspend the reason is because tf.div's divisor is zero. After fixing this bug, the training could be continued and loss will stop improving just after step 300:
evaluation_auc = 0.4823507, global_step = 0, loss = 0.6020629, sec_at_spe = 0.086208425 ... global_step = 300, negative_distance = 0.0, positive_distance = 0.0

Thanks a million if you could advise how to reproduce your solution.

I've managed to train a model using `firmas` (4000 signers) dataset but the training has some problems. I did the following preprocess progress: 1. Use `preprocess_image.py` to binarize `firmas` invertedly. (black background and white color signature strokes) 2. Generate training and validation list by running `generate_list_firmas.py` 3. Start training with `run.py` with correct path settings (keep training hyperparameters unchanged) The training threw an exception at `model.py` line 266: ``` metric_ops = tf.metrics.auc(labels_reversal, distance_norm) tf.summary.scalar('auc', metric_ops[1]) ``` and I suspend the reason is because `tf.div`'s divisor is zero. After fixing this bug, the training could be continued and loss will stop improving just after step 300: `evaluation_auc = 0.4823507, global_step = 0, loss = 0.6020629, sec_at_spe = 0.086208425 ... global_step = 300, negative_distance = 0.0, positive_distance = 0.0` Thanks a million if you could advise how to reproduce your solution.
dlutkaka commented 2019-08-27 08:56:45 +08:00 (Migrated from github.com)

Hi, The work can be reproduced on Tensorflow1.7, did you run the code on tf2.0?
Please tell me if it still doesn't work on tf1.7.
Actually, binarizing and inverting only have little effect.

Hi, The work can be reproduced on Tensorflow1.7, did you run the code on tf2.0? Please tell me if it still doesn't work on tf1.7. Actually, binarizing and inverting only have little effect.
chendongliang87 commented 2019-08-27 10:04:03 +08:00 (Migrated from github.com)

I'm using TF1.14 not 2.0. it requires a big change if switching to 2.0.
Let me try to reproduce with TF1.7.

BTW, I also notice some messy code in input_fn. Will open another issue to clarify. Thanks!

I'm using TF1.14 not 2.0. it requires a big change if switching to 2.0. Let me try to reproduce with TF1.7. BTW, I also notice some messy code in `input_fn`. Will open another issue to clarify. Thanks!
chendongliang87 commented 2019-09-23 15:31:25 +08:00 (Migrated from github.com)

after I changed to TF 1.7.1 with CUDA 9.0, I have the following error, which may due to the same nan I think:

2019-09-23 06:39:05.397952: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 48733 of 50000
2019-09-23 06:39:12.911159: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
INFO:tensorflow:Saving checkpoints for 1 into experiments/augment_lr_1-3/model.ckpt.
INFO:tensorflow:step = 0, loss = 0.6931714
INFO:tensorflow:auc = 0.47116327, negative_distance = 0.0004723185, positive_distance = 0.0005186035
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [predictions must be in [0, 1]] [Condition x >= y did not hold element-wise:x (div:0) = ] [[-nan][-nan][-nan]...] [y (auc/Cast/x:0) = ] [0]
	 [[Node: auc/assert_greater_equal/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_FLOAT], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_0, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_3, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_2)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 43, in <module>
    estimator.train(lambda: input_fn(params, is_training=True, repeating=1, is_augment=True))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 355, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 903, in _train_model
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1170, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 950, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1140, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [predictions must be in [0, 1]] [Condition x >= y did not hold element-wise:x (div:0) = ] [[-nan][-nan][-nan]...] [y (auc/Cast/x:0) = ] [0]
	 [[Node: auc/assert_greater_equal/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_FLOAT], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_0, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_3, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_2)]]

Caused by op 'auc/assert_greater_equal/Assert/AssertGuard/Assert', defined at:
  File "run.py", line 43, in <module>
    estimator.train(lambda: input_fn(params, is_training=True, repeating=1, is_augment=True))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 355, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 824, in _train_model
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 805, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/Code/models.py", line 266, in model_fn_signature
    metric_ops = tf.metrics.auc(labels_reversal, distance_norm)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/metrics_impl.py", line 662, in auc
    labels, predictions, thresholds, weights)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/metrics_impl.py", line 470, in _confusion_matrix_at_thresholds
    message='predictions must be in [0, 1]'),
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/check_ops.py", line 725, in assert_greater_equal
    return control_flow_ops.Assert(condition, data, summarize=summarize)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/tf_should_use.py", line 118, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 180, in Assert
    guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard")
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2056, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1897, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 178, in true_assert
    condition, data, summarize, name="Assert")
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 51, in _assert
    name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): assertion failed: [predictions must be in [0, 1]] [Condition x >= y did not hold element-wise:x (div:0) = ] [[-nan][-nan][-nan]...] [y (auc/Cast/x:0) = ] [0]
	 [[Node: auc/assert_greater_equal/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_FLOAT], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_0, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_3, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_2)]]
after I changed to TF 1.7.1 with CUDA 9.0, I have the following error, which may due to the same `nan` I think: ``` 2019-09-23 06:39:05.397952: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 48733 of 50000 2019-09-23 06:39:12.911159: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled. INFO:tensorflow:Saving checkpoints for 1 into experiments/augment_lr_1-3/model.ckpt. INFO:tensorflow:step = 0, loss = 0.6931714 INFO:tensorflow:auc = 0.47116327, negative_distance = 0.0004723185, positive_distance = 0.0005186035 Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1312, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun status, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [predictions must be in [0, 1]] [Condition x >= y did not hold element-wise:x (div:0) = ] [[-nan][-nan][-nan]...] [y (auc/Cast/x:0) = ] [0] [[Node: auc/assert_greater_equal/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_FLOAT], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_0, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_3, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_2)]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "run.py", line 43, in <module> estimator.train(lambda: input_fn(params, is_training=True, repeating=1, is_augment=True)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 355, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 903, in _train_model _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 546, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1022, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1113, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise raise value File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1098, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1170, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 950, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1140, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [predictions must be in [0, 1]] [Condition x >= y did not hold element-wise:x (div:0) = ] [[-nan][-nan][-nan]...] [y (auc/Cast/x:0) = ] [0] [[Node: auc/assert_greater_equal/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_FLOAT], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_0, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_3, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_2)]] Caused by op 'auc/assert_greater_equal/Assert/AssertGuard/Assert', defined at: File "run.py", line 43, in <module> estimator.train(lambda: input_fn(params, is_training=True, repeating=1, is_augment=True)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 355, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 824, in _train_model features, labels, model_fn_lib.ModeKeys.TRAIN, self.config) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 805, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/Code/models.py", line 266, in model_fn_signature metric_ops = tf.metrics.auc(labels_reversal, distance_norm) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/metrics_impl.py", line 662, in auc labels, predictions, thresholds, weights) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/metrics_impl.py", line 470, in _confusion_matrix_at_thresholds message='predictions must be in [0, 1]'), File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/check_ops.py", line 725, in assert_greater_equal return control_flow_ops.Assert(condition, data, summarize=summarize) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/tf_should_use.py", line 118, in wrapped return _add_should_use_warning(fn(*args, **kwargs)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 180, in Assert guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard") File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2056, in cond orig_res_f, res_f = context_f.BuildCondBranch(false_fn) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1897, in BuildCondBranch original_result = fn() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 178, in true_assert condition, data, summarize, name="Assert") File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 51, in _assert name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1654, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): assertion failed: [predictions must be in [0, 1]] [Condition x >= y did not hold element-wise:x (div:0) = ] [[-nan][-nan][-nan]...] [y (auc/Cast/x:0) = ] [0] [[Node: auc/assert_greater_equal/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_FLOAT], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_0, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_3, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_2)]] ```
dlutkaka commented 2019-10-09 15:18:13 +08:00 (Migrated from github.com)

It looks like that the loss has become nan after the first batch, I have never seen this before, traing processes are always stable.
what's your learning rate? have tried with a smaller lr?

It looks like that the loss has become nan after the first batch, I have never seen this before, traing processes are always stable. what's your learning rate? have tried with a smaller lr?
myw8 commented 2019-12-13 11:18:19 +08:00 (Migrated from github.com)

I think the author does not open source the loss 2Channel2logit , the loss _loss_inception_2logits is not 2Channel2logit yet

I think the author does not open source the loss 2Channel2logit , the loss _loss_inception_2logits is not 2Channel2logit yet
dlutkaka commented 2019-12-27 09:38:13 +08:00 (Migrated from github.com)

I think the author does not open source the loss 2Channel2logit , the loss _loss_inception_2logits is not 2Channel2logit yet

Acctually, It is...

> I think the author does not open source the loss 2Channel2logit , the loss _loss_inception_2logits is not 2Channel2logit yet Acctually, It is...
CatcherInThePy commented 2022-04-28 20:58:28 +08:00 (Migrated from github.com)

after I changed to TF 1.7.1 with CUDA 9.0, I have the following error, which may due to the same nan I think:

2019-09-23 06:39:05.397952: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 48733 of 50000
2019-09-23 06:39:12.911159: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
INFO:tensorflow:Saving checkpoints for 1 into experiments/augment_lr_1-3/model.ckpt.
INFO:tensorflow:step = 0, loss = 0.6931714
INFO:tensorflow:auc = 0.47116327, negative_distance = 0.0004723185, positive_distance = 0.0005186035
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [predictions must be in [0, 1]] [Condition x >= y did not hold element-wise:x (div:0) = ] [[-nan][-nan][-nan]...] [y (auc/Cast/x:0) = ] [0]
	 [[Node: auc/assert_greater_equal/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_FLOAT], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_0, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_3, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_2)]]


...

InvalidArgumentError (see above for traceback): assertion failed: [predictions must be in [0, 1]] [Condition x >= y did not hold element-wise:x (div:0) = ] [[-nan][-nan][-nan]...] [y (auc/Cast/x:0) = ] [0]
	 [[Node: auc/assert_greater_equal/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_FLOAT], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_0, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_3, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_2)]]

Any solutions or updates on this error? I get the same error with the same tensorflow versions using the CEDAR dataset. Trying smaller learning rates or changing the batch size did not work.

> after I changed to TF 1.7.1 with CUDA 9.0, I have the following error, which may due to the same `nan` I think: > > ``` > 2019-09-23 06:39:05.397952: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 48733 of 50000 > 2019-09-23 06:39:12.911159: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled. > INFO:tensorflow:Saving checkpoints for 1 into experiments/augment_lr_1-3/model.ckpt. > INFO:tensorflow:step = 0, loss = 0.6931714 > INFO:tensorflow:auc = 0.47116327, negative_distance = 0.0004723185, positive_distance = 0.0005186035 > Traceback (most recent call last): > File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call > return fn(*args) > File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1312, in _run_fn > options, feed_dict, fetch_list, target_list, run_metadata) > File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun > status, run_metadata) > File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__ > c_api.TF_GetCode(self.status.status)) > tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [predictions must be in [0, 1]] [Condition x >= y did not hold element-wise:x (div:0) = ] [[-nan][-nan][-nan]...] [y (auc/Cast/x:0) = ] [0] > [[Node: auc/assert_greater_equal/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_FLOAT], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_0, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_3, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_2)]] > > >... > > InvalidArgumentError (see above for traceback): assertion failed: [predictions must be in [0, 1]] [Condition x >= y did not hold element-wise:x (div:0) = ] [[-nan][-nan][-nan]...] [y (auc/Cast/x:0) = ] [0] > [[Node: auc/assert_greater_equal/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_FLOAT, DT_STRING, DT_FLOAT], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_0, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_1, auc/assert_greater_equal/Assert/AssertGuard/Assert/data_3, auc/assert_greater_equal/Assert/AssertGuard/Assert/Switch_2)]] > ``` Any solutions or updates on this error? I get the same error with the same tensorflow versions using the CEDAR dataset. Trying smaller learning rates or changing the batch size did not work.
Sign in to join this conversation.