Reviews For Paper
Paper ID 3362
Title Convolutional Sequence to Sequence Model for Human Dynamics

Masked Reviewer ID: Assigned_Reviewer_1
Review:
Question 
[Paper Summary] What is the paper about? Please, be concise (3 to 5 sentences) This paper presents an approach to human motion modeling based on convolutional neural networks. In this encoder-decoder approach, a convolutional long term encoder and a short term encoder are used to encode the long term hidden variable and the short term hidden variable. Experiments show better performance on the Human3.6M and CMU Motion Capture datasets.
[Paper Strengths] Please discuss, justifying your comments with the appropriate level of details, the strengths of the paper (i.e. novelty, theoretical approach and/or technical correctness, adequate evaluation, clarity, etc). For instance, a theoretical paper may need no experiments, while a paper with a new approach may require comparisons to existing methods. A seeming reasonable approach has been proposed in this paper. Compared to existed methods, this work seems to achieve better results.
[Paper Weaknesses] Please discuss, justifying your comments with the appropriate level of details, the weaknesses of the paper (i.e. lack of novelty – given references to prior work-, lack of novelty, technical errors, or/and insufficient evaluation, etc). Note: If you think there is an error in the paper, please explain why it is an error. Also remember that theoretical results/ideas are essential to CVPR (some theoretical papers may not need to have experiments). If the theory is novel and interesting, but the results did not outperform other existing algorithms, it is not necessarily a reason to reject. It is not appropriate to ask for comparisons with unpublished papers and papers published after the CVPR deadline. In all cases, please be polite and constructive. CVPR 2018 policy on dual submission and arxiv appears at: http://cvpr2018.thecvf.com/submission/main_conference/author_guidelines. However, as to the novelty and the experimental results, I have the concerns as follows:

1) The authors should demonstrate the roles of long term encoder and short term encoder in the experiments, respectively. The same operations with different sequence length in these two encoders are very weird.

2) How to verify the effectiveness of the Rect Conv? Why not Rect Conv along the temporal axis which can capture long temporal dependency?

3) It is very strange to take skeleton sequence as two-dimensional map which is encoded by convolutional networks, which should be modelled by a recurrent structure. Moreover, the better experimental results with more tunings cannot demonstrate the advantages of CNN.

4) What is the motivation of this work on motion modelling, just to prove that CNN is much better?

5) Why do not cite much better results from RRNN [15], e.g., 0.27 for walking (80ms) in Table 1?

6) Some typos, e.g., it should be RRNN not RNN in Table 2.
[Preliminary Rating] Please rate the paper according to one of the following six choices: Weak Reject
[Preliminary Evaluation] Please indicate to the AC, your fellow reviewers, and the authors your current opinion on the paper. Please tell the ACs what points you think have the most weight in your reviews and summary, and why. This paper lacks of enough novelty. The effectiveness of the network structure should be verified by more experiments. It is not suitable to be accepted by CVPR.
[Confidence] Very Confident - to stress that you are pretty sure about your conclusions (e.g., you are an expert who works in the paper's area).
[Final Recommendation] Please provide your final recommendation by taking into consideration the rebuttal, other reviews and discussions. Thank the authors for answering the concerns with more facts and explanations, e.g., Rect Conv, motivation and some results of RRNN. I do not like Reviewer_3 to complain more on personal opinions on how to model sequence with CNN or RNN. Actually, I hope to hear more insights from the authors on the problem. Using either CNN or RNN is not the keypoint.
[Final Rating] Please rate the paper according to one of the following six choices: Poster

Masked Reviewer ID: Assigned_Reviewer_2
Review:
Question 
[Paper Summary] What is the paper about? Please, be concise (3 to 5 sentences) Briefly, this paper presents a convolutional sequence-to-sequence model for human motion prediction. The proposed model captures two types (i.e., long-term and short-term) of temporal dependencies. Specifically, the authors impose a carefully designed hierarchical CNN to process sequential human pose data and further regress the target predictions. Extensive experiments demonstrate the superior performance of the proposed method over the competing methods on two public benchmarks.
[Paper Strengths] Please discuss, justifying your comments with the appropriate level of details, the strengths of the paper (i.e. novelty, theoretical approach and/or technical correctness, adequate evaluation, clarity, etc). For instance, a theoretical paper may need no experiments, while a paper with a new approach may require comparisons to existing methods. 1. The proposed model is moderate novel for explicitly introducing the long-term and short-term latent variables to learn both spatial dependencies and temporal dynamical information in a sequential fashion.
2. The experimental evaluations are sufficient to demonstrate the effectiveness of the proposed model. The detailed component analyses clarify the contribution of each component.


[Paper Weaknesses] Please discuss, justifying your comments with the appropriate level of details, the weaknesses of the paper (i.e. lack of novelty – given references to prior work-, lack of novelty, technical errors, or/and insufficient evaluation, etc). Note: If you think there is an error in the paper, please explain why it is an error. Also remember that theoretical results/ideas are essential to CVPR (some theoretical papers may not need to have experiments). If the theory is novel and interesting, but the results did not outperform other existing algorithms, it is not necessarily a reason to reject. It is not appropriate to ask for comparisons with unpublished papers and papers published after the CVPR deadline. In all cases, please be polite and constructive. CVPR 2018 policy on dual submission and arxiv appears at: http://cvpr2018.thecvf.com/submission/main_conference/author_guidelines. 1. The proposed model ignores the forgetting mechanism during the temporal dependency modeling. This may limit its performance on some special human actions, which have discontinued pose sequences.
2. The introduced adversarial regularizer is in embarrassed situation. According to Fig. 4, its contribution is quite limited. However, it significantly increases the computation complexity of the proposed model in the training phase. Besides, it is unfair to make comparison between the proposed method and the competing methods that do not employ this regularizer.
[Preliminary Rating] Please rate the paper according to one of the following six choices: Borderline
[Preliminary Evaluation] Please indicate to the AC, your fellow reviewers, and the authors your current opinion on the paper. Please tell the ACs what points you think have the most weight in your reviews and summary, and why. The paper is badly written. There are plenty of spelling or grammar errors in the paper. For instance, "sued" --> "used" in the line 272~273.

Although the experimental results is promising, the proposed convolutional sequence-to-sequence model is not well presented. I will give a higher rate if the authors could address the following concerns.
[Rebuttal Requests] Please pose questions you want to be answered in the rebuttal. 1. I feel confused about Eq. (6). What does it mean? Please give more clarifications and explanations.
2. The time efficiency comparison between the proposed method and the competing methods.
3. Ablation study on the convolutional encoding module (CEM).
[Confidence] Confident - to stress that you are mostly sure about your conclusions (e.g., you are not an expert but can distinguish good work from bad work in that area).
[Final Recommendation] Please provide your final recommendation by taking into consideration the rebuttal, other reviews and discussions. Thanks a lot for the authors' feedback and their conducted experiments during the rebuttal period.

Briefly, although this work seems moderate level of novelty and has achieved promising results over all the compared methods, I think it needs significant improvements in both writing and explanations. Many concerns still exist after reading the authors' feedback and are listed as follows:

1) I still believe the authors' method may have limited performance on some special human actions, which have discontinued pose sequences. The forgetting mechanism of the authors' method is just a local information extraction without learning, and is totally different from that of LSTM.

2) There is no intuitions and sufficient explanations behind Eq.(6). According to the authors' feedback, Eq.(6) leverage the decoder to output a residual value as [15] inside the spatial decoder. However, the authors have pointed out the drawbacks of residual unit ([15] used for longer term prediction) in the introduction section (Please see line 102~107).

3) Directly employing the adversarial regularizer without validations is not convincing. If the adversarial regularizer contributes to generate qualitatively plausible motions, the authors should additionally perform visual comparisons in their paper.

Therefore, I am sorry that I could not give a higher rating and tend to reject this paper. I encourage the authors to revise their paper and submit to another venue.
[Final Rating] Please rate the paper according to one of the following six choices: Weak Reject

Masked Reviewer ID: Assigned_Reviewer_3
Review:
Question 
[Paper Summary] What is the paper about? Please, be concise (3 to 5 sentences) The paper proposes a convolutional architecture for human motion prediction. As opposed to most previous work, which is based on RNNs, this architecture basically treats a series of poses as a matrix, learns space-invariant filters that correlate motion across time and space. There are actually 2 networks -- one that looks at the entire sequence, and thus learns a global descriptor, and one that looks at a small window, and recurrently predicts the following pose. The network is evaluated on H3.6M and on a subset of the CMU mocap dataset.
[Paper Strengths] Please discuss, justifying your comments with the appropriate level of details, the strengths of the paper (i.e. novelty, theoretical approach and/or technical correctness, adequate evaluation, clarity, etc). For instance, a theoretical paper may need no experiments, while a paper with a new approach may require comparisons to existing methods. ## Novelty+
To the best of my knowledge this is the first paper to use a convolutional architecture to model human motion. Although the approach has been used before to model time series (e.g. speech in wavenets). The long and short-term convnets are an interesting touch that I have not seen before.

## Evaluation+
Evaluation is consistent with previous work, and the authors went one step ahead an compared against RRNN (the strongest baseline) on the CMU mocap dataset as well. This new method shows improved performance in nearly all cases

## Clarity+
The paper is easy to read, clearly describes the improvements and adequately contextualizes the contributions within previous work. Notation is concise and only somewhat hard to follow when explaining the small window size of the short-term network (C).
[Paper Weaknesses] Please discuss, justifying your comments with the appropriate level of details, the weaknesses of the paper (i.e. lack of novelty – given references to prior work-, lack of novelty, technical errors, or/and insufficient evaluation, etc). Note: If you think there is an error in the paper, please explain why it is an error. Also remember that theoretical results/ideas are essential to CVPR (some theoretical papers may not need to have experiments). If the theory is novel and interesting, but the results did not outperform other existing algorithms, it is not necessarily a reason to reject. It is not appropriate to ask for comparisons with unpublished papers and papers published after the CVPR deadline. In all cases, please be polite and constructive. CVPR 2018 policy on dual submission and arxiv appears at: http://cvpr2018.thecvf.com/submission/main_conference/author_guidelines. ## Complexity-
One thing I don't like about this paper is the addition of an "adversarial regularizer", which is mentioned nowhere in the abstract or introduction, and is only introduced at the end of Section 3. IMO this adds unnecessary complexity to the model and, as Fig 4 shows, results in marginal improvements.

## Evaluation
How does the zero-velocity (no prediction) baseline ([15]) compare to to these new results on CMU?
Adding it would really help contextualize the results.

## Figures
* Figure 2, on the right, shows a standard diagram of a convnet. I'd get rid of it, since this has appeared over and over on the computer vision literature at this point. I'd rather expand the diagram on the left.
* Please don't use red/green for inputs/predictions. 8% of the population can't follow those diagrams. See http://colorbrewer2.org/ for palettes that are colourblind-safe.

## Clarity / general comments
L44: the others -> others
L120/L188: claims that most RNN works are sequence-to-sequence -> not true! ERD and SRNN were trained with a per-timestep minimization, as is common in speech modelling (when minimizing perplexity). [15] is the only RNN-based previous work that uses a sequence-to-sequence protocol during training.
L199 says that [15] used an LSTM in the encoder. This is false, both encoder and decoder in [15] are GRUs (they even have tied weights!)
L254-260 repeats the end of S2. I'd remove it.
L270. ... pace ...) --> ... pace, etc.)
L513. recoRded
L529. missing citation


[Preliminary Rating] Please rate the paper according to one of the following six choices: Poster
[Preliminary Evaluation] Please indicate to the AC, your fellow reviewers, and the authors your current opinion on the paper. Please tell the ACs what points you think have the most weight in your reviews and summary, and why. Overall happy with the paper. I'd like to see the zero-velocity baseline, but overall I think the paper is pretty solid.
[Rebuttal Requests] Please pose questions you want to be answered in the rebuttal. *Could the authors please comment on the zero-velocity baseline? For the CMU experiments, how does it perform?
*Could the authors also comment on why they decided to use an adversarial regularizer? It seems like an afterthought. What is the numerical impact at 80, 160, 320 ms? Could this variant be tabulated as well?
[Confidence] Very Confident - to stress that you are pretty sure about your conclusions (e.g., you are an expert who works in the paper's area).
[Final Recommendation] Please provide your final recommendation by taking into consideration the rebuttal, other reviews and discussions. I think the review of R1 is unnecessarily harsh. R1's main criticism is that the approach is "weird", and asks if the point of using a CNN is only to prove that CNNs are better -- as I explained in the discussion, I don't think it is our place to judge papers based on how they weird they are, or to reject them because they don't fit our preconceptions of how problems "should be" attacked (R1 seems to believe that RNNs are the one and only acceptable way here).

I think this papers deserves to be presented at CVPR, and is in fact the most solid paper of my entire batch.

The authors should clarify that the adversarial regularization is mostly for qualitative results, otherwise it seems very sketchy (as R2 and I pointed out).
[Final Rating] Please rate the paper according to one of the following six choices: Poster