Neural networks have achieved success in various perceptual tasks. However, it is stated that they are ineffective in solving problems requiring higher-level reasoning. Recent experiments with two recently released video question-answering datasets (CLEVRER and CATER) show that neural networks cannot adequately reason about the Spatio-temporal and compositional structure of visual scenes.
On the other hand, Neuro-symbolic models that combine algorithms with symbolic reasoning techniques to predict, explain, and consider counterfactual possibilities are assumed to be much more suitable than neural networks. It leverages several independently-learned modules such as:
- A neural network ‘perceptual’ front-end to detect objects
- A dynamics module to infer objects’ behavior over time
- A symbolic statistical semantic parser that represents the questions
- A hand-coded symbolic executor interprets inputs and predicts answers