structured space modeling

sequential image classification
- stuff that turns it into sequences of length 1024
keyword spotting
- 1s audio clips to be classified into clips
- speech is difficult b/c of high sampling rate
- 1s clips - 16k
  - mfcc coefficients
  - filter banks, reduces sequences by 100x
- irregular continous data
- missing values of diff frequencies
long range arena
A matricies are construed to be stable, or whether stable matricies lead to zero final state
- do all matricies derive from the Hippo matricy?
- close to unitary, are the eigenvalues controlled, so that it doesn’t blow up?
- stability comes from matrix itself, but comes from the discritization
  - by fiat of memorization, it has to be stable, it has to derived to memorize, and must be stable to memorize
- not every stable matrix will do well
- randomly initalized A matrix won’t work
  - actual randomization leads to NaN out
  - control the eigenvalues to be close to 1
  - becomes stable enough to train
    - but doesn’t perform well
  - performs well when closely tied to hippo
why not regular benchmarks
- what are the limitations
  - text may actually be more difficult
  - text is very discrete, and no idea on how well it would work
- narrativeQA?
  - question/answering from long range tasks?
theory of structured matrixes
- algorithmic lin alg/structured matricies
theory of memorization with hippo
- other approximations
  - hippo is a family that’s not fully understood
- can hippo be learned via EM?
high level intuitions
- why does it do better?
  - not forgetting things? more complex things?
    - what does learning more complex things mean?
    - constructed to not forget
    - long referential sequences, long context sequences
    - tradeoffs? discrete and shorter data -> maybe transformers are better
      - maybe CNN’s are more efficient
      - competitive with RNN’s and CNN’s across the board
  - testing with synthetic datasets?
- these A/B/C/D matricies “biases” the model towards memorization
- A/B are initalized by the theory -> memorization
- C/D are more general deep learning params -> exploiting what it’s memorized
previous work: https://arxiv.org/abs/1611.01569