عجفت الغور

structured space modeling

ml

  • sequential image classification
    • stuff that turns it into sequences of length 1024
  • keyword spotting
    • 1s audio clips to be classified into clips
    • speech is difficult b/c of high sampling rate
    • 1s clips - 16k
      • mfcc coefficients
      • filter banks, reduces sequences by 100x
    • irregular continous data
    • missing values of diff frequencies
  • long range arena
  • A matricies are construed to be stable, or whether stable matricies lead to zero final state
    • do all matricies derive from the Hippo matricy?
    • close to unitary, are the eigenvalues controlled, so that it doesn’t blow up?
    • stability comes from matrix itself, but comes from the discritization
      • by fiat of memorization, it has to be stable, it has to derived to memorize, and must be stable to memorize
    • not every stable matrix will do well
    • randomly initalized A matrix won’t work
      • actual randomization leads to NaN out
      • control the eigenvalues to be close to 1
      • becomes stable enough to train
        • but doesn’t perform well
      • performs well when closely tied to hippo
  • why not regular benchmarks
    • what are the limitations
      • text may actually be more difficult
      • text is very discrete, and no idea on how well it would work
    • narrativeQA?
      • question/answering from long range tasks?
  • theory of structured matrixes
    • algorithmic lin alg/structured matricies
  • theory of memorization with hippo
    • other approximations
      • hippo is a family that’s not fully understood
    • can hippo be learned via EM?
  • high level intuitions
    • why does it do better?
      • not forgetting things? more complex things?
        • what does learning more complex things mean?
        • constructed to not forget
        • long referential sequences, long context sequences
        • tradeoffs? discrete and shorter data -> maybe transformers are better
          • maybe CNN’s are more efficient
          • competitive with RNN’s and CNN’s across the board
      • testing with synthetic datasets?
    • these A/B/C/D matricies “biases” the model towards memorization
    • A/B are initalized by the theory -> memorization
    • C/D are more general deep learning params -> exploiting what it’s memorized
  • previous work: https://arxiv.org/abs/1611.01569