عجفت الغور

fasttext

Tags: ml, nlp

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. “Enriching Word Vectors with Subword Information.” arXiv, June 19, 2017. https://doi.org/10.48550/arXiv.1607.04606.

  • Based off of ngrams
    • uses numerically stable transformations, logs to avoid underflow
    • very simple, 1 layer neural network basically

Joulin, Armand, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. “FastText.Zip: Compressing Text Classification Models.” arXiv, December 12, 2016. https://doi.org/10.48550/arXiv.1612.03651.

  • Very simple as well, vector compression by using a codebook: product quantization

  • notes that “the norms of the vectors are widely spread, typically with a ratio of 1000 between the max and the min”

  • turns out you can prune the most common words (those with lowest entropy)

    • you can also prune out the vectors that don’t add very much (i.e. low norms)
  • the prunables: periods and commas don’t add anything so low entropy, mediocre/disappointing/so-so etc don’t add much value, so low norm