Jensen Shannon Distance

The Jensen Shannon Distance is the last and the complex algorithm used inside sugaroid bot. The equation for finding Jensen Shannon Distance is not directly used within

\[\begin{split}D(M || Q) = \sum M(i) \log \frac {M(i)}{Q(i)} \\ JSD (M || Q) = \frac 12 \sum (\log \frac {M(i)}{\frac12M(i) + Q(i)}) + \log(\frac{Q(i)}{\frac 12 M(i) + Q(i)}))\end{split}\]

This being a complex and CPU intensive process, is handled systematically by a Natural Language Processing library with Industrial Processing support, viz, SpaCy. The SpaCy library handles this effectively by loading data from en_core_web_sm and en_core_web_lg

The difference between sm and lg is that, en_core_web_sm is collection of all the word in the dictionary with vectors only and weighs 7.5 MB. The en_core_web_lg weighs 880 MB, and has data for tensors too. This dataset is more efficient because, the data so obtained has tensor data and this helps to correctly measure Jensen Shannon Distance.

The JSD is internally implemented in an nlp object called LanguageProcessor and handles most of the complex conversations inside sugaroid.brain.utils.LanguageProcessor is a signed class with two methods tokenize and similarity The similarity method return the resultant net vector displacement of the given vectors.