Databases and Training¶
Sugaroid uses an sqlite3
-type database for portability. All the
responses are explicitly saved and trained on sugaroid. Sugaroid has two
types of training: 1. Supervised training 2. Unsupervised training
Supervised training¶
Supervised training is a list of proper responses, most commonly
collected from the Stanford Question Answering Dataset (Natural) (SQuAD
2.0 from Stanford NLP,
attribution to Rajpurkar & Jia et al. ’18). Other reponses are manually
trained from interactions during testing. All the responses are saved to
~/.config/sugaroid/sugaroid.db
which is opened in read-only mode
during production mode to prevent people from tampering with the
dataset. At local testing, it is possible to teach sugaroid a sequel of
responses and this will appended to the SQL database. Using Naive
Bayers algorithm.
Unsupervised Training¶
Unsupervised training are a community collected dataset. The sources of
data, are obviously from the community, on its hosted
sugaroid.srevinsaju.me instance on
Microsoft Azure, frontend on AWS. This data are also appended to the SQL
database like Supervised Training but they
are saved with lesser confidence ( 0.1 * confidence_from_statement
), as data from community needs to undergo refining.
sqlite3
¶
Sugaroid’s backend module is sqlite3
against the conventional MySQL
or MariaDB adapters. sqlite3
was chosen considering its portability
alone. Despite higher IO operations on sqlite3
, community data
collection becomes easier because sqlite3
databases are more or
less, a single file. Another problem it solves is the different ways in
which the operating systems consider the file path to be. Using
sqlite3
helps to keep consistency in case. (For Windows, mysql
is case insensitive, but on GNU/Linux/UNIX its case sensitive). Using
sqlite3
solves that problem.
Privacy policy¶
Sugaroid collects data from its users which are then used to train. This
is done through cookies, on the first response you provide to sugaroid
(on the web interface), on adding the bot to your discord channel (on
the Discord adapter). However, your data is completely safe, and is not
collected for training purposes if its (i) self hosted (ii) run as a
desktop / command line app. All data on the desktop version is still
appended to your respective configuration folders, which is, for
example, on Linux, ~/.config/sugaroid/sugaroid.db
and on Windows its
C:\Users\foobar\AppData\Local\sugaroid\sugaroid.db
.
Note:
AppData
folder is normally hidden on Windows, manually “Show all hidden folders” to see the AppData folder.
Investigating data from the database¶
There are certain cases when you would like to analyze the data stored
in the database, or you would like to do some debugging. In all such
cases, the path to the sugaroid.db
is very much useful. All you need
is an sqlite3
binary, which is available for all platforms.
Download
sqlite3
from here
And then, start investigating by
$ sqlite3 ~/.config/sugaroid/sugaroid.db
This will open a prompt, where you can enter most commands;
Apart from the main database, sugaroid
also stores data in *
~/.config/sugaroid/sugaroid.db
*
~/.config/sugaroid/sugaroid.trainer.json
*
~/.config/sugaroid/sugaroid_internal.db
*
~/.config/sugaroid/data.json
Along with SQL, we have also used JSON type files for configuration alone.