Miniature Breach Index
I spent a long time trying to see how small I can go when setting up a breach index environment. These types of software I was looking at are called "full-text search engines", if you wanted look into the topic further. This is an adjacent blog to my "Breach Data Infrastructure" blog. This blog will be discussing a small-form factor for breach indexing: self-contained on a microSD card.
Considerations
Most of my time spent on this project (more of a curiosity) was testing out various full-text search engine (most open-source) software. There are two main types of search engine databases: In-memory and On-disk [1]. "An in-memory database stores the data in memory and uses disk for backup, while an on-disk database stores the data on disk and uses memory for caching" [2]. For a small scale environment, on-disk databases are preferred as the CPU / RAM power might not fit the requirements. Here are what I tested and a small summary of their features/downsides (Caveat: I was testing these on a Pop!_OS VM, so your results may vary):
Name | |
---|---|
Apache Solr (https://solr.apache.org/) |
|
ZincSearch (https://zincsearch-docs.zinc.dev/) |
|
Quickwit (https://quickwit.io/) |
|
DocFetcher (https://docfetcher.sourceforge.io/en/index.html) |
|
Datashare (https://datashare.icij.org/) |
|
Here are some other software I tried that did not end up working for me either. Didn't want to go into detail for these as I didn't end up spending too much time with these:
There is a Python search engine library called Whoosh that seemed promising for this use case: https://whoosh.readthedocs.io/en/latest/intro.html. I did a bit of testing on Windows, and the closest alternatives I saw that could have worked on Windows were:
https://www.voidtools.com/downloads/ (only seems to index file names, not content)
https://puggle.sourceforge.net/index.html (portable version at: https://puggle.sourceforge.net/portable_manual.html) (not tested)
https://docfetcher.sourceforge.io/en/download.html (not tested)
https://www.listary.com/ (only seems to index file names)
From this list, since DocFetcher worked for Linux, I assume it might be the only options from the aforementioned choices that might actually work.
MicroSD
After testing all the aforementioned software, the only one that worked at a MicroSD scale on Linux was DocFetcher. I downloaded the Linux file from https://docfetcher.sourceforge.io/en/download.html. I then unzipped the file. Before I ran the GUI, I had to break down the data. One issue I noted with DocFetcher was that it was able to index files as large as 170 MB, but it was having a really rough time displaying that file in the GUI. I realized that if I broke a breach down into 50 mb parts, it will be easier for it to highlight exactly where the text matches without running the memory or CPU up. I ran the following on a breach:
-C = put at most SIZE bytes of records per output file
--numeric-suffixes = same as -d, but allow setting the start value
This took 10 minutes 30 seconds for the split command to complete on 4.9 GB of CSV data. To add context for this, this is first of all a MicroSD card (read and write is on the slower side), but also has NTFS formatting on it (which Linux has to be configured to read/write).
After the command was completed, I moved all of the output files into one folder. I ran DocFetcher with ./DocFetcher-GTK3.sh
in the DocFetcher folder. From there I right-clicked in the "Search Scope" area to add a folder and selected my folder with all the 50MB breach files. Here is my config:
My CPU was fluctuating between 5 - 10% usage for this. Not a downside or anything, just wanted to point this out as a normal part of this procedure. This ended up taking 13 minutes to complete indexing. However, when I search for a string, I got an instant result. I would say 50 MB might be a bit on the bigger side. The software starts to lag when displaying 50 MB worth of data (160000 - 300000 lines of csv text). My safe recommendation would be ~25 MB so 100000 lines.
For the previous approach, I used the split
command to split the data. That led to files not having headers after they were split (only the first file had a header). To mitigate this issue, we can use miller
and its split function: https://miller.readthedocs.io/en/latest/reference-verbs/index.html#split [5]. Excerpt from the miller documentation:
For miller, we don't have the option to set a file size (in terms of bytes), but we can choose how many records per file (-n
) or how many files we want in total (-m
). I will choose the records per file option for this with the option of 80,000. I feel like half of the lower-bound we got previously should be good enough for display.
The miller software in the apt repository does not have the split function. The GitHub release does. Link: https://github.com/johnkerl/miller/releases.
Here was the command I ran:
mlr --csv --from breach_name.csv split -n 80000
This took about 7 minutes to complete, but since one line in the CSV file had an extra column, the code stopped when it got to about 75% completion. About 3.7 GB of data was already split by then. I still wanted to use this to see how big the DocFetcher index would get with this information. The folder without any indexing was 90.4 MB. After indexing 3.7 GB of data, the index folder increased to 3.0 GB. Not bad at all; this is about 81% of the actual breach size. I was able to get a result from the search instantly like before, and also get to see the full file after a couple of seconds. My recommendation would be the 80000 records.
Issue #1
While indexing a large set of files (2000+; each ~13 MB each), the indexing ran into an issue. I ended up not keeping the indexed data. Snippet of issue:
I was not able to find a solution for this error. I assume it has something to do with the amount of memory provided to the application.
Issue #2
I was indexing 1468 files, which took about 3 hours and 39 minutes (see image below). At 1290 files, the indexing started running into errors with files.
I did hit "Update Index" option on the index after it was completed, but this did not mitigate this issue. I did see a potential solution for this on the wiki (https://sourceforge.net/p/docfetcher/wiki/FAQ/):
I have not tried this completely, but increasing the RAM did mitigate some memory issues I was dealing with.
Suggestions
Based on the issues above, and my overall experience with DocFetcher, here are some suggestions I will give:
If possible, filter (split, sort, etc.) data on a PC/laptop to make the read/write much quicker
Edit the DocFetcher file (
DocFetcher-GTKX.sh
on Linux andmisc/DocFetcher.bat
on Windows) to increase the RAM to give it more memory to work off ofThe data does not need to be formatted in a certain way (CSV, JSON, etc.). However, add the extension(s) to the index settings so it does not overlook your files at index time
Potential Solution: Quickwit
I was going to try to show a demo for the MicroSD card with Quickwit, but as one line was off in the file, it would throw off the system. They do have an API that can handle bulk data, if you want to check it out: https://quickwit.io/docs/reference/es_compatible_api/#_bulk--batch-ingestion-endpoint. It is doable to use off of a MicroSD card. As breach data is not specifically meant to be formatted a certain way (for example the "Collection" series), this would not really work for me. If you have breach data that has all the data formatted properly, I would recommend giving Quickwit a shot. It does have a clean interface.
Conclusion
I set out with a goal of finding the possibility if breach data indexing was possible on a microSD card. I was able to figure it out and test it out with some breach data. It does need a bit of trial and error to get it to to work, but it is completely possible to do so. This is obviously not a realistic scenario for breach data indexing, but it was something I was just curious about and wanted to prove was possible.
Sources
Last updated