Breach Data Indexing
Background
I was participating in a TraceLabs event, and was using ripgrep to search the data in a breach compilation. ripgrep is much faster than grep, however the search took me 50 minutes to complete. For events like TraceLabs, 50 minutes is not time you have to waste, especially when it is being done for a good cause. Talking with AccessOSINT, me and him discussed how Michael Bazzell has mentioned having really quick searches through breach and stealer log data in his podcast. This got me motivated to actually look into solutions for how to search through breach data much faster.
Search Engines
Whenever I think about Search Engines, the first idea that came to my mind was Google or DuckDuckGo. These services search an index of URLs and provide answers based on your search query. I ended up using search engine services myself to parse through breach data. These search engines index all of your data, so you can search through it much faster. However, before using a search engine, you have to format the data to be digestible by the search engine. I will use two search engines in this blog: Meilisearch and Elasticsearch (more importantly, the ELK stack).
Meilisearch
Meilisearch is a search engine tool that can be hosted locally to parse though data you put into it. Read the known limitations before you go with this solution. Here are the commands and steps I used to get it up and running:
Setup
The first time you run the code, you will get a master key. Save this somewhere as you will need it later.
This is the command I ended up using for set up:
At this point, you should be able to access the web interface:
Here you enter your master key you had gotten earlier. You have now set up Meilisearch.
Creating an Index and adding Data
Indexes are basically the main group that the data falls under. Meilisearch uses movies as an example. Under the index, you then have all of your data. Meilisearch accepts the following formats of data: JSON, NDJSON, CSV. They all have their own syntax for how to send it into Meilisearch.
I will be creating an index called "breach". You can have multiple indexes for each breach, but I feel like one location is easier in the long run to deal with.
You should see the following output:
Visiting the web portal, you should see the following:
We have now created the index, but have no data in the index.
Data Format
Not every breach will have the same fields. In order to make them the same you will have to format it. Linux makes that easy with tools such as awk
, sed
, etc. I like to write my own scripts so I can customize it to my usage.
In Meilisearch, "The primary field is a special field that must be present in all documents. Its attribute is the primary key and its value is the document id. It uniquely identifies each document in an index, ensuring that it is impossible to have two exactly identical documents present in the same index".
Since we will have one index only, we have to have each primary field value be unique. An easy way to do this is to add a new column that has a number or unique Identifier. We can go from 1-X with X being the id of the last line in the breach index. Another way to do this, is to add a random string before each line:
This in my opinion adds way more overhead to a file than going from 1-X does. In addition, you can never be sure that there will not be a collision between two random strings in the long run.
I ended up going with something like this, but modified for each use case:
Even if the fields are different in each file. As long as there is a unique id, it should work.
To add data, use the POST
HTTP request. An example would look like the following:
-T
can be replaced by --data-binary
. For big files (1Gb<), -T
ended up working for me, so I stuck with it.
Viewing Tasks
Deleting Data
ELK Stack
From the website, "Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch."
Setup
I would recommend using Postman, which makes it easier to make HTTP requests with a web interface:
Creating an Index and adding Data:
Although I will be using Postman, I will be adding the full commands I use below.
I then started to clean up a breach I have in a CSV format, so then I can upload it to Elasticsearch. I only had two fields:email and pain-text password. I was able to upload it to Elasticsearch using the following Logstash config:
Run Logstash:
Uploading millions of records took a while. After this, I created an index pattern in Kibana (Stack Management -> Index patterns -> Create index pattern). I chose the name of the index in Elasticsearch I already had made ("breach").
After the index pattern is created, the following can be seen when the "breach" index pattern is clicked on:
The data is now visible in Discover under Analytics:
The data is now searchable through the web UI. The data is also searchable through postman or command line as well:
You can go all out with Kibana dashboards with statistics and graphs on emails, passwords, etc. but that is out of the scope of this blog.
Recommendations
Name all breaches in the following syntax
breach_<name of breach>
You can then use 1 index for all breaches, but also have them have their own index.
Clean data to one format
This way all the data is organized (I use CSV, with the "," as the delimiter)
Save backups of the unedited breach
There is a chance that your clean-up might have missed a password or another field
Also, you can start over from square one in case the data gets corrupted
Statistics
grep -i | ripgrep -aFiN | Search Engine | File Size (Raw) |
---|---|---|---|
0.038s | 0.028s | 0.029s (Meilisearch) | 122 MB |
16.681s | 9.564s | 983ms (Meilisearch) | 6.6 GB |
22.303s | 28.367s | 883ms (Elasticsearch) | 9.4 GB |
The last Elasticsearch data was searched on about 313,900,000 documents. 9.4 GB of raw data ended up being 65.1 GB of indexed data.
Sources
Last updated