Vector DBs for Data Protection

I was reading up some information on vector embeddings recently and started thinking about different use cases of it. One interesting use case was for Data Detection and Response. The thought exercise I had was that if we are able to track data access patters or data lineage and represent them as embeddings, then the embeddings of any anomalous / nefarious access pattern would have very little similarity with other embeddings.

Lets say our files have attributes like: channel and owner Let’s say some valid actions on files could be: copy, rename, zip etc

We could represent a legitimate file copy via a json e.g.

Copying file on desktop from one dir to another

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
	"source": {
		"channel": "desktop", 
		"owner": "corporate"
	}
	"action": "copy",
	"dest": {
		"channel": "desktop", 
		"owner": "corporate"	
	}
}

Here are a few more legitimate scenarios captured in json format:

Download a file from salesforce and upload to a corporate drive

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
	"source": {
		"channel": "salesforce", 
		"owner": "corporate"
	}
	"action": "upload",
	"dest": {
		"channel": "gdrive", 
		"owner": "corporate"	
	}
}

Download a file from personal email and upload to personal drive

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
	"source": {
		"channel": "email", 
		"owner": "personal"
	}
	"action": "upload",
	"dest": {
		"channel": "gdrive", 
		"owner": "personal"	
	}
}

You get the idea…

The next step would be to convert the above into vector embeddings and store it in a vector DB like Pinecone or something similar.

Then, should a suspicious action happen, we could create an embedding for it and search via cosine similarity in a vector DB if it is an outlier. If so, alert on it. e.g.

Download a file from salesforce, rename it then zip it and upload to a personal drive

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
	"source": {
		"channel": "salesforce", 
		"owner": "corporate"
	}
	"action": ["rename", "zip"],
	"dest": {
		"channel": "gdrive", 
		"owner": "personal"	
	}
}

Most products doing similar things seem to be using graph DBs. Curious if there are any products out there that do something similar via embeddings/vector DBs.