mbox series

[PoC,0/1] LLM enriched CVE entries

Message ID 20251205091632.1268768-1-skandigraun@gmail.com
Headers show
Series LLM enriched CVE entries | expand

Message

Gyorgy Sarvari Dec. 5, 2025, 9:16 a.m. UTC
Hello,

tl;dr: This is a proof of concept about using an LLM to enrich CVE feeds. 
Links are somewhere down below to see the difference. If you think it's 
useful, please say so. If you think it's not useful, please say so too.

Motivation: the CVE checker associates CVEs with recipes based on the 
CPE information in the CVE entry. Unfortunately there are quite a few 
CVE entries missing this information entirely, making it impossible to 
associate them with any recipes. Looking at this year so far there are 
over 66000 CVE's opened, of which over 15000 are missing CPEs. Though 
older entries seem to have better CPE-to-CVE ratio, but for this PoC 
I'm mostly interested in the latest vulnerabilities.

The idea: in case CPE information is missing, try to derive it from 
the human language description and the reference links of the CVE, 
using an LLM. The intuition would be that a good portion of the derived 
data would be usable, and even though it wouldn't be perfect, it would 
catch more valid CVEs than without it.

Here I would like to describe my experiments and experiences with the above idea.

The setup: I used a local LLM with ollama, running llama3.1:8b model. 
The hardware is Ryzen 5950X, 32GB RAM, RTX3060 with 12GB VRAM, 
running EndeavourOS. The prompt is part of the patch.

The initial load takes a very long time (that's why I restricted the 
amount for this PoC). Deriving a missing CPE ID takes about a second. 
My initial approach was to work on all post-2020 CVEs. It took about 
an hour to process up to the start of 2024, but it failed during a 
test run (due to a processing bug in my code, the performance remained 
unchanged). After fixing my code I rectricted the CVEs to process only 
the entries from 2025 - I wanted to see finally some results to see 
if it makes any difference. Processing the current year ultimately 
took 280 minutes. (I have also skipped the CVEs associated with 
the linux kernel to speed up the process.).

It was using 2 CPU cores completely, indicating that it could be parallelized, 
but looking further it turned out that my GPU is the bottleneck, it was 
showing 96-98% utilization with nvidia-smi, and ollama was using ~5.5GB 
VRAM. (Another idea would be to group the CVEs before sending them to 
the LLM, but it's untested.)

During processing the entries I noticed that the model produced some strange 
hallucinations - I did expect incorrect vendor/product details, but strangely 
it also produced incorrect data structures in its response, like instead of 
"CPE_ID" key, it returns "OVP_ID" sometimes. This seems to be a problem with 
a very small portion, and currently I drop these. It tends to miss some asterisks 
also from the final CPE id. This seems to happen quite frequently, around 20% 
of the cases. I've heard that intimidating the LLM with some F-bombs in 
the prompt helps, but haven't tried that yet - for now I try to salvage 
the ones that only miss 1 asterisk.

This also means that this CPE-derivation is not 100% reproducible - but the 
idea is still that only a small portion is not reproducible, and we are 
also only interested in a small portion. And in the code you might also 
notice that I do some postprocessing of the response... well, LLMs are 
like that today, I guess. There is no free lunch.

There are some links with the actual results, executed on a few days old oe-core. 
Only CVE-2025-* entries were processed with LLM:

Without LLM, cve-summary.json (~30MB uncompressed): https://sarvari.me/yocto/llm/without_llm.json.tar.gz
Without LLM, human readable unpatched extract: https://sarvari.me/yocto/llm/no-llm.txt
With LLM, cve-summary.json (~30MB uncompressed): https://sarvari.me/yocto/llm/with_llm.json.tar.gz
With LLM, human readable unpatched extract: https://sarvari.me/yocto/llm/llm.txt
With LLM + commentary about each LLM-guessed entry (LibreOffice Calc, I recommend this): https://sarvari.me/yocto/llm/llm_cves_with_comments.ods

This patch is just a proof of concept.
I'm not sure if/how it could be integrated in the project's 
infra - especially the initial load is very heavy, and the patch requires GPU(s).

As a personal opinion, the final result turned out to be better than I 
initially expected. But I can't deny that it also brings noise, especially 
due to missing/incorrectly extracted version info. And of course it doesn't 
make the CVE info complete - it just extends what we have now.

If you would like to try it out, I recommend to delete/backup the 
downloads/CVE_CHECK2/nvdfkie_1-1.db file before trying. (Delete it after trying 
also, because the patch changes the schema a bit)

What do you think? If you are not completely averse to LLMs and this idea, 
I could spend more time on this, and submit something that's more than a PoC.

Looking forward to all kind of feedback, opinion and recommendations, be it 
positive or negative.

Thank you for coming to my TED talk.

---

Gyorgy Sarvari (1):
  cve-update-db: bolt LLM on top of it

 meta/classes/cve-check.bbclass                |   6 +-
 .../recipes-core/meta/cve-update-db-native.bb | 138 ++++++++++++++++--
 2 files changed, 131 insertions(+), 13 deletions(-)