From patchwork Fri Dec 5 09:16:31 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gyorgy Sarvari X-Patchwork-Id: 2028 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00E06D2F7C9 for ; Fri, 5 Dec 2025 09:16:40 +0000 (UTC) Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50]) by mx.groups.io with SMTP id smtpd.msgproc01-g2.1865.1764926195318639696 for ; Fri, 05 Dec 2025 01:16:35 -0800 Authentication-Results: mx.groups.io; dkim=pass header.i=@gmail.com header.s=20230601 header.b=bvrJLdo9; spf=pass (domain: gmail.com, ip: 209.85.221.50, mailfrom: skandigraun@gmail.com) Received: by mail-wr1-f50.google.com with SMTP id ffacd0b85a97d-42e2ddb8a13so928653f8f.0 for ; Fri, 05 Dec 2025 01:16:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764926194; x=1765530994; darn=lists.openembedded.org; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:from:to:cc:subject:date:message-id:reply-to; bh=3/u695ork99wnBtP9DANi9PYTKosfDJuAvrZS50imfI=; b=bvrJLdo9v0CgSMUE8OJ6/OETG9HJLkdTd+ov0rc6llzNvTts0J1Abcv4wEm8qO4tse 6+SG5m26NhOR0m/APqhvtyC/Pu0MJ+6gibAlVcjY6LWn0n8aR2pmGWY3YWjuDBZDMELX z4is5bW0knRFd2vLgKD5GDznpZOjr85ja8JETFjzcP2FRysLXmY7jOvW8B3CNl3Tqk7h h2V+l5FcJnShrJiemJBfGXgCdWzDmCla1tbY8+JfEH7DmBZN2jb1KdAgls0c0BVr4f6s sc7Urj8cGAJZs4NtYriVhxYQYxOYN7Z2V2We2OhHQmm8QUA9vAlz/u9YWGeu0FtnxkKu O2nA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764926194; x=1765530994; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=3/u695ork99wnBtP9DANi9PYTKosfDJuAvrZS50imfI=; b=JdyXe8xph6i4SrHHe/Cbxun9pCDN5Hkmmj6U00ynF8LbihFlg7Zl+4unRgfYGxkFhu T1F91LjZytDDaO02xDkXfY7WGY3Cbl4W5CbuR9bQusjGGiWf7+RzfSD+dyaNS5q1LQZ3 43sTYPWJpWgbJrELDX0nBf3lFgMrKQhEupXi+HrV9CIzG41acHqhw1Ki5nmAW7Y5eWSh I2Tkx29xcyA6syQyjBWLsnonbrKxVeo2PLGMhbwNcpxThLGane4w5kb99b7lwPHnqTZr KyZPrEzUrdLYUwGyYYX0mp23iHuKrrdbH6R5Ptu6DaNmZ0iHxSTmBys3arqNtMMaRfXF ADRw== X-Gm-Message-State: AOJu0YzXY8yxIqFaIxn4Me0qjrw/Ui4i+VBc1+BgOPPDsJ+NkTlFiiVo ZzOyxeSIWfWt0q2T3p8oNEwrjExO5ylbHGBpf/Ky1nEmHNbSijMaO7ZQjDCt5g== X-Gm-Gg: ASbGncvMCSA9xTXhdDNDqDnBLQ5W3ZMBIodY+xt6EuLY6HN6kG3S55B9gEPNzZv34KL UHO5aAoq669EuVH/sPryBB+UPaS/wADA2h1QDCRu2FB3lFn0JSBOlDrfxRVJCd/6Av2W2kyMXYr n0hwhLTZ6zEmQ/NNHynfKR/RT9blDyapysfcgLNYN5cXB5jU/WlisPz2SESe6SDkIyRrF5d0sza geIlxj1i5rPByF7gH3mXbwlm40Sz47a6Oyun8J0qIZouoB4LVeCG+RHcrBqjPlgpISRgKsuoYQ7 RYaj+BBbBz0EC8IRDSV1cYFxu9GmVvfxirkuAFt9tHkpQYTy5JpaEvc4YMLzMSWKYYhqdBfsGHB 1jOOtvC/IVuh5O3TT2yLKZb92HrPUKO3LsQMArbthUsOTbHcvyHcgm2wPqbK0uqPP73sUkTaY54 XQruB8Fzb0 X-Google-Smtp-Source: AGHT+IFaw1Wi2CV8NKZU6nY/KX5nR+A/6wz1N8OLS2JKC3mhZ5qhZzJnAIVuc6J67uhGpwRdGONIEw== X-Received: by 2002:a05:6000:2389:b0:42b:5448:7b1a with SMTP id ffacd0b85a97d-42f731c427dmr9670770f8f.39.1764926193360; Fri, 05 Dec 2025 01:16:33 -0800 (PST) Received: from desktop ([51.154.145.205]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-42f7d353f80sm7786260f8f.41.2025.12.05.01.16.32 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 05 Dec 2025 01:16:32 -0800 (PST) From: Gyorgy Sarvari To: openembedded-core@lists.openembedded.org Subject: [PoC 0/1] LLM enriched CVE entries Date: Fri, 5 Dec 2025 10:16:31 +0100 Message-ID: <20251205091632.1268768-1-skandigraun@gmail.com> X-Mailer: git-send-email 2.52.0 MIME-Version: 1.0 List-Id: X-Webhook-Received: from 45-33-107-173.ip.linodeusercontent.com [45.33.107.173] by aws-us-west-2-korg-lkml-1.web.codeaurora.org with HTTPS for ; Fri, 05 Dec 2025 09:16:39 -0000 X-Groupsio-URL: https://lists.openembedded.org/g/openembedded-core/message/227343 Hello, tl;dr: This is a proof of concept about using an LLM to enrich CVE feeds. Links are somewhere down below to see the difference. If you think it's useful, please say so. If you think it's not useful, please say so too. Motivation: the CVE checker associates CVEs with recipes based on the CPE information in the CVE entry. Unfortunately there are quite a few CVE entries missing this information entirely, making it impossible to associate them with any recipes. Looking at this year so far there are over 66000 CVE's opened, of which over 15000 are missing CPEs. Though older entries seem to have better CPE-to-CVE ratio, but for this PoC I'm mostly interested in the latest vulnerabilities. The idea: in case CPE information is missing, try to derive it from the human language description and the reference links of the CVE, using an LLM. The intuition would be that a good portion of the derived data would be usable, and even though it wouldn't be perfect, it would catch more valid CVEs than without it. Here I would like to describe my experiments and experiences with the above idea. The setup: I used a local LLM with ollama, running llama3.1:8b model. The hardware is Ryzen 5950X, 32GB RAM, RTX3060 with 12GB VRAM, running EndeavourOS. The prompt is part of the patch. The initial load takes a very long time (that's why I restricted the amount for this PoC). Deriving a missing CPE ID takes about a second. My initial approach was to work on all post-2020 CVEs. It took about an hour to process up to the start of 2024, but it failed during a test run (due to a processing bug in my code, the performance remained unchanged). After fixing my code I rectricted the CVEs to process only the entries from 2025 - I wanted to see finally some results to see if it makes any difference. Processing the current year ultimately took 280 minutes. (I have also skipped the CVEs associated with the linux kernel to speed up the process.). It was using 2 CPU cores completely, indicating that it could be parallelized, but looking further it turned out that my GPU is the bottleneck, it was showing 96-98% utilization with nvidia-smi, and ollama was using ~5.5GB VRAM. (Another idea would be to group the CVEs before sending them to the LLM, but it's untested.) During processing the entries I noticed that the model produced some strange hallucinations - I did expect incorrect vendor/product details, but strangely it also produced incorrect data structures in its response, like instead of "CPE_ID" key, it returns "OVP_ID" sometimes. This seems to be a problem with a very small portion, and currently I drop these. It tends to miss some asterisks also from the final CPE id. This seems to happen quite frequently, around 20% of the cases. I've heard that intimidating the LLM with some F-bombs in the prompt helps, but haven't tried that yet - for now I try to salvage the ones that only miss 1 asterisk. This also means that this CPE-derivation is not 100% reproducible - but the idea is still that only a small portion is not reproducible, and we are also only interested in a small portion. And in the code you might also notice that I do some postprocessing of the response... well, LLMs are like that today, I guess. There is no free lunch. There are some links with the actual results, executed on a few days old oe-core. Only CVE-2025-* entries were processed with LLM: Without LLM, cve-summary.json (~30MB uncompressed): https://sarvari.me/yocto/llm/without_llm.json.tar.gz Without LLM, human readable unpatched extract: https://sarvari.me/yocto/llm/no-llm.txt With LLM, cve-summary.json (~30MB uncompressed): https://sarvari.me/yocto/llm/with_llm.json.tar.gz With LLM, human readable unpatched extract: https://sarvari.me/yocto/llm/llm.txt With LLM + commentary about each LLM-guessed entry (LibreOffice Calc, I recommend this): https://sarvari.me/yocto/llm/llm_cves_with_comments.ods This patch is just a proof of concept. I'm not sure if/how it could be integrated in the project's infra - especially the initial load is very heavy, and the patch requires GPU(s). As a personal opinion, the final result turned out to be better than I initially expected. But I can't deny that it also brings noise, especially due to missing/incorrectly extracted version info. And of course it doesn't make the CVE info complete - it just extends what we have now. If you would like to try it out, I recommend to delete/backup the downloads/CVE_CHECK2/nvdfkie_1-1.db file before trying. (Delete it after trying also, because the patch changes the schema a bit) What do you think? If you are not completely averse to LLMs and this idea, I could spend more time on this, and submit something that's more than a PoC. Looking forward to all kind of feedback, opinion and recommendations, be it positive or negative. Thank you for coming to my TED talk. --- Gyorgy Sarvari (1): cve-update-db: bolt LLM on top of it meta/classes/cve-check.bbclass | 6 +- .../recipes-core/meta/cve-update-db-native.bb | 138 ++++++++++++++++-- 2 files changed, 131 insertions(+), 13 deletions(-)