grepXML值并在它包含數字時將其匯出-有解無憂

我有一個 XML 檔案，其中包含指向產品和類別的鏈接，每個類別都以單詞和斜線結尾，例如https://url.com/category/subcatgory并且它們被限制在里面<loc> </loc>

但是，每個產品都有以 6 位數字結尾的鏈接，例如https://url.com/category/subcategory/product-name-of-something-154555

我正在嘗試使用 wget 獲取檔案時對此進行 grep，因此我現在僅在 grep 部分進行試驗，我知道如何獲取檔案并打開它。

這是我一直在運行的代碼，但正在匯出所有鏈接，甚至是類別。

grep -Po "(?<=<loc>)(.*)[0-9]{6}/(?=</loc>)" nameofmyfile.xml

但我成功地用這個代碼對每個 6 位代碼進行了 grep：

grep -oP "(?<=<loc>)*[0-9]{6}/(?=</loc>)" nameofmyfile.xml

但話又說回來，我需要該鏈接前面的部分，因為運行它時我只得到：666444/。

檔案結構是這樣的：

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>https://somelink.com/category/building-materials/concrete/hand-tools/</loc>
        <lastmod>2022-09-11T02:10:42 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/</loc>
        <lastmod>2022-09-11T02:11:06 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
        <url>
        <loc>https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/</loc>
        <lastmod>2022-09-11T02:11:14 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/</loc>
        <lastmod>2022-09-11T02:10:42 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/</loc>
        <lastmod>2022-09-11T02:11:06 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
        <url>
        <loc>https://somelink.com/category/building-materials/paint/</loc>
        <lastmod>2022-09-11T02:11:14 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/building-materials/screws-and-nails/</loc>
        <lastmod>2022-09-11T02:10:42 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/building-materials/concrete/power-toools/</loc>
        <lastmod>2022-09-11T02:11:06 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
        <url>
        <loc>https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/</loc>
        <lastmod>2022-09-11T02:11:14 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/</loc>
        <lastmod>2022-09-11T02:10:42 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>https://somelink.com/category/inside/bathroom/</loc>
        <lastmod>2022-09-11T02:11:06 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
        <url>
        <loc>https://somelink.com/category/inside/pipes/draining-pipes-168544/</loc>
        <lastmod>2022-09-11T02:11:14 02:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
</xml>

如何提取以 -XXXXXX/ 結尾的所有鏈接并跳過另一個？他們在里面<loc> </loc>

uj5u.com熱心網友回復：

如果您只想使用 grep 獲取數字：

<loc>[^>]*\K\d{6}(?=/</loc>)

解釋

<loc>從字面上匹配
[^>]*可選匹配任何字符，除了>
\K忘記到目前為止匹配的內容
[0-9]{6}匹配 6 位數字
(?=/</loc>)/</loc>正向前瞻，向右斷言

查看正則運算式演示。

例子

grep -Po "<loc>[^>]*\K\d{6}(?=/</loc>)" nameofmyfile.xml

輸出

uj5u.com熱心網友回復：

xmllint可用于 afterwget獲取以 . 結尾的鏈接-<6 numbers>。訣竅是用下劃線替換數字，然后檢測

cat tmp.xml | xmllint  --xpath '//*[local-name()="loc" and contains(translate(.,"0123456789","__________"), "-______")]/text()' tmp.xml -

結果

https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/
https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/
https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/
https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/
https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/
https://somelink.com/category/inside/pipes/draining-pipes-168544/

或將 wget 輸出保存到 tmp 檔案

(echo "setrootns"; echo 'cat //defaultns:loc[contains(translate(.,"0123456789","__________"), "-______")]/text()') | xmllint  --shell tmp.xml | grep -v ' ----'

結果

/ > setrootns
/ > cat //defaultns:loc[contains(translate(.,"0123456789","__________"), "-______")]/text()
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/
https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/
https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/
https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/
https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/
https://somelink.com/category/inside/pipes/draining-pipes-168544/
/ >

uj5u.com熱心網友回復：

使用您顯示的示例，請嘗試以下awk代碼。用 GNU 撰寫和測驗awk。簡單的解釋是，將RS（記錄分隔符）設定為正則運算式(^|\n[[:space:]] )<loc>[^<]*<\\/loc>\n，然后在主程式檢查條件并從其值中洗掉不必要的部分。然后檢查零件是否在（該行之前的最后一個斜杠）<loc>.....</loc>之后有 6 位數字，如果是，則根據要求列印。-</loc>

awk -v RS='(^|\n[[:space:]] )<loc>[^<]*<\\/loc>\n' '
RT{
  num=split(RT,arr,"[-/]")
  if(arr[num-2]~/^[0-9]{6}$/){
     print arr[num-2]
  }
}
'  Input_file

這是使用的正則運算式的在線演示。

注意：在正則運算式演示站點中，捕獲組更改為非捕獲組，并且不使用雙重轉義/來根據正則運算式站點明確說明，但應僅在 GNU 中使用上述代碼中使用的正則運算式awk。

轉載請註明出處，本文鏈接：https://www.uj5u.com/gongcheng/508605.html

標籤：正则表达式 xml 终端 grep

上一篇：AndroidStudio錯誤“要求依賴它的庫和應用程式針對AndroidAPI的33版或更高版本進行編譯。”

下一篇：R：如何根據xml檔案中的嵌套結果擴展data.frame