我有一個 XML 檔案,其中包含指向產品和類別的鏈接,每個類別都以單詞和斜線結尾,例如https://url.com/category/subcatgory并且它們被限制在里面<loc> </loc>
但是,每個產品都有以 6 位數字結尾的鏈接,例如https://url.com/category/subcategory/product-name-of-something-154555
我正在嘗試使用 wget 獲取檔案時對此進行 grep,因此我現在僅在 grep 部分進行試驗,我知道如何獲取檔案并打開它。
這是我一直在運行的代碼,但正在匯出所有鏈接,甚至是類別。
grep -Po "(?<=<loc>)(.*)[0-9]{6}/(?=</loc>)" nameofmyfile.xml
但我成功地用這個代碼對每個 6 位代碼進行了 grep:
grep -oP "(?<=<loc>)*[0-9]{6}/(?=</loc>)" nameofmyfile.xml
但話又說回來,我需要該鏈接前面的部分,因為運行它時我只得到:666444/。
檔案結構是這樣的:
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://somelink.com/category/building-materials/concrete/hand-tools/</loc>
<lastmod>2022-09-11T02:10:42 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/</loc>
<lastmod>2022-09-11T02:11:06 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/</loc>
<lastmod>2022-09-11T02:11:14 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/</loc>
<lastmod>2022-09-11T02:10:42 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/</loc>
<lastmod>2022-09-11T02:11:06 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/paint/</loc>
<lastmod>2022-09-11T02:11:14 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/screws-and-nails/</loc>
<lastmod>2022-09-11T02:10:42 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/building-materials/concrete/power-toools/</loc>
<lastmod>2022-09-11T02:11:06 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/</loc>
<lastmod>2022-09-11T02:11:14 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/</loc>
<lastmod>2022-09-11T02:10:42 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/inside/bathroom/</loc>
<lastmod>2022-09-11T02:11:06 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://somelink.com/category/inside/pipes/draining-pipes-168544/</loc>
<lastmod>2022-09-11T02:11:14 02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</xml>
如何提取以 -XXXXXX/ 結尾的所有鏈接并跳過另一個?他們在里面<loc> </loc>
uj5u.com熱心網友回復:
如果您只想使用 grep 獲取數字:
<loc>[^>]*\K\d{6}(?=/</loc>)
解釋
<loc>
從字面上匹配[^>]*
可選匹配任何字符,除了>
\K
忘記到目前為止匹配的內容[0-9]{6}
匹配 6 位數字(?=/</loc>)
/</loc>
正向前瞻,向右斷言
查看正則運算式演示。
例子
grep -Po "<loc>[^>]*\K\d{6}(?=/</loc>)" nameofmyfile.xml
輸出
145890
145489
145488
010274
010272
010273
168544
uj5u.com熱心網友回復:
xmllint
可用于 afterwget
獲取以 . 結尾的鏈接-<6 numbers>
。訣竅是用下劃線替換數字,然后檢測
cat tmp.xml | xmllint --xpath '//*[local-name()="loc" and contains(translate(.,"0123456789","__________"), "-______")]/text()' tmp.xml -
結果
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/
https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/
https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/
https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/
https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/
https://somelink.com/category/inside/pipes/draining-pipes-168544/
或將 wget 輸出保存到 tmp 檔案
(echo "setrootns"; echo 'cat //defaultns:loc[contains(translate(.,"0123456789","__________"), "-______")]/text()') | xmllint --shell tmp.xml | grep -v ' ----'
結果
/ > setrootns
/ > cat //defaultns:loc[contains(translate(.,"0123456789","__________"), "-______")]/text()
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145890/
https://somelink.com/category/building-materials/concrete/hand-tools/screws/screws-145489/
https://somelink.com/category/building-materials/concrete/hand-tools/hammer/hammer-145488/
https://somelink.com/category/inside/heating/floor-heating/pert-222-010274/
https://somelink.com/category/inside/heating/floor-heating/pert-182-010272/
https://somelink.com/category/inside/heating/floor-heating/pert-202-010273/
https://somelink.com/category/inside/pipes/draining-pipes-168544/
/ >
uj5u.com熱心網友回復:
使用您顯示的示例,請嘗試以下awk
代碼。用 GNU 撰寫和測驗awk
。簡單的解釋是,將RS
(記錄分隔符)設定為正則運算式(^|\n[[:space:]] )<loc>[^<]*<\\/loc>\n
,然后在主程式檢查條件并從其值中洗掉不必要的部分。然后檢查零件是否在(該行之前的最后一個斜杠)<loc>.....</loc>
之后有 6 位數字,如果是,則根據要求列印。-
</loc>
awk -v RS='(^|\n[[:space:]] )<loc>[^<]*<\\/loc>\n' '
RT{
num=split(RT,arr,"[-/]")
if(arr[num-2]~/^[0-9]{6}$/){
print arr[num-2]
}
}
' Input_file
這是使用的正則運算式的在線演示。
注意:在正則運算式演示站點中,捕獲組更改為非捕獲組,并且不使用雙重轉義/
來根據正則運算式站點明確說明,但應僅在 GNU 中使用上述代碼中使用的正則運算式awk
。
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/508605.html