PHP实现采集抓取淘宝网单个商品信息

调用淘宝的数据可以使用淘宝提供的api,如果只需调用淘宝商品图片名称等公开信息在自己网站上,使用php中的 file_get_contents 函数实现即可。

调用淘宝的数据可以使用淘宝提供的api,如果只需调用淘宝商品图片名称等公开信息在自己网站上,使用php中的 file_get_contents 函数实现即可。

思路:

file_get_contents(url) 该函数根据 url 如 http://www.baidu.com 将该网页内容(源码)以字符串形式输出(一个整字符串),然后配合preg_match,preg_replace等这些正则表达式操作就可以实现获取该url特定div,img等信息了。当然前题是淘宝在单个商品页面的结构是固定的,如500图的img中id就是J_ImgBooth!

具体实现方法:(获取500图,名称,价格,属性及商品描述)

复制代码代码如下:
$text=file_get_contents("http://item.taobao.com/item.htm?id=2380347279"); //将url地址上页面内容保存进$text

A.获取500图:

复制代码代码如下:
preg_match('/]*id="J_ImgBooth"[^r]*rc=\"([^"]*)\"[^>]*>/', $text, $img); 
//运用正则抓取img标签中id为J_ImgBooth的img,$img[0]为该500图img标签,$img[1]为500图的图片地址;

B. 获取名称:

复制代码代码如下:
preg_match('/([^<>]*)<\/title>/', $text, $title);  <br /> //因为正文中的商品名称标签没有特殊class或id正则不好抓取,就抓<title>标签中的内容了,一般来说title中内容就是商品名称了(实际有些出入),$title[0]整个title标签 $title[1]标签中内容; <br /> $title=iconv('GBK','UTF-8',$title); <br /> //如果你的网站是utf8编码,那么需要进行一下转码(淘宝是gbk编码) <br /> <p> </p> <p> <strong>C.获取价格:</strong></p> <p> </p> <span><u>复制代码</u></span>代码如下: <br /> preg_match('/<([a-z]+)[^i]*id=\"J_StrPrice\"[^>]*>([^<]*)<\/\\1>/is', $text, $price); <br /> //同理获取id为J_StrPrice的标签内容$price[2], $price[0]是整个标签, $price[1]为strong标签名; <br /> $price=floatval($price);//放入数据库估计还有转一下变量类型 <br /> <p> </p> <p> <strong>D.获取属性:</strong></p> <p> 这之前获取的内容都是在单标签中相对只需一个正则就可搞定,然而如果要获取如</p> <p> </p> <span><u>复制代码</u></span>代码如下: <br /> … <br />   <br /> <br />   <br /> … <br />   <br /> <ul> <br />   <br /> … <br />   <br /> </ul> <br />   <br /> … <br />   <br /> … <br />   <br /> <br />   <br /> <br />   <br /> <br />   <br /> … <br /> <p> </p> <p> 这样特定div中有未知n个<>标签,获取该特定div将会非常的困难,搜了下网上,最接近的也只是”/<([a-z]+)[^>]*>([^<>]|(?R))*<\/\\1>/”这样使用递归抓取标签对,但是他不能抓特定标签,所以想要轻松抓取class=”attributes”的div我是没法办到了。但是淘宝网页有其特殊性,就是它的各个标签结构基本是固定的……标签后面不是就是,所以我们可以采用变通法达到获取属性标签内容的目的。</p> <p> </p> <span><u>复制代码</u></span>代码如下: <br /> preg_match('/<(div)[^c]*class=\"attributes\"[^>]*>.*<\/\\1>/is', $text, $text0); <br /> //这个正则会抓取标签,当然我们属性标签就在这个的前面部分。 <br />   <br /> $text1=preg_replace("/<\/div>[^<]*<(div)[^c]*id=\"description\"[^>]*>.*<\/\\1>/is","",$text0); <br /> //匹配到</div >至最后然后用””代替(就是把匹配的删除了),所以如果attributes的div后面紧跟的是description那么我们已经达到目的了。 <br />   <br /> $attributes=preg_replace("/<\/div>[^<]*<(div)[^c]*class=\"box J_TBox\"[^>]*>.*<\/\\1>/is","",$text1); <br /> //如果attributes后面紧跟box J_Tbox标签,那么我们还需要使用以上这步来剔除box J_Tbox标签,当然如果attributes的div后面紧跟的是description,这一步将不会匹配到任何即什么都不会做。 <br /> <p> </p> <p> <strong>E.获取描述:</strong></p> <p> 通过上面方法你肯定觉得淘宝页面上任何标签都可以很简单获取了吧(我之前也是这么想的),但是使用这个方法获取描述时得到的内容将会是“描述加载中”,是的,这个描述内容不是在源码中的,它是打开页面加载进一大堆js后,不知道从淘宝的哪个角落中加载进来的。</p> <p> 好吧,那么我们也可以模仿它放一些js进去。不知道哪些对加载描述有用?没事,全加载进来肯定没错。不知道需要放那些特定div上去有作用?抓一个源码,删掉一些div一步步试试看,你会发现“ </p> <p> </p> <span><u>复制代码</u></span>代码如下: <br /> <br />   <br /> 描述加载中 <br />   <br /> <br /> <p> </p> <p> 这几个div是加载描述所必须的,那么下面就是写代码了:</p> <p> </p> <span><u>复制代码</u></span>代码如下: <br /> preg_match_all('/<script[^>]*>[^<]*<\/script>/is', $text, $content);//页面js脚本 <br />  $content=$content[0]; <br />  $description=' <br />    <br />    描述加载中 <br />   '; <br /> foreach ($content as &$v){$description.=iconv('GBK','UTF-8',$v);}; <br /> //将这个$description放进页面,描述就会自动的加载进来了,当然多个商品描述在同一个页面也会只有一个描述会被加载的。 <!-- Inline Paywall (Embedded in flow) --> </div> <!-- Copyright Notice --> <div class="m-article-copyright-notice"> <h3 class="m-copyright-title">版权声明</h3> <p class="m-copyright-text"> 本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处。如若内容有涉嫌抄袭侵权/违法违规/事实不符,请点击 <a href="https://china666.feishu.cn/share/base/form/shrcn5HiHOSejVQ3P07wLwwh6ae" target="_blank" rel="nofollow" class="m-copyright-report-link">举报</a> 进行投诉反馈! </p> </div> <!-- Publisher Card --> </div><!-- End m-detail-container --> </div><!-- End m-detail-content-wrap --> <!-- Footer Zone (Gray Background) --> <div class="m-detail-footer-wrap"> <div class="m-detail-container"> <!-- Related Articles --> <div class="m-related-section"> <div class="m-related-header"> <h3 class="m-related-title">推荐阅读</h3> </div> <ul class="m-related-list"> <li class="m-related-item"> <a href="/dev/747938.html" class="m-related-link"> <span class="m-related-item-title">Duilib中list控件支持ctrl和shif多行选中的实现</span> <span class="m-related-item-meta"> <span class="m-related-date m-time-ago" datetime="1701992340" data-time="1701992340">2023-12-08</span> </span> </a> </li> <li class="m-related-item"> <a href="/dev/747937.html" class="m-related-link"> <span class="m-related-item-title">[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif</span> <span class="m-related-item-meta"> <span class="m-related-date m-time-ago" datetime="1701992333" data-time="1701992333">2023-12-08</span> </span> </a> </li> <li class="m-related-item"> <a href="/dev/747936.html" class="m-related-link"> <span class="m-related-item-title">win10系统 微软输入法 于eclipse ctrl+shif+f冲突间接处理办法</span> <span class="m-related-item-meta"> <span class="m-related-date m-time-ago" datetime="1701992330" data-time="1701992330">2023-12-08</span> </span> </a> </li> <li class="m-related-item"> <a href="/dev/747935.html" class="m-related-link"> <span class="m-related-item-title">Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif</span> <span class="m-related-item-meta"> <span class="m-related-date m-time-ago" datetime="1701992327" data-time="1701992327">2023-12-08</span> </span> </a> </li> <li class="m-related-item"> <a href="/dev/747934.html" class="m-related-link"> <span class="m-related-item-title">读LDD3,内存映射与DMA--PAGE_SHIF…</span> <span class="m-related-item-meta"> <span class="m-related-date m-time-ago" datetime="1701992325" data-time="1701992325">2023-12-08</span> </span> </a> </li> </ul> </div> <!-- Comments --> <!-- Comments (Hidden temporarily) --> </div> <!-- Site Footer (Detail Page Only) --> <!-- Site Footer --> <footer class="m-site-footer"> <div class="m-footer-container"> <div class="m-footer-content"> <div class="m-footer-links"> <a href="/p/aboutus.html" target="_blank" class="m-footer-link">关于网站</a> <a href="/p/contactus.html" target="_blank" class="m-footer-link">联系我们</a> <a href="https://beian.miit.gov.cn" target="_blank" rel="noopener">浙ICP备14026978号-4</a> </div> </div> </div> </footer> </div> <!-- Teleport Template for Topbar Actions --> <div id="detail-actions-tpl" class="m-hidden"> <!-- Like --> <a href="javascript:;" class="m-btn-action m-btn-like btn-like has-count" data-action="vote" data-type="like" data-id="742411" data-tag="archives" title="点赞"> <i class="fa-regular fa-thumbs-up"></i> <span class="m-btn-action-count"></span> </a> <!-- Collection --> <a href="javascript:;" class="m-btn-action m-btn-collection addbookbark" data-id="742411" data-aid="742411" data-type="archives" data-action="/addons/cms/ajax/collection.html" title="收藏"> <i class="fa-regular fa-star"></i> </a> <!-- Comment --> <!-- Comment (Hidden temporarily) <a href="#comments" class="m-btn-action m-btn-comment has-count" title="评论"> <i class="fa-regular fa-comment"></i> <span class="m-btn-action-count"></span> </a> --> <!-- Share --> <a href="javascript:;" class="m-btn-action m-btn-share" title="分享"> <i class="fa-solid fa-share-nodes"></i> </a> <!-- Back to Top --> <a href="javascript:;" class="m-btn-action m-btn-totop" title="回到顶部"> <i class="fa-solid fa-chevron-up"></i> </a> </div> <!-- Hidden Share Content Template --> <div id="share-content-wrapper" class="m-hidden"> <div class="m-share-modal-body"> <div class="m-share-icons-wrapper"> <a href="javascript:;" data-url="https://private-origin.imspm.com/dev/742411.html" data-title="PHP实现采集抓取淘宝网单个商品信息" data-id="wechat" class="m-share-item m-share-wechat" title="分享到微信"><i class="fa-brands fa-weixin"></i></a> <a href="javascript:;" data-url="https://private-origin.imspm.com/dev/742411.html" data-title="PHP实现采集抓取淘宝网单个商品信息" data-id="weibo" class="m-share-item m-share-weibo" title="分享到微博"><i class="fa-brands fa-weibo"></i></a> <a href="javascript:;" data-url="https://private-origin.imspm.com/dev/742411.html" data-title="PHP实现采集抓取淘宝网单个商品信息" data-id="qq" class="m-share-item m-share-qq" title="分享到QQ"><i class="fa-brands fa-qq"></i></a> <a href="javascript:;" data-url="https://private-origin.imspm.com/dev/742411.html" data-title="PHP实现采集抓取淘宝网单个商品信息" data-id="link" class="m-share-item m-share-link" title="复制链接"><i class="fa-solid fa-link"></i></a> </div> <div class="m-share-modal-tip">点击图标分享</div> </div> </div> <!-- Page Specific Scripts --> <script src="/assets/addons/cms/js/typezen/article.js?v=1778783258"></script> <!-- SEO: Structured Data (JSON-LD) for Article --> <script type="application/ld+json"> { "@context": "https://schema.org", "@type": "Article", "headline": "PHP实现采集抓取淘宝网单个商品信息", "image": [ "https://private-origin.imspm.com/assets/addons/cms/img/noimage.jpg" ], "datePublished": "2023-12-08T02:20:04+08:00", "dateModified": "2023-12-08T02:20:04+08:00", "author": [{ "@type": "Person", "name": "超级产品经理", "url": "https://private-origin.imspm.com/u/0" }], "publisher": { "@type": "Organization", "name": "超级产品经理", "logo": { "@type": "ImageObject", "url": "/assets/img/imspm_logo.svg" } }, "description": "调用淘宝的数据可以使用淘宝提供的api,如果只需调用淘宝商品图片名称等公开信息在自己网站上,使用php中的 file_get_contents 函数实现即可。" } </script> </div> </main> <!-- Right Sidebar --> </div> </div> <!-- Scripts --> <script src="/assets/libs/jquery/dist/jquery.min.js"></script> <script src="/assets/libs/bootstrap/dist/js/bootstrap.min.js"></script> <script src="/assets/libs/fastadmin-layer/dist/layer.js"></script> <script src="/assets/libs/art-template/dist/template-native.js"></script> <script src="/assets/addons/cms/js/jquery.autocomplete.js"></script> <script src="/assets/addons/cms/js/cms.js"></script> <script src="/assets/addons/cms/js/common.js"></script> <script src="/assets/addons/cms/js/typezen/theme.js?v=1778783258"></script> <!-- Mobile Nav --> <!-- Mobile Bottom Navigation --> <nav class="m-mobile-nav" role="navigation" aria-label="移动端导航"> <!-- Home --> <a href="/" class="m-mobile-nav-item "> <svg class="m-mobile-nav-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <path d="M3 9l9-7 9 7v11a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2z"></path> <polyline points="9,22 9,12 15,12 15,22"></polyline> </svg> <span class="m-mobile-nav-label">首页</span> </a> <!-- Search --> <a href="/s.html" class="m-mobile-nav-item"> <svg class="m-mobile-nav-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <circle cx="11" cy="11" r="8"></circle> <line x1="21" y1="21" x2="16.65" y2="16.65"></line> </svg> <span class="m-mobile-nav-label">搜索</span> </a> <!-- Categories (Trigger for Mobile Menu usually, linking to channel index for now) --> <a href="/[:diyname].html" class="m-mobile-nav-item"> <svg class="m-mobile-nav-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <line x1="8" y1="6" x2="21" y2="6"></line> <line x1="8" y1="12" x2="21" y2="12"></line> <line x1="8" y1="18" x2="21" y2="18"></line> <line x1="3" y1="6" x2="3.01" y2="6"></line> <line x1="3" y1="12" x2="3.01" y2="12"></line> <line x1="3" y1="18" x2="3.01" y2="18"></line> </svg> <span class="m-mobile-nav-label">栏目</span> </a> <!-- User / Profile --> <a href="/index/user/index.html" class="m-mobile-nav-item"> <svg class="m-mobile-nav-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"> <path d="M20 21v-2a4 4 0 0 0-4-4H8a4 4 0 0 0-4 4v2"></path> <circle cx="12" cy="7" r="4"></circle> </svg> <span class="m-mobile-nav-label">我的</span> </a> </nav> <!-- addons/contentunlocker/view/hook/unlocker.html --> <link rel="stylesheet" href="/assets/addons/contentunlocker/css/unlocker.css?v=1778783258"> <script> window.contentUnlockerConfig = {"theme_color":"#2563eb","dialog_title":"\u68c0\u6d4b\u5230\u6d4f\u89c8\u5668\u5f00\u542f\u4e86\u5e7f\u544a\u62e6\u622a\u5668","dialog_desc":"\u8bf7\u5173\u95ed\u5e7f\u544a\u62e6\u622a\u5668\u6216\u5c06\u672c\u7ad9\u52a0\u5165\u5e7f\u544a\u62e6\u622a\u5668\u767d\u540d\u5355\uff0c\u652f\u6301\u6211\u4eec\u7ee7\u7eed\u4e3a\u60a8\u63d0\u4f9b\u4f18\u8d28\u5185\u5bb9\u3002","close_delay":"60"}; window.contentUnlockerConfig.plugin_url = '/assets/addons/contentunlocker'; </script> <script src="/assets/addons/contentunlocker/js/unlocker.js?v=1778783258"></script> </body> </html>