初次接触python,写的很简单,开发工具PyCharm,python 3.4很方便

python 部分模块安装时需要其他的附属模块之类的,可以先

pip install wheel

然后可以直接下载whl文件进行安装

pip installlxml-3.5.0-cp34-none-win32.whl

定义一个类,准备保存的类型

class CnblogArticle:
def __init__(self):
self.num
=''
self.category
=''
self.title
=''
self.author
=''
self.postTime
=''
self.articleComment
=''
self.articleView
=''

因为CSDN博客频道只有18页,所以解析18页,有多线程解析(main注释部分)及普通解析,在main方法里

注意事项:每个item以class=blog_list区分,部分item下有class=category,少部分没有,所有要注意,否则会报错

<div class="blog_list">
<h1>
<a href="/other/index.html" class="category">[综合]</a>
<a name="49786427" href="http://blog.csdn.net/matrix_space/article/details/49786427" target="_blank">Python: scikit-image canny 边缘检测</a>

<img src="http://static.blog.csdn.net/images/icon-zhuanjia.gif" class="blog-icons" alt="专家" title="专家">
</h1>

<dl>
<dt>
<a href="http://blog.csdn.net/matrix_space" target="_blank">
<img src="http://avatar.csdn.net/F/9/7/3_shinian1987.jpg" alt="shinian1987" />
</a>
</dt>
<dd>这个用例说明canny 边缘检测的用法

import numpy as np
import matplotlib.pyplot as plt
from scipy import ndimage as ndi
from skimage import feature


# Generate noisy image of a square
im = np.zeros((128, 128))
im[3...
</dd>
</dl>
<p>
<a class="tag" href="/tag/details.html?tag=python" target="_blank">python</a>
</p>
<div class="about_info">
<span class="fr digg" id="digg_49786427" blog="1164951" digg="0" bury="0"></span>
<span class="fl">
<a href="http://blog.csdn.net/matrix_space" target="_blank" class="user_name">shinian1987</a>
<span class="time">3小时前</span>
<a href="http://blog.csdn.net/matrix_space/article/details/49786427" target="_blank" class="view">阅读(104)</a>
<a href="http://blog.csdn.net/matrix_space/article/details/49786427#comments" target="_blank" class="comment">评论(0)</a>
</span>
</div>
</div>
<div class="blog_list">
<h1>
<a name="50524490" href="http://blog.csdn.net/u010579068/article/details/50524490" target="_blank">STL_算法 for_each 和 transform 比较</a>

</h1>

<dl>
<dt>
<a href="http://blog.csdn.net/u010579068" target="_blank">
<img src="http://avatar.csdn.net/9/9/B/3_u010579068.jpg" alt="u010579068" />
</a>
</dt>
<dd>C++ Primer 学习中。。。

&#160;

简单记录下我的学习过程&#160;(代码为主)







所有容器适用
/**----------------------------------------------------------------------------------
for_each &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;速度快 &#160; &#160; &#160; &#160; &#160; &#160; &#160;...
</dd>
</dl>
<p>
<a class="tag" href="/tag/details.html?tag=STL_算法" target="_blank">STL_算法</a>
<a class="tag" href="/tag/details.html?tag=for_each" target="_blank">for_each</a>
<a class="tag" href="/tag/details.html?tag=transform" target="_blank">transform</a>
<a class="tag" href="/tag/details.html?tag=STL" target="_blank">STL</a>
</p>
<div class="about_info">
<span class="fr digg" id="digg_50524490" blog="1499803" digg="0" bury="0"></span>
<span class="fl">
<a href="http://blog.csdn.net/u010579068" target="_blank" class="user_name">u010579068</a>
<span class="time">3小时前</span>
<a href="http://blog.csdn.net/u010579068/article/details/50524490" target="_blank" class="view">阅读(149)</a>
<a href="http://blog.csdn.net/u010579068/article/details/50524490#comments" target="_blank" class="comment">评论(0)</a>
</span>
</div>
</div>

Beautiful Soup 4.2.0 文档 可以去官网直接查看

# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
import urllib.request
import os
import sys
import time
import threading
class CnblogUtils(object):
def __init__(self):
self.headers
={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'}
self.contentAll
=set()

def getPage(self,url=None):
request
=urllib.request.Request(url,headers=self.headers)
response
=urllib.request.urlopen(request)
soup
=BeautifulSoup(response.read(),"lxml")
return soup

def parsePage(self,url=None,page_num=None):
soup
=self.getPage(url)
itemBlog
=soup.find_all('div','blog_list')
cnArticle
=CnblogUtils
for i,itemSingle in enumerate(itemBlog):
cnArticle.num
=i
cnArticle.author
=itemSingle.find('a','user_name').string
cnArticle.postTime
=itemSingle.find('span','time').string
cnArticle.articleComment
=itemSingle.find('a','comment').string
cnArticle.articleView
=itemSingle.find('a','view').string
if itemSingle.find('a').has_attr('class'):
cnArticle.category
=itemSingle.find('a','category').string
cnArticle.title
=itemSingle.find('a',attrs={'name':True}).string
else:
cnArticle.category
="None"
cnArticle.title
=itemSingle.find('a').string
self.contentAll.add(str(cnArticle.author))
self.writeFile(page_num,cnArticle.num,cnArticle.author,cnArticle.postTime,cnArticle.articleComment,cnArticle.articleView,cnArticle.category,cnArticle.title)

def writeFile(self,page_num,num,author,postTime,articleComment,articleView,category,title):
f
=open("a.txt",'a+')
f.write(str(
'page_num is {}'.format(page_num))+'\t'+str(num)+'\t'+str(author)+'\t'+str(postTime)+'\t'+str(articleComment)+'\t'+str(articleView)+'\t'+str(category)+'\t'+str(title)+'\n')
f.close()

def main(thread_num):
start
=time.clock()
cnblog
=CnblogUtils()
'''
thread_list = list();
for i in range(0, thread_num):
thread_list.append(threading.Thread(target = cnblog.parsePage, args = ('http://blog.csdn.net/?&page={}'.format(i),i+1,)))
for thread in thread_list:
thread.start()
for thread in thread_list:
thread.join()
print(cnblog.contentAll)
'''
for i in range(0,18):
cnblog.parsePage(
'http://blog.csdn.net/?&page={}'.format(i),i+1)
end
=time.clock()
print('time = {}'.format(end-start))

if __name__ == '__main__':
main(
18)

程序运行结果:

page_num is 1    0    foruok    18分钟前    评论(0)    阅读(0)    [编程语言]    Windows下从源码编译SKIA
page_num is 1 1 u013467442 31分钟前 评论(0) 阅读(3) [编程语言] Cubieboard学习资源
page_num is 1 2 tuke_tuke 32分钟前 评论(0) 阅读(15) [移动开发] UI组件之AdapterView及其子类关系,Adapter接口及其实现类关系
page_num is 1 3 xiaominghimi 53分钟前 评论(0) 阅读(51) [移动开发] 【COCOS2D-X 备注篇】ASSETMANAGEREX使用异常解决备注->CHECK_JNI/CC‘JAVA.LANG.NOCLASSDEFFOUNDERROR’
page_num is 1 4 shinian1987 1小时前 评论(0) 阅读(64) [综合] Python: scikit-image canny 边缘检测
page_num is 1 5 u010579068 1小时前 评论(0) 阅读(90) None STL_算法 for_each 和 transform 比较
page_num is 1 6 u013467442 1小时前 评论(0) 阅读(94) [编程语言] OpenGLES2.0着色器语言glsl
page_num is 1 7 u013467442 1小时前 评论(0) 阅读(89) [编程语言] OpenGl 坐标转换
page_num is 1 8 AaronGZK 1小时前 评论(0) 阅读(95) [编程语言] bzoj4390【Usaco2015 Dec】Max Flow
page_num is 1 9 AaronGZK 1小时前 评论(0) 阅读(95) [编程语言] bzoj1036【ZJOI2008】树的统计Count
page_num is 1 10 danhuang2012 1小时前 评论(0) 阅读(90) [编程语言] Node.js如何处理健壮性
page_num is 1 11 EbowTang 1小时前 评论(0) 阅读(102) [编程语言]
<LeetCode OJ> 121. Best Time to Buy and Sell Stock
page_num is 1 12 cartzhang 2小时前 评论(0) 阅读(98) [架构设计] 给虚幻4添加内存跟踪功能
page_num is 1 13 u013595419 2小时前 评论(0) 阅读(93) [综合] 第2章第1节练习题3 共享栈的基本操作
page_num is 1 14 ghostbear 2小时前 评论(0) 阅读(115) [系统运维] Dynamics CRM 2016 Series: Overview
page_num is 1 15 u014723529 2小时前 评论(0) 阅读(116) [编程语言] 将由BeanUtils的getProperty方法返回的Date对象的字符串表示还原为对象
page_num is 1 16 Evankaka 2小时前 评论(1) 阅读(142) [架构设计] Jenkins详细安装与构建部署使用教程
page_num is 1 17 Evankaka 2小时前 评论(0) 阅读(141) [编程语言] Ubuntu安装配置JDK、Tomcat、SVN服务器

网速不好时多线程可能报错

获取了数据了就可以进行数据分析,或者深度搜索,根据author去获取author对应的blog等

更多相关文章

  1. 建模分析之机器学习算法(附python&R代码)
  2. NMF算法简介及python实现(gradient descent)
  3. 机器学习算法之七:5分钟上手SVM
  4. 【机器学习算法-python实现】最大似然估计(Maximum Likelihood)
  5. FP-growth算法思想和其python实现
  6. python编程之一:使用网格索引算法进行空间数据查询
  7. 【py交易】算法竞赛入门经典6.3.1 小球下落 python
  8. 《机器学习实战》使用Apriori算法和FP-growth算法进行关联分析(Py
  9. 机器学习算法入门之(一) 梯度下降法实现线性回归

随机推荐

  1. 漫谈Android数据缓存、数据序列化和Inten
  2. Dagger2 在 Android 项目的正确使用方式
  3. Android最新面试题汇总 持续更新
  4. Android之如何判断定位是否开启及定位模
  5. 【Android 内存优化】Bitmap 硬盘缓存 (
  6. 4.1 Android如何访问资源
  7. Android多媒体分析(一)MediaScanner
  8. Android.自定义控件的实现
  9. android 广播机制
  10. android usb挂载分析--ntfs-3g移植