40+岁大龄码农的自我修养

idea常用插件与配置

2017-11-20

颜色主题， Monokai_2
java类模板

/**
 * @author: 个人姓名
 * @date: ${DATE} ${TIME}
 * @className: ${NAME}
 * @description:
 */

插件
- 阿里, 插件搜索 alibaba 参考
- FindBugs-IDEA
- GsonFormat 参考1 参考2
- Translation 有道翻译参考
- HighlightBracketPair 括号高亮
  参考
全局jdk版本设置
File-Other Setting-Default Project Structure
设置:
- 方法之间加横线
  setting-editor-general-appearance-show method separators

用asyncio协程和aiohttp爬取虎扑步行街前100页的主贴

2017-08-17

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import asyncio
import aiohttp
import time
from pyquery import PyQuery as pq
import sys
import codecs
import json


async def get_post_url(url):
    '''
    得到每一页的所有链接
    '''
    async with aiohttp.ClientSession() as client:
        async with client.get(url) as response:
            body = await response.text(encoding="utf-8")
            # print(body)
            post_list = parser(body)
            for post_url in post_list:
                post = {}
                post["url"] = post_url
                post_urls.append(post_url)


async def get_post_info(url):
    '''
    根据链接得到标题 作者 发帖时间等内容
    '''
    async with aiohttp.ClientSession() as client:
        async with client.get(url) as response:
            body = await response.text(encoding="utf-8")
            post_info = paser_post(body)
            if post_info is not None:
                post_info["url"] = url
                post_lists.append(post_info)


def paser_post(html):
    '''
    解析列表页
    '''
    post_info = {}
    doc = pq(html)
    main_post = doc('div#tpc')
    post_author = main_post.find('div.author a.u').text()
    post_time = main_post.find('div.author span.stime').text()
    post_title = doc('h1#j_data').text()
    post_info["title"] = post_title
    post_info['time'] = post_time
    post_info['author'] = post_author
    if not post_title and not post_time and not post_author:
        return None
    return post_info


def parser(html):
    '''
    解析帖子页
    '''
    post_list = []
    doc = pq(html)
    links_item = doc('table[id="pl"]').find('tbody').find('tr[mid]')
    for link_item in links_item.items():
        post_link = link_item.find('td.p_title').find('a').attr('href')
        post_link = "https://bbs.hupu.com" + post_link
        post_list.append(post_link)
    return post_list


# 得到开始时间
start_time = time.time()
# 存储数据的列表
post_lists = []
# 每一个帖子链接的列表
post_urls = []
# 创建时间循环
loop = asyncio.get_event_loop()

# 将步行街前一百的链接加入事件循环, 同时访问这100页, 得到所有的帖子链接
urls = [
    "https://bbs.hupu.com/bxj-postdate-{}".format(i) for i in range(1, 101)]
tasks = [get_post_url(url) for url in urls]
loop.run_until_complete(asyncio.wait(tasks))
# 输出帖子总数
print(len(post_urls))
# 将所有帖子链接加入事件循环, 得要内容
for i in range(0, len(post_urls), 1000):
    lenth = len(post_urls) - i
    if lenth >= 1000:
        lenth = 1000
    print(i)
    end_time = time.time()
    print("cast time", end_time - start_time)
    tasks = [get_post_info(post_urls[num + i]) for num in range(lenth)]
    loop.run_until_complete(asyncio.wait(tasks))
loop.close()

# 存储所有数据.
post_dicts = {"posts": post_lists, "lenth": len(post_lists)}
with codecs.open("post.json", "w", "utf-8") as f:
    f.write(json.dumps(post_dicts, indent=True))
end_time = time.time()

print("cast time", end_time - start_time)

因为虎扑没有登录只能看到前100页的帖子.
总共11594个主贴, 总共耗费了358秒.
大概每秒爬32个帖子.
开了500个协程和1000个协程,速度差不多.
应该是单ip的极限了.

bash常用的命令和工具

2017-07-18

工具
- 1 # thefuck
  用于输错命令后的自动纠正
  - 安装
    1
    brew install thefuck
  - 使用
    1
    2
    3
    4
    pythn
    # zsh: command not found: pythonn
    fuck
    # python [enter/↑/↓/ctrl+c]
- 2 # tldr
  用于bash命令的提示和示例
  - 安装
    1
    brew install tldr
- 使用
- 3 # mycli
  支持自动补全和语法高亮的mysql命令行工具
  - 安装
    1
    brew install mycli
常用命令

命令	作用
history	查看命令行历史记录，再用 !n（n 是命令编号）就可以再次执行
ctrl-a	将光标移至行首
ctrl-e	将光标移至行尾
alt-b 和 alt-f	以单词为单位移动光标
pstree -p	进程树
ps -ef grep python	显示python的进程
netstat -lntp 或 ss -plat	检查哪些进程在监听端口
alias	alias ll=’ls -latr’ 创建了一个新的命令别名 ll
ln -s	创建软连接
df -h	查看硬盘分区
exec $SHELL	重启shell

技巧

删除大量文件的最快方法之一

1	mkdir empty && rsync -r --delete empty/ some-dir && rmdir some-dir

20170506 可用的登录知乎的python代码

2017-05-06

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup
import time
import json
import os


url = 'https://www.zhihu.com'
loginURL = 'https://www.zhihu.com/login/email'
headers = {
    "User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0',
    "Referer": "https://www.zhihu.com/",
    'Host': 'www.zhihu.com',
}
data = {
    'email': 'xxxxx',
    'password': 'xxxxx',
    'rememberme': "true",
}
s = requests.session()
if os.path.exists('cookiefile'):
    with open('cookiefile') as f:
        cookie = json.load(f)
    s.cookies.update(cookie)
    req1 = s.get(url, headers=headers)
    # 建立一个zhihu.html文件,用于验证是否登陆成功
    with open('zhihu.html', 'w') as f:
        f.write(req1.content)
else:
    req = s.get(url, headers=headers)
    print req
    soup = BeautifulSoup(req.text, "html.parser")
    xsrf = soup.find('input', {'name': '_xsrf', 'type': 'hidden'}).get('value')
    data['_xsrf'] = xsrf
    timestamp = int(time.time() * 1000)
    captchaURL = 'https://www.zhihu.com/captcha.gif?=' + \
        str(timestamp) + "&type=login"
    print captchaURL
    with open('zhihucaptcha.gif', 'wb') as f:
        captchaREQ = s.get(captchaURL, headers=headers)
        f.write(captchaREQ.content)
    loginCaptcha = raw_input('input captcha:\n').strip()
    data['captcha'] = loginCaptcha
    print data
    loginREQ = s.post(loginURL, headers=headers, data=data)
    if not loginREQ.json()['r']:
        print s.cookies.get_dict()
        with open('cookiefile', 'wb') as f:
            json.dump(s.cookies.get_dict(), f)
    else:
        print 'login fail'

vim个人常用命令

2017-04-17

按Esc进入普通模式，在该模式下使用方向键或者h,j,k,l键可以移动游标

按键	说明
h	左
l	右
j	上
k	下
w	下一个单词
b	上一个单词

在普通模式下使用下面的键将进入插入模式，并可以从相应的位置开始输入

按键	说明
i	在当前光标处进行编辑
I	行首插入
A	行末插入
a	在光标后插入
o	在当前光标处进行编辑
O	在当前光标处进行编辑
x	删除游标所在的字符
dd	删除整行
D	删除至行尾
gg	游标移动到到第一行
G(Shift+g)	到最后一行

在普通模式下按:进入命令模式,

按键	说明
set nu	显示行号
q!	强制退出, 不保存
q	退出
wq	保存并退出
wq!	强制保存并退出

vscode配置

2017-04-17

按f1输入task,调出运行配置文件
输入

{
    "version": "0.1.0",
    "command": "python",
    "isShellCommand": true,
    "args": ["${file}"],
    "showOutput": "always",
    "options":
        {
        "env":
            {
            "PYTHONIOENCODING": "UTF-8"
            }
        }
}

常用插件
进辅助线 Guides
文件图标 vscode-icons
缩进线 Indenticator
设置

// 将设置放入此文件中以覆盖默认设置
{
    "window.zoomLevel": 2,
    //字号
    "editor.fontSize": 14,
    // 字体
    "editor.fontFamily": "Hack, Menlo, Monaco, 'Courier New', monospace",
    //80个字符的提示线
    "editor.rulers": [80],
    "editor.acceptSuggestionOnEnter": true,
    // Arguments passed in. Each argument is a separate item in the array.
    //pep8自动格式化
    "python.formatting.autopep8Args":[
        "--max-line-length=80",
        "--indent-size=4"
    ],

    // Format the document upon saving. 保存文件后自动格式化
    "python.formatting.formatOnSave": true,
    // 忽略的pep8提示
    "python.linting.pylintArgs": [
        "--include-naming-hint=n",
        "--disable=W0311",
        "--disable=C0103",
        "--disable=E1101",
        "--disable=C0111",
        "--disable=W0621"
    ],
    //去除尾部的空格
    "files.trimTrailingWhitespace": true,
    //失去焦点时自动保存
    "files.autoSave": "onFocusChange",
    //显示缩进空格
      "editor.renderWhitespace": "boundary",
    "editor.renderLineHighlight": "line",
   //忽略的文件
  "files.exclude": {
    "**/.git": true,
    "**/.svn": true,
    "**/.hg": true,
    "**/.DS_Store": true,
        ".vscode": true,
        "**/__pycache__": true,
        "**/**/*.pyc": true
  },
  //关闭显示打开的文件
  "explorer.openEditors.visible": 0,
  //关闭回车的自动补全
    "editor.acceptSuggestionOnEnter": false


}

需要安装pylint和auto pep8;

自己需要的快捷键

[
    { "key": "f6",                    "command": "workbench.action.debug.continue",
                                     "when": "inDebugMode" },

{ "key": "f6",                    "command": "workbench.action.debug.start",
                                     "when": "!inDebugMode" },
{ "key": "f5",           "command": "workbench.action.tasks.build" }
]

不出现自动补全提示的时候,
在~/.vscode/extensions/donjayamanne.python-0.5.5/pythonFiles/preview/jedi/parser目录下复制一份grammar3.5.txt,并将其改名为grammar3.6.txt

黑苹果MAC os 10.11新安装后的设置和软件安装

2016-10-22

新安装好的系统,需要安装软件和配置,将这些步骤记录下来,以后需要的适合查询.

屏蔽”Thunderbolt 1.2 固件更新”
黑苹果在APP store更新界面一直有”Thunderbolt 1.2 固件更新”提醒, 在iterm2运行”

1	softwareupdate --ignore ThunderboltFirmwareUpdate1.2

安装Xcode Command Line Tools
也就是命令行工具,大概需要10分钟.

1	xcode-select --install

安装zsh, 命令参考github

1 2	sh -c "$(curl -fsSL https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"

然后配置iterm2, 去掉General的两个Confirm的勾选, 以防iterm2阻止MAC的关机. Profiles-Text两个字号选择18, 字体选择Hack

修改主题和提示符: subl ~/.zshrc, 第10行的主题改成half-life,
然后提示符修改代码加入最后,

1
2

PROMPT=$'%{$purple%}%n%{$reset_color%} in %{$limegreen%}%~%{$reset_color%}$(ruby_prompt_info " with%{$fg[red]%} " v g "%{$reset_color%}")$vcs_info_msg_0_%{$orange%}%{$reset_color%} at %{$hotpink%}%* %{$orange%}
λ%{$reset_color%} '

安装brew

1 2	ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

用brew分别安装各种工具,
比如tree, git, openssl, mongodb等.
tree是mac下生成目录树的工具,

brew install tree
brew install git
brew install openssl
brew python3
brew mysql

安装pip

1	sudo easy_install pip

在墙内为了更快的安装各种库, 需要更换pip的安装源,自己建立 “~/.pip/pip.conf”文件

mkdir ~/.pip/
cd ~/.pip/
subl pip.conf

将

[global]
trusted-host = pypi.douban.com
index-url = http://pypi.douban.com/simple/

拷贝进去

用pip安装各种库,
比如requests, beautifulsoup4, mysql-pyhton, virtualenv等.

hexo最新版本的安装和数据恢复

2016-09-14

最近用webstorm运行编辑好的html网页文件,总是显示无法打开,改localhost端口不行. 网上搜索了半天,发现很少人出现这种问题. 我怀疑是MAC osx系统以前装的东西跟webstorm冲突了. 只好重新安装我的黑苹果系统.

安装一帆风顺,各种软件也安装配置正常. 但是安装和恢复hexo博客文件的时候遇到的问题, 按照以前的方法总是没法恢复成功.只好按hexo 官网的最新方法一步一步的重新安装了.

更新github的SSH key, 在终端输入命令一路回车生成SSH

ssh-keygen -t rsa -C "your_email@example.com"

cd ~/.ssh
subl id_rsa.pub

将sublime text里面的ssh密钥拷贝粘贴到github账户里面.

1 2	ssh -T git@github.com

验证ssh配置是否成功.

1 2	git config --global user.name "用户名" git config --global user.email "邮箱地址"

运行上面命令,可以避免每次hexo d提交的时候都输入账户密码

安装Node.js, hexo官网提供的是用命令安装的方式, 我发现还是去Node.js官网下载一个安装程序方便.
下载地址
安装Hexo

1	npm install -g hexo-cli

初始化

1
2
3

hexo init mrxin.github.io
cd mrxin.github.io
npm install

完成后, 用 sublime text打开 _config.yml进行配置, 参考以前的配置文件, 需要注意的是, 最后deploy部分,以前type是github,现在改成了git.

deploy:
  type: git
  repository: https://github.com/MrXin/MrXin.github.io.git
  branch: master

这里还有一个重要的步骤是现在需要安装”hexo-deployer-git”,方法:

1	npm install hexo-deployer-git --save

主题安装和配置:
我喜欢的主题叫”Landscape-plus”, 不知道是hexo的bug还是这个主题的不过,标签或者分类只能显示10篇文章的主题.
我又挑了一款不好看的主题,叫maupassant,其官网.
使用方法比较简单,按照官方步骤来就行了,最后在多说参数后填上:clutchbear即可.

最后将以前备份的md文件拷贝到_posts目录里面, 生成静态页面传到github上.

1 2	hexo g hexo d

Rss需要按装插件:

1	npm install hexo-generator-feed@1 --save

修改zsh命令提示符

2016-09-13

修改zsh命令提示符

安装好zsh后，打开终端，输入
1
vim ~/.zshrc
打开zsh配置文件
在.zshrc最下面加入
1
2
3
4
5
6
autoload -U compinit promptinit
compinit
promptinit

# 设置 redhat 主题的默认命令行提示符
prompt redhat
启动tab命令补全和命令提示符主题,
可以用终端命令’prompt -l’或者’prompt -p’查看可用主题

修改保存后,重启终端,可能出现警告

1 2	zsh compinit: insecure directories, run compaudit for list. Ignore insecure directories and continue [y] or abort compinit [n]?

按y后,可以出现修改后命令提示符

如果想取消这个警告,在On OSX 10.11下,输入

1
2
3

cd /usr/local/share/
sudo chmod -R 755 zsh
sudo chown -R root:staff zsh

python获取京东商品信息

2016-08-23

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

url = 'http://list.jd.com/list.html?cat=9987%2C653%2C655&page=1'
req = requests.get(url)


soup = BeautifulSoup(req.text, "html.parser")
items = soup.select('li.gl-item')
# print len(items)
for item in items:
    sku = item.find('div')['data-sku']
    print sku,
    price_url = 'http://p.3.cn/prices/mgets?skuIds=J_' + str(sku)
    price = requests.get(price_url).json()[0]['p']
    print price,
    nameinfo = item.find('div', class_="p-name").find('a')
    name = nameinfo['title']
    item_url = 'http:' + nameinfo['href']
    print name, item_url,
    commit = item.find('div', class_="p-commit").find('a')
    if commit:
        print commit.get_text()

其中价格是json获取的,

1	http://p.3.cn/prices/mgets?skuIds=J_ + skuId

还有几个获取获取json的方法:

1 2	http://c0.3.cn/stock?skuId=965009&cat=652,829,854&area=1_2812_51141_0&extraParam={"originid":"1"}

其中skuid是商品id,cat可以在商品网页里面获取到, area是地区码,
返回京东网页版商品的价格,当地商品是否有货.

1 2	http://item.m.jd.com/ware/thirdAddress.json?address=jd1356&wareId=965009&provinceId=1&cityId=2812&countryId=51141

其中wareID是商品id,area是地区码,返回京东手机版商品的价格,当地商品是否有货.

1 2	http://pe.3.cn/prices/pcpmgets?skuids=965012&origin=5&area=1_2812_51141

其中skuid是商品id,cat可以在商品网页里面获取到, area是地区码,
返回商品网页版价格和手机版微信版价格.

按Esc进入普通模式，在该模式下使用方向键或者h,j,k,l键可以移动游标

在普通模式下使用下面的键将进入插入模式，并可以从相应的位置开始输入

在普通模式下按:进入命令模式,