Note:
前两回大多是python和其pandas和numpy库的编码笔记.
第一回附加了安装jupyter notebook的过程,第二回增加了DA的过程.
笔记有少许重复,仅记载常用code,以前记录过的function也有在旧笔记中补充.
data wrangling
gather
举例:
在rottenTomatoes上下载电影评价.link
在某位影评人网站上下载评价.link
Word cloud 插件.link
Flat File Structure.一种文件格式:Flat files contain tabular data in plain text format with one data record per line and each record or line having one or more fields. These fields are separated by delimiters, like commas, tabs, or colons.
如果数据用comma分割就是csv,用tab是tsv.
是人能读得懂的.
注:excel并不是,excel的源文件是人读不懂的.
也可以用read_csv来read tsv.
1 | df = pd.read_csv("bestofrt.tsv", sep='\t') |
web scraping
注:我们使用了os library来读取文件和存储.读取也可以使用 glob library.
request
用request library make http request.link.
1 | import requests |
response是200就是成功的意思.
现在东西在response里,将其导出.
1 | with open(os.path.join(folder_name, |
check content.
1 | os.listdir(folder_name) |
glob
用glob读取类似txt的文件.
用glob.glob读取一个文件夹下的内容.
1 | import glob |
beautiful soap
首先先download一个html文件.用beautiful soap来抓data.link.
1 | from bs4 import BeautifulSoup |
减去white space用strip.
1 | 'string'.strip() |
用replace减去100,000中的comma.
1 | 'string'.replace(',','') |
1 | # List of dictionaries to build file by file and later convert to a DataFrame |
api
最方便的办法.
(我理解api为接口,像usb接口那样.)
1 | import wptools |
Note:关于数据的储存,还可能是json.Json类似python里的list.
比如说之前的page object中存放的就是json的格式.
1 | page.data['image'][0] # access first image |
更多关于json与python的link.
注:download文件时,如果是文字,要确认encode方式(eg.utf-8).
如果是图片要注意library的使用,有些可能会出错.link
1 | import requests |
1 | # List of dictionaries to build and convert to a DataFrame later |
先用terminal access到想要保存文件的directory.
再run:
1 | jupyter notebook |
在界面里可以直接new创建文件.
Note:用shift tab可以读取function的格式备注.
zip file,doc:
1 | import zipfile |
将文件保存好后可以存放到file中或database.
(file:比如说csv,database:比如说可以用sql读的.)
1 | df.to_csv('dataset.csv', index=False) |
我们可以在python中连上sql.link.
1 | from sqlalchemy import create_engine |
assess
Focus on quality and tidiness.目的是了解和寻找差错.
Dirty: 数据本身有问题.完整性(缺失),重复,精准,出错(数据单位不一致),一致性等.
Messy: 有structure的问题.
用isnull check空缺.
1 | sum(df.columnName.isnull()) |
在这一步记载数据的问题方便后面更改.
普通的读取用index某一rows的数据.
1 | second_row = df.loc[[1]] |
check uniqueness.
1 | df_1.nunique() |
check duplicate.
1 | duplicateRowsDF = df_1[df_1.duplicated(['tweet_id'])] |
1 | df_2.describe() |
clean
第一件事是make a copy.
1 | df_clean = df.copy() |
define
在这一步记载哪些是要clean的问题.用动词写出要做的步骤.
code
code更多翻找上两回的笔记.
string slicing.
1 | df['columnName'].str[2:] |
string replacing.
1 | df['columnName'].str.replace('oldValue','newValue') |
column的名字是应当记载的值.用melt.
1 | treatments_clean = pd.melt(treatments_clean, id_vars=['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'], |
stackoverflow里的例子.
1 | df = pd.DataFrame({'ID':{0:'0',1:'1',2:'2'}, |
stackoverflow用lambda自定义function的方法.
反过来可以用get_dummies.
将email和电话号码分成两个column.用extract.
1 | patients_clean['phone_number'] = patients_clean.contact.str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True) |
大小写转换.
1 | df.columnName = df.columnName.str.lower() |
用rename更改column名字.
1 | df_clean = df_clean.rename(columns={'oldName1': 'newName1', |
使用value_counts快速了解column内的数值分布.
1 | df.columnName.value_counts() |
更改column内的数值:
1 | list = ['valueWannaReplace','valueWannaReplace'] |
附stackoverflow里的例子.
选中多列并更改的情况.
(isin是用value判定的)
1 | df[df['A'].isin([3, 6])] # 选中A column里有3和6的row |
使用index.
1 | df_1[df_1.index.isin([231,837,885,1480])] # 把df['columnName']改成index |
用assert来test条件.
1 | for value in asap_list: |
merge.
1 | treatments_clean = pd.merge(treatments_clean, adverse_reactions_clean, |
1 | # Mapping from full state name to abbreviation |
数据类型转变.
1 | # To category |
使用pad填充string.
1 | # Strip all " ", "-", "(", ")", and "+" and store each number without any formatting. Pad the phone number with a 1 if the length of the number is 10 digits (we want country code). |
remove some data.用不等于就行.
1 | patients_clean = patients_clean[patients_clean.surname != 'Doe'] |
test
check重复.
1 | # Patient ID should be the only duplicate column |
Link.
check所有有null的row.
1 | patients[patients['address'].isnull()] |
analysis and visualization
以前的笔记记载的更多.
比如说用value_counts()来画pie chart.
1 | df_clean.StartDate.value_counts().plot(kind="pie") |
附录:read and write files with python
1 | # 这里的f是一个python file,r是read only |
1 | # w是write.如果my_file.txt不存在,会创造一个新的 |
with可以自带关闭file.
1 | with open('my_path/my_file.txt', 'r') as f: |
1 | f.readline() # reads a single line from the file; |
1 | def create_cast_list(filename): |