关于data的笔记 - Week3 | Underlake Reading Room

Note:
前两回大多是python和其pandas和numpy库的编码笔记.
第一回附加了安装jupyter notebook的过程,第二回增加了DA的过程.

笔记有少许重复,仅记载常用code,以前记录过的function也有在旧笔记中补充.

data wrangling

gather

举例:
在rottenTomatoes上下载电影评价.link
在某位影评人网站上下载评价.link
Word cloud 插件.link

Flat File Structure.一种文件格式:Flat files contain tabular data in plain text format with one data record per line and each record or line having one or more fields. These fields are separated by delimiters, like commas, tabs, or colons.

如果数据用comma分割就是csv,用tab是tsv.
是人能读得懂的.
注:excel并不是,excel的源文件是人读不懂的.

也可以用read_csv来read tsv.

1	df = pd.read_csv("bestofrt.tsv", sep='\t')

web scraping

注:我们使用了os library来读取文件和存储.读取也可以使用 glob library.

request

用request library make http request.link.

import requests
import os

# Make directory if it doesn't already exist
folder_name = 'theFolderName'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

url = 'the_adress'
response = requests.get(url)

response是200就是成功的意思.
现在东西在response里,将其导出.

1
2
3

with open(os.path.join(folder_name,
                        url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)

check content.

1	os.listdir(folder_name)

glob

用glob读取类似txt的文件.

用glob.glob读取一个文件夹下的内容.

import glob
import pandas as pd

# List of dictionaries to build file by file and later convert to a DataFrame
df_list = []
for ebert_review in glob.glob('ebert_reviews/*.txt'):
    with open(ebert_review, encoding='utf-8') as file: # 注意这里的encoding需要在原网页文件下查看
        title = file.readline()[:-1] # readline来取一行一行的内容,最后的-1是用于减去newline character的
        review_url = file.readline()[:-1]
        review_text = file.read()

        # Append to list of dictionaries
        df_list.append({'title': title,
                        'review_url': review_url,
                        'review_text': review_text})
df = pd.DataFrame(df_list, columns = ['title', 'review_url', 'review_text']) # 最后转成df

beautiful soap

首先先download一个html文件.用beautiful soap来抓data.link.

from bs4 import BeautifulSoup

with open('fileName.html') as file:
  soup = BeautifulSoup(file, 'lxml') # first we make a soup
  # in the soup we have everything in the html file without clean

# 然后用find来寻找某个我们想要的tag内的内容.
soup.find('title')
# return everything with tag title
soup.find('title').content[0]
# return第一个在title tag内的内容.
soup.find('title').content[0][:-len(' the thing wanna delete')]
# 进行string的删减.

# 可以用tag和class一起找.这个例子是在(有class的)div里的span tag.
soup.find('div',class_='className').find('span')

减去white space用strip.

1	'string'.strip()

用replace减去100,000中的comma.

1	'string'.replace(',','')

# List of dictionaries to build file by file and later convert to a DataFrame
    df_list = []
    folder = 'rt_html'
    for movie_html in os.listdir(folder):
        with open(os.path.join(folder, movie_html)) as file:
            soup = BeautifulSoup(file, 'lxml')
            title = soup.find('title').contents[0][:-len(' - Rotten rottenTomatoes')]
            audience_score = soup.find('div', class_='audience-score meter').find('span').contents[0][:-1]
            num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')
            num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',','')
            # Append to list of dictionaries
            df_list.append({'title': title,
                            'audience_score': int(audience_score),
                            'number_of_audience_ratings': int(num_audience_ratings)})
    df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])

api

最方便的办法.
(我理解api为接口,像usb接口那样.)

用mediaWiki举例.link.link.

import wptools
# https://en.wikipedia.org/wiki/Mahatma_Gandhi

page = wptools.page('Mahatma_Gandhi').get()

# 其中page自带很多attribute,可以用其中的image来找wiki page上的image
page.data['image']

Note:关于数据的储存,还可能是json.Json类似python里的list.
比如说之前的page object中存放的就是json的格式.

1	page.data['image'][0] # access first image

更多关于json与python的link.

注:download文件时,如果是文字,要确认encode方式(eg.utf-8).
如果是图片要注意library的使用,有些可能会出错.link

import requests
from PIL import Image
from io import BytesIO
r = requests.get(url)
i = Image.open(BytesIO(r.content))

# List of dictionaries to build and convert to a DataFrame later
df_list = []
image_errors = {}
for title in title_list: # title list是自己创造好的一个list of string
    try:
        # This cell is slow so print ranking to gauge time remaining
        ranking = title_list.index(title) + 1
        print(ranking)
        page = wptools.page(title, silent=True)
        images = page.get().data['image']
        # First image is usually the poster
        first_image_url = images[0]['url']
        r = requests.get(first_image_url)
        # Download movie poster image
        i = Image.open(BytesIO(r.content))
        image_file_format = first_image_url.split('.')[-1]
        i.save(folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format)
        # Append to list of dictionaries
        df_list.append({'ranking': int(ranking),
                        'title': title,
                        'poster_url': first_image_url})

    # Not best practice to catch all exceptions but fine for this short script
    except Exception as e:
        print(str(ranking) + "_" + title + ": " + str(e))
        image_errors[str(ranking) + "_" + title] = images

# 用try保护,最后查看出错的图片
for key in image_errors.keys():
    print(key)

先用terminal access到想要保存文件的directory.
再run:

1	jupyter notebook

在界面里可以直接new创建文件.
Note:用shift tab可以读取function的格式备注.

zip file,doc:

import zipfile
# Extract all contents from zip file
with zipfile.ZipFile('archive.zip', 'r') as myzip:
    myzip.extractall()

将文件保存好后可以存放到file中或database.
(file:比如说csv,database:比如说可以用sql读的.)

1	df.to_csv('dataset.csv', index=False)

我们可以在python中连上sql.link.

from sqlalchemy import create_engine
# Create SQLAlchemy Engine and empty bestofrt database
# bestofrt.db will not show up in the Jupyter Notebook dashboard yet
engine = create_engine('sqlite:///bestofrt.db')

# Store cleaned master DataFrame ('df') in a table called master in bestofrt.db
# bestofrt.db will be visible now in the Jupyter Notebook dashboard
df.to_sql('master', engine, index=False)

assess

Focus on quality and tidiness.目的是了解和寻找差错.

Dirty: 数据本身有问题.完整性(缺失),重复,精准,出错(数据单位不一致),一致性等.
Messy: 有structure的问题.

用isnull check空缺.

1	sum(df.columnName.isnull())

在这一步记载数据的问题方便后面更改.

普通的读取用index某一rows的数据.

1	second_row = df.loc[[1]]

link

check uniqueness.

1 2	df_1.nunique() df_1.shape

check duplicate.

1 2	duplicateRowsDF = df_1[df_1.duplicated(['tweet_id'])] print(duplicateRowsDF)

1	df_2.describe()

clean

第一件事是make a copy.

1	df_clean = df.copy()

define

在这一步记载哪些是要clean的问题.用动词写出要做的步骤.

code

code更多翻找上两回的笔记.

string slicing.

1	df['columnName'].str[2:]

string replacing.

1	df['columnName'].str.replace('oldValue','newValue')

column的名字是应当记载的值.用melt.

treatments_clean = pd.melt(treatments_clean, id_vars=['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'],
                           var_name='treatment', value_name='dose')
treatments_clean = treatments_clean[treatments_clean.dose != "-"]
treatments_clean['dose_start'], treatments_clean['dose_end'] = treatments_clean['dose'].str.split(' - ', 1).str
treatments_clean = treatments_clean.drop('dose', axis=1)

treatments_clean = pd.merge(treatments_clean, adverse_reactions_clean,
                            on=['given_name', 'surname'], how='left')

stackoverflow里的例子.

df = pd.DataFrame({'ID':{0:'0',1:'1',2:'2'},
                   'A': {0: 0, 1: 1, 2: 0},
                   'B': {0: 1, 1: 0, 2: 0},
                   'C': {0: 0, 1: 0, 2: 1}})
df = (df.melt('ID', var_name='Col2')
        .query('value== 1')
        .sort_values(['ID', 'Col2'])
        .drop('value',1))

stackoverflow用lambda自定义function的方法.

反过来可以用get_dummies.

将email和电话号码分成两个column.用extract.

patients_clean['phone_number'] = patients_clean.contact.str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True)

# [a-zA-Z] to signify emails in this dataset all start and end with letters
patients_clean['email'] = patients_clean.contact.str.extract('([a-zA-Z][a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+[a-zA-Z])', expand=True)

# Note: axis=1 denotes that we are referring to a column, not a row
patients_clean = patients_clean.drop('contact', axis=1)

大小写转换.

1	df.columnName = df.columnName.str.lower()

用rename更改column名字.

1 2	df_clean = df_clean.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})

使用value_counts快速了解column内的数值分布.

1	df.columnName.value_counts()

更改column内的数值:

list = ['valueWannaReplace','valueWannaReplace']

for value in asap_list:
    df_clean.StartDate.replace( value ,'newValue', inplace=True)

附stackoverflow里的例子.

选中多列并更改的情况.
(isin是用value判定的)

1	df[df['A'].isin([3, 6])] # 选中A column里有3和6的row

使用index.

1	df_1[df_1.index.isin([231,837,885,1480])] # 把df['columnName']改成index

stackoverflow.

用assert来test条件.

1 2	for value in asap_list: assert value not in df_clean.StartDate.values

merge.

1 2	treatments_clean = pd.merge(treatments_clean, adverse_reactions_clean, on=['given_name', 'surname'], how='left')

# Mapping from full state name to abbreviation
state_abbrev = {'California': 'CA',
                'New York': 'NY',
                'Illinois': 'IL',
                'Florida': 'FL',
                'Nebraska': 'NE'}

# Function to apply
def abbreviate_state(patient):
    if patient['state'] in state_abbrev.keys():
        abbrev = state_abbrev[patient['state']]
        return abbrev
    else:
        return patient['state']

patients_clean['state'] = patients_clean.apply(abbreviate_state, axis=1)

数据类型转变.

# To category
patients_clean.assigned_sex = patients_clean.assigned_sex.astype('category')
patients_clean.state = patients_clean.state.astype('category')

# To datetime
patients_clean.birthdate = pd.to_datetime(patients_clean.birthdate)

# Strip u and to integer
treatments_clean.dose_start = treatments_clean.dose_start.str.strip('u').astype(int)
treatments_clean.dose_end = treatments_clean.dose_end.str.strip('u').astype(int)

使用pad填充string.

1
2

# Strip all " ", "-", "(", ")", and "+" and store each number without any formatting. Pad the phone number with a 1 if the length of the number is 10 digits (we want country code).
patients_clean.phone_number = patients_clean.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')

remove some data.用不等于就行.

1	patients_clean = patients_clean[patients_clean.surname != 'Doe']

test

check重复.

1
2
3

# Patient ID should be the only duplicate column
all_columns = pd.Series(list(patients_clean) + list(treatments_clean)) # patients_clean是df.
all_columns[all_columns.duplicated()]

Link.

check所有有null的row.

1 2	patients[patients['address'].isnull()] all_columns[all_columns.duplicated()]

analysis and visualization

以前的笔记记载的更多.

比如说用value_counts()来画pie chart.

1	df_clean.StartDate.value_counts().plot(kind="pie")

附录:read and write files with python

# 这里的f是一个python file,r是read only
f = open('my_path/my_file.txt', 'r')
# 这里的file_data是一个string
file_data = f.read()
f.close()

# w是write.如果my_file.txt不存在,会创造一个新的
f = open('my_path/my_file.txt', 'w')
f.write("Hello there!")
f.close()

with可以自带关闭file.

1 2	with open('my_path/my_file.txt', 'r') as f: file_data = f.read()

1	f.readline() # reads a single line from the file;

def create_cast_list(filename):
    cast_list = []
    with open(filename) as f:
        for line in f:
            name = line.split(",")[0]
            cast_list.append(name)

    return cast_list

cast_list = create_cast_list('flying_circus_cast.txt')
for actor in cast_list:
    print(actor)