0%

关于在python中写SQL的笔记

来源Linkedin learning.
万物皆可接API!!

忘记装matplotlib.

1
python -m pip install -U matplotlib




SQLite

Create a SQL lite database using python.

Add data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import sqlite3

# create connection object with the connect function
# save the connection object in a variable called "connection"

connection = sqlite3.connect('movies.db') # 这里的movies.db可以改成任何的directory

# since we do not any database yet, this function will create the database
# next time this code will just connect with the database movies.db

# we create a cursor(光标)
# cursor有点像pointer

cursor = connection.cursor()

# we need to put sql command in tripe quotes

cursor.execute(''' CREATE TABLE IF NOT EXISTS Movies
(Title TEXT, Director TEXT, Year INT)''')

connection.commit() # commit create table changes to the database
connection.close()

Save as movies.py.
Execute command.



1
python3 movies.py

这时候就已经生成一个movies.db在对应的directory中.



Add more data example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import sqlite3

connection = sqlite3.connect('movies.db')

cursor = connection.cursor()

cursor.execute(" INSERT INTO Movies VALUES
('Taxi Driver','Martin Scorsese', 1976) ")

cursor.execute(" SELECT * FROM Movies")

# to retrieve it in python, we need not only select data
# we can use fetchone function to retrieve a single row
# or use fetchall

print(cursor.fetchone()) # will print the taxi driver line

connection.commit()
connection.close()



Add more lines at once

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import sqlite3

connection = sqlite3.connect('movies.db')

cursor = connection.cursor()

famousFilms = [('Pulp Fiction', 'Quentin Tarantino', 1994),
('Back to the Future', 'Steven Spielberg', 1985),
('Moonrise Kingdom', 'Wes Anderson', 2012)]

cursor.executemany('INSERT INTO Movies VALUES (?,?,?)', famousFiles)
# 三个问号对应每行数据的三个part

records = cursor.execute(" SELECT * FROM Movies")
#records储存了select的return

print(cursor.fetchall())

# 也可以用loop来fetch one by one

# for record in records:
# print(record)

connection.commit()
connection.close()

因为cursor works like a pointer.所以当cursor跑过result了之后,就不能再用cursor在iterate了.
的意思是,我们已经用fetchall把select的东西跑完了,再用fetchone,是不会回到第一行的数据的,会return empty.
所以把东西存在records里,是做一个copy,不会动cursor里的东西.
Cursor resets after each execute, execute many command.



Tuple

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import sqlite3

connection = sqlite3.connect('movies.db')

cursor = connection.cursor()

release_year = (1985, )

cursor.execute("SELECT * FROM Movies WHERE year=?", release_year)

print(cursor.fetchall())

connection.commit()
connection.close()

Tuples (in python) are used to store multiple items in a single variable.

mytuple = (“apple”, “banana”, “cherry”)

这里,我们要找数据中year为1985年的数据.
使用tuple的instead of string是为了prevent SQL injection attack.
如果要用python的variable来比较,得像这样,use question mark placeholder,做一个tuple values as the second argument.



Object-relational mapping(ORM)

ORM也是可以query and manipulate data的一种方式.
也有很多libraries implement ORM-like functionality.
其中有SQLAlchemy,它是一个SQL toolkit.

要使用SQLAlchemy ORM,首先要装pip.

1
python3 -m pip install --user --upgrade pip

视频推荐先装个virtual环境再在环境里装SQLAlchemy.

1
python3 -m pip install --user virtualenv
1
python3 -m venv sqlalchemy-workspace

在对应的directory下生成了sqlalchemy-workspace的文件夹.
cd进去后,

1
source bin/activate

active虚拟环境,再,

1
pip install sqlalchemy

btw离开虚拟环境是,

1
deactivate


然后我们可以使用sqlalchemy了.

Import and add data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import sqlalchemy as db

engine = db.create_engine('sqlitle:///movies.db')

# engine is not same as python database APIs connection object
# engine allows your application to have multiple database connections
# and it manages those connections

connection = engine.connect()
# this connection is actually a proxy for the true python database API connection

metadata = db.MetaData()
# this metadata will hold all info about our table

movies = db.Table('Movies', metadata, autoload=True, autoload_with=engine)

query = db.select([movies])

result_proxy = connection.execute(query)
# this result proxy proxies the cursor object from the python database API
# we can use the result proxy to retrieve the data

result_set = result_proxy.fetchall()
# result set is the actual data

print(result_set[0]) # asscess to the first object
print(result_set[:2]) # first two objects

# add another row of data
query = movies.insert().values(Title="Psycho", Director="Alfred Hitchcock", Year="1960")
connection.execute(query) # pass the query object to connection to execute

# check the database
query = db.select([movies])
result_proxy = connection.execute(query)
result_set = result_proxy.fetchall()

# print all
print(result_set)



Create table

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

import sqlalchemy as db

engine = db.create_engine('sqlitle:///users-sqlalchemy.db')
metadata = db.MetaData()
connection = engine.connect()

users = db.Table('Users', metadata,
db.Column('user_id', db.Integer, primary_key=True),
db.Column('first_name', db.Text),
db.Column('last_name', db.Text),
db.Column('email_address', db.Text))

metadata.create_all(engine) # will create the table
insertion_query = users.insert().values([
{"first_name":"Tina","last_name":"Mccoy","email_address":"tmccoyl@gmail.com"}
])

connection.execute(insertion_query)

selection_query = db.select([users.column.email_address])
selection_result = connection.execute(selection_query)

print(selection_result.fetchall())





MySQL

Mysql will put database on a server instead of directly on the file system.
Follows a client/server relationship. Need to send requests to the server to access or modify data in the database.
In order to send requests to the database, we need to install the correct driver.

视频讲首先要下载MySQL community.
然后install + set root password via installer.

当在command line里输入mysql或者mysql.server没有被认出来时,得更改PATH variable.
在mac中可以:

1
sudo nano /etc/path

加入mysql的path

1
2
/usr/local/mysql/bin
/usr/local/mysql/support-files


记得重启terminal.

要写mysql得把server给run起来.

1
2
sudo mysql.server start
sudo mysql.server stop


MySQL shell

Login:

1
mysql -u root -p

然后可以直接在command line上写mysql.

1
CREATE DATABASE projects;

以下都是command line里的mysql
create了之后. To configure the databse we’re working with.

1
use projects;
1
CREATE TABLE projects(project_id INT(11) NOT NULL AUTO_INCREMENT, title VARCHAR(30), description VARCHAR(255), PRIMARY_KEY(project_id));

查看生成的table.

1
show tables;
1
CREATE TABLE tasks(task_id INT(11) NOT NULL AUTO_INCREMENT, project_id INT(11) NOT NULL, description VARCHAR(255), PRIMARY_KEY(task_id), FOREIGN KEY(project_id) REFERENCES project(project_id));
1
INSERT INTO projects(title, description) VALUES ("Organize Photos", "Organize old iPhone photos by year");
1
2
INSERT INTO tasks(description,project_id) VALUES ("Organize 2020 Photos", "1");
INSERT INTO tasks(description,project_id) VALUES ("Organize 2019 Photos", "1");

Project_id是auto increment的所以是1和2.

1
INSERT INTO projects(title, description) VALUES ("Read more", "Read a book a month this year");
1
INSERT INTO tasks(description,project_id) VALUES ("Read the huntress", "2");

With the task table, having that foreign key constraint, we cannot inset a task without a corresponding project ID. We will also have to delete tasks before we delete the corresponding project. With special operators, such as on delete cascade, we can avoid this.

1
SELECT * FROM projects;
1
SELECT * FROM tasks;

顺便一提出去mysql环境:

1
\q


Python

要在python里写,首先要下载compliant module.

进入虚拟环境.

1
2
3
virtualenv mysql-workspace
cd mysql-workspace
source bin/active
1
pip install mysql-connector-python

在此workspace(directory)下建立mysql-database.py
就可以开始写python了.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import mysql.connector as mysql

def connect(db_name):
try:
return mysql.connect(
host='localhost',
user='root',
password='password', # the password to your database
database=db_name
)
except Error as e:
print(e)

if __name__ == '__main__':
db = connect("projects")
cursor = db.cursor()

cursor.execute("SELECT * FROM projects")
project_records = cursor.fetchall()
print(project_records)

db.close()

For the host parameter we could put in some remote server running in AWS, but initially with MySQL installation, the host is where the database lives. Our application will send requests back and forth, to this location our localhost.

Run the file.

1
python3 mysql-database.py



Create function to replace SQL command:

Create function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import mysql.connector as mysql

def connect(db_name):
try:
return mysql.connect(
host='localhost',
user='root',
password='password', # the password to your database
database=db_name
)
except Error as e:
print(e)

def add_new_project(cursor, project_title, project_description, task_descriptions):
project_data = (project_title, project_description)
cursor.execute('''INSERT INTO projects(title, description)
VALUES (%s, %s)''', project_data ) # triple quotes is for code in different lines

project_id = cursor.lastrowid # the 自动生成的id

tasks_data = []
for description in task_descriptions:
task_data = (project_id, description)
tasks_data.append(task_data)

cursor.executemany('''INSERT INTO tasks(project_id, description)
VALUES (%s, %s)''', tasks_data)

if __name__ == '__main__':
db = connect("projects")
cursor = db.cursor()

tasks = ["Clean bathroom", "Clean kitchen", "Clean living room"]

add_new_project(cursor, "Clean house","Clean house by room",tasks)
db.commit() #commit database changes

cursor.execute("SELECT * FROM projects")
project_records = cursor.fetchall()
print(project_records)

cursor.execute("SELECT * FROM tasks")
tasks_records = cursor.fetchall()
print(tasks_records)

db.close()



注意以上的code中hard code成分还是很多,用了column name直接进入了code.
可塑性很差,以下改良.

SQLAlchemy

Get into the mysql.

1
mysql -u root -p
1
CREATE DATABASE household;
1
\q

建立好table退出.



进入个有sqlalchemy的虚拟环境.

1
virtualenv mysql-sqlalchemy-workspace
1
cd mysql-sqlalchemy-workspace
1
source bin/active

然后装一下主角.

1
pip install mysql-connector-python
1
pip install sqlalchemy

虽然不理解但视频是这么说的:

Now let’s connect to our database using SQLAlchemy from a Python file.
We’ll be using the ORM components of SQLAlchemy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
from sqlalchemy import Column, Integer, String, ForeignKey
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker, relationship
from sqlalchemy.ext.declaractive import declarative_base

engine = create_engine("mysql+mysqlconnector://root:password@localhost:3306/household", echo=True)

# 这句话我也没听懂
# to connect our database
# syntax is the database type along with the driver
# database type -> mysql
# plus(+) the driver -> mysqlconnector
# then give user and password
# replace the password into your actual password
# then the host for our database
# by default it's at port 3306
# household is the database we just create

# echo=True makes SQLAlchemy points out SQL statement

base = declarative_base()

class Project(Base): # also can be called as models
__tablename__ = 'projects'
__table_args__ = {'schema':'household'}

project_id = Column(Integer, primary_key=True)
title = Column(String(length=50))
description = Column(String(length=50))

def __repr__(self):
return "<Project(title'{0}', description='{1}')>".format(
self.title, self.description
)

clase Task(Base):
__tablename__ = 'tasks'
__table_args__ = {'schema':'household'}

task_id = Column(Integer, primary_key=True)
project_id = Column(Integer, ForeignKey('household.projects.project_id'))
description = Column(String(length=50))

# define foreign key relationship
project = relationship("Project")
# notice we use project the name of the class not the name of the table

def __repr__(self):
return "<Task(description='{0}')>".format(
self.description
)

Base.metadata.create_all(engine) # pass in the engine
# for the create all function, if the table already exist
# then SQLAlchemy will use the already existing table rather than creating new tables

session_maker = sessionmaker()
session_maker.configure(bind=engine)
session = session_maker()

# session can create data transaction
# transaction is a set of all or nothing queries
# we either want all of them to run or none of them to run
# we never leave the database in half done state

# create the data
organize_closet_project = Project(title='Organize closet', description='In a cool way')
# insert into database
session.add(organize_closet_project)
session.commit()

tasks = [Task(project_id=organize_closet_project.project_id, description='Decide what close to donate'),
Task(project_id=organize_closet_project.project_id, description='Organize winter clothes'),
Task(project_id=organize_closet_project.project_id, description='Organize summer clothes')]

# to make project_id work, we need to commit the project edition first
# we have to do this because auto incrementation of the primary key doesn't happen until after
# the project has been added to the database
# once the project insertion is committed, the database will initialize that project_id attribute
# the project object itself will also be updated so project_id will no longer be null

session.bulk_save_objects(tasks) # to save a lot of data at one time
session.commit()

# select the organized closet project
our_project = session.query(Project).filter_by(title='Organize closet').first()
# this could return lots of results, but we add first for specific instance
print(our_project)

# retrieve all of the task data
our_task = session.query(Task).all()
print(our_task)



Add CSV

We have a database called ‘sales’.

记得.

1
pip install mysql-connector-python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import mysql.connector as mysql
import csv

connect = mysql.connect(
host='localhost',
user='root',
password='password', # the password to your database
database='sales',
allow_local_infile=True)

cursor = connection.cursor()

create_query = '''CREATE TABLE salesperson(
id INT(255) NOT NULL AUTO_INCREMENT,
first_name VARCHAR(255) NOT NULL,
last_name VARCHAR(255) NOT NULL,
email_address VARCHAR(255) NOT NULL,
city VARCHAR(255) NOT NULL,
state VARCHAR(255) NOT NULL,
PRIMARY KEY(id))'''

cursor.execute("DROP TABLE IF EXISTS salesperson")

cursor.execute(create_query)

# this will read everything into string
# with open('./salesperson.csv', 'r') as f: # read only format
# csv_data = csv.reader(f)
# for row in csv_data:
# print(row)
# row_tuple = tuple(row)
# cursor.execute('INSERT INTO salesperson(first_name, last_name, email_address, city, state) VALUES("%s","%s","%s","%s","%s")', row_tuple)
## we do have to insert each row individually since we have to do that tuple conversion

q = '''LOAD DATA LOCAL INFILE '/Users/.../salesperson.csv' INTO TABLE salesperson FIELDS TERMINATED BY ',' ENCLOSED by '"' (first_name, last_name, email_address, city, state);'''

cursor.execute(q)

connection.commit()

curosr.execute("SELECT * FROM salesperson LIMIT 10")
print(cursor.fetchall())

connection.close()



ETL

因为大多是讲API如何在python中使用,tool>knowledge的成分多,就没有继续记下去了.
但这一块讲ETL的挺好所以单独记一下.

ETL, also known as extract, transformation and load.
Idea is we extract or we read the data from it source, database or spreadsheet. Or we extract it from applications of IoT data streams, and then we do things to transform the data.
We can think of each data source as a puzzle piece, and the transformation is the operations that help us reshape those puzzle pieces so that they all fit together in a logical way.
Finally once we have the data transformed then we can load it into a relational database, at which point we can then start using SQL to analyze it.

Transformation is a board topic, it could be as simple as removing out the white spaces or tabs at the beginning or end of the text.
We also have to reformat dates. Different system use different representation for dates, so we want to make sure we have a consistent way of looking at dates.
Also two different system that talks about the same thing, say for example, the departments in your organization they might use different codes to refer to them. So one application might use two letter codes to refer to departments and another application might spell out the whole name of the application. When transform, we need to pick the standard way of categorizing data or standard sets of codes and make sure we consistently use those.