Assigned: 20 October 2015
Due: 3 November 2015
# setup
import sqlite3
import comp521
import math
check, report = comp521.start('A3')
# these are the variables you will use below. You must use the variables because
# they will have different values when I grade your assignment.
from A3vars import *
Suppose you have a roster from a census return with $N$ entries in alphabetical order by first name. Each page has $P$ entries.
1. What is the lower bound $B_l$ and the upper bound $B_u$ on the number of pages that must be examined to find a given first name using a binary search? Assume that having computed a page number, you can open directly to that page.
# using variables N and P compute your answers and leave them in variables
# Bl and Bu
Bl = 1
Bu = math.ceil(math.log(N/P,2))
print 'lower bound =', Bl
print 'upper bound =', Bu
check('Bl', Bl, points=5)
check('Bu', Bu, points=5)
lower bound = 1 upper bound = 10.0 Bl correct Bu correct
2. Now suppose a 1-page table of contents is added to the roster. It lists the first "first name" on each of the other pages (Assume no first name has more than 1 page of entries). How many pages that must be examined to look up a given first name?
# leave your answer in variable A2.
A2 = 2
print 'A2 =', A2
check('A2', A2, points=5)
A2 = 2 A2 correct
3. Now suppose that the roster is reorganized such that each page contains a list of names with the same initials. For example, 'Thomas Jam Andrew' falls on the same page as 'Travis Jam Anand' (both have initials 'TJA'). Assume that no pattern takes more than one page, and there are a total of $M$ pages. Finally, assume that given the pattern you can go directly to the page. How many pages must be checked to find a particular name?
# compute your answer into A3
A3 = 1
check('A3', A3, points=5)
A3 correct
4. Using the same organization as question 3, assuming a page still has room for up to $P$ entries, what is the occupancy percentage of an average page in the new roster?
# compute your answer into A4
A4 = N/float(M*P)*100
print 'A4 =', A4
check('A4', A4, points=5)
A4 = 51.2 A4 correct
5. Now suppose we add a two-level index to the roster. The one-page primary index lists the first name on each of the secondary index pages. The secondary index pages list the first name on each of the roster pages. Assume no first name has more than 1 page of entries. How many pages must be examined to look up a given first name?
# compute your answer into A5
A5 = 3
print 'A5 =', A5
check('A5', A5, points=5)
A5 = 3 A5 correct
Consider a disk with a total disk capacity of $C_{disk}$ TB, a surface capacity of $C_{surf}$ GB, and a sector size of $C_{sect}$ bytes. There are $S_t$ sectors per track. The average seek time is $T_{seek}$ seconds.
I'm using the binary versions of Tera, Mega, etc. $T = 2^{40}$, $G = 2^{30}$, $M = 2^{20}$, etc.
6. What is the capacity of a single track in MB?
# compute your answer into A6
A6 = (Csect*St)/(math.pow(2,20))
print 'A6 =', A6
check('A6', A6, points=5)
A6 = 16.0 A6 correct
7. How many tracks are on each surface?
# compute your answer into A7
A7 = Csurf*1024/((Csect*St)/(math.pow(2,20)))
print 'A7 =', A7
check('A7', A7, points=5)
A7 = 8192.0 A7 correct
8. How many platters are in this disk? Each platter has 2 surfaces.
# compute your answer into A8
A8 = Cdisk*1024/(2*Csurf)
print 'A8 =', A8
check('A8', A8, points=5)
A8 = 8 A8 correct
9. How many cylinders are in this disk?
# compute your answer into A9
A9 = Csurf*1024/(((Csect*St)/(math.pow(2,20))))
print 'A9 =', A9
check('A9', A9, points=5)
A9 = 8192.0 A9 correct
10. If the maximum rotational delay is $T_{rotation}$ what is the spinning speed of the disk in revolutions per minute (RPM)?
# compute your answer into A10
A10 = (1/Trotation)*60
print 'A10 =', A10
check('A10', A10, points=5)
A10 = 6000.0 A10 correct
11. Suppose that a file containing $N_{file}$ records of 250 bytes each is to be stored on such a disk, and that no record is allowed to span two blocks. Assume a block size of $C_{block}$ bytes. How many blocks are required to store the entire file?
# Leave your answer in A11
A11 = Nfile/(math.ceil(Cblock/250))
print 'A11 =', A11
check('A11', A11, points=5)
A11 = 62500.0 A11 correct
Download and unzip actors.db.zip to produce an actors.db file in your working folder. This database has the following schema.
CREATE TABLE Actors (
aid INTEGER PRIMARY KEY,
FName TEXT,
LName TEXT);
CREATE TABLE Casts (
aid INTEGER,
mid INTEGER,
role TEXT,
PRIMARY KEY(aid, mid),
FOREIGN KEY(aid) REFERENCES Actors
FOREIGN KEY(mid) REFERENCES Movies);
CREATE TABLE Movies ( mid INTEGER PRIMARY KEY,
Title TEXT,
Year INTEGER);
In this part, you will try to speed up a few queries using indexes.
The basic syntax for creating indexes using SQLite is: CREATE INDEX index_name on table_name(column1, column2,...)
# this is a helper function to clear indexes for timing without them
# no need to change it
def dropIndexes(dbname):
'''drop any indexes that we may have created'''
db = sqlite3.connect(dbname)
cursor = db.cursor()
cursor.execute("""
select name from sqlite_master
where type = 'index' and sql not null
""")
for row in cursor.fetchall():
name = row[0]
print 'dropping', name
cursor.execute('drop index if exists %s' % name)
db.commit()
db.close()
# this is a helper function to measure the time required for a query to run
# no need to change it.
import time
def timeQuery(cursor, query, repeat=1):
T = 0
for i in range(repeat):
Tstart = time.time()
cursor.execute(query)
rowCounter = len(cursor.fetchall())
Tend = time.time()
T += Tend - Tstart
return (T/repeat, rowCounter)
# a helper to time several queries
# no need to change it
def runQueries(dbName, queries):
'''time queries from a list'''
db = sqlite3.connect(dbName)
cursor = db.cursor()
times = []
for q in queries:
t, rows = timeQuery(cursor, q, 5)
print q
print 'returned %d rows in %f seconds' % (rows, t)
times.append(t)
db.close()
return times
Here are the queries we'll run. Add your own if you like.
queries = [
"""
SELECT LName, FName FROM Actors WHERE LName = "Smith"
""",
"""
SELECT m.title FROM Movies m
WHERE m.Year > 2000 AND m.Year < 2002
""",
"""
SELECT A.LName, A.FName, M.year, M.title
FROM Actors A, Movies M, Casts C
WHERE A.aid = C.aid and M.mid = C.mid and
A.LName = 'Bishop' and
M.year >= 2000 and M.year <= 2002 and
C.role = 'Himself'
ORDER BY A.LName, A.FName, M.year
"""
]
# get the base time for the queries without indexes
# you shouldn't need to change this code
# though you might find it useful as I did to move the db to another drive
dbName = 'actors.db'
dropIndexes(dbName)
TimeWithout = runQueries(dbName, queries)
dropping lastNameIndex dropping movieYearIndex dropping actorIDIndex dropping movieIDIndex dropping bro dropping dude SELECT LName, FName FROM Actors WHERE LName = "Smith" returned 5773 rows in 0.230371 seconds SELECT m.title FROM Movies m WHERE m.Year > 2000 AND m.Year < 2002 returned 11039 rows in 0.098444 seconds SELECT A.LName, A.FName, M.year, M.title FROM Actors A, Movies M, Casts C WHERE A.aid = C.aid and M.mid = C.mid and A.LName = 'Bishop' and M.year >= 2000 and M.year <= 2002 and C.role = 'Himself' ORDER BY A.LName, A.FName, M.year returned 11 rows in 0.221059 seconds
Can you do better adding indexes? I was able to get the speedups shown below by adding two indexes on single columns. Experiment with it. I actually found one index that reduced the speedup (a bad thing) from about 800 to about 1.3. You'll want to avoid that one.
# can you do better with an index?
dropIndexes(dbName)
db = sqlite3.connect(dbName)
cursor = db.cursor()
# create your indexes here
cursor.execute("CREATE INDEX lastNameIndex ON Actors(LName)")
cursor.execute("CREATE INDEX movieYearIndex ON Movies(Year)")
# your work should go between this comment and the one above
db.commit()
db.close()
# now get the time for the queries with indexes
TimeWith = runQueries(dbName, queries)
SELECT LName, FName FROM Actors WHERE LName = "Smith" returned 5773 rows in 0.009840 seconds SELECT m.title FROM Movies m WHERE m.Year > 2000 AND m.Year < 2002 returned 11039 rows in 0.033627 seconds SELECT A.LName, A.FName, M.year, M.title FROM Actors A, Movies M, Casts C WHERE A.aid = C.aid and M.mid = C.mid and A.LName = 'Bishop' and M.year >= 2000 and M.year <= 2002 and C.role = 'Himself' ORDER BY A.LName, A.FName, M.year returned 11 rows in 0.001804 seconds
# I'm computing the speedup here
for i in range(len(queries)):
print i, TimeWithout[i]/TimeWith[i]
0 23.4110221645 1 2.92755146029 2 122.549895583
Now I want you to report your results. The times for the queries without indexes should be in variable TimeWithout
and with indexes in TimeWith
. I want you to compute the speedup (TimeWithout[i] / TimeWith[i]) for each query. If you're doing it right, you should be able to achieve speedups of greater than 1 (duh) but they will vary considerably depending on your disk drives, processor speed, memory size, etc.
Write your report by changing the numbers in the second column of the table below.
Query | Speedup |
---|---|
0 | 24.6 |
1 | 3.1 |
2 | 783 |
This part will count 40 points but I'm not including in the checker results.
onyen = 'jpuccio'
collaborators = ['Patrick Lung', 'Zen Yang']
report(onyen, collaborators)
A10 A11 A2 A3 A4 A5 A6 A7 A8 A9 Bl Bu Report for jpuccio Collaborators: ['Patrick Lung', 'Zen Yang'] 12 of 12 correct, 60 of 60 points
Click the button below to submit your assignment. Watch for a confirmation message that your notebook was successfully uploaded. You may submit as often as you wish, only the last submission will count.
dropIndexes(dbName)
dropping lastNameIndex dropping movieYearIndex