python - Randomly sample rows from a file based on times in columns -
this bit complex, , appreciate help! trying randomly sample rows .csv file. essentially, want resulting file of unique locations (locations specified easting
, northing
columns of data file, below). want randomly pull 1 location per 12 hour period per sessiondate
in file (12 hour periods divided into: between 0631
, 1829
hours , between 1830
, 0630
hours; given start:
, end:
in data file, below); if 2 locations within 6 hours of each other (based on start:
time), location tossed, , new location randomly drawn, , sampling continue until no new locations drawn (i.e., sampling without replacement). have been trying python, experience limited. tried first putting each row dictionary, , each row list, follows:
import random import csv f = open('file.csv', "u") list = [] line in f: list.append(line.split(','))
i'm unsure go here - how sample these lists way need to, write them output file 'unique' locations.
here top few lines of data file:
sessiondate start: end: easting northing 27-apr-07 18:00 21:45 174739 9785206 28-apr-07 18:00 21:30 171984 9784738 28-apr-07 18:00 21:30 171984 9784738 28-apr-07 18:00 21:30 171984 9784738 28-apr-07 18:00 21:30 171984 9784738
it gets bit complicated of observations span midnight, may on different dates, can within 6 hours of each other (which why have criterion), example:
sessiondate start: end: easting northing 27-apr-07 22:30 23:25 171984 9784738 28-apr-07 0:25 1:30 174739 9785206
here's solution - made few changes data (location make easier eyeball results). create dict
of dates pointing dict
of locations points list of selected rows.
data = """sessiondate start: end: easting northing 27-apr-07 18:00 21:45 1 27-apr-07 18:00 21:30 g 2 28-apr-07 18:00 21:30 b 2 28-apr-07 18:00 21:30 b 2 28-apr-07 18:00 21:30 b 2 29-apr-07 8:00 11:30 c 3 29-apr-07 20:00 21:30 c 3 29-apr-07 20:00 21:30 c 3 30-apr-07 8:00 10:30 d 4 30-apr-07 16:00 17:30 e 5 30-apr-07 14:00 21:30 f 6 30-apr-07 18:00 21:30 f 6 """ selected = {} line in data.split("\n"): if "session" in line: continue if not line: continue tmp = [x x in line.split() if x] raw_dt = " ".join([tmp[0], tmp[1]]).strip() curr_dt = datetime.strptime(raw_dt, "%d-%b-%y %h:%m") loc = (tmp[-2], tmp[-1]) found = false dt in selected: diff = dt - curr_dt if dt < curr_dt: diff = curr_dt - dt # print dt, curr_dt, diff, diff <= timedelta(hours=12), loc, loc in selected[dt] if diff <= timedelta(hours=12): if loc not in selected[dt]: selected[dt].setdefault(loc, []).append(tmp) found = true else: found = true if not found: if curr_dt not in selected: selected[curr_dt] = {} if loc not in selected[curr_dt]: selected[curr_dt][loc] = [tmp,] # if output needs sorted rows = sorted(x k in selected l in selected[k] x in selected[k][l]) row in rows: print " ".join(row)
Comments
Post a Comment