Task : Finding the top-k users
The first feature that she wants to add is rewards for top-k users of the app each month. Specifically, she has written a script that generates, at the end of each month, a file containing the time spent (in milli seconds) by each user on the app. She wants to find the top-k most active users (based on the time they spend on the app) and give them rewards each month. If the two users have spent exactly the same amount of time, the user with smaller ID (the user who had registered at the app earlier) should be be given preference. She initially implemented merge sort algorithm that sorts all the users based on the time they spend on the app and retrieved the first k users. Let U be the total number of users. This algorithm clearly takes O(U log U) time to retrieve the top-k users assuming that comparing the time spent by two users takes O(1). She is concerned that this algorithm would be too slow when her app becomes a major hit (and the number of users becomes huge).
This task is to implement an algorithm that reports the top-k users in O(U log k) and uses O(U) space assuming that comparing the time spent by two users takes O(1). Note that you cannot assume that the maximum possible time spent by any user on the app is a constant, i.e., counting/radix sort does not meet the time and space complexity requirements.
Input
The input file named timeSpent.txt records the time spent (in milli seconds) by each user. Each line in the file has two values separated by a colon :. The first value is the user ID and the second value is the time spent by the user. Below are the first 5 lines from the file.
- 1:9695819
- 2:5004755
- 3:611496
- 4:2006676
- 5:423596
The user with ID 1 spent 9695819 ms on the app, the user with ID 2 spent 5004755 ms and so on.
Output
Your solution for this task must be written in a file named topk.py. The program must read from the file timeSpent.txt and ask the user to enter a value of k. Then, it must print the top-k users (in sorted order) in O(U log k) time and using O(U) space. Below is a sample output that your program must generate.
Example:
Enter the value of k: 10
#1: User ID: 61 Time spent: 9999446
#2: User ID: 7199 Time spent: 9998848
#3: User ID: 5666 Time spent: 9997642
#4: User ID: 4655 Time spent: 9996306
#5: User ID: 5922 Time spent: 9996306
#6: User ID: 7392 Time spent: 9994884
#7: User ID: 653 Time spent: 9994178
#8: User ID: 3117 Time spent: 9993810
The above output shows the top-10 users in descending order of the time they spent on the app. Note that users with IDs 4655 and 5922 spent exactly the same amount of time on the app. However, the user 4655 is ranked higher because its ID is smaller. Your program will be tested on a different version of timeSpent.txt, e.g., the number of users in timeSpent.txt (and/or the time they spend on the app) may be different.
import heapq def Heapify(data, n, i): ''' Find maximum data in a list If found the maximum data swap the position. Input: data, Length of a list, root Output: Maximum data Time complexity : O(k log n) where n is user list Space complexity: O(1) ''' largst = i # Initialize largest as root left = 2 * i + 1 right = 2 * i + 2 if left < n and data[i][1] < data[left][1]: largst = left if right < n and data[largst][1] < data[right][1]: largst = right if largst != i: data[i],data[largst] = data[largst],data[i] Heapify(data, n, largst) def heapSort(data,heap): """ Find maximum data in a list If found the maximum data swap the position. Input: data, Length of a list, root Output: Maximum data Time complexity : O(k log n) Space complexity: O(1) """ n = len(data) klen = len(heap) # Build a maxheap for i in range(n, -1, -1): Heapify(data, n, i) # One by one extract elements j = 0 for i in range(klen-1, 0, -1): data[i], data[0] = data[0], data[i] sort = data[i] heap[j] = sort Heapify(data, i, 0) j+=1 return heap def order(heap): """ Sorts User ID if thier time spent is equal Input: Tuple in a data list Output: Sorted data Time complexity: O(k) where k is the list size Space complexity: O(1) """ k = len(heap) for i in range(1,k): if heap[i][1]==heap[i-1][1] and heap[i][0]>heap[i-1][0]: heap[i], heap[i-1] = heap[i-1], heap[i] return heap def heap_Search(data, k ): """ Find top k tiems in data Input: Data and size of k Output: Sorted Top k """ heap = [[0,0]] for item in range(len(data)): x = data[item][1] y = data[item] if len(heap) < k or x > heap[0][1]: if len(heap) == k : heapq.heappop(heap) heapq.heappush( heap, y ) if heap[0]==[0,0]: del heap[0] return heap def openfile(): file = open("timeSpent.txt",'r') lines = file.readlines() file.close() for i in range(len(lines)): lines[i] = lines[i].rstrip("\n") for i in range(len(lines)): lines[i] = lines[i].split(":") lines[i][0] = int(lines[i][0]) lines[i][1] = int(lines[i][1]) return lines def main(): data = openfile() k = int(input("Enter the value of k:")) k +=1 heap = [None]*k heap = heapSort(data,heap) # Remove None for i in range(k): if heap[i] is None: del heap[i] # Sort User ID heap = order(heap) #heap2 = heap_Search(data,k) for i in range(len(heap)): print ("#",i+1,"User ID",heap[i][0], "Time spent:".rjust(10), heap[i][1]) if __name__=="__main__": main()
Algorithm Time complexity
1) openfile() function will read input text file and splitting the text file’s data into an array in this below format. It takes O(N) where N is the total number of users. data = [[User ID, Time spent]]
2) When Heap pass to heap_Search() function will take the first item of the data list. From the data list, heap_Search() will limit the heap list’s size to K size. If the heap is full heappop will remove the smallest element on the heap and push the current element as the new smallest element. This function will take U log K times to fulfil heap list where K is top K and U is the total number of users.
Worst case time complexity: O (U log K)
Worst case space complexity: O(1)
3) When the data pass to heapSort() function, the heapify() function will find the longest
spent time over total users and put that into new heap array until K times. If heap array is full, order() function will sort the array by User ID in ascending order if the user ID has same time spent recording. This takes (k log n) times.
Worst case time complexity: O (K log U)
Worst case space complexity: O(1)