Python coding: Heap Sorting example (Movie buddies)

FIT2004
Spread the love

Task : Finding the top-k users

The first feature that she wants to add is rewards for top-k users of the app each month. Specifically, she has written a script that generates, at the end of each month, a file containing the time spent (in milli seconds) by each user on the app. She wants to find the top-k most active users (based on the time they spend on the app) and give them rewards each month. If the two users have spent exactly the same amount of time, the user with smaller ID (the user who had registered at the app earlier) should be be given preference. She initially implemented merge sort algorithm that sorts all the users based on the time they spend on the app and retrieved the first k users. Let U be the total number of users. This algorithm clearly takes O(U log U) time to retrieve the top-k users assuming that comparing the time spent by two users takes O(1). She is concerned that this algorithm would be too slow when her app becomes a major hit (and the number of users becomes huge).


This task is to implement an algorithm that reports the top-k users in O(U log k) and uses O(U) space assuming that comparing the time spent by two users takes O(1). Note that you cannot assume that the maximum possible time spent by any user on the app is a constant, i.e., counting/radix sort does not meet the time and space complexity requirements.

Input

The input file named timeSpent.txt records the time spent (in milli seconds) by each user. Each line in the file has two values separated by a colon :. The first value is the user ID and the second value is the time spent by the user. Below are the first 5 lines from the file.

  • 1:9695819
  • 2:5004755
  • 3:611496
  • 4:2006676
  • 5:423596

The user with ID 1 spent 9695819 ms on the app, the user with ID 2 spent 5004755 ms and so on.

Output

Your solution for this task must be written in a file named topk.py. The program must read from the file timeSpent.txt and ask the user to enter a value of k. Then, it must print the top-k users (in sorted order) in O(U log k) time and using O(U) space. Below is a sample output that your program must generate.

Example:

Enter the value of k: 10

#1: User ID: 61 Time spent: 9999446

#2: User ID: 7199 Time spent: 9998848

#3: User ID: 5666 Time spent: 9997642

#4: User ID: 4655 Time spent: 9996306

#5: User ID: 5922 Time spent: 9996306

#6: User ID: 7392 Time spent: 9994884

#7: User ID: 653 Time spent: 9994178

#8: User ID: 3117 Time spent: 9993810

The above output shows the top-10 users in descending order of the time they spent on the app. Note that users with IDs 4655 and 5922 spent exactly the same amount of time on the app. However, the user 4655 is ranked higher because its ID is smaller. Your program will be tested on a different version of timeSpent.txt, e.g., the number of users in timeSpent.txt (and/or the time they spend on the app) may be different.

import heapq

def Heapify(data, n, i):
    '''
    Find maximum data in a list
    If found the maximum data swap the position. 
    
    Input: data, Length of a list, root
    Output: Maximum data
    Time complexity : O(k log n) where n is user list
    Space complexity: O(1)
    '''
    
    largst = i                  # Initialize largest as root
    left = 2 * i + 1
    right = 2 * i + 2     

    if left < n and data[i][1] < data[left][1]:
        largst = left
 
    if right < n and data[largst][1] < data[right][1]:
        largst = right

    if largst != i:
        data[i],data[largst] = data[largst],data[i]
        
        Heapify(data, n, largst)
 
def heapSort(data,heap):
    """
    Find maximum data in a list
    If found the maximum data swap the position. 
    
    Input: data, Length of a list, root
    Output: Maximum data
    Time complexity : O(k log n) 
    Space complexity: O(1)
    
    """
    n = len(data)
    klen = len(heap)
    
    # Build a maxheap
    for i in range(n, -1, -1):
        Heapify(data, n, i)
        
    # One by one extract elements
    j = 0
    for i in range(klen-1, 0, -1):
        data[i], data[0] = data[0], data[i] 
        sort = data[i]
        heap[j] = sort
        Heapify(data, i, 0)
        j+=1
    return heap


def order(heap):
    """
    Sorts User ID if thier time spent is equal 
    
    Input: Tuple in a data list
    Output: Sorted data
    Time complexity: O(k) where k is the list size
    Space complexity: O(1)
    """
    k = len(heap)
    for i in range(1,k):
        if heap[i][1]==heap[i-1][1] and heap[i][0]>heap[i-1][0]:
            heap[i], heap[i-1] = heap[i-1], heap[i]
    return heap 
    


def heap_Search(data, k ):
    """
    Find top k tiems in data
    
    Input: Data and size of k
    Output: Sorted Top k 
    """
    heap = [[0,0]]
    
    for item in range(len(data)):
        x = data[item][1]
        y = data[item]

        if len(heap) < k or x > heap[0][1]:
            if len(heap) == k  :
                heapq.heappop(heap)
            heapq.heappush( heap, y )
        if heap[0]==[0,0]:
            del heap[0]
    return heap

  
def openfile():
    file = open("timeSpent.txt",'r')
    lines = file.readlines()
    file.close()
    
    for i in range(len(lines)):
        lines[i] = lines[i].rstrip("\n")
    for i in range(len(lines)):
        lines[i] = lines[i].split(":")
        lines[i][0] = int(lines[i][0])
        lines[i][1] = int(lines[i][1])
    return lines

def main():
    data = openfile()

    k = int(input("Enter the value of k:"))
    k +=1
    heap = [None]*k
    heap = heapSort(data,heap)
    
    # Remove None
    for i in range(k):
        if heap[i] is None:
            del heap[i]
          
    # Sort User ID
    heap = order(heap)
    #heap2 = heap_Search(data,k)
    
    for i in range(len(heap)):
        print ("#",i+1,"User ID",heap[i][0], "Time spent:".rjust(10), heap[i][1])
                  
if __name__=="__main__":
    main()
    


Algorithm Time complexity

1) openfile() function will read input text file and splitting the text file’s data into an array in this below format. It takes O(N) where N is the total number of users. data = [[User ID, Time spent]] 

2) When Heap pass to heap_Search() function will take the first item of the data list. From the data list, heap_Search() will limit the heap list’s size to K size. If the heap is full heappop will remove the smallest element on the heap and push the current element as the new smallest element. This function will take U log K times to fulfil heap list where K is top K and U is the total number of users. 

Worst case time complexity: O (U log K) 

Worst case space complexity: O(1) 

3) When the data pass to heapSort() function, the heapify() function will find the longest 

spent time over total users and put that into new heap array until K times. If heap array is full, order() function will sort the array by User ID in ascending order if the user ID has same time spent recording. This takes (k log n) times. 

Worst case time complexity: O (K log U) 

Worst case space complexity: O(1) 


By Python Helper

Author: preservsun

Leave a Reply

Your email address will not be published. Required fields are marked *