Mar 6, 2019

python open very big file, python memory


Python open very big file: 

#Say you are looping through a big 2 TB file
logfile = open("huge_log_file.txt","r")
info_lines = [(line,len(line)) for line in logfile if line.startswith("INFO")]
#Here it will get a huge list - costs RAM, this list could contain 2 TB of content

logfile = open("huge_log_file.txt","r")
info_lines = ((line,len(line)) for line in logfile if line.startswith("INFO"))
#Here it will get generator object - memory efficient


Opening a file (read mode) does NOT implicitly read nor load its contents into memory.  
Even when you do so using Python's context management protocol (the with keyword).

e.g.,

with open('huge_log_file.txt', 'r') as f:
   for each_line in f:
     do_something_each_line(each_line)

Then your peak memory utilization shouldn't be much larger than the longest line of the file

If you really are reading the full content of the file into a data structure like list, then it's no wonder that your RAM usage peaks like that. 
It's not that python puts the full contents of the file in RAM, but that you do.
e.g.,

#Here it will load all into memory as you are storing/dumping into a list called info_lines
info_lines = [(line,len(line)) for line in logfile if line.startswith("INFO")] 


#Memory efficinet - since it return generator
info_lines = ((line,len(line)) for line in logfile if line.startswith("INFO"))



file = '/tmp/huge_log_file.txt'

#This is fine
with open(file, 'r') as fh:
   for each in fh:
       print each

#this is a blunder, this will crash, it loads every thing into memory
#with open(file, 'r') as fh:
#    lines = fh.readlines()
#    print len(lines)


No comments:

Post a Comment