Reading and Writing Files#
What you will learn in this lesson:
Open a file
Read a file
Write to a file
Introduction#
So far we have just used data (e.g. list of numbers, dictionaries, etc) already defined in our code. However, an incredible amount of data is stored in separate files. These files can contain text, weather data, traffic data, socioeconomic data, literary works, and more.
Reading from a file is esssential in any data science research question. For example, you may need to write a program that reads some text from a file to feed that information to a Large Language Model (LLM).
Alternatively, you may want some text, tabular data, etc generated by your code to be written in a separate file, so you don’t have the risk of losing it when your program finishes, or you don’t need to generate again next time you run your code.
Opening a File#
Before we read data from an external file, or write data to an external file, we need to open that file, to either read from it or write to it.
In Python we can do that with the open
function. The syntax to open a file is the following:
with open("path/to/filename", "Access_Mode") as [some name]:
# Some code for Input (reading)/Output(writing) operation(s)
Let’s break this down a bit:
- We begin with the keyword
with
. This ensures proper resource management by automatically closing the file after we are done working with it. - Next, we use the
open
function, which takes two arguments: - The path to the file we want to open.
- The mode in which we want to open the file (e.g., read, write, etc.).
- We follow this with
as
, which allows us to create a file object from the opened file. - After the
as
, we provide the name of the file object. - The statement ends with a colon (:).
- Finally, we indent the block of code where we perform the input or output operations on the file.
# Simple example
with open('datascience_41model.txt') as f:
print(f)
<_io.TextIOWrapper name='datascience_41model.txt' mode='r' encoding='UTF-8'>
File Modes#
Common file modes in Python include:
'r'
: Read (default) – opens the file for reading.'w'
: Write – opens the file for writing, truncating (clearing) the file if it exists.'a'
: Append – opens the file for writing, adding content to the end of the file without clearing it.'r+'
: Read and write – opens the file for both reading and writing.
Reading from a File#
In this case, we have to use the 'r'
mode when we open the file, or nothing, as by default the file will be opened in read mode.
There are different ways to read from a file:
Use the method
read
of the file object to read an entire file.
with open('datascience_41model.txt', 'r') as f:
content = f.read()
print(content)
Data Science is a complex and evolving field, but most agree that it can be defined as a combination of expertise drawn from three broad areascomputer science and technology, math and statistics, and domain knowledge -- with the purpose of extracting knowledge and value from data.
Beyond this, the field is often defined as a series of practical activities ranging from the cleaning and wrangling of data, to its analysis and use to infer models, to the visual and rhetorical representation of results to stakeholders and decision-makers.
This essay proposes a model of data science that goes beyond laundry-list definitions to get at the specific nature of data science and help distinguish it from adjacent fields such as computer science and statistics.
We define data science as an interdisciplinary field comprising four broad areas of expertise: value, design, systems, and analytics. A fifth area, practice, integrates the other four in specific contexts of domain knowledge. We call this the 4+1 model of data science.
Together, these areas belong to every data science project, even if they are often unconnected and siloed in the academy.
Use the method
read
of the file object to partially read a file.
with open('datascience_41model.txt', 'r') as f:
content_partial = f.read(217)
print(content_partial)
Data Science is a complex and evolving field, but most agree that it can be defined as a combination of expertise drawn from three broad areascomputer science and technology, math and statistics, and domain knowledge
Use the method
readlines
to read a file and return a list of strings, where each element corresponds to a line:
with open('datascience_41model.txt', 'r') as f:
content = f.readlines()
print(content)
['Data Science is a complex and evolving field, but most agree that it can be defined as a combination of expertise drawn from three broad areascomputer science and technology, math and statistics, and domain knowledge -- with the purpose of extracting knowledge and value from data. \n', '\n', 'Beyond this, the field is often defined as a series of practical activities ranging from the cleaning and wrangling of data, to its analysis and use to infer models, to the visual and rhetorical representation of results to stakeholders and decision-makers. \n', '\n', 'This essay proposes a model of data science that goes beyond laundry-list definitions to get at the specific nature of data science and help distinguish it from adjacent fields such as computer science and statistics. \n', '\n', 'We define data science as an interdisciplinary field comprising four broad areas of expertise: value, design, systems, and analytics. A fifth area, practice, integrates the other four in specific contexts of domain knowledge. We call this the 4+1 model of data science. \n', '\n', 'Together, these areas belong to every data science project, even if they are often unconnected and siloed in the academy.\n']
Iterate through a file object to read line by line.
Some times, our external file is too heavy, so reading all the data at once may not be possible. What we can do is to read this kind of line by line. In this case, will treat the opened file as an iterator whose line we are going to iterate over using a for-loop:
with open('datascience_41model.txt', 'r') as f:
for line in f:
print(line)
Data Science is a complex and evolving field, but most agree that it can be defined as a combination of expertise drawn from three broad areascomputer science and technology, math and statistics, and domain knowledge -- with the purpose of extracting knowledge and value from data.
Beyond this, the field is often defined as a series of practical activities ranging from the cleaning and wrangling of data, to its analysis and use to infer models, to the visual and rhetorical representation of results to stakeholders and decision-makers.
This essay proposes a model of data science that goes beyond laundry-list definitions to get at the specific nature of data science and help distinguish it from adjacent fields such as computer science and statistics.
We define data science as an interdisciplinary field comprising four broad areas of expertise: value, design, systems, and analytics. A fifth area, practice, integrates the other four in specific contexts of domain knowledge. We call this the 4+1 model of data science.
Together, these areas belong to every data science project, even if they are often unconnected and siloed in the academy.
Writing to a File#
In this case, we will mostly use the method write
of the file object.
For this type of operation, we can adopt two different modes:
'w'
: To write after truncating (clearing) the file if it exists.
# Write mode
with open('test.txt', 'w') as f:
for i in range(1, 5):
f.write(str(i))
with open('test.txt', 'r') as f:
content = f.read()
print(content)
########################################
# Do not worry about this part right now
import os
if os.path.exists('test.txt'):
os.remove('test.txt')
#########################################
1234
We can be quite flexible with what we write:
with open('test2.txt', 'w') as f:
for i in range(1, 5):
f.write(str(i))
f.write("\n")
with open('test2.txt', 'r') as f:
content = f.read()
print(content)
########################################
# Do not worry about this part right now
import os
if os.path.exists('test2.txt'):
os.remove('test2.txt')
#########################################
1
2
3
4
'a'
: To add content to the end of the file without clearing it.
# First, write mode
with open('test3.txt', 'w') as f:
for i in range(1, 5):
f.write(str(i))
with open('test3.txt', 'r') as f:
content = f.read()
print(content)
# Then append to this file more numbers
with open('test3.txt', 'a') as f:
for i in range(5, 10):
f.write(str(i))
with open('test3.txt', 'r') as f:
content = f.read()
print(content)
########################################
# Do not worry about this part right now
import os
if os.path.exists('test3.txt'):
os.remove('test3.txt')
#########################################
1234
123456789
If the file does not exist or it is empty, 'a'
would behave as 'w'
:
with open('test4.txt', 'a') as f:
for i in range(1, 5):
f.write(str(i))
with open('test4.txt', 'r') as f:
content = f.read()
print(content)
########################################
# Do not worry about this part right now
import os
if os.path.exists('test4.txt'):
os.remove('test4.txt')
#########################################
1234
write
only accepts text data, i.e. strings.
# This will fail
with open('test5.txt', 'w') as f:
try:
for i in range(1, 5):
f.write(i)
except TypeError as e:
print(e)
finally:
########################################
# Do not worry about this part right now
import os
if os.path.exists('test5.txt'):
os.remove('test5.txt')
#########################################
write() argument must be str, not int
We could also use the method writelines
, which like readlines
, it is used to insert multiple strings (in a list) at a single time.
list_string = ["You are in Berlin \n", "Your are in Paris \n", "You are in Bilbao \n"]
with open('test6.txt', 'w') as f:
f.writelines(list_string)
with open('test6.txt', 'r') as f:
content = f.read()
print(content)
########################################
# Do not worry about this part right now
import os
if os.path.exists('test6.txt'):
os.remove('test6.txt')
#########################################
You are in Berlin
Your are in Paris
You are in Bilbao
Practice excersises#
1- Create a list of cities: [‘Bilbao’, ‘London’, ‘Paris’, ‘Budapest’].
2- Write each city from the list to a file called “best_cities.txt”, ensuring each city is on a new line.
3- Read the created file to make sure the outcome was correct.
# Your answers here
1- To the previous file “best_cities.txt”, add two more cities that you think are the best or that you also like. Ensure again that each new city is on a new line.
2 - Read the updated file to make sure the outcome is correct.
# Your answers here