Digital Education Resources - Vanderbilt Libraries Digital Lab
Previous lesson: Reading and writing CSV files or
optional lesson on NumPy arrays
This lesson introduces the pandas library for Python and one of the data structures that it introduces: Series. Series are one-dimensional objects built from NumPy arrays. This lesson covers the way that Series items are designated and how Series are sliced using .loc()
and .iloc()
. It also discusses the distinction between views and copies, and ways that changes to Series can be made to persist.
Learning objectives At the end of this lesson, the learner will:
.loc[]
and .iloc[]
..loc[]
and .iloc[]
.Total video time: 33m 36s
Lesson Jupyter notebook at GitHub
The conventional import statement for pandas is:
import pandas as pd
All of the examples in the rest of this lesson assume that you’ve executed this statement and assigned pandas to the abbreviation pd
.
Dictionaries are addressable by key:
states_dict = {'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'}
print(states_dict['TN'])
Lists are addressable by integer index:
animal_list = ['lizard', 'spider', 'worm', 'bee']
print(animal_list[2])
The form of the statement to create a Series from a Python dictionary is:
series_name = pd.Series(values_dictionary)
pandas Series are similar to both lists and dictionaries since they can be indexed by either a label index (similar to a dictionary key) or an integer position index (similar to the index of a list).
states_series = pd.Series(states_dict)
print(states_series[2])
print(states_series['TN'])
NOTE: As with other Python indexing, the integer position index is zero-based (counting starts with zero). So an integer index of 2 locates the 3rd item in the Series.
To avoid ambiguity, the indexing system can be made explicit by indexing using .loc[]
for label indices and .iloc[]
for integer position indexing:
states_series.loc['TN']
states_series.iloc[2]
Do not associate the “i” in .iloc[]
with the word “index”, since “index” is used generically to describe the label index. It’s better to think of the “i” as representing “integer”.
A pandas Series is composed of a NumPy array with an associated label index object. Basing the Series data structure on a NumPy array allows the Series to be used in vectorized operations and provides better performance than with basic Python data structures like lists and dictionaries.
The .values
attribute of a Series returns the NumPy array and the .index
attribute returns the label index. Both are iterable.
In addition to specifying a particular item in a series, .loc[]
and .iloc[]
can be used to extract a part of the series (a “slice”) based on one of several mechanisms. The resulting slice is also a pandas Series, and the label indices from the original Series are maintained in the slice.
Series slices can be denoted using a range of either label indices or integer positions:
states_series.loc['TN':'PA']
states_series.iloc[1:4]
Slices denoted by integer position do not include the last numbered item, but slices denoted by labels include the last labeled item. In the examples above, the item with the index PA
will be included in the first slice but the item in position 4 will not be included in the second slice.
Slices can also be denoted by a list of labels or integers:
states_series.loc[ ['TN', 'AK', 'OH'] ]
states_series.iloc[ [1, 3, 0] ]
The order of items in the slice will be the order of the indices in the list and all listed indices will be included.
The items to be included in a slice can also be selected using a boolean Series passed into .loc[]
. The label indices of the boolean series should match the label indices of the Series to be sliced on an item by item basis. This is usually accomplished by generating the boolean Series by performing a vectorized operation that produces a boolean on the Series to be sliced.
For example, the method .startswith()
will produce a Series of booleans based on whether each corresponding string value from a source Series begins with the string passed in as an argument. The following expression generates a Series of booleans with a True
value when the string value starts with the character A
:
states_series.str.startswith('A')
If the resulting boolean Series is passed into .loc[]
, the resulting slice will include each item in the source Series for which the condition is True
.
The label indices are maintained for the values in the resulting slice.
When a slice is generated from a Series, its values are not independent from the values in the original Series. The slice is a view
of the original Series and making changes to the slice can affect the original series. (This is a general behavior of Python objects that are derived from other mutable Python objects.)
If we want the slice to be independent from the original series we can use the .copy()
method to make the slice an actual copy of the original series rather than a view of it.
states_slice_view = states_series.loc['TN':'PA']
states_slice_copy = states_series.loc['TN':'PA'].copy()
If we execute a slicing operation or some other operation that acts on a Series, such as .sort_values()
, a view is returned and will be displayed below the cell in the notebook. However, that action does not result in any permanent change, since the result is simply being displayed. This is different from the behavior of the .sort()
list method, which results in a change to the original list when it is executed.
There are three ways that we can make the change generated by the operation persist:
1. Create a copy and assign a new name to the resulting object:
sorted_series = states_series.sort_values().copy()
2. Assign the resulting object to the name of the original object. Since the new object replaces the old one, it is not necessary to use the .copy()
method.
state_series = states_series.sort_values()
3. Use an inplace=True
argument in the method. When using this strategy, nothing is returned by the method, so no assignment operator is necessary. The result of the method replaces the original object.
states_series.sort_values(inplace=True)
NOTE: as of October 2022, there is some discussion of deprecation the inplace
argument since it is considered to have some unintended pitfalls. So it is probably better to use the second strategy.
Additional notes about sorting Series:
.sort_index()
method can be used to sort the series by the label index instead of by the value.ascending=False
argument can be used to sort in descending order instead of the default ascending order. When the series items are strings, this produces reverse alphabetical order.If a Series is created from an object without assigned labels (e.g. a list or NumPy array), pandas will generate label indices from a sequence of integers starting with zero.
animal_list = ['lizard', 'spider', 'worm', 'bee']
animal_series = pd.Series(animal_list)
In the initial Series, the label index and integer (position) index will be the same for every item. However, if the Series is modified by sorting or slicing, the originally assigned integer label indices will remain associated with their values, but the integer position index will start with 0 for the first item.
Since this situation can be confusing, it is desirable to assign some kind of unique label identifier to the items when the Series is created. This can be done using an index
argument.
id_animal_series = pd.Series(animal_list, index=['a', 'b', 'c', 'd'])
The following exercises use data queried from Wikidata. To acquire the data, run the first cell in the example notebook:
import requests
def get_wikidata():
query_string = '''select distinct ?qid ?label where {
?qid wdt:P195 wd:Q18563658. # must be in the collection Vanderbilt University Fine Arts Gallery
?qid wdt:P18 ?image. # item must be depicted in Wikimedia Commons
?qid rdfs:label ?label. # get the label
filter(lang(?label)='en') # filter the labels to include only English
}'''
response = requests.get('https://query.wikidata.org/sparql', params={'query' : query_string}, headers={'Accept': 'application/sparql-results+json'})
data = response.json()
results = data['results']['bindings']
urls = []
labels = []
for result in results:
urls.append(result['qid']['value'])
labels.append(result['label']['value'])
return urls, labels
urls, labels = get_wikidata()
The result is a list of URLs and labels for artworks, which can be used to generate a series using the code in the second cell in the example notebook:
import pandas as pd
labels_series = pd.Series(labels, index=urls)
labels_series.head()
NOTE: In the notebook environment, the URLs in the output should be hyperlinked. So you should be able to click on them to see what the artworks look like.
1. Determine how many items are in the series using the len()
function. Display the first 10 and the last 10 items in the Series.
2. Use .loc[]
to display the item that has the label index http://www.wikidata.org/entity/Q105096111
.
3. Use .iloc[]
to slice the Series to include the 4th through 6th items in the series (should include 3 results). Pay attention to the fact that counting starts with zero and that the list integer index you specify isn’t included in the range.
4. Slice the Series by boolean condition to include only labels that contain the word jar
. The method for doing this is similar to the .startswith()
method used in the examples: .contains()
. You need to apply it to the .str
attribute, like this: .str.contains('jar')
.
5. Make the results case insensitive by passing in a case=False
argument into the .contains()
method. How many additional results does this produce?
6. Create a copy of the slice and assign it the name jars
. Sort the Series of jars by index and by value.
7. Create a script that allows the user to enter the string they would like to search for within the labels. Try your script with soba
and virgin
.
Next lesson: DataFrame manipulation
Revised 2022-11-01
Questions? Contact us
License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"