Thursday, October 27, 2022

Better NameError messages for Python

Python 3.11 is barely out and already the 3.12 alpha has some improvements for NameError messages. I suspect that these will be backported to 3.11 in time for the next release. 

On Ideas Python Discussion, Pamela Fox suggested that it might be useful to consider potential missing import when a NameError was raised. Thus, instead of having

>>> stream = io.StringIO()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'io' is not defined. Did you mean 'id'?

one would see

>>> stream = io.StringIO()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'io' is not defined. Did you mean 'id'? Or did you forget to import 'io'?

Of course, something like this was already done by friendly/friendly-traceback (aka Friendly). However, in this particular case, the information provided by Friendly contained too much information; this has been since fixed.

To no one's surprise, Pablo Galindo Salgado came up with a version of this for Python, where names found in sys.stdlib_module_names were considered and potentially added, with the result as initially suggested by Pamela Fox. Pamela then made a second suggestion to see if names of popular third-party libraries could also be considered. This, for now, appears to be out of scope for Python.

This set the stage for a friendly (pun intended) competition...

I decided to revise what I had done for Friendly in such cases and found some room for improvements. First, let's look at a couple of examples (with screenshots) of the new behaviour for Python.


As we can see with the first example, Python first makes suggestions about potential typos ('io' instead of 'id') followed by the suggestion about a missing import. Note that 'id' is a builtin who is never used with a dotted attribute.

The second example suggest a missing import only. However, as I am using Windows, this module does not exist.

Can Friendly do better?  Note that Friendly can be used with Python 3.6+ (including Python 3.12), all of which would show the same output. I've chosen to use Python 3.10 for this example, as I will explain near the end of this post.


The message included in the Python traceback does not include the additional hint about a missing import in this case. However, Friendly adds it on its own.  Note that it does not suggest 'id' as a potential typo. But what if we had made such a typo?


Here, Friendly does make the suggestion about a potential typo.  What about the second example given above?



Friendly also uses sys.stdlib_module_names initially, but also check with importlib.util.find_spec() to see if the module can be located.

It can also find potentially relevant third-party modules that are installed, but not yet imported.


Using importlib.util.find_spec() allows us to implement Pamela Fox's suggestion about suggesting third-party modules that are installed.

However, we can do even better with some dedicated code. To demonstrate this, I need to use the latest addition to the "friendly-traceback family" - which I have only tested with Python 3.10 so far.


I'll likely have more to say about friendly_pandas in the near future.

Final thoughts

For those excited about the improved traceback with Python 3.11 and PEP 657: Fine-grained error locations in tracebacks, but cannot yet install Python 3.11, please note that Friendly can already something similar, if not better with any Python version 3.6.1+



I say "better" because, unlike Python's traceback, the information is not limited to a single line of code:




Tuesday, October 18, 2022

pandas' SettingWithCopyWarning: did I get it right?

 I am just beginning to learn pandas and am looking to provide some automated help. From what I read, it appears that SettingWithCopyWarning is something that confuse many people. Is the following correct?

In [2]:
df = pd.DataFrame([[10, 20, 30], [40, 50., 60]],
                  index=list("ab"),
                  columns=list("xyz"))
In [3]:
df.loc["b"]["x"] = 99
`SettingWithCopyWarning`: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
In [4]:
# What is SettingWithCopyWarning ?
what()
Pandas occasionally emits a SettingWithCopyWarning when you use       
'chained indexing', either directly or indirectly,and you then attempt
to assign a value to the result. By 'direct chained indexing', we mean
that your code contains something like:                               

...[index_1][index_2] = ...                                           

During the first extraction using [index_1], pandas found that the    
series to be created contained values of different types. It          
automatically created a new series converting all values to a common  
type. The second indexing, [index_2] was then done a this copy instead
of the original dataframe. Thus, the assigment was not done on the    
original dataframe, which caused Pandas to emit this warning.         

An 'indirect chained indexing' essentially amount to the same problem 
except that the second indexing is not done on the same line as that  
which was done to extract the first series.                           
In [5]:
# Can I get more specific information for what I just did?
why()
You used direct chained indexing of a dataframe which made a copy of  
the original content of the dataframe. If you try to assign a value to
that copy, the original dataframe will not be modified. Instead of    
doing a direct chained indexing                                       

df.loc["b"]["x"] ...                                                  

try:                                                                  

df.loc["b", "x"] ...                                                  
In [6]:
# What about if I tried to use indirect chaining. 
# There are two possibilities
series = df.loc["b"]
series["x"] = 99
`SettingWithCopyWarning`: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
In [7]:
where()
Warning issued on line 4 of code block [6].                                                         

       1| # What about if I tried to use indirect chaining.  
       2| # There are two possibilities
       3| series = df.loc["b"]
     > 4| series["x"] = 99
In [8]:
why()
I suspect that you used indirect chained indexing of a dataframe.     
First, you likely created a series using something like:              

series = df.loc[...]                                                  

This made a copy of the data contained in the dataframe. Next, you    
indexed that copy                                                     

series["x"]                                                           

This had no effect on the original dataframe. If your goal is to      
modify the value of the original dataframe, try something like the    
following instead:                                                    

df.loc[..., "x"]                                                      
In [9]:
# What if I do things in a different order
series_1 = df["x"]
series_1.loc["b"] = 99
`SettingWithCopyWarning`: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
In [10]:
where()
Warning issued on line 3 of code block [9].                                                         

       1| # What if I do things in a different order
       2| series_1 = df["x"]
     > 3| series_1.loc["b"] = 99
In [11]:
why()
I suspect that you used indirect chained indexing of a dataframe.     
First, you likely created a series using something like:              

series_1 = df[...]                                                    

This made a copy of the data contained in the dataframe. Next, you    
indexed that copy                                                     

series_1.loc["b"]                                                     

This had no effect on the original dataframe. If your goal is to      
modify the value of the original dataframe, try something like the    
following instead:                                                    

df.loc[..., "b"]                                                      
In [12]:
# What if I had multiples data frames?
df2 = df.copy()
series = df.loc["b"]
series["x"] = 99
`SettingWithCopyWarning`: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
In [13]:
where()
Warning issued on line 4 of code block [12].                                                        

       2| df2 = df.copy()
       3| series = df.loc["b"]
     > 4| series["x"] = 99
In [14]:
why()
In your code, you have the following dataframes: {'df2', 'df'}. I do  
not know which one is causing the problem here; I will use the name   
df2 as an example.                                                    

I suspect that you used indirect chained indexing of a dataframe.     
First, you likely created a series using something like:              

series = df2.loc[...]                                                 

This made a copy of the data contained in the dataframe. Next, you    
indexed that copy                                                     

series["x"]                                                           

This had no effect on the original dataframe. If your goal is to      
modify the value of the original dataframe, try something like the    
following instead:                                                    

df2.loc[..., "x"]