{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "# Gestionando la ausencia de datos\n", "\n", "Ocurre con frecuencia que disponemos de catálogos de datos donde hay muestras incompletas.\n", "Por ejemplo, los datos obtenidos a partir de encuestas donde se registran preguntas sin responder o sensores que no proporcionan ningún valor viable, etc.\n", "\n", "**Hay que aceptarlo y saber gestionarlo**\n", "\n", "```Pandas``` asigna el valor o el código NaN (Not a Number) a los valores desconocidos. Más especificamente, los objetos son designados como: None y las fechas como NaT.\n", "\n", "Las operaciones que involucren este tipo de datos internamente han de manejar los correspondientes códigos: NaN, None o NaT. ¿Cómo afecta un NaN a una media aritmética?\n", "\n", "En este capítulo trabajaremos con esta típología de valores.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 1. 2. nan 4. 5. 6. nan 8.]\n", "nan\n" ] } ], "source": [ "import numpy as np\n", "\n", "datos = np.array([1,2,np.nan,4,5,6,np.nan,8])\n", "print(datos)\n", "\n", "print(datos.mean())\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Volvemos a trabajor con el conjunto de datos ``data/who.csv``." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/who.csv\")\n", "df.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Como ya sabéis através de la API se puede obtener una descripción más detallada de las posibilidades de cada método de Python, y en especial de los métodos de Pandas. Para cargar un fichero de tamaño elevado es recomendable cargar aquellos atributos que nos interesen desde un principio usando el argumento: ``usecols``\n", "\n", "[Enlace a documentación](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Country Urban_population_growth\n", "0 Afghanistan 5.44\n", "1 Albania 2.21\n", "2 Algeria 2.61\n", "3 Andorra NaN\n", "4 Angola 4.14\n" ] } ], "source": [ "df = pd.read_csv(\"data/who.csv\", usecols=[\"Country\",\"Urban_population_growth\"])\n", "print(df[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ¿Que valor corresponde a un NA del `DataFrame`? La función `isna()`\n", "\n", "La función `isna()` de pandas se utiliza para detectar valores ausentes en un DataFrame o Series.\n", "Devuelve un objeto del mismo tamaño, donde cada posición contiene `True` si el valor es `NaN`,\n", "`None` o está faltante, y `False` en caso contrario.\n", "\n", "Es especialmente útil para:\n", "- Contar valores ausentes (`df.isna().sum()`).\n", "- Filtrar filas con datos faltantes (`df[df['columna'].isna()]`).\n", "- Explorar la presencia de valores nulos antes de aplicar métodos de imputación.\n", "\n", "Mas información en la documentación: [Enlace](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Country 0\n", "Urban_population_growth 14\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().sum()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['Urban_population_growth'], dtype='object')\n", "------------------------------\n", "Country False\n", "Urban_population_growth False\n", "dtype: bool\n" ] } ], "source": [ "#¿Qué columnas tienen datos sin valor: NaN, NaT, None?\n", "# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html\n", "\n", "print(df.columns[df.isna().any()])\n", "\n", "# Equivale a preguntar si ¿existe algún valor positivo dentro de esas series?\n", "print(\"-\"*30)\n", "print(df.any().isna())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Country Urban_population_growth\n", "0 False False\n", "1 False False\n", "2 False False\n", "3 False True\n", "4 False False\n" ] } ], "source": [ "#No dudéis en ejecutar \"partes\" (dividamos la instrucción para comprenderla)\n", "print(df.isna()[:5])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Country 202\n", "Urban_population_growth 188\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#¿Cuántas muestras son correctas? \n", "df.notna().sum()\n", "# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.notna.html\n", "# y de cuantas muestras?\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Country 202\n", "Urban_population_growth 188\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.notnull().sum() #ambas funcionas son equivalentes en Pandas, no en numpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tratando la ausencia de datos\n", "- Ignorando: \"Hay X muestras válidas de tantas\"\n", "- Rellenando: reemplazar muestras desconocidas por otros valores: media, valor neutro, etc." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Country Urban_population_growth\n", "0 Afghanistan 5.44\n", "1 Albania 2.21\n", "2 Algeria 2.61\n", "3 Andorra 0.00\n", "4 Angola 4.14\n" ] } ], "source": [ "#La manera más optima de remplazar estos valores es con la función: fillna\n", "print(df.fillna(0)[:5])\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Si queremos que nuestra variable de dataframe contenga dichas asignaciones recordad asignar la operación a la variable pertinente o a una nueva\n", "df = df.fillna(0) " ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Maneras de rellenar una serie con datos NA\n", "\n", "Cuando los dataframes contienen números la operabildad con valores perdidos puede gestionarse de manera más eficiente. Pongamos un ejemplo:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " one two three\n", "a 0.883893 0.195865 0.357537\n", "b -2.343262 -1.084833 0.559696\n", "c 0.939469 -0.978481 0.503097\n", "d 0.406414 0.323461 -0.493411\n", "e -0.792017 -0.842368 -1.279503\n" ] } ], "source": [ "import numpy as np\n", "\n", "np.random.seed(20)\n", "\n", "#Creamos un dataframe \n", "df = pd.DataFrame(np.random.randn(5, 3), \n", " index=['a', 'b', 'c', 'd', 'e'],\n", " columns=['one', 'two', 'three'])\n", "print(df)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " one two three\n", "a 0.883893 0.195865 0.357537\n", "b -2.343262 NaN 0.559696\n", "c 0.939469 NaN 0.503097\n", "d 0.406414 0.323461 -0.493411\n", "e -0.792017 NaN -1.279503\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/6j/7gfvt_29797dypw8t1wttblw0000gn/T/ipykernel_66062/13172693.py:2: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!\n", "You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.\n", "A typical example is when you are setting values in a column of a DataFrame, like:\n", "\n", "df[\"col\"][row_indexer] = value\n", "\n", "Use `df.loc[row_indexer, \"col\"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", "\n", " df.two[df.two<0]=np.nan\n" ] } ], "source": [ "#Creamos valores NaN para testear \n", "df.two[df.two<0]=np.nan\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "Podemos usar ```fillna``` para rellenar de diversas maneras la serie o series. Por ejemplo, usando una operación de agregación como la media" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " one two three\n", "a 0.883893 0.195865 0.357537\n", "b -2.343262 NaN 0.559696\n", "c 0.939469 NaN 0.503097\n", "d 0.406414 0.323461 -0.493411\n", "e -0.792017 NaN -1.279503\n", "---------------------------------\n", " one two three\n", "a 0.883893 0.195865 0.357537\n", "b -2.343262 0.259663 0.559696\n", "c 0.939469 0.259663 0.503097\n", "d 0.406414 0.323461 -0.493411\n", "e -0.792017 0.259663 -1.279503\n" ] } ], "source": [ "print(df)\n", "print(\"-\"*33)\n", "print(df.fillna(df.mean()))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " one two three\n", "a 0.883893 0.195865 0.357537\n", "b -2.343262 HOLA 0.559696\n", "c 0.939469 HOLA 0.503097\n", "d 0.406414 0.323461 -0.493411\n", "e -0.792017 HOLA -1.279503\n", "---------------------------------\n", " one two three\n", "a 0.883893 0.195865 0.357537\n", "b -2.343262 0.883893 0.559696\n", "c 0.939469 0.883893 0.503097\n", "d 0.406414 0.323461 -0.493411\n", "e -0.792017 0.883893 -1.279503\n" ] } ], "source": [ "#Con un valor en concreto del propio dataframe\n", "print(df.fillna(\"HOLA\"))\n", "print(\"-\"*33)\n", "print(df.fillna(df.loc[\"a\", [\"one\"]].values[0]))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "#### Podemos rellenar con datos interpolados\n", "\n", "En la documentación vemos una serie de ejemplos: [Interpolate](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " one two three\n", "a 0.883893 0.195865 0.357537\n", "b -2.343262 NaN 0.559696\n", "c 0.939469 NaN 0.503097\n", "d 0.406414 0.323461 -0.493411\n", "e -0.792017 NaN -1.279503\n", "-----------------------------------\n", " one two three\n", "a 0.883893 0.195865 0.357537\n", "b -2.343262 0.238397 0.559696\n", "c 0.939469 0.280929 0.503097\n", "d 0.406414 0.323461 -0.493411\n", "e -0.792017 0.323461 -1.279503\n" ] } ], "source": [ "print(df)\n", "print(\"-\"*35)\n", "print(df.interpolate())" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " one two three\n", "a 0.883893 0.195865 0.357537\n", "b -2.343262 -0.891783 0.559696\n", "c 0.939469 0.721283 0.503097\n", "d 0.406414 0.323461 -0.493411\n", "e -0.792017 -1.035760 -1.279503\n", "----------------------------------------------------------------------\n", "-0.8917828081181468\n" ] } ], "source": [ "print(df.interpolate(axis=1)) # Tomemos como referencia el valor NA de (b,\"two\")\n", "print(\"--\"*35)\n", "print(df.mean(axis=1).b)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0.195865\n", "1 0.195865\n", "2 0.195865\n", "3 0.323461\n", "4 0.323461\n", "Name: two, dtype: float64\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/6j/7gfvt_29797dypw8t1wttblw0000gn/T/ipykernel_66062/2480900594.py:3: FutureWarning: Series.interpolate with method=pad is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.\n", " print(df.two.interpolate(method=\"pad\"))\n" ] } ], "source": [ "# Para usar otro tipo de interpolaciones es recomendable tener un índice numérico por cuestiones de frecuencia en el método de interpolación\n", "df.index = range(len(df))\n", "print(df.two.interpolate(method=\"pad\"))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0.195865\n", "1 0.195865\n", "2 0.323461\n", "3 0.323461\n", "4 NaN\n", "Name: two, dtype: float64\n" ] } ], "source": [ "print(df.two.interpolate(method=\"nearest\"))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Valores interpolados:3\n" ] } ], "source": [ "print(\"Valores interpolados:\" + str(df.two.interpolate().count()-df.two.count()))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Eliminación de valores NA\n", "\n", "Existen operaciones para la eliminación de valores NA" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " one two three\n", "0 0.883893 0.195865 0.357537\n", "1 -2.343262 NaN 0.559696\n", "2 0.939469 NaN 0.503097\n", "3 0.406414 0.323461 -0.493411\n", "4 -0.792017 NaN -1.279503\n", "-----------------------------------\n", " one two three\n", "0 0.883893 0.195865 0.357537\n", "3 0.406414 0.323461 -0.493411\n" ] } ], "source": [ "print(df)\n", "print(\"-\"*35)\n", "print(df.dropna())\n", "\n", "# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "O bien, podemos borrar cambiando el eje ``axis=0`` o 1" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onethree
00.8838930.357537
1-2.3432620.559696
20.9394690.503097
30.406414-0.493411
4-0.792017-1.279503
\n", "
" ], "text/plain": [ " one three\n", "0 0.883893 0.357537\n", "1 -2.343262 0.559696\n", "2 0.939469 0.503097\n", "3 0.406414 -0.493411\n", "4 -0.792017 -1.279503" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(axis=1)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Ejercicios" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "**1) Del fichero who.csv, contabiliza cuántos paises tienen algun valor NaN.**" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryCountryIDContinentAdolescent fertility rate (%)Adult literacy rate (%)Gross national income per capita (PPP international $)Net primary school enrolment ratio female (%)Net primary school enrolment ratio male (%)Population (in thousands) totalPopulation annual growth rate (%)...Total_CO2_emissionsTotal_incomeTotal_reservesTrade_balance_goods_and_servicesUnder_five_mortality_from_CMEUnder_five_mortality_from_IHMEUnder_five_mortality_rateUrban_populationUrban_population_growthUrban_population_pct_of_total
0FalseFalseFalseFalseFalseTrueTrueTrueFalseFalse...FalseTrueTrueTrueFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseTrueTrueTrueFalseFalseFalseFalse...TrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
..................................................................
197FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
198FalseFalseFalseTrueTrueTrueTrueTrueTrueTrue...FalseFalseTrueTrueFalseFalseFalseFalseFalseFalse
199FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
200FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
201FalseFalseFalseFalseFalseTrueFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", "

202 rows × 358 columns

\n", "
" ], "text/plain": [ " Country CountryID Continent Adolescent fertility rate (%) \\\n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False True \n", "4 False False False False \n", ".. ... ... ... ... \n", "197 False False False False \n", "198 False False False True \n", "199 False False False False \n", "200 False False False False \n", "201 False False False False \n", "\n", " Adult literacy rate (%) \\\n", "0 False \n", "1 False \n", "2 False \n", "3 True \n", "4 False \n", ".. ... \n", "197 False \n", "198 True \n", "199 False \n", "200 False \n", "201 False \n", "\n", " Gross national income per capita (PPP international $) \\\n", "0 True \n", "1 False \n", "2 False \n", "3 True \n", "4 False \n", ".. ... \n", "197 False \n", "198 True \n", "199 False \n", "200 False \n", "201 True \n", "\n", " Net primary school enrolment ratio female (%) \\\n", "0 True \n", "1 False \n", "2 False \n", "3 False \n", "4 False \n", ".. ... \n", "197 False \n", "198 True \n", "199 False \n", "200 False \n", "201 False \n", "\n", " Net primary school enrolment ratio male (%) \\\n", "0 True \n", "1 False \n", "2 False \n", "3 False \n", "4 False \n", ".. ... \n", "197 False \n", "198 True \n", "199 False \n", "200 False \n", "201 False \n", "\n", " Population (in thousands) total Population annual growth rate (%) ... \\\n", "0 False False ... \n", "1 False False ... \n", "2 False False ... \n", "3 False False ... \n", "4 False False ... \n", ".. ... ... ... \n", "197 False False ... \n", "198 True True ... \n", "199 False False ... \n", "200 False False ... \n", "201 False False ... \n", "\n", " Total_CO2_emissions Total_income Total_reserves \\\n", "0 False True True \n", "1 False False False \n", "2 False False False \n", "3 True True True \n", "4 False False False \n", ".. ... ... ... \n", "197 False False False \n", "198 False False True \n", "199 False False False \n", "200 False False False \n", "201 False False False \n", "\n", " Trade_balance_goods_and_services Under_five_mortality_from_CME \\\n", "0 True False \n", "1 False False \n", "2 False False \n", "3 True True \n", "4 False False \n", ".. ... ... \n", "197 False False \n", "198 True False \n", "199 False False \n", "200 False False \n", "201 False False \n", "\n", " Under_five_mortality_from_IHME Under_five_mortality_rate \\\n", "0 False False \n", "1 False False \n", "2 False False \n", "3 True True \n", "4 False False \n", ".. ... ... \n", "197 False False \n", "198 False False \n", "199 False False \n", "200 False False \n", "201 False False \n", "\n", " Urban_population Urban_population_growth Urban_population_pct_of_total \n", "0 False False False \n", "1 False False False \n", "2 False False False \n", "3 True True True \n", "4 False False False \n", ".. ... ... ... \n", "197 False False False \n", "198 False False False \n", "199 False False False \n", "200 False False False \n", "201 False False False \n", "\n", "[202 rows x 358 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"data/who.csv\")\n", "df.isna()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "**1b) Ordena el anterior resultado para identificar cuál es el pais con mayor número de campos desconocidos.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "-" } }, "source": [ "**2) who.csv, Selecciona la primera, tercera y decima columna, de las filas comprendidas entre la 100 y la 150.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "**2b) ¿Cuántos valores NaN hay presentes?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "**2c) Crea un nuevo dataframe donde los NaN sean cero.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "**2d) Elimina aquellas filas de la anterior selección donde haya NaN.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Series Temporales\n", "Las series temporales son muestras de valores tomadas a lo largo del tiempo con un muestreo generalmente equidistante. Por ejemplo, información económica, demográfica, meteorológica; registros de seguridad, actividad, etc.\n", "\n", "La biblioteca Pandas gestiona las series temporales utilizando el índice: una fecha (`datetime`):\n", "https://docs.python.org/es/3/library/datetime.html\n", "\n", "\n", "El índice de un _dataframe_ es el pilar básico de acceso a los valores, por lo que su uso simplifica procesos de filtrado, selección, interpolación, etc.\n", "\n", "Enlace a la documentación: [TimeSeries](https://pandas.pydata.org/docs/user_guide/timeseries.html)" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": true, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " date temperaturemin temperaturemax precipitation snowfall \\\n", "0 2015-04-08 62.1 84.0 0.00 0.0 \n", "1 2015-04-20 63.0 78.1 0.28 0.0 \n", "2 2015-04-26 45.0 54.0 0.02 0.0 \n", "3 2015-04-28 39.0 69.1 0.00 0.0 \n", "4 2015-05-03 46.9 79.0 0.00 0.0 \n", "\n", " snowdepth avgwindspeed fastest2minwinddir fastest2minwindspeed \\\n", "0 0.0 5.82 40.0 29.97 \n", "1 0.0 11.86 180.0 21.92 \n", "2 0.0 5.82 50.0 12.97 \n", "3 0.0 2.68 40.0 12.08 \n", "4 0.0 2.68 200.0 12.08 \n", "\n", " fastest5secwinddir ... drizzle snow freezingrain smokehaze thunder \\\n", "0 30.0 ... No No No Yes No \n", "1 170.0 ... No No No No Yes \n", "2 40.0 ... No No No No No \n", "3 40.0 ... No No No No No \n", "4 210.0 ... No No No No No \n", "\n", " highwind hail blowingsnow dust freezingfog \n", "0 No No No No No \n", "1 No No No No No \n", "2 No No No No No \n", "3 No No No No No \n", "4 No No No No No \n", "\n", "[5 rows x 28 columns]\n" ] } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"data/rdu-weather-history.csv\",sep=\";\") \n", "#Qué contiene el fichero\n", "print(df.head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(df.date[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(df.date.sort_values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nosotros solo cubriremos los aspectos básicos de estos tipos de datos; lo que queremos es poder responder preguntas similares a las siguientes:\n", "- ¿Cómo podría obtener la temperatura media de un año?\n", "- ¿Cómo podría obtener la temperatura más alta de todos los meses de julio?\n", "\n", "En primer lugar, se ha de transformar el índice en una Fecha:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pandas import DatetimeIndex\n", "\n", "import pandas as pd\n", "df = pd.read_csv(\"data/rdu-weather-history.csv\",sep=\";\") \n", "\n", "df.index = DatetimeIndex(df[\"date\"])\n", "df.sort_index(inplace=True)\n", "df.head()\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.drop(columns=\"date\")\n", "# df.drop(columns=\"date\",inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing and Slicing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.index.day" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.index.weekday" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "business_days = df[df.index.weekday < 5] # 0:4 Monday to Friday\n", "business_days[[\"temperaturemax\"]].head(8)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[\"2014\"].index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[\"2014-01-03\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[\"2014-01\":\"2016-06\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[\"2018-01-01\":\"2018-01-15\", [\"temperaturemin\", \"temperaturemax\"]].head()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Date range: {df.index.min()} to {df.index.max()}\")\n", "print(f\"Total days: {(df.index.max() - df.index.min()).days}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "january_data = df[df.index.month == 1]\n", "january_data[[\"temperaturemin\", \"temperaturemax\"]].head()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Aggregations\n", "df.loc[\"2015\"].temperaturemin.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Conditional operatives\n", "df.loc[\"2015\"].temperaturemin.min() > df.loc[\"2016\"].temperaturemin.min() # returns a numpy scalar\n", "\n", "#bool(df.loc[\"2015\"].temperaturemin.min() > df.loc[\"2016\"].temperaturemin.min())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resampling \n", "\n", "Doc https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html\n", "\n", "Offset strings:\n", "https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects\n", "\n", "\n", "### Reducing frequency\n", "\n" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " temperaturemin temperaturemax\n", "date \n", "2007-01-31 34.922581 55.535484\n", "2007-02-28 30.067857 51.532143\n", "2007-03-31 42.161290 69.541935\n", "2007-04-30 47.593333 72.243333\n", "2007-05-31 56.041935 80.287097\n", " temperaturemax precipitation\n", " mean max min sum\n", "date \n", "2007-01-31 55.535484 73.9 35.1 3.12\n", "2007-02-28 51.532143 73.9 34.0 1.74\n", "2007-03-31 69.541935 87.1 48.9 3.52\n", "2007-04-30 72.243333 87.1 48.0 3.88\n", "2007-05-31 80.287097 91.9 57.0 1.43\n" ] } ], "source": [ "# Daily data to Monthly \n", "monthly_avg = df[[\"temperaturemin\", \"temperaturemax\"]].resample(\"ME\").mean()\n", "\n", "print(monthly_avg.head())\n", "#\n", "weekly_avg = df[[\"temperaturemin\", \"temperaturemax\"]].resample(\"W\").mean()\n", "quarterly_avg = df[[\"temperaturemin\", \"temperaturemax\"]].resample(\"QE\").mean()\n", "monthly_stats = df[[\"temperaturemax\", \"precipitation\"]].resample(\"ME\").agg({\n", " \"temperaturemax\": [\"mean\", \"max\", \"min\"],\n", " \"precipitation\": \"sum\"\n", "})\n", "print(monthly_stats.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Increasing frequency" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperaturemaxprecipitation
date
2018-01-0128.20.00
2018-01-0232.00.00
2018-01-0332.00.06
2018-01-0430.20.00
2018-01-0531.10.00
\n", "
" ], "text/plain": [ " temperaturemax precipitation\n", "date \n", "2018-01-01 28.2 0.00\n", "2018-01-02 32.0 0.00\n", "2018-01-03 32.0 0.06\n", "2018-01-04 30.2 0.00\n", "2018-01-05 31.1 0.00" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_df = df.loc[\"2018-01-01\":\"2018-01-31\", [\"temperaturemax\", \"precipitation\"]].copy()\n", "sample_df.head()" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " temperaturemax precipitation\n", "date \n", "2018-01-01 00:00:00 28.2 0.0\n", "2018-01-01 01:00:00 28.2 0.0\n", "2018-01-01 02:00:00 28.2 0.0\n", "2018-01-01 03:00:00 28.2 0.0\n", "2018-01-01 04:00:00 28.2 0.0\n", "2018-01-01 05:00:00 28.2 0.0\n", "2018-01-01 06:00:00 28.2 0.0\n", "2018-01-01 07:00:00 28.2 0.0\n", "2018-01-01 08:00:00 28.2 0.0\n", "2018-01-01 09:00:00 28.2 0.0\n", "2018-01-01 10:00:00 28.2 0.0\n", "2018-01-01 11:00:00 28.2 0.0\n", "2018-01-01 12:00:00 28.2 0.0\n", "2018-01-01 13:00:00 28.2 0.0\n", "2018-01-01 14:00:00 28.2 0.0\n", "2018-01-01 15:00:00 28.2 0.0\n", "2018-01-01 16:00:00 28.2 0.0\n", "2018-01-01 17:00:00 28.2 0.0\n", "2018-01-01 18:00:00 28.2 0.0\n", "2018-01-01 19:00:00 28.2 0.0\n", "2018-01-01 20:00:00 28.2 0.0\n", "2018-01-01 21:00:00 28.2 0.0\n", "2018-01-01 22:00:00 28.2 0.0\n", "2018-01-01 23:00:00 28.2 0.0\n", "2018-01-02 00:00:00 32.0 0.0\n", "2018-01-02 01:00:00 32.0 0.0\n", "2018-01-02 02:00:00 32.0 0.0\n" ] } ], "source": [ "hourly_ffill = sample_df.resample(\"h\").ffill()\n", "print(hourly_ffill.head(27))" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " temperaturemax precipitation\n", "date \n", "2018-01-01 00:00:00 28.2 0.0\n", "2018-01-01 01:00:00 32.0 0.0\n", "2018-01-01 02:00:00 32.0 0.0\n", "2018-01-01 03:00:00 32.0 0.0\n", "2018-01-01 04:00:00 32.0 0.0\n", "2018-01-01 05:00:00 32.0 0.0\n", "2018-01-01 06:00:00 32.0 0.0\n", "2018-01-01 07:00:00 32.0 0.0\n", "2018-01-01 08:00:00 32.0 0.0\n", "2018-01-01 09:00:00 32.0 0.0\n", "2018-01-01 10:00:00 32.0 0.0\n", "2018-01-01 11:00:00 32.0 0.0\n", "2018-01-01 12:00:00 32.0 0.0\n", "2018-01-01 13:00:00 32.0 0.0\n", "2018-01-01 14:00:00 32.0 0.0\n", "2018-01-01 15:00:00 32.0 0.0\n", "2018-01-01 16:00:00 32.0 0.0\n", "2018-01-01 17:00:00 32.0 0.0\n", "2018-01-01 18:00:00 32.0 0.0\n", "2018-01-01 19:00:00 32.0 0.0\n", "2018-01-01 20:00:00 32.0 0.0\n", "2018-01-01 21:00:00 32.0 0.0\n", "2018-01-01 22:00:00 32.0 0.0\n", "2018-01-01 23:00:00 32.0 0.0\n" ] } ], "source": [ "\n", "hourly_bfill = sample_df.resample(\"h\").bfill()\n", "print(hourly_bfill.head(24))" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " temperaturemax precipitation\n", "date \n", "2018-01-01 00:00:00 28.200000 0.0\n", "2018-01-01 01:00:00 28.358333 0.0\n", "2018-01-01 02:00:00 28.516667 0.0\n", "2018-01-01 03:00:00 28.675000 0.0\n", "2018-01-01 04:00:00 28.833333 0.0\n", "2018-01-01 05:00:00 28.991667 0.0\n", "2018-01-01 06:00:00 29.150000 0.0\n", "2018-01-01 07:00:00 29.308333 0.0\n", "2018-01-01 08:00:00 29.466667 0.0\n", "2018-01-01 09:00:00 29.625000 0.0\n", "2018-01-01 10:00:00 29.783333 0.0\n", "2018-01-01 11:00:00 29.941667 0.0\n", "2018-01-01 12:00:00 30.100000 0.0\n", "2018-01-01 13:00:00 30.258333 0.0\n", "2018-01-01 14:00:00 30.416667 0.0\n", "2018-01-01 15:00:00 30.575000 0.0\n", "2018-01-01 16:00:00 30.733333 0.0\n", "2018-01-01 17:00:00 30.891667 0.0\n", "2018-01-01 18:00:00 31.050000 0.0\n", "2018-01-01 19:00:00 31.208333 0.0\n", "2018-01-01 20:00:00 31.366667 0.0\n", "2018-01-01 21:00:00 31.525000 0.0\n", "2018-01-01 22:00:00 31.683333 0.0\n", "2018-01-01 23:00:00 31.841667 0.0\n" ] } ], "source": [ "hourly_interp = sample_df.resample(\"h\").interpolate(method=\"linear\")\n", "print(hourly_interp.head(24))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Moving averages" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " temperaturemax temp_max_7d_avg\n", "date \n", "2007-01-01 68.0 NaN\n", "2007-01-02 55.9 NaN\n", "2007-01-03 62.1 NaN\n", "2007-01-04 69.1 NaN\n", "2007-01-05 72.0 NaN\n", "2007-01-06 71.1 NaN\n", "2007-01-07 66.0 66.314286\n", "2007-01-08 66.9 66.157143\n", "2007-01-09 55.0 66.028571\n", "2007-01-10 44.1 63.457143\n" ] } ], "source": [ "df[\"temp_max_7d_avg\"] = df[\"temperaturemax\"].rolling(window=7).mean()\n", "print(df[[\"temperaturemax\", \"temp_max_7d_avg\"]].head(10))" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " mean std min max\n", "date \n", "2007-01-01 NaN NaN NaN NaN\n", "2007-01-02 NaN NaN NaN NaN\n", "2007-01-03 NaN NaN NaN NaN\n", "2007-01-04 NaN NaN NaN NaN\n", "2007-01-05 NaN NaN NaN NaN\n", "2007-01-06 NaN NaN NaN NaN\n", "2007-01-07 66.314286 5.659042 55.9 72.0\n", "2007-01-08 66.157143 5.619566 55.9 72.0\n", "2007-01-09 66.028571 5.896811 55.0 72.0\n", "2007-01-10 63.457143 10.228857 44.1 72.0\n" ] } ], "source": [ "rolling_stats = df[\"temperaturemax\"].rolling(window=7).agg({\n", " \"mean\": \"mean\",\n", " \"std\": \"std\",\n", " \"min\": \"min\",\n", " \"max\": \"max\"\n", "})\n", "print(rolling_stats.head(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Shifting and lagging" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperaturemaxtemp_max_lag1
date
2007-01-0168.0NaN
2007-01-0255.968.0
2007-01-0362.155.9
2007-01-0469.162.1
2007-01-0572.069.1
2007-01-0671.172.0
2007-01-0766.071.1
2007-01-0866.966.0
2007-01-0955.066.9
2007-01-1044.155.0
\n", "
" ], "text/plain": [ " temperaturemax temp_max_lag1\n", "date \n", "2007-01-01 68.0 NaN\n", "2007-01-02 55.9 68.0\n", "2007-01-03 62.1 55.9\n", "2007-01-04 69.1 62.1\n", "2007-01-05 72.0 69.1\n", "2007-01-06 71.1 72.0\n", "2007-01-07 66.0 71.1\n", "2007-01-08 66.9 66.0\n", "2007-01-09 55.0 66.9\n", "2007-01-10 44.1 55.0" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"temp_max_lag1\"] = df[\"temperaturemax\"].shift(1)\n", "df[[\"temperaturemax\", \"temp_max_lag1\"]].head(10)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperaturemaxtemp_max_lag1
date
2007-01-0168.055.9
2007-01-0255.962.1
2007-01-0362.169.1
2007-01-0469.172.0
2007-01-0572.071.1
2007-01-0671.166.0
2007-01-0766.066.9
2007-01-0866.955.0
2007-01-0955.044.1
2007-01-1044.151.1
\n", "
" ], "text/plain": [ " temperaturemax temp_max_lag1\n", "date \n", "2007-01-01 68.0 55.9\n", "2007-01-02 55.9 62.1\n", "2007-01-03 62.1 69.1\n", "2007-01-04 69.1 72.0\n", "2007-01-05 72.0 71.1\n", "2007-01-06 71.1 66.0\n", "2007-01-07 66.0 66.9\n", "2007-01-08 66.9 55.0\n", "2007-01-09 55.0 44.1\n", "2007-01-10 44.1 51.1" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"temp_max_lag1\"] = df[\"temperaturemax\"].shift(-1)\n", "df[[\"temperaturemax\", \"temp_max_lag1\"]].head(10)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " temperaturemax temp_max_diff\n", "date \n", "2007-01-01 68.0 NaN\n", "2007-01-02 55.9 -12.1\n", "2007-01-03 62.1 6.2\n", "2007-01-04 69.1 7.0\n", "2007-01-05 72.0 2.9\n", "2007-01-06 71.1 -0.9\n", "2007-01-07 66.0 -5.1\n", "2007-01-08 66.9 0.9\n", "2007-01-09 55.0 -11.9\n", "2007-01-10 44.1 -10.9\n" ] } ], "source": [ "df[\"temp_max_diff\"] = df[\"temperaturemax\"].diff()\n", "print(df[[\"temperaturemax\", \"temp_max_diff\"]].head(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data ranges\n" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(range(0,10))" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 2, 4, 6, 8]\n" ] } ], "source": [ "print(list(range(0,10,2)))" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2018-01-07', '2018-01-14', '2018-01-21', '2018-01-28',\n", " '2018-02-04', '2018-02-11', '2018-02-18', '2018-02-25'],\n", " dtype='datetime64[ns]', freq='W-SUN')" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.date_range('01/01/2018',periods=8,freq='W')" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2007-01-01', '2007-01-02', '2007-01-03', '2007-01-04',\n", " '2007-01-05', '2007-01-06', '2007-01-07', '2007-01-08',\n", " '2007-01-09', '2007-01-10',\n", " ...\n", " '2019-06-14', '2019-06-15', '2019-06-16', '2019-06-17',\n", " '2019-06-18', '2019-06-19', '2019-06-20', '2019-06-21',\n", " '2019-06-22', '2019-06-23'],\n", " dtype='datetime64[ns]', length=4557, freq='D')" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date_range = pd.date_range(start=df.index.min(), end=df.index.max(), freq=\"D\")\n", "date_range" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Total possible days: 4557\n", "Days with data: 4557\n", "Missing days: 0\n" ] } ], "source": [ "missing_dates = date_range.difference(df.index)\n", "print(f\"\\nTotal possible days: {len(date_range)}\")\n", "print(f\"Days with data: {len(df)}\")\n", "print(f\"Missing days: {len(missing_dates)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Actividades" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A. Crea una nueva columna de estación: Invierno, Primavera, Verano, Otoño" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Winter\n", "Spring\n", " season\n", "date \n", "2007-01-01 Winter\n", "2007-01-02 Winter\n", "2007-01-03 Winter\n", "2007-01-04 Winter\n", "2007-01-05 Winter\n", "2007-01-06 Winter\n", "2007-01-07 Winter\n", "2007-01-08 Winter\n", "2007-01-09 Winter\n", "2007-01-10 Winter\n" ] } ], "source": [ "fecha = df.index[0]\n", "#20 DIC and 21 MAR -> Winter\n", "def get_season(fecha):\n", " if fecha.month >= 12 or fecha.month <= 3:\n", " return \"Winter\"\n", " elif fecha.month >= 3 and fecha.month <= 6:\n", " return \"Spring\"\n", " elif fecha.month >= 6 and fecha.month <= 9:\n", " return \"Summer\"\n", " else:\n", " return \"Autumn\"\n", "\n", "print(get_season(df.index[0]))\n", "print(get_season(df.index[100]))\n", "\n", "df[\"season\"] = df.index.map(get_season)\n", "print(df[[\"season\"]].head(10))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "B. ¿Qué estadísticos tenemos por estación?\n", "- temperaturemax: mean, max, min, std\n", "- temperaturemin: mean, max, min\n", "- precipitacion: sum" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "C. ¿Cuántas veces ha nevado por año (`snowfall`)?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetemperaturemintemperaturemaxprecipitationsnowfallsnowdepthavgwindspeedfastest2minwinddirfastest2minwindspeedfastest5secwinddir...drizzlesnowfreezingrainsmokehazethunderhighwindhailblowingsnowdustfreezingfog
02015-04-0862.184.00.000.00.05.8240.029.9730.0...NoNoNoYesNoNoNoNoNoNo
12015-04-2063.078.10.280.00.011.86180.021.92170.0...NoNoNoNoYesNoNoNoNoNo
22015-04-2645.054.00.020.00.05.8250.012.9740.0...NoNoNoNoNoNoNoNoNoNo
32015-04-2839.069.10.000.00.02.6840.012.0840.0...NoNoNoNoNoNoNoNoNoNo
42015-05-0346.979.00.000.00.02.68200.012.08210.0...NoNoNoNoNoNoNoNoNoNo
\n", "

5 rows × 28 columns

\n", "
" ], "text/plain": [ " date temperaturemin temperaturemax precipitation snowfall \\\n", "0 2015-04-08 62.1 84.0 0.00 0.0 \n", "1 2015-04-20 63.0 78.1 0.28 0.0 \n", "2 2015-04-26 45.0 54.0 0.02 0.0 \n", "3 2015-04-28 39.0 69.1 0.00 0.0 \n", "4 2015-05-03 46.9 79.0 0.00 0.0 \n", "\n", " snowdepth avgwindspeed fastest2minwinddir fastest2minwindspeed \\\n", "0 0.0 5.82 40.0 29.97 \n", "1 0.0 11.86 180.0 21.92 \n", "2 0.0 5.82 50.0 12.97 \n", "3 0.0 2.68 40.0 12.08 \n", "4 0.0 2.68 200.0 12.08 \n", "\n", " fastest5secwinddir ... drizzle snow freezingrain smokehaze thunder \\\n", "0 30.0 ... No No No Yes No \n", "1 170.0 ... No No No No Yes \n", "2 40.0 ... No No No No No \n", "3 40.0 ... No No No No No \n", "4 210.0 ... No No No No No \n", "\n", " highwind hail blowingsnow dust freezingfog \n", "0 No No No No No \n", "1 No No No No No \n", "2 No No No No No \n", "3 No No No No No \n", "4 No No No No No \n", "\n", "[5 rows x 28 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"data/rdu-weather-history.csv\",sep=\";\")\n", "\n", "df.head()\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.int64(49)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df[\"snowfall\"]>0).sum()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAh8AAAHGCAYAAADUhOmrAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjcsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvTLEjVAAAAAlwSFlzAAAPYQAAD2EBqD+naQAAIkBJREFUeJzt3Qu4FVXdP/DFRS4qoKAoCAiZpYSaFzSUV6UoMvKRstKyJPOSiZlaKpRYZAleKsp8pCxDEtRudhEz01LMO5DmJbmoKGlIpR5Q5Gic+T9rvf9zHg6CZu8+a+/Z5/N5nnnO3jObPT/W2TPnu9esmelQFEURAAAy6ZhrRQAAwgcAkJ2eDwAgK+EDAMhK+AAAshI+AICshA8AICvhAwDIqnOoMU1NTeHpp58OPXr0CB06dKh2OQDAfyBes3T16tWhf//+oWPHjuUKHzF4DBw4sNplAAD/heXLl4cBAwaUK3zEHo/m4nv27FntcgCA/8CqVatS50Hz3/FShY/mQy0xeAgfAFAu/8mQCQNOAYCshA8AICvhAwDISvgAALISPgCArIQPACAr4QMAyEr4AACyEj4AgKyEDwAgK+EDAMhK+AAAshI+AICshA8AIKvOeVcH/zeDJ86taBMumza2ou8HwOvT8wEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQG2Hj3nz5oVDDz009O/fP3To0CH88pe/bLW8KIpwzjnnhH79+oXu3buH0aNHhyVLllSyZgCgPYWPF198Meyxxx7hkksu2ejyCy64IHznO98JM2bMCHfffXfYYostwpgxY8LatWsrUS8AUHKd3+g/OOSQQ9K0MbHXY/r06eHss88Ohx12WJo3a9assN1226UekiOPPPL/XjEAUGoVHfPx+OOPhxUrVqRDLc169eoV9ttvv3DnnXdu9N80NjaGVatWtZoAgPpV0fARg0cUezrWF583L9vQ1KlTU0BpngYOHFjJkgCAGlP1s10mTZoUGhoaWqbly5dXuyQAoCzhY/vtt08/n3nmmVbz4/PmZRvq2rVr6NmzZ6sJAKhfFQ0fQ4YMSSHj5ptvbpkXx3DEs15GjBhRyVUBAO3lbJcXXnghLF26tNUg0/vuuy/07t07DBo0KJx66qnha1/7Wth5551TGJk8eXK6Jsi4ceMqXTsA0B7Cx/z588OoUaNanp9++unp5/jx48PMmTPDmWeema4FcsIJJ4Tnn38+jBw5Mtxwww2hW7dula0cACilDkW8OEcNiYdp4lkvcfCp8R9saPDEuRVtlGXTxmpkgMx/v6t+tgsA0L4IHwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AADlDh/r1q0LkydPDkOGDAndu3cPO+20Uzj33HNDURSVXhUAUEKdK/2G559/frj00kvDFVdcEd72treF+fPnh2OOOSb06tUrnHLKKZVeHQDQ3sPHHXfcEQ477LAwduzY9Hzw4MHhqquuCvfcc0+lVwUAlFDFD7vsv//+4eabbw6LFy9Oz++///7wpz/9KRxyyCEbfX1jY2NYtWpVqwkAqF8V7/mYOHFiChC77LJL6NSpUxoD8vWvfz0cddRRG3391KlTw5QpUypdBgDQXno+fvKTn4TZs2eHOXPmhIULF6axHxdddFH6uTGTJk0KDQ0NLdPy5csrXRIAUM89H2eccUbq/TjyyCPT89122y088cQTqYdj/Pjxr3p9165d0wQAtA8V7/lYs2ZN6Nix9dvGwy9NTU2VXhUAUEIV7/k49NBD0xiPQYMGpVNt//znP4dvfvOb4VOf+lSlVwUAlFDFw8fFF1+cLjJ20kknhZUrV4b+/fuHT3/60+Gcc86p9KoAgBKqePjo0aNHmD59epoAADbk3i4AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFad864OAGhLgyfOrfh7Lps2tqLvp+cDAMhK+AAAshI+AICshA8AICvhAwDISvgAALISPgCArIQPACAr4QMAyEr4AACyEj4AgKyEDwAgK+EDAMhK+AAAshI+AICshA8AICvhAwDISvgAALISPgCArIQPACAr4QMAyEr4AACyEj4AgKyEDwAgK+EDABA+AID6pecDACh/+HjqqafCxz/+8dCnT5/QvXv3sNtuu4X58+e3xaoAgJLpXOk3fO6558IBBxwQRo0aFX7729+GbbfdNixZsiRsvfXWlV4VAFBCFQ8f559/fhg4cGD40Y9+1DJvyJAhlV4NAFBSFT/s8utf/zrss88+4cMf/nDo27dv2HPPPcNll122ydc3NjaGVatWtZoAgPpV8Z6Pxx57LFx66aXh9NNPD1/84hfDvffeG0455ZTQpUuXMH78+Fe9furUqWHKlCmVLgMgi8ET51b8PZdNG9vuaqR9qXjPR1NTU9hrr73Ceeedl3o9TjjhhHD88ceHGTNmbPT1kyZNCg0NDS3T8uXLK10SAFDP4aNfv35h6NChrebtuuuu4cknn9zo67t27Rp69uzZagIA6lfFw0c802XRokWt5i1evDjsuOOOlV4VAFBCFQ8fp512WrjrrrvSYZelS5eGOXPmhO9///thwoQJlV4VAFBCFQ8fw4cPD9dee2246qqrwrBhw8K5554bpk+fHo466qhKrwoAKKGKn+0Svf/9708TAMCG3NsFAMhK+AAAhA8AoH7p+QAAshI+AICshA8AICvhAwDISvgAALISPgCArIQPACAr4QMAyEr4AACyEj4AgKyEDwAgK+EDAMhK+AAAshI+AICshA8AICvhAwDISvgAALISPgCArIQPACAr4QMAyEr4AACyEj4AgKyEDwAgK+EDAMhK+AAAshI+AICshA8AICvhAwDISvgAALISPgCArIQPACAr4QMAyEr4AACyEj4AgKyEDwAgK+EDAMhK+AAAshI+AICshA8AICvhAwDISvgAALISPgCArIQPACAr4QMAyEr4AACyEj4AgKyEDwAgK+EDAMhK+AAAshI+AID6Ch/Tpk0LHTp0CKeeempbrwoAaO/h49577w3f+973wu67796WqwEASqTNwscLL7wQjjrqqHDZZZeFrbfeuq1WAwCUTJuFjwkTJoSxY8eG0aNHv+brGhsbw6pVq1pNAED96twWb3r11VeHhQsXpsMur2fq1KlhypQpbVEGUHKDJ86t6Pstmza2ou8H1EjPx/Lly8PnPve5MHv27NCtW7fXff2kSZNCQ0NDyxT/PQBQvyre87FgwYKwcuXKsNdee7XMW7duXZg3b1747ne/mw6zdOrUqWVZ165d0wQAtA8VDx/vete7wgMPPNBq3jHHHBN22WWXcNZZZ7UKHgBA+1Px8NGjR48wbNiwVvO22GKL0KdPn1fNBwDaH1c4BQDKf7bLhm655ZYcqwEASkDPBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQVee8q6NWDZ44t+LvuWza2Iq/J0A12VdWhp4PACAr4QMAyEr4AACyEj4AgKyEDwAgK+EDAMhK+AAAshI+AICshA8AICvhAwDISvgAALISPgCArIQPACAr4QMAyEr4AACyEj4AgKyEDwAgK+EDAMhK+AAAshI+AICshA8AICvhAwDISvgAALISPgCArIQPACAr4QMAyEr4AADKHT6mTp0ahg8fHnr06BH69u0bxo0bFxYtWlTp1QAAJVXx8HHrrbeGCRMmhLvuuiv8/ve/D6+88kp4z3veE1588cVKrwoAKKHOlX7DG264odXzmTNnph6QBQsWhAMPPLDSqwMA2nv42FBDQ0P62bt3740ub2xsTFOzVatWtXVJAEC9ho+mpqZw6qmnhgMOOCAMGzZsk2NEpkyZ8l+vY/DEuaGSlk0bGyqt0jW2VZ1Uht831Ma2Yz/ZTs92iWM/HnzwwXD11Vdv8jWTJk1KvSPN0/Lly9uyJACgXns+Tj755HDdddeFefPmhQEDBmzydV27dk0TANA+VDx8FEURPvvZz4Zrr7023HLLLWHIkCGVXgUAUGKd2+JQy5w5c8KvfvWrdK2PFStWpPm9evUK3bt3r/TqAID2Pubj0ksvTWM3Dj744NCvX7+W6Zprrqn0qgCAEmqTwy4AAJvi3i4AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFad864OqAWDJ86t+Hsumza24u8J1Cc9HwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEAZCV8AABZCR8AQFbCBwCQlfABAGQlfAAAWQkfAEBWwgcAkJXwAQBkJXwAAFkJHwBAVsIHAJCV8AEA1Ef4uOSSS8LgwYNDt27dwn777RfuueeetloVANDew8c111wTTj/99PDlL385LFy4MOyxxx5hzJgxYeXKlW2xOgCgvYePb37zm+H4448PxxxzTBg6dGiYMWNG2HzzzcPll1/eFqsDAEqkc6Xf8OWXXw4LFiwIkyZNapnXsWPHMHr06HDnnXe+6vWNjY1patbQ0JB+rlq16j9aX1PjmlBJ/+l634hK19gWdZahxsjvuzbbMfL7rt229PtuX23ZVKUam19TFMXrv2FRYU899VRca3HHHXe0mn/GGWcU++6776te/+Uvfzm93qQNfAZ8BnwGfAZ8BkLp22D58uWvmxUq3vPxRsUekjg+pFlTU1N49tlnQ58+fUKHDh0qso6YxgYOHBiWL18eevbsGWpRGWosS51q1Ja1pgyfybLUqcbabcvY47F69erQv3//131txcPHNttsEzp16hSeeeaZVvPj8+233/5Vr+/atWua1rfVVluFthAbt1Y3qDLVWJY61agta00ZPpNlqVONtdmWvXr1qs6A0y5duoS999473Hzzza16M+LzESNGVHp1AEDJtMlhl3gYZfz48WGfffYJ++67b5g+fXp48cUX09kvAED71ibh44gjjgj/+Mc/wjnnnBNWrFgR3v72t4cbbrghbLfddqEa4mGdeM2RDQ/v1JIy1FiWOtWoLWtNGT6TZalTjfXRlh3iqNPsawUA2i33dgEAshI+AICshA8AICvhAwDISvgAALISPmpIvApsPDW5Fq1bty7VF0+hrlXxpoSLFi1KU/MNCmkf4kl78TNay2bOnOlzCfUaPnbbbbdw7rnnpmvV16p475oPfehDYdCgQeEzn/lM2mked9xxoV+/fmGHHXYI+++/f/j73/8easHcuXPDgQceGLbYYot0vf54ifx4+ftPfOIT4cknnwy14Ac/+EEYOnRo6N27d/q5/uMf/vCH1S6vVK6//vr0WTzzzDPDI4880mrZc889F975zneGavr3v/8dzj777HDQQQel6xNEF154Ydhyyy3D5ptvni5uGO+sXYtOOOGE8PTTT4dacc8997QKbNddd11q17gPiheInDVrVqhFS5YsSVfMXrp0aagFPXr0CMcee2y44447Qhms2yCkx8/BXXfd1eru8lkUdaZDhw5Fnz59ik6dOhVjxowpfvaznxWvvPJKUUs+9alPFcOGDSsuvvji4qCDDioOO+ywYvfddy/+9Kc/pbsBDx8+vDj66KOrXWYxa9asokePHsXnP//54ktf+lKx/fbbFxMnTiwuvfTSVPc222xTLF68uKo1XnDBBcXmm2+e6vrjH/9YPPzww2mKjydNmlRsscUWxYUXXljUsvvuu6/o2LFjtcsoZs+enbabsWPHFiNHjiy6detWXHnllS3LV6xYUfU6zz777GK77bYrTj/99GLo0KHFiSeeWAwcODDVecUVVxQ77LBDcf7551e1xq233nqjU9w39erVq+V5tcXf5TPPPJMe//rXv07P437nkksuKY477riic+fOxS9+8Yuq1njeeecVN910U3r87LPPFu9617tSO8Yp1vve9763eO6556paY6zlbW97W/q5yy67FBdddFGxcuXKotYsW7as2HvvvdM2HtutoaGhGD16dEt7vulNbyoWLVqUrZ66DB9PPfVUce211xaHHnpo2oC23Xbb9Ac0/lGqBf369Stuv/32lh16rPnGG29sWR5DSNyJVlvckK6++uqW5/fee28xYMCAoqmpKT0/4ogjig984ANVrLAoBg0aVFxzzTWbXB7rj3+caj18xM9Atb397W8vvv3tb7c8j+0aw9sPfvCDmgkfcQf5m9/8Jj1esmRJqmf9z2isOQb7atpyyy1TgJs5c2bL9KMf/Sjt9L/+9a+3zKu2+JlrDh8xbMYAv75Y6zve8Y6imuL+ZuHChelxDER77rlnev7SSy+l7SbWd+yxx9ZEO8Z6Tj755KJ3795Fly5dig9+8IPF9ddf37K/rLbDDz88fWmM289HPvKR4oADDigOPvjg4m9/+1vx9NNPpy/r48aNy1ZPXYaP5g0qio0a0/POO++cdlQjRowofvjDH1a1xvhNPabQZptttlnxwAMPtDx/7LHH0k6/2rp37148/vjjrebFMBfDXXT33XcXW221VVFN8dv5a4XKhx56KP0/qikGtNea3vnOd1b9j3oUP3Pxs7e+P/zhD+mPaeztqoXwEX/fTz75ZKvnf/3rX1uex/pjb101xVDU3Hu5evXqVttO/DzW4r6yb9++xfz581stf+SRR6q+fXft2rVlXzl48ODi1ltvbbU81hy/zNXS35y1a9cWc+bMSb00cXuJAWry5MlFtW277bbFn//85/T4+eefT3XfdtttLcsXLFiQehVzqbsxHx06dGj1PI6jmDRpUli8eHE6TrjTTjuFU045JVTTzjvvnI6vRr/97W9Dt27dwo033tiy/He/+10YMmRIqLbBgweH+fPntzxfuHBh6NixY8s9euK4ildeeaWKFYYwfPjwMG3atDQWYGPHNs8///z0mmr6zW9+E9auXZtuNb2xKY5XqAXxltpxUPH6Ro0alT6rZ5xxRrj44otDtcX2ev7551ue77XXXumYe7N43HrDfUBub37zm9Px/zg+Kt7X6vbbbw+16uGHHw5/+ctfQvfu3dPdxze0se0qpx133DE8+OCD6XH8vXbu3Pp2ZJ06dUo3La2mDT9v8T4pH/3oR8NNN90UHn300fDJT34yDTautrX/fx8UxW0mtt36207c/tesWZOvoKLObJhCNyYe66qmeHw6dsG++c1vTsn+pz/9adG/f//UFXbkkUemLrvvfve7RbXFGuIx6jPPPLM455xzUo3rd3HG/0fsBq2m+++/P41FieN8Yi9CHAMQp/g4zovfitbvVaqG3XbbreXQxcbEbyPV7lGI4tij+HvemDiGJvaMVLvOUaNGveYhi5/85CfpuHatuPnmm9OhwTj+KPZw1lrPR/x9Nh/z/9a3vtVq+VVXXZXG1VRTHK+16667pt6kb3zjG6nneunSpS29XPGwwYc+9KGa/5tTC4de3vGOd6QxU9Hll1+eejnWP9T21a9+Neu20yZ3ta2mONo9pvjXEhNeNR111FGpVyGOMB4xYkQ6uyWemRG/wcfk+f3vfz/9P6ptwoQJqafjyiuvTN8oY4KfPHlyy/J99903zJkzp6o17r777qlXK9YY2/Oxxx5L8+O3zq997WvhYx/7WNV/33vvvXfqNYoj4jcmflOKZz5V22mnnbbJEfsHH3xw6sGp9hkQM2bMCJttttkml8eeuHimTq2IZwfF3/3xxx+fzhiL3zZrxeOPP97q+YY9cPGsobPOOitU0xe+8IV0Vl3cP8Ze62XLloW3vOUtqQck9srEnq+rrrqqqjXGs65er/ey2r1x0Ve+8pUwbty4cMEFF6T9euxhj5/LP/zhD+n5vffem3V/7q620MZicIuHgOKpoMAb99e//jUd/otfLuLhoXg4/YADDgijR4+uiT/sZbFs2bKwYMGC9IUofgGOh1kvueSS9KV37Nix6TBrLnUbPmIqfuihh1ou2hW/Ccf0/FrfmnIrQ41lqnNT34TjNVNqoWeh7OLnIF6nopbbUo3tqy3LQDtuQlFn1q1bl65JEUdpNx/LbJ7ivHjMK75GjfXRlmW5hsZrideheeKJJ4paV4a2VGP7asuXX3655redMrRjNfZDdXe2y8SJE9OYiTh+InbRxZHQcYqP45kPcVk8+0WN9dGW9SD2KtXC2U1QNvFsHdtOOfdDdXfYJR4SuOKKK8KYMWM2ujwOsjn66KNfdUphTmWosSx1xgFnr+Wll15KA1Jr+b4f999/f/p/VLvGMrSlGttXW5Zh26mHdqxGW9bd2S6rV69O9yDZlDhQqdrnhZehxrLUGb/5HHnkkZtM7HG8R9zwq+k/2TnVgjK0pRrbV1uWYdspQzvWYlvWXc9HHLEbB/jMnj07bLPNNq2W/fOf/0w3RIunuzVf5EuN5W7LeAOseAprvEHfxtx3331pZHc1v3XEi8i93s7psssuq/o3ozK0pRrbV1uWYdspQzvWYlvWXc9HvA7A+973vvStPN7htvlqnPHQwAMPPJDO0qjmH8uy1FiWOuPpdosWLdrk8ngFv3hX3moaNmxY2G+//V5z5xQ3+morQ1uqsX21ZRm2nTK0Yy22Zd31fETxPPA4HiFedGr900PjBb3e8573pAuqVFsZaixTnbXsc5/7XLoWwfTp0ze6PF6COd7G/o9//GP22qCW2Xbqty3rMnwAALWr7g67NLvnnnvCnXfe2erberyMebVvMla2GstS58ZqjL0z8RLw1F9bqrF9tWUZaMd23vOxcuXKcPjhh6c7ScYr860/TiHeIyAen/v5z38e+vbtq0ZtmVWt75xsO+2nHctUZ61vO2Vqx5pqy6LOHH744enOh4888sirlsV5+++/f9XvgliGGstSZxlqjHe8HDlyZLoy7I477ljsu+++aYqP47y47PXuiplDGdpSje2rLcuw7ZShHWuxLesufGy55ZbFwoULN7l8/vz56TXVVIYay1JnGWosy86pDG2pxvbVlmXYdsrQjrXYlnU35iPennzVqlWveeGs+JpqKkONZamzDDXGs4XmzZsX3vrWt75qWZz3ne98J92yvtrK0JZqbF9tWYZtpwztWIttWXfnSR5xxBFh/Pjx4dprr231gYiP47xjjjkmfPSjH1WjtsymLDsn2077acey1FmGbacM7ViTbVnUmbVr1xYnnnhi0aVLl3QnwW7duqUpPo7zPvOZz6TXqFFb5nLSSSel46q/+MUvioaGhpb58XGcN3jw4OLkk08uqs22037asSx1lmHbKUM71mJb1t3ZLs1iwluwYEGrEb3xErc9e/YMtaIMNZalzlqusbGxMZx66qnh8ssvT5er79KlS5r/8ssvh86dO6dLM3/rW9+q+je4MrRlMzW2j7Ys07ZTy+1Yi21Zt+EDak2t75ygVtl26rAtizq0Zs2a4rbbbiseeuihVy176aWXiiuuuKKotjLUWJY6y1BjWZShLdXYvtqyDLTjG1d34WPRokUt5y3HY24HHnhg8dRTT7UsX7FiRZqvRm2ZUxl2Trad9tOOZaqz1redsrRjrbVl3Z3tctZZZ6W798WrzsU7DcY7Co4cOTJdaa5WlKHGstRZhhoXL14cdt1113Rny3h34IMOOig8/fTTLcsbGhrSiPhqK0NbqrF9tWUZtp0ytGNNtmVRZ/r27Vv85S9/aXne1NSURiIPGjSoePTRR2sihZahxrLUWYYax40bV4wdO7b4xz/+USxZsiQ9HjJkSPHEE0+k5bVQY1naUo3tqy3LsO2UoR1rsS3rLnz06NGjePjhh181f8KECcWAAQOKefPmVf2DUIYay1JnGWosy86pDG2pxvbVlmXYdsrQjrXYlnUXPoYPH17MmjVro8vih2Grrbaq+gehDDWWpc4y1FiWnVMZ2lKN7asty7DtlKEda7Et6y58nHfeecUhhxyyyeXxgi9xYFA1laHGstRZhhrLsnMqQ1uqsX21ZRm2nTK0Yy22pet8QBubOnVquO2228L111+/0eUnnXRSmDFjRmhqavK7ANtOu9gPCR8AQFZ1d6otAFDbhA8AICvhAwDISvgAKubggw9Od84EeC3CB1AVt9xyS+jQoUN4/vnn/QagnRE+AICshA/gv/Liiy+Go48+Omy55ZahX79+4Rvf+Ear5T/+8Y/DPvvsk260tf3224ePfexj6eZb0bJly8KoUaPS46233jr1gHzyk59Mz+N1BuI1CYYMGRK6d+8e9thjj/Czn/3MbwnqiPAB/FfOOOOMcOutt4Zf/epX4cYbb0yHURYuXNiy/JVXXgnnnntuuP/++8Mvf/nLFDiaA8bAgQPDz3/+8/Q43gn073//e/j2t7+dnsfgMWvWrHTBo4ceeiicdtpp4eMf/3haF1AfXGQMeMNeeOGF0KdPn3DllVeGD3/4w2nes88+GwYMGBBOOOGEMH369Ff9m/nz54fhw4eH1atXp96SGFZi78dzzz0Xttpqq/SaxsbG0Lt373DTTTeFESNGtPzb4447LqxZsybMmTPHbwvqQOdqFwCUz6OPPhpefvnlsN9++7XMi6HhrW99a8vzBQsWhK985Sup5yMGjObLNj/55JNh6NChG33fpUuXppDx7ne/u9X8uK4999yzzf4/QF7CB9Am40HGjBmTptmzZ4dtt902hY74PAaJ1+pRiebOnRt22GGHVsu6du3qNwV1QvgA3rCddtopbLbZZuHuu+8OgwYNSvNi78bixYvDQQcdFB555JHwr3/9K0ybNi2N72g+7LK+Ll26pJ/r1q1rmRd7RGLIiEElvg9Qn4QP4A2LYzaOPfbYNOg0jv3o27dv+NKXvhQ6dvzfMewxkMRwcfHFF4cTTzwxPPjgg2nw6fp23HHHdJbLddddF973vvelM1vimTFf+MIX0iDTeJhm5MiRoaGhIdx+++2hZ8+eYfz48X5bUAec7QL8Vy688MLwP//zP+HQQw8No0ePTkFh7733TsviYZaZM2eGn/70p6k3I/aAXHTRRa3+fTysMmXKlDBx4sSw3XbbhZNPPjnNjyFl8uTJ6ayXXXfdNbz3ve9Nh2HiqbdAfXC2CwCQlZ4PACAr4QMAyEr4AACyEj4AgKyEDwAgK+EDAMhK+AAAshI+AICshA8AICvhAwDISvgAALISPgCAkNP/A3BmP+Svs3UsAAAAAElFTkSuQmCC", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df2 = df[df[\"snowfall\"]>0]\n", "df2.head()\n", "\n", "df2.index = pd.to_datetime(df2[\"date\"])\n", "\n", "df2[\"snowfall\"].groupby(df2.index.year).count().plot(kind=\"bar\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "D. ¿En qué año se han registrado más nieve (`snowdepth`)? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "E. Crea un dataframe que contenga la temperatura máxima de julio por cada año." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "date\n", "2007 97.0\n", "2008 100.9\n", "2009 97.0\n", "2010 100.9\n", "2011 99.0\n", "2012 105.1\n", "2013 93.9\n", "2014 97.0\n", "2015 100.0\n", "2016 93.0\n", "2017 93.0\n", "2018 96.1\n", "2019 93.0\n", "Name: temperaturemax, dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"data/rdu-weather-history.csv\",sep=\";\")\n", "df.index = pd.to_datetime(df[\"date\"])\n", "\n", "dfjulio = df[df.index.month==6]\n", "dfjulio.head()\n", "\n", "dfjulio.groupby(dfjulio.index.year)[\"temperaturemax\"].max()\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "F. ¿Cuál fue la temperaturemax de cada principio de mes?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pivotación de tablas " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pivotar una tabla consiste en organizar las columnas a filas o las filas a columnas. Con ello disponemos los datos *transpuestos* a la modelización original.\n", "\n", "Enlace a la documentación:\n", "- https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html\n", "- https://pandas.pydata.org/docs/reference/api/pandas.pivot.html" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Municipio Categoria Values\n", "0 muni0 Inscritos 4\n", "1 muni0 Censo 3\n", "2 muni0 Población 4\n", "3 muni1 Inscritos 7\n", "4 muni1 Censo 1\n", "5 muni1 Población 4\n", "6 muni2 Inscritos 5\n", "7 muni2 Censo 1\n", "8 muni2 Población 7\n", "9 muni3 Inscritos 7\n", "10 muni3 Censo 7\n", "11 muni3 Población 8\n", "12 muni4 Inscritos 7\n", "13 muni4 Censo 2\n", "14 muni4 Población 6\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "samples=5\n", "df= pd.DataFrame(\n", " {\n", " \"Municipio\":np.repeat([\"muni%i\"%i for i in range(samples)],3) ,\n", " \"Categoria\" :[\"Inscritos\",\"Censo\",\"Población\"]*(samples),\n", " \"Values\" : np.random.randint(1,10,samples*3)\n", " })\n", " \n", " \n", "print(df)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "Could not convert ['muni0muni1muni2muni3muni4' 'CensoCensoCensoCensoCenso'] to numeric", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mKeyError\u001b[39m Traceback (most recent call last)", "\u001b[36mFile \u001b[39m\u001b[32m~/Projects/TTAD_master/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812\u001b[39m, in \u001b[36mIndex.get_loc\u001b[39m\u001b[34m(self, key)\u001b[39m\n\u001b[32m 3811\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m3812\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_engine\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcasted_key\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 3813\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n", "\u001b[36mFile \u001b[39m\u001b[32mpandas/_libs/index.pyx:167\u001b[39m, in \u001b[36mpandas._libs.index.IndexEngine.get_loc\u001b[39m\u001b[34m()\u001b[39m\n", "\u001b[36mFile \u001b[39m\u001b[32mpandas/_libs/index.pyx:196\u001b[39m, in \u001b[36mpandas._libs.index.IndexEngine.get_loc\u001b[39m\u001b[34m()\u001b[39m\n", "\u001b[36mFile \u001b[39m\u001b[32mpandas/_libs/hashtable_class_helper.pxi:7088\u001b[39m, in \u001b[36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[39m\u001b[34m()\u001b[39m\n", "\u001b[36mFile \u001b[39m\u001b[32mpandas/_libs/hashtable_class_helper.pxi:7096\u001b[39m, in \u001b[36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[39m\u001b[34m()\u001b[39m\n", "\u001b[31mKeyError\u001b[39m: 'Categoria'", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[31mKeyError\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[73]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m df[\u001b[43mdf\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mCategoria\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m==\u001b[33m\"\u001b[39m\u001b[33mCenso\u001b[39m\u001b[33m\"\u001b[39m].mean()\n", "\u001b[36mFile \u001b[39m\u001b[32m~/Projects/TTAD_master/.venv/lib/python3.12/site-packages/pandas/core/frame.py:4113\u001b[39m, in \u001b[36mDataFrame.__getitem__\u001b[39m\u001b[34m(self, key)\u001b[39m\n\u001b[32m 4111\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m.columns.nlevels > \u001b[32m1\u001b[39m:\n\u001b[32m 4112\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._getitem_multilevel(key)\n\u001b[32m-> \u001b[39m\u001b[32m4113\u001b[39m indexer = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 4114\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m is_integer(indexer):\n\u001b[32m 4115\u001b[39m indexer = [indexer]\n", "\u001b[36mFile \u001b[39m\u001b[32m~/Projects/TTAD_master/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:3819\u001b[39m, in \u001b[36mIndex.get_loc\u001b[39m\u001b[34m(self, key)\u001b[39m\n\u001b[32m 3814\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(casted_key, \u001b[38;5;28mslice\u001b[39m) \u001b[38;5;129;01mor\u001b[39;00m (\n\u001b[32m 3815\u001b[39m \u001b[38;5;28misinstance\u001b[39m(casted_key, abc.Iterable)\n\u001b[32m 3816\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28many\u001b[39m(\u001b[38;5;28misinstance\u001b[39m(x, \u001b[38;5;28mslice\u001b[39m) \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m casted_key)\n\u001b[32m 3817\u001b[39m ):\n\u001b[32m 3818\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m InvalidIndexError(key)\n\u001b[32m-> \u001b[39m\u001b[32m3819\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(key) \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01merr\u001b[39;00m\n\u001b[32m 3820\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[32m 3821\u001b[39m \u001b[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001b[39;00m\n\u001b[32m 3822\u001b[39m \u001b[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001b[39;00m\n\u001b[32m 3823\u001b[39m \u001b[38;5;66;03m# the TypeError.\u001b[39;00m\n\u001b[32m 3824\u001b[39m \u001b[38;5;28mself\u001b[39m._check_indexing_error(key)\n", "\u001b[31mKeyError\u001b[39m: 'Categoria'" ] } ], "source": [ "df[df[\"Categoria\"]==\"Censo\"].mean() # !" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Values
Categoria
Censo2.8
Inscritos6.0
Población5.8
\n", "
" ], "text/plain": [ " Values\n", "Categoria \n", "Censo 2.8\n", "Inscritos 6.0\n", "Población 5.8" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# indexcolumn, Grouper, array, or list of the previous\n", "# Keys to group by on the pivot table index. If a list is passed, it can contain any of the other types (except list). \n", "# If an array is passed, it must be the same length as the data and will be used in the same manner as column values.\n", "pd.pivot_table(df, index=['Categoria'],values=\"Values\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(14.0)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.pivot_table(df, index=['Municipio'],columns=[\"Categoria\"])\n", "df2[(\"Values\",\"Censo\")].sum()\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df2.reset_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MunicipioValues
CategoriaCensoInscritosPoblación
0muni03.04.04.0
1muni11.07.04.0
2muni21.05.07.0
3muni37.07.08.0
4muni42.07.06.0
\n", "
" ], "text/plain": [ " Municipio Values \n", "Categoria Censo Inscritos Población\n", "0 muni0 3.0 4.0 4.0\n", "1 muni1 1.0 7.0 4.0\n", "2 muni2 1.0 5.0 7.0\n", "3 muni3 7.0 7.0 8.0\n", "4 muni4 2.0 7.0 6.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'df2' is not defined", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mKeyError\u001b[39m Traceback (most recent call last)", "\u001b[36mFile \u001b[39m\u001b[32m~/Projects/TTAD_master/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812\u001b[39m, in \u001b[36mIndex.get_loc\u001b[39m\u001b[34m(self, key)\u001b[39m\n\u001b[32m 3811\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m3812\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_engine\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcasted_key\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 3813\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n", "\u001b[36mFile \u001b[39m\u001b[32mpandas/_libs/index.pyx:167\u001b[39m, in \u001b[36mpandas._libs.index.IndexEngine.get_loc\u001b[39m\u001b[34m()\u001b[39m\n", "\u001b[36mFile \u001b[39m\u001b[32mpandas/_libs/index.pyx:196\u001b[39m, in \u001b[36mpandas._libs.index.IndexEngine.get_loc\u001b[39m\u001b[34m()\u001b[39m\n", "\u001b[36mFile \u001b[39m\u001b[32mpandas/_libs/hashtable_class_helper.pxi:7088\u001b[39m, in \u001b[36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[39m\u001b[34m()\u001b[39m\n", "\u001b[36mFile \u001b[39m\u001b[32mpandas/_libs/hashtable_class_helper.pxi:7096\u001b[39m, in \u001b[36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[39m\u001b[34m()\u001b[39m\n", "\u001b[31mKeyError\u001b[39m: 'Censo'", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[31mKeyError\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[81]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m \u001b[43mdf2\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mCenso\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m.mean() \u001b[38;5;66;03m#Alerta, pivotar también genera multi-columas/indices. \u001b[39;00m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/Projects/TTAD_master/.venv/lib/python3.12/site-packages/pandas/core/frame.py:4112\u001b[39m, in \u001b[36mDataFrame.__getitem__\u001b[39m\u001b[34m(self, key)\u001b[39m\n\u001b[32m 4110\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m is_single_key:\n\u001b[32m 4111\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m.columns.nlevels > \u001b[32m1\u001b[39m:\n\u001b[32m-> \u001b[39m\u001b[32m4112\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_getitem_multilevel\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 4113\u001b[39m indexer = \u001b[38;5;28mself\u001b[39m.columns.get_loc(key)\n\u001b[32m 4114\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m is_integer(indexer):\n", "\u001b[36mFile \u001b[39m\u001b[32m~/Projects/TTAD_master/.venv/lib/python3.12/site-packages/pandas/core/frame.py:4170\u001b[39m, in \u001b[36mDataFrame._getitem_multilevel\u001b[39m\u001b[34m(self, key)\u001b[39m\n\u001b[32m 4168\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_getitem_multilevel\u001b[39m(\u001b[38;5;28mself\u001b[39m, key):\n\u001b[32m 4169\u001b[39m \u001b[38;5;66;03m# self.columns is a MultiIndex\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m4170\u001b[39m loc = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 4171\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(loc, (\u001b[38;5;28mslice\u001b[39m, np.ndarray)):\n\u001b[32m 4172\u001b[39m new_columns = \u001b[38;5;28mself\u001b[39m.columns[loc]\n", "\u001b[36mFile \u001b[39m\u001b[32m~/Projects/TTAD_master/.venv/lib/python3.12/site-packages/pandas/core/indexes/multi.py:3059\u001b[39m, in \u001b[36mMultiIndex.get_loc\u001b[39m\u001b[34m(self, key)\u001b[39m\n\u001b[32m 3056\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m mask\n\u001b[32m 3058\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(key, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m-> \u001b[39m\u001b[32m3059\u001b[39m loc = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_get_level_indexer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlevel\u001b[49m\u001b[43m=\u001b[49m\u001b[32;43m0\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[32m 3060\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m _maybe_to_slice(loc)\n\u001b[32m 3062\u001b[39m keylen = \u001b[38;5;28mlen\u001b[39m(key)\n", "\u001b[36mFile \u001b[39m\u001b[32m~/Projects/TTAD_master/.venv/lib/python3.12/site-packages/pandas/core/indexes/multi.py:3410\u001b[39m, in \u001b[36mMultiIndex._get_level_indexer\u001b[39m\u001b[34m(self, key, level, indexer)\u001b[39m\n\u001b[32m 3407\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mslice\u001b[39m(i, j, step)\n\u001b[32m 3409\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m3410\u001b[39m idx = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_get_loc_single_level_index\u001b[49m\u001b[43m(\u001b[49m\u001b[43mlevel_index\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 3412\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m level > \u001b[32m0\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._lexsort_depth == \u001b[32m0\u001b[39m:\n\u001b[32m 3413\u001b[39m \u001b[38;5;66;03m# Desired level is not sorted\u001b[39;00m\n\u001b[32m 3414\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(idx, \u001b[38;5;28mslice\u001b[39m):\n\u001b[32m 3415\u001b[39m \u001b[38;5;66;03m# test_get_loc_partial_timestamp_multiindex\u001b[39;00m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/Projects/TTAD_master/.venv/lib/python3.12/site-packages/pandas/core/indexes/multi.py:2999\u001b[39m, in \u001b[36mMultiIndex._get_loc_single_level_index\u001b[39m\u001b[34m(self, level_index, key)\u001b[39m\n\u001b[32m 2997\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m -\u001b[32m1\u001b[39m\n\u001b[32m 2998\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m2999\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mlevel_index\u001b[49m\u001b[43m.\u001b[49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/Projects/TTAD_master/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:3819\u001b[39m, in \u001b[36mIndex.get_loc\u001b[39m\u001b[34m(self, key)\u001b[39m\n\u001b[32m 3814\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(casted_key, \u001b[38;5;28mslice\u001b[39m) \u001b[38;5;129;01mor\u001b[39;00m (\n\u001b[32m 3815\u001b[39m \u001b[38;5;28misinstance\u001b[39m(casted_key, abc.Iterable)\n\u001b[32m 3816\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28many\u001b[39m(\u001b[38;5;28misinstance\u001b[39m(x, \u001b[38;5;28mslice\u001b[39m) \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m casted_key)\n\u001b[32m 3817\u001b[39m ):\n\u001b[32m 3818\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m InvalidIndexError(key)\n\u001b[32m-> \u001b[39m\u001b[32m3819\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(key) \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01merr\u001b[39;00m\n\u001b[32m 3820\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[32m 3821\u001b[39m \u001b[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001b[39;00m\n\u001b[32m 3822\u001b[39m \u001b[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001b[39;00m\n\u001b[32m 3823\u001b[39m \u001b[38;5;66;03m# the TypeError.\u001b[39;00m\n\u001b[32m 3824\u001b[39m \u001b[38;5;28mself\u001b[39m._check_indexing_error(key)\n", "\u001b[31mKeyError\u001b[39m: 'Censo'" ] } ], "source": [ "df2[\"Censo\"].mean() #! Alerta, pivotar también genera multi-columas/indices. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Values
Categoria
Censo14
Inscritos30
Población29
\n", "
" ], "text/plain": [ " Values\n", "Categoria \n", "Censo 14\n", "Inscritos 30\n", "Población 29" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# aggfuncfunction, list of functions, dict, default “mean”\n", "# If a list of functions is passed, the resulting pivot table will have hierarchical columns whose top level are the function names\n", "# (inferred from the function objects themselves).\n", "pd.pivot_table(df, index=['Categoria'],values=\"Values\", aggfunc=\"sum\")\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Region Product Quarter Sales Units\n", "0 North A Q1 100 10\n", "1 North B Q1 150 15\n", "2 North C Q1 200 20\n", "3 South A Q2 120 12\n", "4 South B Q2 160 16\n", "5 South C Q2 210 21\n", "6 East A Q3 110 11\n", "7 East B Q3 170 17\n", "8 East C Q3 190 19\n" ] } ], "source": [ "# Multi_index \n", "sales_data = pd.DataFrame({\n", " 'Region': ['North', 'North', 'North', 'South', 'South', 'South', 'East', 'East', 'East'],\n", " 'Product': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],\n", " 'Quarter': ['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2', 'Q3', 'Q3', 'Q3'],\n", " 'Sales': [100, 150, 200, 120, 160, 210, 110, 170, 190],\n", " 'Units': [10, 15, 20, 12, 16, 21, 11, 17, 19]\n", "})\n", "\n", "print(sales_data)\n" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Quarter Q1 Q2 Q3\n", "Region Product \n", "East A NaN NaN 110.0\n", " B NaN NaN 170.0\n", " C NaN NaN 190.0\n", "North A 100.0 NaN NaN\n", " B 150.0 NaN NaN\n", " C 200.0 NaN NaN\n", "South A NaN 120.0 NaN\n", " B NaN 160.0 NaN\n", " C NaN 210.0 NaN\n" ] } ], "source": [ "pivot_multi_index = pd.pivot_table(sales_data,\n", " index=['Region', 'Product'],\n", " columns='Quarter',\n", " values='Sales')\n", "print(pivot_multi_index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Quarter Q1 Q2 Q3\n", "Region Product \n", "East A 0.0 0.0 110.0\n", " B 0.0 0.0 170.0\n", " C 0.0 0.0 190.0\n", "North A 100.0 0.0 0.0\n", " B 150.0 0.0 0.0\n", " C 200.0 0.0 0.0\n", "South A 0.0 120.0 0.0\n", " B 0.0 160.0 0.0\n", " C 0.0 210.0 0.0\n" ] } ], "source": [ "# Incomplete data\n", "pivot_multi_index = pd.pivot_table(sales_data,\n", " index=['Region', 'Product'],\n", " columns='Quarter',\n", " values='Sales',\n", " fill_value=0)\n", "print(pivot_multi_index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Actividades" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Nombre Apellidos Altura Sexo Nacimiento Cof Categoria\n", "0 Will Smith 1.43 M 10/10/1920 0.19 laboral\n", "1 Jon Snow 1.98 M 10/1/1970 0.98 laboral\n", "2 Laia Ramirez 1.87 F 09/10/1987 0.76 cap6\n", "3 Luzy Raim 1.67 F 23/07/1979 0.56 cap6\n", "4 Fein Mang 1.78 M 12/03/1937 0.27 cap6\n", "5 Victor Colom 1.78 M 22/09/1957 0.97 cap8\n" ] } ], "source": [ "# A\n", "df_experiment = pd.read_csv(\"data/experiment.csv\")\n", "print(df_experiment)\n", "\n", "## Get: \n", "# - Index: 'Categoria'\n", "# - Columns: 'Sexo'\n", "# - Values: 'Altura'\n", "# - Aggregations: ['mean', 'min', 'max', 'std']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Unnamed: 0 Dni Nom CP Ciutat \\\n", "0 0 H61414629 María Dolores Arjona Jove 7800 Eivissa \n", "1 1 S3138381C Núria Quirós 7511 Ruberts \n", "2 2 J8698188C Miguel José María Gil Vargas 7340 Alaro \n", "3 3 A48821615 Jordi Chaves Bustamante 7609 Bellavista \n", "4 4 U0247281I Jana Rosa Collado Menéndez 7006 Palma \n", ".. ... ... ... ... ... \n", "995 995 W8757353A Joan Téllez Pera 7529 Ariany \n", "996 996 H72967045 Álex Javier Campos Palomar 7701 Mao \n", "997 997 N9389400D Èric José Guerrero Hernando 7800 Eivissa \n", "998 998 C61296679 Èric Pinedo 7009 Palma \n", "999 999 H90312224 Pol Iglesias Domingo 7609 Bellavista \n", "\n", " Sexe Tipus certificat cat Punts \n", "0 M B 73 \n", "1 F A 40 \n", "2 M A 45 \n", "3 F B 40 \n", "4 M B 86 \n", ".. ... ... ... \n", "995 F B 32 \n", "996 M B 72 \n", "997 F B 14 \n", "998 M A 1 \n", "999 F A 4 \n", "\n", "[1000 rows x 8 columns]\n" ] } ], "source": [ "# B\n", "df_groups = pd.read_csv(\"data/data_groups.csv\")\n", "print(df_groups)\n", "\n", "# Crear a pivot table con:\n", "# - mostrando los puntos medios por ciudad y tipo de certificado\n", "# - limitando a las 5 ciudades con mas población\n", "# - incluyendo los márgenes!! MARGINS ¿Investiga que son los MARGINS y MARGINS_NAME?\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.10" } }, "nbformat": 4, "nbformat_minor": 1 }