{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Agrupaciones de datos\n", "\n", "Las agrupaciones son operaciones necesarias para analizar datos, ya que permiten extraer información en función de datos categóricos de nuestro dataframe.\n", "\n", "Cargaremos los datos llamados _experiment.csv_ que podemos encontrar en el siguiente [enlace](https://github.com/wisaaco/TTAD/tree/main/data/)" ] }, { "cell_type": "code", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "slide" }, "ExecuteTime": { "end_time": "2025-11-14T08:12:59.745855Z", "start_time": "2025-11-14T08:12:59.240653Z" } }, "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df= pd.read_csv(\"../../data/Pandas/experiment.csv\")\n", "print(df)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Nombre Apellidos Altura Sexo Nacimiento Cof Categoria\n", "0 Will Smith 1.43 M 10/10/1920 0.19 laboral\n", "1 Jon Snow 1.98 M 10/1/1970 0.98 laboral\n", "2 Laia Ramirez 1.87 F 09/10/1987 0.76 cap6\n", "3 Luzy Raim 1.67 F 23/07/1979 0.56 cap6\n", "4 Fein Mang 1.78 M 12/03/1937 0.27 cap6\n", "5 Victor Colom 1.78 M 22/09/1957 0.97 cap8\n" ] } ], "execution_count": 1 }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "En el siguiente ejemplo agrupamos los datos según el sexo de la persona mediante el método `groupby` que devuelve un `DataFrame` agrupado:" ] }, { "cell_type": "code", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" }, "ExecuteTime": { "end_time": "2025-11-14T08:13:00.879969Z", "start_time": "2025-11-14T08:13:00.876274Z" } }, "source": [ "bySex = df.groupby('Sexo')\n", "bySex" ], "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 2 }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "El atributo `groups` nos muestra los grupos hemos creado:" ] }, { "cell_type": "code", "metadata": { "pycharm": { "name": "#%%\n" }, "ExecuteTime": { "end_time": "2025-11-14T08:13:01.773017Z", "start_time": "2025-11-14T08:13:01.769855Z" } }, "source": [ "# Podemos saber los grupos realizados y que índices del dataframe tienen.\n", "bySex.groups # nos proporciona un diccionario\n" ], "outputs": [ { "data": { "text/plain": [ "{'F': [2, 3], 'M': [0, 1, 4, 5]}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 3 }, { "cell_type": "code", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" }, "ExecuteTime": { "end_time": "2025-11-14T08:13:02.213029Z", "start_time": "2025-11-14T08:13:02.194939Z" } }, "source": [ "bySex.describe()" ], "outputs": [ { "data": { "text/plain": [ " Altura Cof \\\n", " count mean std min 25% 50% 75% max count mean \n", "Sexo \n", "F 2.0 1.7700 0.141421 1.67 1.7200 1.77 1.82 1.87 2.0 0.6600 \n", "M 4.0 1.7425 0.228674 1.43 1.6925 1.78 1.83 1.98 4.0 0.6025 \n", "\n", " \n", " std min 25% 50% 75% max \n", "Sexo \n", "F 0.141421 0.56 0.61 0.66 0.7100 0.76 \n", "M 0.431383 0.19 0.25 0.62 0.9725 0.98 " ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AlturaCof
countmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%max
Sexo
F2.01.77000.1414211.671.72001.771.821.872.00.66000.1414210.560.610.660.71000.76
M4.01.74250.2286741.431.69251.781.831.984.00.60250.4313830.190.250.620.97250.98
\n", "
" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 4 }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Esto nos permite realizar operaciones de filtrado con base a los grupos que hemos creado:" ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:03.058260Z", "start_time": "2025-11-14T08:13:03.054623Z" } }, "source": [ "dfM = df.loc[bySex.groups['M'].values] #Recorda que \"loc\" accedeix per index de fila\n", "dfM" ], "outputs": [ { "data": { "text/plain": [ " Nombre Apellidos Altura Sexo Nacimiento Cof Categoria\n", "0 Will Smith 1.43 M 10/10/1920 0.19 laboral\n", "1 Jon Snow 1.98 M 10/1/1970 0.98 laboral\n", "4 Fein Mang 1.78 M 12/03/1937 0.27 cap6\n", "5 Victor Colom 1.78 M 22/09/1957 0.97 cap8" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NombreApellidosAlturaSexoNacimientoCofCategoria
0WillSmith1.43M10/10/19200.19laboral
1JonSnow1.98M10/1/19700.98laboral
4FeinMang1.78M12/03/19370.27cap6
5VictorColom1.78M22/09/19570.97cap8
\n", "
" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 5 }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:13.214632Z", "start_time": "2025-11-14T08:13:13.208823Z" } }, "source": [ "df[df.Sexo==\"M\"] # es lo mismo!" ], "outputs": [ { "data": { "text/plain": [ " Nombre Apellidos Altura Sexo Nacimiento Cof Categoria\n", "0 Will Smith 1.43 M 10/10/1920 0.19 laboral\n", "1 Jon Snow 1.98 M 10/1/1970 0.98 laboral\n", "4 Fein Mang 1.78 M 12/03/1937 0.27 cap6\n", "5 Victor Colom 1.78 M 22/09/1957 0.97 cap8" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NombreApellidosAlturaSexoNacimientoCofCategoria
0WillSmith1.43M10/10/19200.19laboral
1JonSnow1.98M10/1/19700.98laboral
4FeinMang1.78M12/03/19370.27cap6
5VictorColom1.78M22/09/19570.97cap8
\n", "
" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 6 }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Funciones de agregación en grupos.\n", "\n", "El método 'aggregate' nos permite crear variables de agregación en la tabla obtenida con 'groupby'. Indicaremos la información que deseamos obtener de cada columna utilizando un diccionario. Especificamos la función que aplicaremos a los datos de cada grupo en cada columna para obtener un único valor.\n", "\n", "- Ref: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.aggregate.html" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "Para ejemplificar esta sección, agruparemos el _dataframe_ por `Categoria` laboral. En este caso para la columna `Altura` consultamos la suma de las alturas del grupo y `Cof` la media.\n", "\n", "La función `aggregate` nos permite crear variables de agregación sobre la tabla obtenida con `groupby`. Indicaremos la información que queremos obtener de cada columna con un diccionario. Especificamos la función que vamos a aplicar a los datos de cada grupo en cada columna para obtener un único valor." ] }, { "cell_type": "code", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" }, "ExecuteTime": { "end_time": "2025-11-14T08:13:14.301171Z", "start_time": "2025-11-14T08:13:14.293704Z" } }, "source": [ "dfg = df.groupby([\"Categoria\"]).aggregate({\n", " \"Altura\":\"sum\",\n", " \"Cof\":\"mean\"})\n", "\n", "dfg" ], "outputs": [ { "data": { "text/plain": [ " Altura Cof\n", "Categoria \n", "cap6 5.32 0.530\n", "cap8 1.78 0.970\n", "laboral 3.41 0.585" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AlturaCof
Categoria
cap65.320.530
cap81.780.970
laboral3.410.585
\n", "
" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 7 }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:15.051012Z", "start_time": "2025-11-14T08:13:15.048162Z" } }, "source": [ "dfg.index" ], "outputs": [ { "data": { "text/plain": [ "Index(['cap6', 'cap8', 'laboral'], dtype='object', name='Categoria')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 8 }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "**Podemos aplicar un gran número de funciones de agregación:**\n", "\n", "- [Funciones estadísticas](https://docs.scipy.org/doc/numpy/reference/routines.statistics.html): mean, std, ...\n", "\n", "- [Funciones matemáticas](https://docs.scipy.org/doc/numpy/reference/routines.math.html): sum, prod, ...\n", "\n", "- Otras funciones: max, min, ...\n", "\n", "- [Documentación](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html)\n" ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:15.855878Z", "start_time": "2025-11-14T08:13:15.853676Z" } }, "source": [ "type(dfg) #Alerta! Una agregación genera un dataframe y por lo tanto podemos seguir aplicando lo que ya sabemos" ], "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 9 }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:16.595836Z", "start_time": "2025-11-14T08:13:16.590930Z" } }, "source": [ "dfg[dfg.Altura>3]" ], "outputs": [ { "data": { "text/plain": [ " Altura Cof\n", "Categoria \n", "cap6 5.32 0.530\n", "laboral 3.41 0.585" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AlturaCof
Categoria
cap65.320.530
laboral3.410.585
\n", "
" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 10 }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Agrupaciones de múltiples columnas\n", "\n", "También se pueden realizar agrupaciones de múltiples columnas. Se crean todas las combinaciones de las diversas columnas que existen en el DataFrame. Veamos un ejemplo:" ] }, { "cell_type": "code", "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" }, "ExecuteTime": { "end_time": "2025-11-14T08:13:17.745696Z", "start_time": "2025-11-14T08:13:17.741696Z" } }, "source": [ "gr = df.groupby(['Sexo',\"Categoria\"]).mean(numeric_only=True)\n", "gr\n", "\n" ], "outputs": [ { "data": { "text/plain": [ " Altura Cof\n", "Sexo Categoria \n", "F cap6 1.770 0.660\n", "M cap6 1.780 0.270\n", " cap8 1.780 0.970\n", " laboral 1.705 0.585" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AlturaCof
SexoCategoria
Fcap61.7700.660
Mcap61.7800.270
cap81.7800.970
laboral1.7050.585
\n", "
" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 11 }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Si queremos realizar un conteo de los elementos, debemos seleccionar" ] }, { "cell_type": "code", "metadata": { "pycharm": { "name": "#%%\n" }, "scrolled": true, "ExecuteTime": { "end_time": "2025-11-14T08:13:18.900938Z", "start_time": "2025-11-14T08:13:18.896173Z" } }, "source": [ "gr = df.groupby(['Sexo',\"Categoria\"])[\"Sexo\"].count()\n", "print(gr)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sexo Categoria\n", "F cap6 2\n", "M cap6 1\n", " cap8 1\n", " laboral 2\n", "Name: Sexo, dtype: int64\n" ] } ], "execution_count": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiindice \n", "\n", "A veces, un índice no es suficiente para expresar la meta-información que identifica una o varias columnas. Por ejemplo, una coordenada está compuesta por la latitud y la longitud.\n", "\n", "Un 'multiíndice' es una jerarquía de índices.\n", "\n", "Agrupar según diferentes criterios resulta en un multiíndice." ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:20.002080Z", "start_time": "2025-11-14T08:13:19.997867Z" } }, "source": [ "gr = df.groupby(['Sexo',\"Categoria\"]).count()\n", "gr.index" ], "outputs": [ { "data": { "text/plain": [ "MultiIndex([('F', 'cap6'),\n", " ('M', 'cap6'),\n", " ('M', 'cap8'),\n", " ('M', 'laboral')],\n", " names=['Sexo', 'Categoria'])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 13 }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:21.764160Z", "start_time": "2025-11-14T08:13:21.760510Z" } }, "source": [ "gr.loc[(\"M\",:)] # Como accedemos en rango en un multiindice? " ], "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (263982132.py, line 1)", "output_type": "error", "traceback": [ " \u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[14]\u001B[39m\u001B[32m, line 1\u001B[39m\n\u001B[31m \u001B[39m\u001B[31mgr.loc[(\"M\",:)] # Como accedemos en rango en un multiindice?\u001B[39m\n ^\n\u001B[31mSyntaxError\u001B[39m\u001B[31m:\u001B[39m invalid syntax\n" ] } ], "execution_count": 14 }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:22.938106Z", "start_time": "2025-11-14T08:13:22.932545Z" } }, "source": [ "gr.loc[\"F\"] # primer index" ], "outputs": [ { "data": { "text/plain": [ " Nombre Apellidos Altura Nacimiento Cof\n", "Categoria \n", "cap6 2 2 2 2 2" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NombreApellidosAlturaNacimientoCof
Categoria
cap622222
\n", "
" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 15 }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:24.064872Z", "start_time": "2025-11-14T08:13:23.741714Z" } }, "source": [ "gr.loc[\"cap6\"] # dependent index" ], "outputs": [ { "ename": "KeyError", "evalue": "'cap6'", "output_type": "error", "traceback": [ "\u001B[31m---------------------------------------------------------------------------\u001B[39m", "\u001B[31mKeyError\u001B[39m Traceback (most recent call last)", "\u001B[36mFile \u001B[39m\u001B[32m~/PycharmProjects/TTAD/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812\u001B[39m, in \u001B[36mIndex.get_loc\u001B[39m\u001B[34m(self, key)\u001B[39m\n\u001B[32m 3811\u001B[39m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[32m-> \u001B[39m\u001B[32m3812\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43m_engine\u001B[49m\u001B[43m.\u001B[49m\u001B[43mget_loc\u001B[49m\u001B[43m(\u001B[49m\u001B[43mcasted_key\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 3813\u001B[39m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mKeyError\u001B[39;00m \u001B[38;5;28;01mas\u001B[39;00m err:\n", "\u001B[36mFile \u001B[39m\u001B[32mpandas/_libs/index.pyx:167\u001B[39m, in \u001B[36mpandas._libs.index.IndexEngine.get_loc\u001B[39m\u001B[34m()\u001B[39m\n", "\u001B[36mFile \u001B[39m\u001B[32mpandas/_libs/index.pyx:196\u001B[39m, in \u001B[36mpandas._libs.index.IndexEngine.get_loc\u001B[39m\u001B[34m()\u001B[39m\n", "\u001B[36mFile \u001B[39m\u001B[32mpandas/_libs/hashtable_class_helper.pxi:7088\u001B[39m, in \u001B[36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001B[39m\u001B[34m()\u001B[39m\n", "\u001B[36mFile \u001B[39m\u001B[32mpandas/_libs/hashtable_class_helper.pxi:7096\u001B[39m, in \u001B[36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001B[39m\u001B[34m()\u001B[39m\n", "\u001B[31mKeyError\u001B[39m: 'cap6'", "\nThe above exception was the direct cause of the following exception:\n", "\u001B[31mKeyError\u001B[39m Traceback (most recent call last)", "\u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[16]\u001B[39m\u001B[32m, line 1\u001B[39m\n\u001B[32m----> \u001B[39m\u001B[32m1\u001B[39m \u001B[43mgr\u001B[49m\u001B[43m.\u001B[49m\u001B[43mloc\u001B[49m\u001B[43m[\u001B[49m\u001B[33;43m\"\u001B[39;49m\u001B[33;43mcap6\u001B[39;49m\u001B[33;43m\"\u001B[39;49m\u001B[43m]\u001B[49m \u001B[38;5;66;03m# dependent index\u001B[39;00m\n", "\u001B[36mFile \u001B[39m\u001B[32m~/PycharmProjects/TTAD/.venv/lib/python3.12/site-packages/pandas/core/indexing.py:1192\u001B[39m, in \u001B[36m_LocationIndexer.__getitem__\u001B[39m\u001B[34m(self, key)\u001B[39m\n\u001B[32m 1190\u001B[39m maybe_callable = com.apply_if_callable(key, \u001B[38;5;28mself\u001B[39m.obj)\n\u001B[32m 1191\u001B[39m maybe_callable = \u001B[38;5;28mself\u001B[39m._check_deprecated_callable_usage(key, maybe_callable)\n\u001B[32m-> \u001B[39m\u001B[32m1192\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43m_getitem_axis\u001B[49m\u001B[43m(\u001B[49m\u001B[43mmaybe_callable\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43maxis\u001B[49m\u001B[43m=\u001B[49m\u001B[43maxis\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~/PycharmProjects/TTAD/.venv/lib/python3.12/site-packages/pandas/core/indexing.py:1432\u001B[39m, in \u001B[36m_LocIndexer._getitem_axis\u001B[39m\u001B[34m(self, key, axis)\u001B[39m\n\u001B[32m 1430\u001B[39m \u001B[38;5;66;03m# fall thru to straight lookup\u001B[39;00m\n\u001B[32m 1431\u001B[39m \u001B[38;5;28mself\u001B[39m._validate_key(key, axis)\n\u001B[32m-> \u001B[39m\u001B[32m1432\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43m_get_label\u001B[49m\u001B[43m(\u001B[49m\u001B[43mkey\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43maxis\u001B[49m\u001B[43m=\u001B[49m\u001B[43maxis\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~/PycharmProjects/TTAD/.venv/lib/python3.12/site-packages/pandas/core/indexing.py:1382\u001B[39m, in \u001B[36m_LocIndexer._get_label\u001B[39m\u001B[34m(self, label, axis)\u001B[39m\n\u001B[32m 1380\u001B[39m \u001B[38;5;28;01mdef\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34m_get_label\u001B[39m(\u001B[38;5;28mself\u001B[39m, label, axis: AxisInt):\n\u001B[32m 1381\u001B[39m \u001B[38;5;66;03m# GH#5567 this will fail if the label is not present in the axis.\u001B[39;00m\n\u001B[32m-> \u001B[39m\u001B[32m1382\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43mobj\u001B[49m\u001B[43m.\u001B[49m\u001B[43mxs\u001B[49m\u001B[43m(\u001B[49m\u001B[43mlabel\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43maxis\u001B[49m\u001B[43m=\u001B[49m\u001B[43maxis\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~/PycharmProjects/TTAD/.venv/lib/python3.12/site-packages/pandas/core/generic.py:4315\u001B[39m, in \u001B[36mNDFrame.xs\u001B[39m\u001B[34m(self, key, axis, level, drop_level)\u001B[39m\n\u001B[32m 4312\u001B[39m index = \u001B[38;5;28mself\u001B[39m.index\n\u001B[32m 4314\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(index, MultiIndex):\n\u001B[32m-> \u001B[39m\u001B[32m4315\u001B[39m loc, new_index = \u001B[43mindex\u001B[49m\u001B[43m.\u001B[49m\u001B[43m_get_loc_level\u001B[49m\u001B[43m(\u001B[49m\u001B[43mkey\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mlevel\u001B[49m\u001B[43m=\u001B[49m\u001B[32;43m0\u001B[39;49m\u001B[43m)\u001B[49m\n\u001B[32m 4316\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m drop_level:\n\u001B[32m 4317\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m lib.is_integer(loc):\n\u001B[32m 4318\u001B[39m \u001B[38;5;66;03m# Slice index must be an integer or None\u001B[39;00m\n", "\u001B[36mFile \u001B[39m\u001B[32m~/PycharmProjects/TTAD/.venv/lib/python3.12/site-packages/pandas/core/indexes/multi.py:3309\u001B[39m, in \u001B[36mMultiIndex._get_loc_level\u001B[39m\u001B[34m(self, key, level)\u001B[39m\n\u001B[32m 3307\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m indexer, maybe_mi_droplevels(indexer, ilevels)\n\u001B[32m 3308\u001B[39m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[32m-> \u001B[39m\u001B[32m3309\u001B[39m indexer = \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43m_get_level_indexer\u001B[49m\u001B[43m(\u001B[49m\u001B[43mkey\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mlevel\u001B[49m\u001B[43m=\u001B[49m\u001B[43mlevel\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 3310\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m (\n\u001B[32m 3311\u001B[39m \u001B[38;5;28misinstance\u001B[39m(key, \u001B[38;5;28mstr\u001B[39m)\n\u001B[32m 3312\u001B[39m \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;28mself\u001B[39m.levels[level]._supports_partial_string_indexing\n\u001B[32m 3313\u001B[39m ):\n\u001B[32m 3314\u001B[39m \u001B[38;5;66;03m# check to see if we did an exact lookup vs sliced\u001B[39;00m\n\u001B[32m 3315\u001B[39m check = \u001B[38;5;28mself\u001B[39m.levels[level].get_loc(key)\n", "\u001B[36mFile \u001B[39m\u001B[32m~/PycharmProjects/TTAD/.venv/lib/python3.12/site-packages/pandas/core/indexes/multi.py:3410\u001B[39m, in \u001B[36mMultiIndex._get_level_indexer\u001B[39m\u001B[34m(self, key, level, indexer)\u001B[39m\n\u001B[32m 3407\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mslice\u001B[39m(i, j, step)\n\u001B[32m 3409\u001B[39m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[32m-> \u001B[39m\u001B[32m3410\u001B[39m idx = \u001B[38;5;28;43mself\u001B[39;49m\u001B[43m.\u001B[49m\u001B[43m_get_loc_single_level_index\u001B[49m\u001B[43m(\u001B[49m\u001B[43mlevel_index\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mkey\u001B[49m\u001B[43m)\u001B[49m\n\u001B[32m 3412\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m level > \u001B[32m0\u001B[39m \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;28mself\u001B[39m._lexsort_depth == \u001B[32m0\u001B[39m:\n\u001B[32m 3413\u001B[39m \u001B[38;5;66;03m# Desired level is not sorted\u001B[39;00m\n\u001B[32m 3414\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(idx, \u001B[38;5;28mslice\u001B[39m):\n\u001B[32m 3415\u001B[39m \u001B[38;5;66;03m# test_get_loc_partial_timestamp_multiindex\u001B[39;00m\n", "\u001B[36mFile \u001B[39m\u001B[32m~/PycharmProjects/TTAD/.venv/lib/python3.12/site-packages/pandas/core/indexes/multi.py:2999\u001B[39m, in \u001B[36mMultiIndex._get_loc_single_level_index\u001B[39m\u001B[34m(self, level_index, key)\u001B[39m\n\u001B[32m 2997\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m -\u001B[32m1\u001B[39m\n\u001B[32m 2998\u001B[39m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[32m-> \u001B[39m\u001B[32m2999\u001B[39m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mlevel_index\u001B[49m\u001B[43m.\u001B[49m\u001B[43mget_loc\u001B[49m\u001B[43m(\u001B[49m\u001B[43mkey\u001B[49m\u001B[43m)\u001B[49m\n", "\u001B[36mFile \u001B[39m\u001B[32m~/PycharmProjects/TTAD/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py:3819\u001B[39m, in \u001B[36mIndex.get_loc\u001B[39m\u001B[34m(self, key)\u001B[39m\n\u001B[32m 3814\u001B[39m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28misinstance\u001B[39m(casted_key, \u001B[38;5;28mslice\u001B[39m) \u001B[38;5;129;01mor\u001B[39;00m (\n\u001B[32m 3815\u001B[39m \u001B[38;5;28misinstance\u001B[39m(casted_key, abc.Iterable)\n\u001B[32m 3816\u001B[39m \u001B[38;5;129;01mand\u001B[39;00m \u001B[38;5;28many\u001B[39m(\u001B[38;5;28misinstance\u001B[39m(x, \u001B[38;5;28mslice\u001B[39m) \u001B[38;5;28;01mfor\u001B[39;00m x \u001B[38;5;129;01min\u001B[39;00m casted_key)\n\u001B[32m 3817\u001B[39m ):\n\u001B[32m 3818\u001B[39m \u001B[38;5;28;01mraise\u001B[39;00m InvalidIndexError(key)\n\u001B[32m-> \u001B[39m\u001B[32m3819\u001B[39m \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mKeyError\u001B[39;00m(key) \u001B[38;5;28;01mfrom\u001B[39;00m\u001B[38;5;250m \u001B[39m\u001B[34;01merr\u001B[39;00m\n\u001B[32m 3820\u001B[39m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mTypeError\u001B[39;00m:\n\u001B[32m 3821\u001B[39m \u001B[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001B[39;00m\n\u001B[32m 3822\u001B[39m \u001B[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001B[39;00m\n\u001B[32m 3823\u001B[39m \u001B[38;5;66;03m# the TypeError.\u001B[39;00m\n\u001B[32m 3824\u001B[39m \u001B[38;5;28mself\u001B[39m._check_indexing_error(key)\n", "\u001B[31mKeyError\u001B[39m: 'cap6'" ] } ], "execution_count": 16 }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:24.646580Z", "start_time": "2025-11-14T08:13:24.640335Z" } }, "source": [ "# Necesitamos invocar un IndexSlice.\n", "gr.loc[pd.IndexSlice[:, 'cap6'],:] \n", "# https://pandas.pydata.org/docs/reference/api/pandas.IndexSlice.html" ], "outputs": [ { "data": { "text/plain": [ " Nombre Apellidos Altura Nacimiento Cof\n", "Sexo Categoria \n", "F cap6 2 2 2 2 2\n", "M cap6 1 1 1 1 1" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NombreApellidosAlturaNacimientoCof
SexoCategoria
Fcap622222
Mcap611111
\n", "
" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "En estas situaciones donde queremos acceder a esos datos, la forma más sencilla es simplemente eliminar el indice:" ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:26.046167Z", "start_time": "2025-11-14T08:13:26.038778Z" } }, "source": [ "gr = df.groupby(['Sexo',\"Categoria\"]).count().reset_index()\n", "gr" ], "outputs": [ { "data": { "text/plain": [ " Sexo Categoria Nombre Apellidos Altura Nacimiento Cof\n", "0 F cap6 2 2 2 2 2\n", "1 M cap6 1 1 1 1 1\n", "2 M cap8 1 1 1 1 1\n", "3 M laboral 2 2 2 2 2" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SexoCategoriaNombreApellidosAlturaNacimientoCof
0Fcap622222
1Mcap611111
2Mcap811111
3Mlaboral22222
\n", "
" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 18 }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:13:26.596827Z", "start_time": "2025-11-14T08:13:26.593947Z" } }, "source": [ "gr.index = gr[\"Categoria\"]\n", "print(gr)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Sexo Categoria Nombre Apellidos Altura Nacimiento Cof\n", "Categoria \n", "cap6 F cap6 2 2 2 2 2\n", "cap6 M cap6 1 1 1 1 1\n", "cap8 M cap8 1 1 1 1 1\n", "laboral M laboral 2 2 2 2 2\n" ] } ], "execution_count": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Agregaciones avanzadas\n", "\n", "Sobre una agregación, podemos realizar operaciones más allá de las aritméticas.
\n", "\n", "Por ejemplo, si deseamos crear un histograma de la distribución de tipos de certificados por ciudad y código postal, necesitamos crear una lista para cada grupo." ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:14:00.888234Z", "start_time": "2025-11-14T08:14:00.883072Z" } }, "source": [ "df = pd.read_csv(\"../../data/Pandas/data_groups.csv\")\n", "print(df.head())\n", "print(df.shape)\n" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Unnamed: 0 Dni Nom CP Ciutat Sexe \\\n", "0 0 H61414629 María Dolores Arjona Jove 7800 Eivissa M \n", "1 1 S3138381C Núria Quirós 7511 Ruberts F \n", "2 2 J8698188C Miguel José María Gil Vargas 7340 Alaro M \n", "3 3 A48821615 Jordi Chaves Bustamante 7609 Bellavista F \n", "4 4 U0247281I Jana Rosa Collado Menéndez 7006 Palma M \n", "\n", " Tipus certificat cat Punts \n", "0 B 73 \n", "1 A 40 \n", "2 A 45 \n", "3 B 40 \n", "4 B 86 \n", "(1000, 8)\n" ] } ], "execution_count": 23 }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:14:01.940552Z", "start_time": "2025-11-14T08:14:01.933641Z" } }, "source": [ "dfc = df.groupby([\"Ciutat\",\"CP\"])[\"Tipus certificat cat\"].apply(list)\n", "dfc" ], "outputs": [ { "data": { "text/plain": [ "Ciutat CP \n", "Alaro 7340 [A, C, B, A, A, B, A, D, A, B, A, B, B, A, B, ...\n", "Ariany 7529 [B, A, B, A, A, B, A, B, A, B, A, A, C, A, B, ...\n", "Bellavista 7609 [B, C, A, A, A, C, A, A, B, A, A, A, B, A, C, ...\n", "Binissalem 7350 [A, B, B, C, C, A, A, A, C, A, C, A, B, B, D, ...\n", "Eivissa 7800 [B, A, A, B, B, B, A, A, A, A, B, C, A, A, B, ...\n", "La Savina 7870 [A, A, A, B, A, A, B, B, A, A, B, A, B, B, A, ...\n", "Mao 7701 [B, C, B, C, B, C, A, B, B, B, A, C, C, A, C, ...\n", " 7703 [B, A, B, C, A, A, B, A, C, B, A, B, C, A, B, ...\n", "Palma 7006 [B, A, B, B, A, A, A, B, B, A, A, B, B, A, B, ...\n", " 7009 [A, D, A, B, C, A, A, C, B, A, A, A, C, A, A, ...\n", " 7013 [A, A, B, C, C, C, C, A, A, A, B, A, C, C, A, ...\n", "Pedruscada 7590 [A, B, A, A, A, A, B, B, C, B, B, B, C, A, A, ...\n", "Ruberts 7511 [A, A, A, B, C, A, B, A, A, B, B, A, C, B, B, ...\n", "Name: Tipus certificat cat, dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 24 }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:14:02.850469Z", "start_time": "2025-11-14T08:14:02.848281Z" } }, "source": [ "tipusAlaro = dfc.loc[pd.IndexSlice[\"Alaro\",7340]]\n", "\n", "values, counts = np.unique(tipusAlaro, return_counts=True)\n", "print(values) # Tipus \n", "print(counts) # quantitat\n", "print(\"-\"*40)\n", "distribucioAlaro = dict(zip(values,counts)) ## Que fa el ZIP?!\n", "print(distribucioAlaro)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['A' 'B' 'C' 'D']\n", "[57 42 16 5]\n", "----------------------------------------\n", "{np.str_('A'): np.int64(57), np.str_('B'): np.int64(42), np.str_('C'): np.int64(16), np.str_('D'): np.int64(5)}\n" ] } ], "execution_count": 25 }, { "metadata": {}, "cell_type": "markdown", "source": [ "En pandas, además de utilizar las funciones de agregación predefinidas como `sum()`, `mean()`, `min()` o `max()`, también podemos emplear funciones propias (definidas por el usuario) para realizar agregaciones más flexibles y adaptadas a nuestras necesidades.\n", "\n", "Estas funciones personalizadas pueden ser cualquier función de Python que reciba una serie de datos como entrada y devuelva un único valor como resultado. Esto permite aplicar cálculos específicos, métricas avanzadas o transformaciones que no están disponibles de forma nativa en pandas." ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:14:04.444517Z", "start_time": "2025-11-14T08:14:04.438161Z" } }, "source": [ "df2 = df.groupby([\"Ciutat\",\"CP\"]).agg({\"Tipus certificat cat\": np.size}) # lambda !\n", "\n", "print(df2)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Tipus certificat cat\n", "Ciutat CP \n", "Alaro 7340 120\n", "Ariany 7529 113\n", "Bellavista 7609 98\n", "Binissalem 7350 93\n", "Eivissa 7800 91\n", "La Savina 7870 90\n", "Mao 7701 63\n", " 7703 45\n", "Palma 7006 36\n", " 7009 23\n", " 7013 37\n", "Pedruscada 7590 106\n", "Ruberts 7511 85\n" ] } ], "execution_count": 26 }, { "metadata": {}, "cell_type": "markdown", "source": [ "#### Funciones Lambda\n", "\n", "En **Python**, las **funciones lambda** permiten definir funciones pequeñas y anónimas de forma compacta. Se utilizan especialmente cuando necesitamos una función sencilla para una operación puntual y no queremos crear una función completa con `def`.\n", "\n", "Una función lambda se escribe en una sola línea y puede recibir uno o varios argumentos, pero solo puede contener una expresión, cuyo resultado se devuelve automáticamente. Gracias a su simplicidad, las lambdas son muy útiles en operaciones como filtrado, transformación o ordenación de datos, así como al trabajar con funciones de orden superior como `map()`, `filter()` o `sorted()`.\n", "\n", "Su uso hace el código más expresivo y conciso en situaciones donde definir una función tradicional sería innecesariamente largo. Sin embargo, se recomienda utilizarlas únicamente para tareas simples, manteniendo la legibilidad del código como prioridad." ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:15:37.394421Z", "start_time": "2025-11-14T08:15:37.392008Z" } }, "source": [ "a = np.array([10,10,4,5,7,8,12,4507,30])\n", "b = list(map(lambda i:i**2+2*i,a))\n", "\n", "def f1(i):\n", " return i**2+2\n", "\n", "b = list(map(f1,a))\n", "\n", "print(b)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[np.int64(102), np.int64(102), np.int64(18), np.int64(27), np.int64(51), np.int64(66), np.int64(146), np.int64(20313051), np.int64(902)]\n" ] } ], "execution_count": 29 }, { "metadata": {}, "cell_type": "markdown", "source": "El siguiente código muestra cómo definir una función personalizada en Python para utilizarla como parte de una agregación en pandas. En este caso, la función `miBarem50p` recibe una serie y cuenta cuántos valores son mayores que 50. Posteriormente, esta función se aplica dentro de un `groupby` mediante una función `lambda`, lo que permite integrar la lógica propia dentro del proceso de agregación junto con otras funciones como `np.size`. Esto hace posible obtener métricas específicas adaptadas a nuestras necesidades durante el análisis de los datos." }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2025-11-14T08:14:05.975198Z", "start_time": "2025-11-14T08:14:05.966396Z" } }, "source": [ "def miBarem50p(serie):\n", " up50list = []\n", " for value in serie.values:\n", " if value>50:\n", " up50list.append(value)\n", " return len(up50list)\n", "\n", "\n", "df3 = df.groupby([\"Ciutat\",\"CP\"]).agg(\n", " {\"Punts\": [lambda x: miBarem50p(x), np.size]})\n", "\n", "print(df3)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Punts \n", " size\n", "Ciutat CP \n", "Alaro 7340 61 120\n", "Ariany 7529 56 113\n", "Bellavista 7609 43 98\n", "Binissalem 7350 50 93\n", "Eivissa 7800 44 91\n", "La Savina 7870 41 90\n", "Mao 7701 34 63\n", " 7703 18 45\n", "Palma 7006 19 36\n", " 7009 10 23\n", " 7013 19 37\n", "Pedruscada 7590 65 106\n", "Ruberts 7511 36 85\n" ] } ], "execution_count": 28 }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Ejercicios" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "**1) Usando el fichero WHO.csv, ¿Cuál es el volumen total de CO2 emitido por cada continente?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "**2) ¿Cuál es el número de paises por continente?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "**3) Del conjunto \"who.csv\" selecciona 30 paises al azar y sobre ellos calcula la media de \"Net primary school enrolment ratio female (%)\" agrupados por: Continente**\n", "\n", "Nota: la selección de 30 paises aleatoria ha de ser reproducible" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "**3b) Repite la anterior actividad pero ahora con todos los paises. ¿Sale la misma media?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4) Calcula la cantidad de ayuda recibida por cada municipio en función del númeto total de habitantes. (v2)**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataframe de Municipios:\n", " Nombre Código Postal Población\n", "0 Municipio1 60494 39232\n", "1 Municipio2 65125 15315\n", "2 Municipio3 15306 34075\n", "3 Municipio4 43936 10127\n", "4 Municipio5 77013 19470\n", "5 Municipio6 73691 10158\n", "6 Municipio7 63075 7214\n", "7 Municipio8 49755 41525\n", "8 Municipio9 72468 17417\n", "9 Municipio10 56930 35902\n", "\n", "Dataframe de Ayudas:\n", " Nombre Ayuda Económica (en euros) Número de Beneficiarios\n", "0 Municipio10 9989 72\n", "1 Municipio3 1230 23\n", "2 Municipio5 2528 48\n", "3 Municipio2 7534 80\n", "4 Municipio2 1018 47\n", "5 Municipio6 9086 100\n", "6 Municipio8 6458 25\n", "7 Municipio9 4996 80\n", "8 Municipio2 6328 52\n", "9 Municipio6 2031 79\n", "10 Municipio7 4130 36\n", "11 Municipio6 4632 87\n", "12 Municipio10 4909 80\n", "13 Municipio4 3334 85\n", "14 Municipio9 9896 46\n", "15 Municipio8 8339 66\n", "16 Municipio8 2494 21\n", "17 Municipio9 2318 86\n", "18 Municipio5 6243 59\n", "19 Municipio1 9322 50\n" ] } ], "source": [ "# V2. Con necesidad de agrupar\n", "import pandas as pd\n", "import random\n", "\n", "random.seed(0)\n", "\n", "nombres = [f'Municipio{i}' for i in range(1, 11)] \n", "\n", "data_municipios = {\n", " 'Nombre': nombres,\n", " 'Código Postal': [random.randint(10000, 99999) for _ in range(10)],\n", " 'Población': [random.randint(1000, 50000) for _ in range(10)] # Añadimos un atributo aleatorio, en este caso \"Población\"\n", "}\n", "\n", "df_municipios = pd.DataFrame(data_municipios)\n", "\n", "\n", "data_ayudas = {\n", " 'Nombre': [random.choice(nombres) for _ in range(20)],\n", " 'Ayuda Económica (en euros)': [random.randint(1000, 10000) for _ in range(20)],\n", " 'Número de Beneficiarios': [random.randint(10, 100) for _ in range(20)]\n", "}\n", "\n", "df_ayudas = pd.DataFrame(data_ayudas)\n", "\n", "print(\"Dataframe de Municipios:\")\n", "print(df_municipios)\n", "\n", "print(\"\\nDataframe de Ayudas:\")\n", "print(df_ayudas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**5) Agrupa los datos por el nombre de de la escuela.**\n", "- ¿Qué escuela tiene más infantes?\n", "- ¿Qué escuela tiene los infantes más altos?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = {\n", " 'school': ['s001', 's002', 's003', 's001', 's002', 's004'],\n", " 'class': ['V', 'V', 'VI', 'VI', 'V', 'VI'],\n", " 'name': ['Alberto Franco', 'Gino Mcneill', 'Ryan Parkes', 'Eesha Hinton', 'Gino Mcneill', 'David Parkes'],\n", " 'date_Of_Birth': ['15/05/2002', '17/05/2002', '16/02/1999', '25/09/1998', '11/05/2002', '15/09/1997'],\n", " 'age': [12, 12, 13, 13, 14, 12],\n", " 'height': [173, 192, 186, 167, 151, 159],\n", " 'weight': [35, 32, 33, 30, 31, 32],\n", " 'address': ['street1', 'street2', 'street3', 'street1', 'street2', 'street4']\n", "}\n", "\n", "df = pd.DataFrame(data, index=['S1', 'S2', 'S3', 'S4','S5','S6'])\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO dejar esto pendiente para isaac futuro\n", "#hola\n", "#HOLA\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**6) Dado el siguiente listado de ventas:**\n", "- ¿Qué comprador ha gastado más?\n", "- ¿Qué vendedor ha hecho más ventas?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Producto Precio Comprador Vendedor\n", "0 Producto 1 47 Juan Laura\n", "1 Producto 2 22 Maria Laura\n", "2 Producto 3 82 Pedro Elena\n", "3 Producto 4 19 Maria Miguel\n", "4 Producto 5 85 Juan Carlos\n", "5 Producto 6 15 Ana Miguel\n", "6 Producto 7 89 Juan Laura\n", "7 Producto 8 74 Maria Laura\n", "8 Producto 9 26 Juan Laura\n", "9 Producto 10 11 Pedro Elena\n", "10 Producto 11 86 Maria Elena\n", "11 Producto 12 81 Maria Laura\n", "12 Producto 13 16 Juan Miguel\n", "13 Producto 14 35 Ana Laura\n", "14 Producto 15 60 Ana Laura\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# Crear datos aleatorios\n", "np.random.seed(1)\n", "n = 15\n", "productos = ['Producto ' + str(i) for i in range(1, n+1)]\n", "precios = np.random.randint(10, 100, n)\n", "compradores = np.random.choice(['Juan', 'Pedro', 'Maria', 'Ana'], n)\n", "vendedores = np.random.choice(['Carlos', 'Laura', 'Miguel', 'Elena'], n)\n", "\n", "# Crear DataFrame\n", "df_ventas = pd.DataFrame({\n", " 'Producto': productos,\n", " 'Precio': precios,\n", " 'Comprador': compradores,\n", " 'Vendedor': vendedores\n", "})\n", "\n", "print(df_ventas)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#TODO\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**7) Fichero \"data/data_groups_cursos.csv\"; contiene cursos realizados por personas. Cada muestra corresponde a un curso.**\n", "Se pide un listado del número de cursos realizado por cada persona y una valoración de sus puntos según el tipo de curso realizado:\n", "- 'A': 3 puntos\n", "- 'B': 2 puntos\n", "- 'C': 1 punto\n", "- 'D': 0.5 punto\n", "- \"E\": 0.5 punto\n", "- \"F\": 1 punto" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\n", "6\n" ] } ], "source": [ "a = np.array([1,2,3,4,5,6,10])\n", "print(a.max())\n", "print(a.argmax())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorización de datos\n", "\n", "TODO\n", "Valores ya categorizados?\n", "\n", "https://pandas.pydata.org/docs/user_guide/categorical.html\n", "\n", "\n", "\n", "### Función `cut`\n", "La función `cut` de Pandas es una herramienta útil para segmentar y discretizar datos en intervalos o categorías. Esta función es especialmente útil cuando desea convertir una variable numérica continua en una variable categórica al dividirla en intervalos o categorías específicas.\n", "\n", "[Documentación](https://pandas.pydata.org/docs/reference/api/pandas.cut.html)\n", "\n", "Crearemos un nuevo DataFrame para realizar pruebas:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
candidatnota
015
120
233
343
457
569
673
785
892
9104
\n", "
" ], "text/plain": [ " candidat nota\n", "0 1 5\n", "1 2 0\n", "2 3 3\n", "3 4 3\n", "4 5 7\n", "5 6 9\n", "6 7 3\n", "7 8 5\n", "8 9 2\n", "9 10 4" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "np.random.seed(0)\n", "df_random = pd.DataFrame({\"candidat\":np.arange(1,11), \"nota\":np.random.randint(0,11,size=10)})\n", "df_random" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A continuación, creamos una columna nueva que contiene la nota pero en escala categórica:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
candidatnotanotaCategorica
015mig
120dolent
233dolent
343dolent
457bona
569bona
673dolent
785mig
892dolent
9104mig
\n", "
" ], "text/plain": [ " candidat nota notaCategorica\n", "0 1 5 mig\n", "1 2 0 dolent\n", "2 3 3 dolent\n", "3 4 3 dolent\n", "4 5 7 bona\n", "5 6 9 bona\n", "6 7 3 dolent\n", "7 8 5 mig\n", "8 9 2 dolent\n", "9 10 4 mig" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_random[\"notaCategorica\"] = pd.cut(df_random.nota, 3, labels=[\"dolent\", \"mig\", \"bona\"])\n", "df_random" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
candidatnotanotaCategorica
457bona
569bona
\n", "
" ], "text/plain": [ " candidat nota notaCategorica\n", "4 5 7 bona\n", "5 6 9 bona" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_random[df_random.notaCategorica==df_random.notaCategorica.max()]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
candidatnotanotaCategorica
015Be
120Suspens
233Suspens
343Suspens
457Be
569Excel·lent
673Suspens
785Be
892Suspens
9104Suspens
\n", "
" ], "text/plain": [ " candidat nota notaCategorica\n", "0 1 5 Be\n", "1 2 0 Suspens\n", "2 3 3 Suspens\n", "3 4 3 Suspens\n", "4 5 7 Be\n", "5 6 9 Excel·lent\n", "6 7 3 Suspens\n", "7 8 5 Be\n", "8 9 2 Suspens\n", "9 10 4 Suspens" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# mejora\n", "bins = [-1, 4, 7, 8.5, 10]\n", "labels = ['Suspens', 'Be', 'Notable', 'Excel·lent']\n", "df_random[\"notaCategorica\"] = pd.cut(df_random.nota, bins, labels=labels) # include_lowest=True\n", "df_random" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0. 0.25 0.5 0.75 1. ]\n", "----------\n", "0 (3.5, 5.0]\n", "1 (-0.001, 3.0]\n", "2 (-0.001, 3.0]\n", "3 (-0.001, 3.0]\n", "4 (5.0, 9.0]\n", "5 (5.0, 9.0]\n", "6 (-0.001, 3.0]\n", "7 (3.5, 5.0]\n", "8 (-0.001, 3.0]\n", "9 (3.5, 5.0]\n", "Name: nota, dtype: category\n", "Categories (4, interval[float64, right]): [(-0.001, 3.0] < (3.0, 3.5] < (3.5, 5.0] < (5.0, 9.0]]\n", "----------\n", "(-0.001, 3.0] 5\n", "(3.5, 5.0] 3\n", "(5.0, 9.0] 2\n", "(3.0, 3.5] 0\n", "Name: nota, dtype: int64\n", "0 5\n", "1 0\n", "2 3\n", "3 3\n", "4 7\n", "5 9\n", "6 3\n", "7 5\n", "8 2\n", "9 4\n", "Name: nota, dtype: int64\n" ] } ], "source": [ "# Podemos definir el intervalo que deseemos (como un Histograma)\n", "import numpy as np\n", "marks = np.arange(0,1.1,0.25)\n", "print(marks)\n", "print\n", "print(\"-\"*10)\n", "factors = pd.qcut(df_random.nota, marks)\n", "print(factors)\n", "print(\"-\"*10)\n", "print(pd.value_counts(factors))\n", "print(df_random.nota)" ] }, { "cell_type": "markdown", "metadata": {}, "source": "### Actividad" }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(4557, 28)\n", "count 4555.000000\n", "mean 0.013723\n", "std 0.214786\n", "min 0.000000\n", "25% 0.000000\n", "50% 0.000000\n", "75% 0.000000\n", "max 7.010000\n", "Name: snowfall, dtype: float64\n" ] } ], "source": [ "# Del fichero:\n", "\n", "import pandas as pd\n", "df = pd.read_csv(\"data/rdu-weather-history.csv\",sep=\";\") \n", "# print(df.head())\n", "# print(df.columns)\n", "print(df.shape)\n", "print(df.snowfall.describe())\n", "\n", "# values, repeticiones = np.unique(df.snowfall,return_counts=True)\n", "# print(values)\n", "# print(repeticiones)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Queremos discretizar la variable `snowfall` para tener 4 categorias de nevada.\n", "\n", "En primer lugar, usa la función `qcut` para hacer esta categorización. Haz una prueba y analiza por qué está función quizás no sea la mejor opción.
\n", "Ahora, crea 4 categorías utilizando la función `cut`:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
\n", "Isaac Lera and Gabriel Moya
\n", "Universitat de les Illes Balears
\n", "isaac.lera@uib.edu, gabriel.moya@uib.edu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "my3110", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 1 }