This reference sheet contains the key objects that we use in DSCI 100, and a brief syntax example for each of the main packages. During the closed book exams, you will still have access to this page, so get familiar with it already now. There is no guarantee that every function or parameter in the textbook is covered here, but if you think there is something missing, please let us know and we can consider adding it.
Note that we have only described use cases relevant to DSCI 100.
For example, the function reference for a pandas data frame is pd.DataFrame(data_as_dict)
,
because in DSCI 100 we show you how to create data frames from a dictionary in this manner.
Although there are many other ways of creating data frames
and many parameters that you could use inside pd.DataFrame
,
we have opted to only include what we use in DSCI 100
to make this guide more useful for you.
Sometimes we have included the exact parameter name of a function,
e.g. drop(columns)
,
other times we have opted to included a more descriptive name,
e.g. agg(list_of_aggregations)
.
pandas
)A typical data frame operation would look something like this:
import pandas as pd
df.loc[
df['column2'] < 10,
['column2', 'column3']
].mean()
pandas
functions, prefix with pd.
Name | Description |
---|---|
DataFrame(data_as_dict) | A two-dimensional, heterogeneous tabular data structure with labeled axes |
Series(data_as_list) | A one-dimensional labeled array capable of holding data of various types |
crosstab(df) | Compute a cross-tabulation of two (or more) factors, e.g. a confusion matrix. |
concat(dfs_or_series) | Concatenate pandas objects (dfs, series) along a particular axis |
read_csv(filepath, sep) | Read data from a CSV file |
read_excel(filepath, sheet) | Read data from an Excel file |
read_html(filepath) | Parse HTML tables from a web page and return a list of DataFrames |
to_csv(filepath, index) | Write a DataFrame to a CSV file |
Data frame methods, prefix with the name of the data frame (e.g. df.
)
Name | Description |
---|---|
abs() | Convert to absolute value |
agg(list_of_aggregations) | Perform multiple aggregations (‘mean’, ‘median’, etc) |
apply(func) | Apply a function to every column or row in the df |
assign(newcol=oldcol*2) | Create a new column |
astype(dtype) | Convert the data type of a column |
describe() | Calculate common descriptive statistics |
drop(columns) | Remove the specified row(s) or column(s) |
dropna() | Remove the rows that contains NULL/NA values |
groupby(column_as_str) | Group rows together if share value in the specified column |
contains(string) | For searching for a str within the values of a df column |
isin(values) | For each value in a column, check if it is present in the specified values |
iloc[] | Select rows and columns by their integer positions |
info() | Print a concise summary of a df |
loc[] | Filter rows and select columns at the same time |
max() | Return the maximum value of the columns |
mean() | Return the mean value of the columns |
melt(id_vars, var_name, value_name) | Unpivot a df from wide to long format |
merge(df, on) | Merge data frames or series |
min() | Return the minimum value of the columns |
nlargest() | Return top n rows after sorting the df by a column’s highest value |
nsmallest() | Return top n rows after sorting the df by a column’s smallest value |
pivot(index, columns, values) | Turn df from long to wide |
quantile() | Return the value at the specified quantile |
query() | Filter a data frame based on row condition query, similar to []. |
rename(columns) | Rename columns in a df |
replace(to_replace, value) | Replace specific values to desired new values |
reset_index() | Reset the index of the df to a numerical range |
round() | Round values in a df |
sample(n, frac, replace) | Return a random sample of n rows from a df |
shape | Find the number of rows and cols of the df |
sort_values() | Order the rows of a df by the values of certain col |
str.startswith() | Test if the start of each str element in a column matches a pattern |
sum() | Return the sum of the values over the requested axis |
tail() | Return a specified number of rows from the end of the df |
unique() | See all unique values in a column |
value_counts() | Return a Series containing the frequency of each distinct value in the column |
Data base methods, perform on a database connection or a table
Name | Description |
---|---|
ibis.sqlite.connect(database, host, port, user, password) | Connect to a database |
list_tables() | List all tables in the database |
table() | Connect to a specific table in a database |
order_by() | The same as sort_values for a data frame |
execute() | Execute a database table operation to yield a data frame from the database |
altair
)A typical chart syntax would look something like this:
import altair as alt
alt.Chart(df).mark_point().encode(
x='column1',
y=alt.Y('column2').title('Column 2')
)
Name | Description |
---|---|
Chart(df) | Specity the data used for the chart |
Color, X, Y, … | Helpers for modifying the corresponding encoding, eg. adding a title |
axis(tickCount, format) | Modify the axis format of a single axis |
configure_axis(titleFontSize) | Modify the axis format of all axes |
mark_point(color, opacity, size) | Represent the data with the corresponding graphical mark |
bin(maxbins) | Group the values of a column into bins (buckets) |
‘count()’, ‘mean()’, etc | Special encoding strings to compute the count, mean, etc of a column |
datum(value) | A special encoding that creates data points directly instead of using a data frame |
facet(column_as_str, columns) | Create multiple views of a dataset where each panel contains a different subset |
disable_max_rows() | Disable the default maximum row limit for displaying data, which simplifies working with large data frames |
encode(x, y, color, …) | Specify how data columns are encoded as visual channels (x, y, color, etc) |
& (ampersand) | Concatenating charts vertically |
| (pipe) | Concatenating charts horizontally |
+ (plus) | Layer multiple charts on top of each other |
legend(orient) | Modify the legend format and properties for the chart |
properties(width, height) | Set various properties and configurations for the chart |
resolve_scale(x, y) | Control if scales are shared or independent between charts |
save(‘filename’) | Save the chart to a file |
scale(zero, type) | Modify scale properties for encoding channels in the chart |
stack() | Control if marks (e.g. bars of areas) should stack on top of each other |
title() | Add a title to the axis |
scikit-learn
)Setting up a typical scikit-learn model would look something like this:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
Name | Description |
---|---|
GridSearchCV(estimator, param_grid, cv) | Perform an exhaustive search over a hyperparameter grid to find the best combination using cross-validation. |
KMeans(n_clusters) | Initialize a K-Means clustering algorithm for grouping data points into clusters based on similarity. |
KNeighborsClassifier(n_neighbors) | Initialize a k-Nearest Neighbors classifier for classification tasks. |
KNeighborsRegressor(n_neighbors) | Initialize a k-Nearest Neighbors regressor for regression tasks. |
LinearRegression() | Initialize a Linear regression model for predicting continuous target values from input features. |
SimpleImputer() | Initialize a Imputation transformer for handling missing data using simple strategies. |
StandardScaler() | Initialize a Scaler for standardizing features by subtracting the mean and scaling to unit variance. |
best_params_ | Attribute in GridSearchCV containing the best hyperparameters found during the grid search. |
coef_ | Attribute in linear models (e.g., LinearRegression) containing the estimated coefficients of features. |
cross_validate() | Function for evaluating a model’s performance using cross-validation and returning multiple scores. |
cv_results_ | Attribute in GridSearchCV containing detailed results from cross-validation grid search. |
euclidean_distances(x, y) | Compute pairwise Euclidean distances between points in two datasets. |
fit(X, y) | Fit/train the model |
get_params() | Method to retrieve the hyperparameters of an estimator. |
inertia_ | Attribute in KMeans indicating the sum of squared distances from samples to their cluster centers. |
intercept_ | Attribute in linear models (e.g., LinearRegression) containing the intercept term of the model. |
make_column_selector(dtype_include) | Function for creating a column selector for transformers in a ColumnTransformer. |
make_column_transformer((transformer, list_of_columns), remainder) | Function for creating a composite transformer for feature preprocessing. |
make_pipeline(preprocessor, model) | Function for creating a composite estimator (pipeline) with specified preprocessing and model steps. |
mean_squared_error() | Function to calculate the mean squared error between true and predicted values for regression tasks. |
named_steps | Attribute in a Pipeline object providing access to individual steps by name. |
predict(X) | Method used to make predictions on new data samples for various estimators. |
score(X, y) | Method for calculating the accuracy or performance metric of a classifier or regressor. |
set_config() | Function to configure global scikit-learn settings and behavior. |
train_test_split(df, train_size, stratify) | Function for splitting a dataset into training and testing sets for model evaluation. |
Name | Description |
---|---|
[print(num**2) for num in range(10)] | A list comprehension to loop through the given range and perform an operation (here printing the square of the numbers) |
append() | Add elements to the end of a list or array |
enumerate(iterable) | Generate indices and values from an iterable |
print() | Display text or variable values to the console |
range(start, stop, step) | Create a range with regularly spaced values |
np.arange(start, stop, step) | Create an array with regularly spaced values |
np.array(list) | Create a NumPy array from a list or other iterable |
np.seed(seed) | Seed the random number generator for reproducibility |
np.sqrt(x) | Calculate the square root of a numeric value or array |