Linking the Datasets

Once we concluded the initial exploration of our datasets, and finally decided that we would focus on sardines, we looked for ways to relate our datasets to each other. After all, we had such a plethora of data, that it was natural to try to get the most out of it and stop treating each dataset as an independent entity.

Relationship Between Sardine Larvae and Sardine

import plotly.express as px
import pandas as pd
import numpy as np
import scipy
from scipy.stats import pearsonr
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from scipy import stats
import plotly.io as pio

pd.options.mode.chained_assignment = None

#loading dataset
sardine_data = pd.read_csv("data/sardine_data.csv")
sardine_data2 = pd.read_csv("data/lagged_sardine_data.csv")
sardine_data = sardine_data.rename(columns={"Sardine Larvae lbs": "Count"})
sardine_data['Count'] = sardine_data['Count'].round(0)
sardine_data2 = sardine_data2.rename(columns={"Sardine Larvae lbs": "Count"})
sardine_data2['Count'] = sardine_data2['Count'].round(0)

#linear regression model
X = sardine_data['CatchLbs'].values.reshape(-1,1)
Y = sardine_data['Count'].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
Y_pred = linear_regressor.predict(X)
Y = np.array(Y).reshape(-1,)
X = np.array(X).reshape(-1,)

fig = px.scatter(sardine_data, x='CatchLbs', y='Count', trendline="ols", labels = {"Count" : 'Sardine Abundance'}, title ='Sardine Larvae lbs vs Sardine Catch')
fig.show()
print("Pearson Correlation:", stats.pearsonr(X, Y))
Pearson Correlation: (0.5965469231213448, 0.0002482500982253997)

Our results showcase a the first numeric result, a pearson correlation of around .5965, which indicates a moderate positive linear correlation between sardine catch and sardine larvae. Thus from this we can infer that if there is more sardine larvae being caught in a single year, that means that there is more sardine in the ocean that could mate which in turn, lay more larvae.

The second numeric result of .00024 is a p-value that tests whether these variables are correlated at all. The hypothesis testing is as followed with a 5% significant value:
H0 (null hypothesis)- There is no correlation between sardine catch and sardine larvae
H1 (alternate hypothesis) - There is a correlation between sardine catch and sardine larvae

In simpler terms, if our value is below 5%, then we can safely conclude that the two variables we are testing do in fact have a linear correlation with one another. Thus, from our pearson correlation result, we can conclude that there is a positive linear correlation between sardine larvae and sardine catch.

Lagged Correlation and Analysis

Now what if we want to see if there is any connection between fish larvae and them growing up to be caught in the future? We can visualize this through a lagged correlation. According to [NOAA] (https://www.fisheries.noaa.gov/species/pacific-sardine#:~:text=They reproduce at age 1,hatch in about 3 days), it takes about 1-2 years, depending on the factors, for the pacific sardine to mature and become able to reproduce. Thus, we can set back the catch lbs data by 1 year to account for the time it takes for the sardine larvae to reach adulthood. We chose 1 year as our parameter due to this [article] (http://calcofi.org/~calcofi/publications/calcofireports/v37/Vol_37_Butler_etal.pdf) which explains how after the first population collapse of the 1940s, most pacific sardine generally were able to reproduce at age 1, which some individuals being able to do so even earlier. Then we can plot and visualize our results as follows:

X = sardine_data2['CatchLbs'].values.reshape(-1,1)
Y = sardine_data2['Count'].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
Y_pred = linear_regressor.predict(X)
Y = np.array(Y).reshape(-1,)
X = np.array(X).reshape(-1,)

fig = px.scatter(sardine_data2, x='CatchLbs', y='Count', trendline="ols", labels = {"Count" : 'Sardine Abundance'}, title = 'Lagged Correlation for Sardine larve vs Sardine Catch')
fig.show()
print("Pearson Corerelation:", stats.pearsonr(X, Y))
Pearson Corerelation: (0.5876176838067256, 0.00040596039012906216)

Introducing Cross Correlation

We sought to investigate this lagged relationship in more detail. The tool we ended up settling on is called cross correlation. Cross correlation is similar to regular correlation, in the sense that it measures a relationship between two variables. However, cross correlation has the additional parameter of time. In other words, it allows one of the variables to be “offset” to see if there is a relationship when the times aren’t perfectly allowed.

We thought this would be perfect for our data - if we saw a large increase in larval abundance for a particular year, for example, we would expect to see a larger fishery catch some time later, which would result in a larger correlation for that specific year offset. Check out what we discovered by hitting the play button below!

COMMON_TO_SCIENTIFIC = {"Anchovy, northern": 'Engraulis.mordax', 
"Mackerel, jack": 'Trachurus.symmetricus', 
"Mackerel, Pacific": 'Scomber.japonicus',
"Opah": 'Lampridiformes1',
"Sardine, Pacific": 'Sardinops.sagax',
# "Yellowtail": 'Seriola.lalandi' # Not enough datapoints for yellowtail
}



def group_by_year(scientificName, commonName, treat_NaN_as_zeros = False):
    """
        Returns a DataFrame with columns [Year, Fishery, Larva] where each column is thesum of pounds caught for the species in that year.

        scientificName: Scientific name of the target fish (for larval data)
        commonName: Common name of the target fish (for fishery data) 
        treat_NaN_as_zeros: setting it to True will treat a missing larval catch for a certain year to 0. Setting it to False will ignore the whole year (False default)
    """

    # Find all catches with the current species
    catches_with_species = cleaned_fishery[cleaned_fishery['Species Name'] == commonName] 

    # interpret them as floats and group by sums of pounds per year
    catches_with_species.loc[:,'Year'] = catches_with_species['Year'].astype(float)
    catches_with_species.loc[:,'Pounds'] = catches_with_species['Pounds'].astype(float)
    catches_with_species = catches_with_species.groupby('Year').sum()


    # find all larval catches with the sums of pounds per year and sum them
    caught_species_larva = larva_orig[larva_orig[scientificName] != 0]
    caught_species_larva = caught_species_larva[caught_species_larva['year'] > 1980]
    caught_species_larva = caught_species_larva.groupby('year').sum()

    larva_array = caught_species_larva[scientificName]
    result = []

    for year, caught_pounds in catches_with_species['Pounds'].iteritems():
        try:
            result.append(np.array([year, caught_pounds, larva_array[int(year)]])) # this will fail if there were no larval catches for this species for the year!
        except:
            if treat_NaN_as_zeros:
                result.append(np.array([year, caught_pounds, 0])) # if we want to treat a nonexistant larval catch as 0

    # Result will be a 2 dimensional np array where the first column is a year, so construct a dataframe from it
    return pd.DataFrame(data = np.array(result), columns=["Year", "Fishery", "Larva"])

def local_correlation(df):
    """
        Returns a pearson correlation between the Fishery and Larva columns of the dataframe passed in
    """
    return scipy.stats.pearsonr(df['Fishery'], df['Larva'])



def offset_larva_catch(scientificName, commonName, offset, treat_NaN_as_zeros = False):
    """
        Returns a modified version of the dataset where the Fishery Catches are shifted later by the offset. For example, if a certain fish had n catches in 2008, and offset is 2, 
        the returned dataset would have n in 2010. This is useful in calculating correlation with offset year.

        scientificName: Scientific name of the target fish (for larval data)
        commonName: Common name of the target fish (for fishery data) 
        offset: the amount of years to shift fishery catches by. If offset is negative, the larva data will be set back.
        treat_NaN_as_zeros: setting it to True will treat a missing larval catch for a certain year to 0. Setting it to False will ignore the whole year (False default)
    """
    orig_dataset = group_by_year(scientificName, commonName, treat_NaN_as_zeros).to_numpy() # convert the dataset to numpy for easier indexing
    result = []
    if offset > 0:
        for i in range(len(orig_dataset) - abs(offset)):
            result.append(np.array([orig_dataset[i+offset][0], orig_dataset[i+offset][1], orig_dataset[i][2]])) #append the row with the offset fishery
    else:
        for i in range(len(orig_dataset) - abs(offset)):
            result.append(np.array([orig_dataset[i][0], orig_dataset[i][1], orig_dataset[i-offset][2]])) #if negative, append the row with the offset larva backwards
    if not result:
        return pd.DataFrame(columns=["Year", "Fishery", "Larva"])
    return pd.DataFrame(data = np.array(result), columns=["Year", "Fishery", "Larva"]) #convert back to dataframe and return it

    


larva_orig = pd.read_csv('data/Fishlarvaldata_Capstone_2021_FromAndrewThompson_updated 1804 1904 1507 1607 1601 1704 1604 1501 1407 1311 ichthyoplankton by line and station.csv')
fishery_updated = pd.read_csv('data/2232021_SummaryByQuarter_blockgrouping_87-20_210223_Redacted.csv')

# clean data 
cleaned_fishery = fishery_updated.dropna(how='any')
cleaned_fishery = cleaned_fishery[cleaned_fishery['Total Price'] != ' ']

curr_list = [] #correlations per species
scientific = 'Sardinops.sagax'
common = 'Sardine, Pacific'
for year in range(-3, 8):
    offset_df = offset_larva_catch(scientific, common, year) #find the offset of the current year
    if len(offset_df) < 2:
        continue
    corr = local_correlation(offset_df) # find the correlation with the current fish and offset
    curr_list.append(np.array([year, corr[0]]))
curr_list = np.array(curr_list)




frames = []
for i in range(1, len(curr_list)):
    frames.append(go.Frame(data=[go.Scatter(x=curr_list[:i, 0], y=curr_list[:i, 1])]))

fig = go.Figure(
    data=[go.Scatter(x=curr_list[:1, 0], y=curr_list[:1, 1])],
    layout=go.Layout(
        xaxis=dict(range=[-3, 6], autorange=False),
        yaxis=dict(range=[0, 1], autorange=False),
        title="Correlation vs Years Offset (Sardines)",
        xaxis_title = "Years Offset (Fishery Catch year - Larval Catch Year)",
        yaxis_title = "Pearson Correlation",
        updatemenus=[dict(
            type="buttons",
            buttons=[dict(label="Play",
                          method="animate",
                          args=[None])])]
    ),
    frames=frames
)

fig.show()

As you can see, we realized that correlations peaked with an offset of several years between the larvae and fishery catch date. This appears to support our hypothesis that a high larva catch causes a high fishery catch several years later.

If this is further investigated, it could have very exciting impacts, both environmental and economical. Being able to accurately predict fishery catches in the future can allow humans to more quickly respond to climate threats, and allow companies to make more intelligent economic decisions based on how abundant the catch will be!