The name of the project is Bibliometric Engine
It is a software that takes in input a file in a particular format (CSV), containing citations between scholarly documents, each identified by a digital object identifier (DOI)
The goal of the software is to run particular analysis and extractions on such data
The project must be implemented by a group of people
You need to
form groups of at least 3 and at most 4 people
choose a name for the group (yes: a name) - it will be used to publish the ranks of the best-performing projects (more info later)
communicate the name of the group and its members (including their emails) to me by sending an email at silvio.peroni@unibo.it
Final deadline: all groups must be ready by next Wednesday at most (16 December)
<group_name> import * class BibliometricEngine(object): def __init__(self, citations_file_path): self.data = process_citations(citations_file_path) def compute_impact_factor(self, dois, year): return do_compute_impact_factor(self.data, dois, year) def get_co_citations(self, doi1, doi2): return do_get_co_citations(self.data, doi1, doi2) def get_bibliographic_coupling(self, doi1, doi2): return do_get_bibliographic_coupling(self.data, doi1, doi2) def get_citation_network(self, start, end): return do_get_citation_network(self.data, start, end) def merge_graphs(self, g1, g2): return do_merge_graphs(self.data, g1, g2)
def search_by_prefix(self, prefix, is_citing, dump): if dump is None: return do_search_by_prefix(self.data, prefix, is_citing) else: return do_search_by_prefix(dump, prefix, is_citing) def search(self, query, field, dump): if dump is None: return do_search(self.data, query, field) else: return do_search(dump, query, field) def filter_by_value(self, query, field, dump): if dump is None: return do_filter_by_value(self.data, query, field) else: return do_filter_by_value(dump, query, field)
You have to develop a Python file named as your group, where spaces are substituted by underscores and the whole file is lowercase
E.g.: group Best group ever, file best_group_ever.py
The import statement must specify the module implemented
E.g.: from best_group_ever import *
Each group has to implement the functions that have been highlighted in red in the previous slide
def process_citations(citations_file_path)
It takes in input a comma separated value (CSV) file, and return a data structure containing all the data included in the CSV is some form
The data can be preprocessed, changed according to some empirical rule, ordered in a certain way, etc.
These data will be automatically provided in input to the other functions
citing | cited | creation | timespan |
10.2964/jsik_2020_003 | 10.1007/s11192-019-03217-6 | 2020-02-29 | P0Y5M15D |
10.1007/s11192-019-03311-9 | 10.1007/s11192-019-03217-6 | 2019-12-04 | P0Y2M20D |
10.1007/s11192-019-03217-6 | 10.1007/978-3-030-00668-6_8 | 2019-09-14 | P1Y |
10.1007/s11192-019-03217-6 | 10.1007/978-3-319-11955-7_42 | 2019-09-14 | P5Y |
... | ... | ... | ... |
citing (string): the DOI of a citing article
cited (string): the DOI of a cited article
creation (string): publication date of the citing article (date format: YYYY-MM-DD
, YYYY-MM
or YYYY
)
timespan (string): the difference between the publication date of the citing article and the publication date of the cited article (duration format): PnYnMnD
, PnYnM
, or PnY
, plus -
before P
for negative durations
def do_compute_impact_factor(data, dois, year)
data: the data returned by process_citations
dois: a set of DOIs identifying articles
year: a string in format YYYY
to consider
It returns a number which is the result of the computation of the Impact Factor (IF) for such documents. The IF of a set of documents dois
on a year year
is computed by counting the number of citations all the documents in dois
have received in year year
, and then dividing such a value by the number of documents in dois
published in the previous two years (i.e. in year-1
and year-2
).
def do_get_co_citations(data, doi1, doi2)
data: the data returned by process_citations
doi1: the DOI string of the first article
doi2: the DOI string of the second article
It returns an integer defining how many times the two input documents are cited together by other documents.
def do_get_bibliographic_coupling(data, doi1, doi2)
data: the data returned by process_citations
doi1: the DOI string of the first article
doi2: the DOI string of the first article
It returns an integer defining how many times the two input documents cite both the same document.
def do_get_citation_network(data, start, end)
data: the data returned by process_citations
start: a string defining the starting year to consider (format: YYYY
)
end: a string defining the ending year to consider (format: YYYY
) - it must be equal to or greater than start
It returns a directed graph containing all the articles involved in citations if both of them have been published within the input start
-end
interval (start
and end
included). Use the DOIs of the articles involved in citations as name of the nodes.
def do_merge_graphs(data, g1, g2)
data: the data returned by process_citations
g1: the first graph to consider
g2: the second graph to consider
It returns a new graph being the merge of the two input graphs if these are of the same type (e.g. both DiGraphs). In case the types of the graphs are different, return None
.
def do_search_by_prefix(data, prefix, is_citing)
data: the data returned by process_citations
or by other search/filter activities
prefix: a string defining the precise prefix (i.e. the part before the first slash) of a DOI
is_citing: a boolean telling if the operation should be run on citing articles or not
It returns a sub-collection of citations in data
where either the citing DOI (if is_citing
is True
) or the cited DOI (if is_citing
is False
) is characterised by the input prefix.
def do_search(data, query, field)
data: the data returned by process_citations
or by other search/filter activities
query: a string defining the query to do on the data
field: a string defining the column (it can be either citing, cited, creation, timespan) on which running the query
It returns a sub-collection of citations in data
where the query matched on the input field. It is possible to use wildcards in the query. If no wildcards are used, there should be a complete match with the string in query
to return that citation in the results.
Multiple wildcards *
can be used in query
. E.g. World*Web
looks for all the strings that matches with the word World
followed by zero or more characters, followed by the word Web
(examples: World Wide Web
, World Spider Web
, etc.).
Boolean operators can be used: and
, or
, not
<tokens 1> <operator> <tokens 2>
All matches are case insensitive – e.g. specifying World
as query
will match also strings that contain world
def do_filter_by_value(data, query, field)
data: the data returned by process_citations
or by other search/filter activities
query: a string defining the query to do on the data
field: a string defining column (it can be either citing, cited, creation, timespan) on which running the query
It returns a sub-collection of citations in data
where the query matched on the input field. No wildcarts are permitted in the query, only comparisons.
Comparison operator can be used in query
: <
, >
, <=
, >=
, ==
, !=
<operator> <tokens>
Boolean operators can be used: and
, or
, not
<tokens 1> <operator> <tokens 2>
All matches are case insensitive – e.g. specifying World
as query
will match also strings that contain world
Use the test-driven development to understand when you are doing the right job
The project, i.e. the implementation of the functions including any additional ancillary function developed, must be included in a single file named as your group, e.g. best_group_ever.py
The file must be sent by email to me at silvio.peroni@unibo.it
Submission: 2 days before the exam session (the whole group must attend the session) – e.g. send the project the 27th of January for discussing it on the session of the 29th of January
Your project will be compared with the others in terms of efficiency (i.e. time needed for addressing specific tasks)
Maximum score = c-score + e-score + o-score = 16
All projects will run on large CSV files
Correctness of the result: c-score <= 4
Efficiency of the software: e-score <= 4; projects ranked according to the time spent for addressing various tasks
e-score = 4
e-score = 3
e-score = 2
e-score = 1
Oral colloquium: -8 <= o-score <= 8; it is a personal score, each member of the group has its own
At least 2 for passing the exam, otherwise an additional effort is required (new function)